Files
mm644706215 a56e60e9a3 first add
2025-10-16 17:21:48 +08:00

2.5 KiB

Workflow

This directory contains the scripts and Notebooks used in order to produce the results found in our pre-print: Pre-trained molecular representations enable antimicrobial discovery.

Below is a brief description of what is done in each step.
- 01.prepare_training_data.ipynb: SMILES are obtained for the chemicals used in the study by Maier, 2018. Afterwards all molecular representations are gathered for compounds in the Maier, and MedChemExpress (MCE) libraries. Also, the ECFP4 and MolE representations are obtained for a random selection of molecules from PubChem.
- 02.model_training.ipynb: Here we train XGBoost models to predict antimicrobial activity of compounds using the data from Maier, 2018.
- 03.model_evaluation.ipynb: The results from 02.model_training.ipynb are read and the best model for each molecular representation is gathered. Performance metrics and precision recall curves are calculated. Optimal thresholds for growth-inhibition prediction are also determined. Analysis of test-set predictions is also performed in this notebook. - 04.new_predictions.ipynb: Predictions are made for compounds not present in the library used by Maier, 2018 using the models evaluated in 03.model_evaluation.ipynb. Predictions for Halicin and Abaucin are made. Additionally, predictions of antimicrobial activity are made on molecules from the MedChemExpress library. Later on, a literature search is performed for molecules predicted to have broad-spectrum activity.
- 05.analyze_mce_predictions.Rmd: An exploration of the predictions made in 04.new_predictions.ipynb is done. Results from the literature search, ranking of known antibiotics, and the molecules chosen for experimental validation are highlighted. Also, a comparison to the predictions made by a model using ECFP4 is also done.
- 06.experimental_validation.Rmd: Analysis of the results from the experimental validation of the chosen compounds. MIC curves, growth curves and growth paramters are performed.
- 07.pubchem_exploration.ipynb: Exploring the representation of a set of 100K randomly selected molecules from PubChem. Also, the most similar molecues to a given query, according to different representations is done.
- 08.compare_mce_maier.ipynb: A comparison of the chemical space of the Maier and MCE libraries is performed.