1 Introduction
-
1. Data Augmentation: ALMERIA employs pairwise molecular contrasts as a form of data augmentation. By considering variations in molecular conformations, it enhances the accuracy of predictions.
-
2. Scalability: The methodology has been implemented using scalable software and methods, allowing it to handle large volumes of data (up to several terabytes). Even when processing a substantial batch of queries, ALMERIA provides rapid responses.
-
3. Model Evaluation: Detailed data split criteria have been applied to evaluate the models on different data partitions. This rigorous evaluation ensures that the models generalize well to new compounds.
-
4. State-of-the-Art Performance: Experimental results demonstrate state-of-the-art performance in molecular activity prediction, with noteworthy ROC AUC values of 0.99, 0.96, and 0.87.
-
5. Robustness and Generalization: The chosen data representation and modelling techniques exhibit good properties for generalization. Additionally, sensitivity analysis of molecular conformations further validates the robustness of ALMERIA’s predictions.
2 Materials and Methods
Fig. 1
2.1 Data Collection
Table 1
Block name | Number of descriptors |
Constitutional descriptors | 43 |
Ring descriptors | 32 |
Topological indices | 75 |
Walk and path counts | 46 |
Connectivity indices | 37 |
Information indices | 48 |
2D matrix-based descriptors | 550 |
2D autocorrelations | 213 |
Burden eigenvalues | 96 |
P-VSA-like descriptors | 45 |
ETA indices | 23 |
Edge adjacency indices | 324 |
Geometrical descriptors | 38 |
3D matrix-based descriptors | 90 |
3D autocorrelations | 80 |
Block name | Number of descriptors |
RDF descriptors | 210 |
3D-MoRSE descriptors | 224 |
WHIM descriptors | 114 |
GETAWAY descriptors | 273 |
Randic molecular profiles | 41 |
Functional group counts | 154 |
Atom-centered fragments | 115 |
Atom-type E-state indices | 170 |
CATS 2D | 150 |
2D Atom Pairs | $1\hspace{0.1667em}596$ |
3D Atom Pairs | 36 |
Charge descriptors | 15 |
Molecular properties | 20 |
Drug-like indices | 27 |
2.2 Data Preparation
-
1. The set of conformations for every compound is reduced to a single representative sample by averaging their descriptor values, i.e. grouping by molecule and then averaging their descriptors column-wise. Thus, this representative sample is built considering the different conformations the molecule may adopt. As a side effect, this improves the efficiency of model building. The first benefit sought with this step is fairer when guiding the model optimization and evaluating its performance, as activity data is usually molecule-wise labelled. However, conformation generation may imply that certain molecules are overrepresented in comparison to others, which may bias both the optimization and the evaluation metrics, i.e. risk of frequency bias. In any case, this step is optional for ALMERIA within the proposed methodology shown in Fig. 1, and all conformations could be used for modelling. However, since we have included this step in the current work and know that it can be controversial outside the field of machine learning, we have included Section 3.4 within the experimentation that analyses its impact on the set of generated conformations.
-
2. Instead of building a separate model for every target, as often found in literature, we opt for building a single model that considers the specific contrast between compounds that correlates with biological activity. We achieve this by performing the absolute difference on the descriptors for every pair of target and ligand molecules. This procedure has a data augmentation effect, and it aims to improve the generalization performance with compounds not yet seen during the model fitting while making the ALMERIA methodology more efficient with a single model without sacrificing interpretability.
2.3 Data Split
2.4 Modelling
2.4.1 The Main Proposal: Gradient Boosting
-
• Complex and non-linear mapping from feature to output space.
-
• Structured input data in a high-dimensional space and big volume. It requires an efficient approach that can also scale to easily accommodate increasing volumes of data, possibly leveraging more hardware in a distributed environment.
-
• Expensive but valuable annotated data, thus leveraging a supervised learning approach to get the most out of prior efforts.
-
• The importance of having annotated data also lies in being able to assess the performance of the model on new out-of-sample molecules not seen during model fitting. For this reason, having a high-capacity model is as important as having tools to avoid memorization and overfit, a potential pitfall in the field (Wallach and Heifets, 2018).
-
• Boosting the ease of interpreting both the model output and the factors that influence its decision the most.
2.4.2 Baselines
-
1. Replace missing numerical data entries with a simple imputation strategy using the mean value from the corresponding feature.
-
2. Drop features columns whose variance is almost zero, i.e. constant values.
-
3. Apply Z-score normalization to transform the different features into the same scale with 0 mean and 1 standard deviation. Statistics used to apply the normalization are calculated from the training data partition to avoid data leakage.
Table 2
Drop zero-variance columns | Impute missing data | Z-score normalization | |
Logistic Regression | Yes | Yes | Yes |
SVM | Yes | Yes | Yes |
Random Forest | Yes | Yes | – |
Deep Learning | Yes | Yes | * |
3 Results and Discussion
3.1 Experiment Setup
Fig. 2
-
• Data partition A: 96 out of the 102 target proteins along with their associated ligand compounds. The list of target proteins for this partition is: ACES, ADA, ADRB1, ADRB2, AKT2, ALDR, AMPC, AOFB, BACE1, BRAF, CAH2, CASP3, CDK2, COMT, CP2C9, CP3A4, CSF1R, CXCR4, DEF, DHI1, DPP4, DRD3, DYR, EGFR, ESR1, ESR2, FA10, FA7, FABP4, FAK1, FGFR1, FKB1A, FNTA, FPPS, GCR, GLCM, GRIA2, GRIK1, HDAC2, HDAC8, HIVINT, HIVPR, HIVRT, HMDH, HS90A, HXK4, IGF1R, INHA, ITAL, JAK2, KIF11, KIT, KITH, KPCB, LCK, LKHA4, MAPK2, MCR, MET, MK01, MK10, MK14, MMP13, MP2K1, NOS1, NRAM, PA2GA, PARP1, PDE5A, PGH1, PGH2, PLK1, PNPH, PPARA, PPARD, PPARG, PRGR, PTN1, PUR2, PYGM, PYRD, RENI, ROCK1, RXRA, SAHH, SRC, TGFR1, THB, THRB, TRY1, TRYB1, TYSY, UROK, VGFR2, WEE1, XIAP.
-
– Data partition A.1: $70\% $ from data partition A has been used to train the models using a $K=10$ cross-validation setting.
-
– Data partition A.2: $30\% $ from data partition A has been considered for testing the model’s accuracy after they have been trained with partition A.1. This sub-partitioning implies that target proteins from partition A have been mixed among partitions A.1 and A.2. Therefore, they could be present in both or just in one of them, but every ligand compound is either in partition A.1 or A.2. This allows assessing the model with new ligands not seen before during training.
-
-
• Data partition B: 6 out of the 102 target proteins and their associated ligand compounds. The list of target proteins for this partition is: AKT1, ACE, AA2AR, ABL1, ANDR, ADA17. This selection has been made by hand to cope with one target compound per DUD-E subset: Diverse, Dud38, GPCR, Kinase, Nuclear, and Protease. This partition allows the assessment of the model with new targets and ligand compounds not seen before during training.
3.2 Hyperparameter Optimization
3.3 Activity Modelling Results
Table 3
AUC | Data partition | ||
Model | A.1 | A.2 | B |
LR | 0.74073 | 0.73816 | 0.57002 |
SVM-e | 0.83517 | 0.82335 | 0.70706 |
RF | 0.99958 | 0.98542 | 0.78419 |
DNN-Z | 0.98848 | 0.93944 | 0.65947 |
DNN | 0.84338 | 0.82481 | 0.82999 |
XGB | 0.99933 | 0.96384 | 0.87539 |
Fig. 3
Fig. 4
Fig. 5
3.4 Sensitivity Analysis for Molecular Conformations
Fig. 6
Fig. 7
Fig. 8
3.5 Molecular Similarity Results
4 Performance Measurement
Table 6
Task | Time |
Data preparation | |
Generating maximum 100 conformations for each of the $32\hspace{0.1667em}909$ compounds using Omega software | 5 h 30 m |
Generating molecular descriptors for each conformation from all the $32\hspace{0.1667em}909$ compounds – $2\hspace{0.1667em}594\hspace{0.1667em}901$ data samples – using Dragon software | 14 h |
Data size for this single crystal aa2ar with maximum 100 conformations per compound: $2\hspace{0.1667em}594\hspace{0.1667em}901$ data samples, $4\hspace{0.1667em}885$ columns, 47.27 GB using 32-bit precision. For the entire DUD-E database: $264\hspace{0.1667em}679\hspace{0.1667em}902$ data samples, 4.7 TB | |
Reading database | 10 m |
Reduce conformations to a single representative sample | < 2 m |
Compound pair data transformation | < 1 m |
Model building | |
CV folds creation | < 1 m |
Hyperparameter optimization with 100 trials and using 10-fold CV per trial (CPU) | 14 h |
Hyperparameter optimization with 100 trials and using 10-fold CV per trial (GPU) | 6 h |
Final model training | < 1 m |
Model inference | |
Activity and similarity prediction on full dataset with all compound pairs | < 1 s |
5 Conclusion
-
1. Reducing multiple conformations for a given molecule to a single representative sample using the averaged descriptors values, thus reducing frequency bias on the model optimization process. Experiments and sensitivity analysis in Section 3.4 show that model response is consistent among multiple combinations of conformation pairs for different molecules.
-
2. Transforming the molecules’ descriptors to pairwise molecular contrasts using the absolute difference between their descriptor values. This has a data augmentation effect. This way, a single model may fit the entire database, therefore enjoying better generalization properties, as shown by the experiments in Section 3.3 on numerical molecular descriptors, like the one generated by the Dragon software (Mauri et al., 2006) for each conformation of every molecule. These conformations were generated using OpenEye Scientific Omega software (Hawkins et al., 2010) with a limit of 100 conformations per molecule.
-
• Applications:
-
– Virtual screening: ALMERIA helps narrow down vast databases of potential drug candidates by predicting their activity and similarity to known drugs. This saves time and money in the early stages of drug development.
-
– Identifying new leads: By analysing large datasets, ALMERIA can identify promising new molecules with potential drug-like properties that might not have been considered before.
-
-
• End users:
-
– Pharmaceutical companies: These companies are constantly searching for new drugs. ALMERIA can significantly accelerate their drug discovery pipelines.
-
– Biotech startups: Smaller companies often lack the resources for large-scale drug discovery efforts. ALMERIA’s efficiency and scalability can be a valuable asset for them.
-
– Academic researchers: Researchers can use ALMERIA to explore potential drug candidates for specific diseases or biological targets.
-