Informatica logo


Login Register

  1. Home
  2. Issues
  3. Volume 35, Issue 3 (2024)
  4. ALMERIA: Boosting Pairwise Molecular Con ...

Informatica

Information Submit your article For Referees Help ATTENTION!
  • Article info
  • Full article
  • Related articles
  • More
    Article info Full article Related articles

ALMERIA: Boosting Pairwise Molecular Contrasts with Scalable Methods
Volume 35, Issue 3 (2024), pp. 617–648
Rafael Mena-Yedra ORCID icon link to view author Rafael Mena-Yedra details   Juana López Redondo ORCID icon link to view author Juana López Redondo details   Horacio Pérez-Sánchez ORCID icon link to view author Horacio Pérez-Sánchez details   Pilar Martinez Ortigosa ORCID icon link to view author Pilar Martinez Ortigosa details  

Authors

 
Placeholder
https://doi.org/10.15388/24-INFOR558
Pub. online: 14 May 2024      Type: Research Article      Open accessOpen Access

Received
1 June 2023
Accepted
1 April 2024
Published
14 May 2024

Abstract

This work introduces ALMERIA, a decision-support tool for drug discovery. It estimates compound similarities and predicts activity, considering conformation variability. The methodology spans from data preparation to model selection and optimization. Implemented using scalable software, it handles large data volumes swiftly. Experiments were conducted on a distributed computer cluster using the DUD-E database. Models were evaluated on different data partitions to assess generalization ability with new compounds. The tool demonstrates excellent performance in molecular activity prediction (ROC AUC: 0.99, 0.96, 0.87), indicating good generalization properties of the chosen data representation and modelling. Molecular conformation sensitivity is also evaluated.

References

 
Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M. (2019). Optuna: a next-generation hyperparameter optimization framework. In: Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2623–2631. https://doi.org/10.1145/3292500.3330701.
 
Banegas-Luna, A.-J., Cerón-Carrasco, J.P., Pérez-Sánchez, H. (2018). A review of ligand-based virtual screening web tools and screening algorithms in large molecular databases in the age of big data. Future Medicinal Chemistry, 10(22), 2641–2658. https://doi.org/10.4155/fmc-2018-0076.
 
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York. 978-0-387-31073-2.
 
Breiman, L. (1997). Arcing the Edge. Technical Report 486, Statistics Department, University of California at Berkeley.
 
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324.
 
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J. (1984). Classification and Regression Trees. Wadsworth. 0-534-98053-8.
 
Cereto-Massagué, A., Ojeda, M.J., Valls, C., Mulero, M., Garcia-Vallvé, S., Pujadas, G. (2015). Molecular fingerprint similarity search in virtual screening. Methods, 71, 58–63. https://doi.org/10.1016/j.ymeth.2014.08.005. https://www.sciencedirect.com/science/article/pii/S1046202314002631.
 
Chen, T., Guestrin, C. (2016). XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, San Francisco, California, USA. Association for Computing Machinery, New York, NY, USA, pp. 785–794. 978-1-4503-4232-2. https://doi.org/10.1145/2939672.2939785.
 
Cheng, Z., Yan, C., Wu, F., Wang, J. (2021). Drug-target interaction prediction using multi-head self-attention and graph attention network. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 19(4), 2208–2218. https://doi.org/10.1109/TCBB.2021.3077905.
 
Cherkasov, A., Muratov, E.N., Fourches, D., Varnek, A., Baskin, I.I., Cronin, M., Dearden, J., Gramatica, P., Martin, Y.C., Todeschini, R., Consonni, V., Kuz’min, V.E., Cramer, R., Benigni, R., Yang, C., Rathman, J., Terfloth, L., Gasteiger, J., Richard, A., Tropsha, A. (2014). QSAR modeling: where have you been? Where are you going to? Journal of Medicinal Chemistry, 57(12), 4977–5010. https://doi.org/10.1021/jm4004285.
 
Dask Development Team (2016). Dask: Library for dynamic task scheduling. https://dask.org.
 
Deng, J., Yang, Z., Ojima, I., Samaras, D., Wang, F. (2021). Artificial intelligence in drug discovery: applications and techniques. Briefings in Bioinformatics, 23(1). https://doi.org/10.1093/bib/bbab430.
 
Friedman, J.H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. https://doi.org/10.1214/aos/1013203451.
 
Furui, K., Ohue, M. (2022). Compound virtual screening by learning-to-rank with gradient boosting decision tree and enrichment-based cumulative gain. arXiv:2205.02169. https://doi.org/10.48550/ARXIV.2205.02169.
 
Hawkins, P.C.D., Skillman, A.G., Warren, G.L., Ellingson, B.A., Stahl, M.T. (2010). Conformer Generation with OMEGA: algorithm and validation using high quality structures from the protein databank and Cambridge structural database. Journal of Chemical Information and Modeling, 50(4), 572–584. https://doi.org/10.1021/ci100031x.
 
Hu, H., Bajorath, J. (2020). Activity cliffs produced by single-atom modification of active compounds: systematic identification and rationalization based on X-ray structures. European Journal of Medicinal Chemistry, 207, 112846. https://doi.org/10.1016/j.ejmech.2020.112846. https://www.sciencedirect.com/science/article/pii/S0223523420308187.
 
Huang, X., Khetan, A., Cvitkovic, M., Karnin, Z. (2020). TabTransformer: Tabular Data Modeling Using Contextual Embeddings. arXiv:2012.06678. https://doi.org/10.48550/ARXIV.2012.06678.
 
Hung, C., Gini, G. (2021). QSAR modeling without descriptors using graph convolutional neural networks: the case of mutagenicity prediction. Molecular Diversity, 25(3), 1283–1299. https://doi.org/10.1007/s11030-021-10250-2.
 
Jiang, Z., Xu, J., Yan, A., Wang, L. (2021). A comprehensive comparative assessment of 3D molecular similarity tools in ligand-based virtual screening. Briefings in Bioinformatics, 22(6), 231. https://doi.org/10.1093/bib/bbab231.
 
Jiménez, J., Škalič, M., Martínez-Rosell, G., De Fabritiis, G. (2018). KDEEP: protein–ligand absolute binding affinity prediction via 3D-convolutional neural networks. Journal of Chemical Information and Modeling, 58(2), 287–296. https://doi.org/10.1021/acs.jcim.7b00650.
 
Kimber, T.B., Chen, Y., Volkamer, A. (2021). Deep learning in virtual screening: recent applications and developments. International Journal of Molecular Sciences, 22(9). https://doi.org/10.3390/ijms22094435.
 
Kumar, A., Zhang, K.Y.J. (2018). Advances in the development of shape similarity methods and their application in drug discovery. Frontiers in Chemistry, 6. https://www.frontiersin.org/article/10.3389/fchem.2018.00315.
 
Li, J., Jiang, X. (2021). Mol-BERT: an effective molecular representation with BERT for molecular property prediction. Wireless Communications and Mobile Computing, 2021, 7181815. https://doi.org/10.1155/2021/7181815.
 
Maggiora, G., Vogt, M., Stumpfe, D., Bajorath, J. (2014). Molecular similarity in medicinal chemistry. Journal of Medicinal Chemistry, 57(8), 3186–3204. https://doi.org/10.1021/jm401411z.
 
Mao, J., Akhtar, J., Zhang, X., Sun, L., Guan, S., Li, X., Chen, G., Liu, J., Jeon, H.-N., Kim, M.S., No, K.T., Wang, G. (2021). Comprehensive strategies of machine-learning-based quantitative structure-activity relationship models. iScience, 24(9), 103052. https://doi.org/10.1016/j.isci.2021.103052. https://www.sciencedirect.com/science/article/pii/S2589004221010208.
 
Martin, Y.C., Kofron, J.L., Traphagen, L.M. (2002). Do structurally similar molecules have similar biological activity? Journal of Medicinal Chemistry, 45(19), 4350–4358. https://doi.org/10.1021/jm020155c.
 
Martínez, M.J., Razuc, M., Ponzoni, I. (2019). MoDeSuS: a machine learning tool for selection of molecular descriptors in QSAR studies applied to molecular informatics. BioMed Research International, 2019, 2905203. https://doi.org/10.1155/2019/2905203.
 
Mauri, A., Consonni, V., Pavan, M., Todeschini, R. (2006). DRAGON software: an easy approach to molecular descriptor calculations. MATCH Communications in Mathematical and in Computer Chemistry, 56(2), 237–248.
 
Mysinger, M.M., Carchia, M., Irwin, J.J., Shoichet, B.K. (2012). Directory of useful decoys, enhanced (DUD-E): better ligands and decoys for better benchmarking. Journal of Medicinal Chemistry, 55(14), 6582–6594. https://doi.org/10.1021/jm300687e.
 
Plewczynski, D., Łaźniewski, M., Augustyniak, R., Ginalski, K. (2011). Can we trust docking results? Evaluation of seven commonly used programs on PDBbind database. Journal of Computational Chemistry, 32(4), 742–755. https://doi.org/10.1002/jcc.21643.
 
Puertas-Martín, S., Redondo, J.L., Ortigosa, P.M., Pérez-Sánchez, H. (2019). OptiPharm: an evolutionary algorithm to compare shape similarity. Scientific Reports, 9(1), 1398. https://doi.org/10.1038/s41598-018-37908-6.
 
Ruiz Puentes, P., Valderrama, N., González, C., Daza, L., Muñoz-Camargo, C., Cruz, J.C., Arbeláez, P. (2021). PharmaNet: pharmaceutical discovery with deep recurrent neural networks. PLOS ONE, 16(4), 0241728. https://doi.org/10.1371/journal.pone.0241728.
 
Samanta, S., O’Hagan, S., Swainston, N., Roberts, T.J., Kell, D.B. (2020). VAE-Sim: a novel molecular similarity measure based on a variational autoencoder. Molecules (Basel, Switzerland), 25(15), 3446. https://doi.org/10.3390/molecules25153446.
 
Shahlaei, M. (2013). Descriptor selection methods in quantitative structure-activity relationship studies: a review study. Chemical Reviews, 113(10), 8093–8103. https://doi.org/10.1021/cr3004339.
 
Shen, C., Hu, Y., Wang, Z., Zhang, X., Pang, J., Wang, G., Zhong, H., Xu, L., Cao, D., Hou, T. (2021). Beware of the generic machine learning-based scoring functions in structure-based virtual screening. Briefings in Bioinformatics, 22(3), 070. https://doi.org/10.1093/bib/bbaa070.
 
Stepniewska-Dziubinska, M.M., Zielenkiewicz, P., Siedlecki, P. (2018). Development and evaluation of a deep learning model for protein–ligand binding affinity prediction. Bioinformatics, 34(21), 3666–3674. https://doi.org/10.1093/bioinformatics/bty374.
 
University of Almeria Supercomputing and Algorithms (SAL) research group – HPC infrastructure. https://sites.google.com/ual.es/hpca/infrastructure/hpc-infrastructure.
 
Wallach, I., Heifets, A. (2018). Most ligand-based classification benchmarks reward memorization rather than generalization. Journal of Chemical Information and Modeling, 58(5), 916–932. https://doi.org/10.1021/acs.jcim.7b00403.
 
Wang, Y., Wang, J., Cao, Z., Barati Farimani, A. (2022). Molecular contrastive learning of representations via graph neural networks. Nature Machine Intelligence, 4(3), 279–287. https://doi.org/10.1038/s42256-022-00447-x.
 
Winter, R., Montanari, F., Noé, F., Clevert, D.-A. (2019). Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chemical Science, 10(6), 1692–1701.
 
Xu, Z., Wang, S., Zhu, F., Huang, J. (2017). Seq2seq fingerprint: an unsupervised deep molecular embedding for drug discovery. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, ACM-BCB ’17, Boston, Massachusetts, USA. Association for Computing Machinery, New York, NY, USA, pp. 285–294. 978-1-4503-4722-8. https://doi.org/10.1145/3107411.3107424.
 
Yin, Y., Hu, H., Yang, Z., Xu, H., Wu, J. (2021). RealVS: toward enhancing the precision of top hits in ligand-based virtual screening of drug leads from large compound databases. Journal of Chemical Information and Modeling, 61(10), 4924–4939. https://doi.org/10.1021/acs.jcim.1c01021.
 
GitHub – dmlc/xgboost: Scalable, Portable and Distributed Gradient Boosting Library (2024). https://github.com/dmlc/xgboost.

Biographies

Mena-Yedra Rafael
https://orcid.org/0000-0002-7046-6741
rafael.mena@ual.es

R. Mena-Yedra is a senior data scientist. He holds a PhD in computing from the Universitat Politècnica de Catalunya (UPC) in Barcelona, Spain. His expertise lies in industrial data science, where he has applied AI/ML techniques across various domains including energy demand modelling, automatic control of microalgae photobioreactors, transportation and mobility research, and has made contributions to the financial industry and cheminformatics. With his diverse experience, he currently works in the AI industry creating innovative solutions to complex problems.

López Redondo Juana
https://orcid.org/0000-0003-2826-1635
jlredondo@ual.es

J. López Redondo is a distinguished full professor in the Department of Computer Science at the University of Almería, Spain. With a PhD in advanced computer techniques, she focuses her research on high-performance computing, optimization applications, and the design of solar energy systems. Her significant contributions to these fields have earned her recognition at international conferences and publications in various international journals. She is also an integral part of the research group TIC 146 Supercomputación-Algoritmos (SAL), established in 1995.

Pérez-Sánchez Horacio
https://orcid.org/0000-0003-4468-7898
hperez@ucam.edu

H. Pérez-Sánchez is a distinguished researcher currently serving a principal investigator of the Structural Bioinformatics and High Performance Computing research group at the Catholic University of Murcia (UCAM) in Murcia, Spain. He earned his PhD in computational chemistry from the University of Murcia and now holds a full professorship at UCAM, where he teaches various disciplines including biotechnology, pharmacy, medicine, and computer engineering.

Martinez Ortigosa Pilar
https://orcid.org/0000-0001-6514-6543
ortigosa@ual.es

P. Martınez Ortigosa is a distinguished full professor at the University of Almeria, Spain, where she also earned her PhD in computer science. She stands out in the Department of Computer Science at the same institution. Her research is primarily centred on creating high performance computing software and tools. These tools are designed to tackle a broad spectrum of issues and applications originating from various scientific and technological fields. She contributes her expertise to the research group TIC 146 Supercomputación-Algoritmos (SAL), a group that has been active since 1995.


Full article Related articles PDF XML
Full article Related articles PDF XML

Copyright
© 2024 Vilnius University
by logo by logo
Open access article under the CC BY license.

Keywords
virtual screening decision tool data modelling data augmentation distributed computing

Funding
This work has been partially supported by Grant PID2021-123278OB-I00 funded by MCIN/AEI/ 10.13039/501100011033 and by “ERDF A way of making Europe”, and by J. Andalucía through Projects UAL18-TIC-A020-B and P18-RT-1193.

Metrics
since January 2020
266

Article info
views

131

Full article
views

132

PDF
downloads

41

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

INFORMATICA

  • Online ISSN: 1822-8844
  • Print ISSN: 0868-4952
  • Copyright © 2023 Vilnius University

About

  • About journal

For contributors

  • OA Policy
  • Submit your article
  • Instructions for Referees
    •  

    •  

Contact us

  • Institute of Data Science and Digital Technologies
  • Vilnius University

    Akademijos St. 4

    08412 Vilnius, Lithuania

    Phone: (+370 5) 2109 338

    E-mail: informatica@mii.vu.lt

    https://informatica.vu.lt/journal/INFORMATICA
Powered by PubliMill  •  Privacy policy