Deriving Homogeneous Subsets from Gene Sets by Exploiting the Gene Ontology

Stier, Quirin; Thrun, Michael C.

doi:10.15388/23-INFOR517

Informatica

Deriving Homogeneous Subsets from Gene Sets by Exploiting the Gene Ontology

Volume 34, Issue 2 (2023), pp. 357–386

Quirin Stier Michael C. Thrun

https://doi.org/10.15388/23-INFOR517

Pub. online: 22 May 2023 Type: Research Article

Open Access

Received
1 June 2022

Accepted
1 May 2023

Published
22 May 2023

Abstract

The Gene Ontology (GO) knowledge base provides a standardized vocabulary of GO terms for describing gene functions and attributes. It consists of three directed acyclic graphs which represent the hierarchical structure of relationships between GO terms. GO terms enable the organization of genes based on their functional attributes by annotating genes to specific GO terms. We propose an information-retrieval derived distance between genes by using their annotations. Four gene sets with causal associations were examined by employing our proposed methodology. As a result, the discovered homogeneous subsets of these gene sets are semantically related, in contrast to comparable works. The relevance of the found clusters can be described with the help of ChatGPT by asking for their biological meaning. The R package BIDistances, readily available on CRAN, empowers researchers to effortlessly calculate the distance for any given gene set.

References

Acharya, S., Saha, S., Nikhil, N. (2017). Unsupervised gene selection using biological knowledge: application in sample clustering. BMC Bioinformatics, 18(1), 1–13.

Adolfsson, A., Ackerman, M., Brownstein, N.C. (2019). To cluster, or not to cluster: an analysis of clusterability methods. Pattern Recognition, 88, 13–26.

Alm, E., Arkin, A.P. (2003). Biological networks. Current Opinion in Structural Biology, 13(2), 193–202.

Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., Harris, M.A., Hill, D.P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J.C., Richardson, J.E., Ringwald, M., Rubin, G.M., Sherlock, G. (2000). Gene ontology: tool for the unification of biology. Nature Genetics, 25(1), 25–29.

Backes, C., Keller, A., Kuentzer, J., Kneissl, B., Comtesse, N., Elnakady, Y.A., Müller, R., Meese, E., Lenhof, H.-P. (2007). GeneTrail—advanced gene set enrichment analysis. Nucleic Acids Research, 35, 186–192. https://doi.org/10.1093/nar/gkm323.

Barabási, A.-L., Oltvai, Z.N. (2004). Network biology: understanding the cell’s functional organization. Nature Reviews Genetics, 5(2), 101–113.

Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.

Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K. (1987). Occam’s razor. Information Processing Letters, 24(6), 377–380.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Camon, E., Magrane, M., Barrell, D., Binns, D., Fleischmann, W., Kersey, P., Mulder, N., Oinn, T., Maslen, J., Cox, A., Apweiler, R. (2003). The gene ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Research, 13(4), 662–672.

Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., Apweiler, R. (2004). The gene ontology annotation (GOA) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Research, 32(Database issue), 262–266.

Davies, D.L., Bouldin, D.W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2), 224–227.

Duda, R.O., Hart, P.E., Stork, D.G. (2000). Pattern Classification. John Wiley & Sons, New York, NY.

Dunn, J.C. (1974). Well-separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4(1), 95–104.

GeneTestingRegistry (2018). OtoGenome Test for Hearing Loss. Retrieved 2017. https://www.ncbi.nlm.nih.gov/gtr/tests/509148/. Online: accessed 24 June 2022.

Grasnick, B., Perscheid, C., Uflacker, M. (2018). A framework for the automatic combination and evaluation of gene selection methods. In: International Conference on Practical Applications of Computational Biology & Bioinformatics. Springer, pp. 166–174.

Hartigan, J.A., Hartigan, P.M. (1985). The dip test of unimodality. The Annals of Statistics, 13(1), 70–84.

Hira, Z.M., Gillies, D.F. (2015). A review of feature selection and feature extraction methods applied on microarray data. Advances in Bioinformatics, 2015. https://doi.org/10.1155/2015/198363.

Jin, B., Lu, X. (2010). Identifying informative subsets of the Gene Ontology with information bottleneck methods. Bioinformatics, 26(19), 2445–2451.

Jones, K.S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21.

Jović, A., Brkić, K., Bogunović, N. (2015). A review of feature selection methods with applications. In: 2015 38th international convention on information and communication technology, electronics and microelectronics (MIPRO), Opatija, Croatia, pp. 1200–1205. https://doi.org/10.1109/MIPRO.2015.7160458.

Kaufman, L., Rousseeuw, P.J. (1990). Partitioning around medoids (program PAM). In: Finding Groups in Data: An Introduction to Cluster Analysis, 344, 68–125.

Landauer, T.K., Foltz, P.W., Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., Misra, V. (2022). Solving quantitative reasoning problems with language models. https://doi.org/10.48550/arXiv.2206.14858.

Li, C.-Y., Mao, X., Wei, L. (2008). Genes and (common) pathways underlying drug addiction. PLoS Computational Biology, 4(1), 2.

Lippmann, C. (2020). Function-Preserving, Integrative Gene Selection: A Method for Reducing Disease-Related Gene Sets to Their Key Components. PhD thesis, Philipps Universityät Marburg.

Lipscomb, C.E. (2000). Medical subject headings (MeSH). Bulletin of the Medical Library Association, 88(3), 265.

López-García P.A., Argote, D.L., Thrun, M.C. (2020). Projection-based classification of chemical groups for provenance analysis of archaeological materials. IEEE Access, 8, 152439–152451.

Lötsch, J., Ultsch, A. (2020). Current projection methods-induced biases at subgroup detection for machine-learning based data-analysis of biomedical data. International Journal of Molecular Sciences, 21(79), 1–13.

Lötsch, J., Doehring, A., Mogil, J.S., Arndt, T., Geisslinger, G., Ultsch, A. (2013). Functional genomics of pain in analgesic drug development and therapy. Pharmacology & Therapeutics, 139(1), 60–70.

Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval, Cambridge University Press. 0521865719. https://doi.org/10.1017/CBO9780511809071.007.

Mardis, E.R. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics, 24(3), 133–141.

Michael, J.R. (1983). The stabilized probability plot. Biometrika, 70(1), 11–17.

Murphy, K.P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press, 0262304325.

Murtagh, F. (2004). On ultrametricity, data coding, and computation. Journal of Classification, 21(2), 167. https://doi.org/10.1007/s00357-004-0015-y.

Nash Jr., J.F. (1950). Equilibrium points in n-person games. Proceedings of the National Academy of Sciences, 36(1), 48–49.

Phan, X.-H., Nguyen, L.-M., Horiguchi, S. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, pp. 91–100.

Rajaraman, A., Ullman, J.D. (2011). Mining of Massive Datasets. Cambridge University Press, 1107015359.

Resnik, P. (1999). Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, 95–130.

Rousseeuw, P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65.

Saeys, Y., Inza, I., Larrañaga, P. (2007), A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507–2517.

Shepard, R.N. (1980). Multidimensional scaling, tree-fitting, and clustering. Science, 210(4468), 390–398.

Sondka, Z., Bamford, S., Cole, C.G., Ward, S.A., Dunham, I., Forbes, S.A. (2018). The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers. Nature Reviews Cancer, 18, 696–705. https://doi.org/10.1038/s41568-018-0060-1.

Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., Mesirov J.P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences (PNAS), 102(43), 15545–15550.

Tang, Y., Zhang, Y.-Q., Huang, Z. (2007). Development of two-stage SVM-RFE gene selection strategy for microarray expression data analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4(3), 365–381.

Tarca, A.L., Bhatti, G., Romero, R. (2013). A comparison of gene set analysis methods in terms of sensitivity, prioritization and specificity. PloS One, 8(11), 79217.

Tasoulis, D.K., Plagianakos, V.P., Vrahatis, M.N. (2006). Differential evolution algorithms for finding predictive gene subsets in microarray data. In: IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, pp. 484–491.

Taub, F.E., DeLeo, J.M., Thompson, E.B. (1983). Sequential comparative hybridizations analyzed by computerized image processing can identify and quantitate regulated RNAs. DNA, 2(4), 309–327.

Thrun, M.C. (2018). Projection-Based Clustering through Self-Organization and Swarm Intelligence: Combining Cluster Analysis with the Visualization of High-Dimensional Data. 978-3658205393, Springer.

Thrun, M.C. (2020). Improving the sensitivity of statistical testing for clusterability with mirrored-density plots. In: Machine Learning Methods in Visualisation for Big Data, pp. 19–23.

Thrun, M.C. (2021a). Distance-based clustering challenges for unbiased benchmarking studies. Scientific Reports, 11(1), 1–12.

Thrun, M.C. (2021b). The exploitation of distance distributions for clustering. International Journal of Computational Intelligence and Applications, 20(03), 2150016.

Thrun, M.C. (2022a). Exploiting distance-based structures in data using an explainable AI for stock picking. MDPI Information, 13(2), 51. https://doi.org/10.3390/info13020051.

Thrun, M.C. (2022b). Identification of explainable structures in data with a human-in-the-loop. German Journal of Artificial Intelligence (Künstl. Intell.), 36, 297–301. https://doi.org/10.1007/s13218-022-00782-6.

Thrun, M.C. (2022c). Knowledge-based identification of homogeneous structures in gene sets. In: World Conference on Information Systems and Technologies, Springer, pp. 81–90.

Thrun, M.C., Lerch, F. (2016). Visualization and 3D printing of multivariate data of biomarkers. In: WSCG 2016 – 24th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision 2016.

Thrun, M.C., Stier, Q. (2021). Fundamental clustering algorithms suite. SoftwareX, 13, 100642.

Thrun, M.C., Ultsch, A. (2015). Models of income distributions for knowledge discovery. In: European Conference on Data Analysis (ECDA). University of Essex, Colchester, pp. 136–137. https://doi.org/10.13140/RG.2.1.4463.0244.

Thrun, M.C., Ultsch, A. (2020a). Uncovering high-dimensional structures of projections from dimensionality reduction methods. MethodsX, 7, 101093.

Thrun, M.C., Ultsch, A. (2020b). Using projection-based clustering to find distance-and density-based clusters in high-dimensional data. Journal of Classification, 38(2), 280–312.

Thrun, M.C., Ultsch, A. (2021). Swarm intelligence for self-organized clustering. Artificial Intelligence, 290, 103237.

Thrun, M.C., Gehlert, T., Ultsch, A. (2020a). Analyzing the fine structure of distributions. PLoS One, 15(10), 1–66. https://doi.org/10.1371/journal.pone.0238835.

Thrun, M.C., Pape, F., Ultsch, A. (2020b). Interactive machine learning tool for clustering in visual analytics. In: 7th IEEE International Conference on Data Science and Advanced Analytics, DSAA 2020, Sydney, NSW, Australia, 2020, pp. 479–487. https://doi.org/10.1109/DSAA49011.2020.00062.

Thrun, M.C., Pape, F., Ultsch, A. (2021). Conventional displays of structures in data compared with Interactive Projection-Based Clustering (IPBC). International Journal of Data Science and Analytics, 12(3), 249–271. https://doi.org/10.1007/s41060-021-00264-2.

Toussaint, G.T. (1980). The relative neighbourhood graph of a finite planar set. Pattern Recognition, 12(4), 261–268.

Ultsch, A., Lötsch, J. (2014). What do all the (human) micro-RNAs do? BMC Genomics, 15(1), 1–12.

Ultsch, A., Lötsch, J. (2017). Machine-learned cluster identification in high-dimensional data. Journal of Biomedical Informatics, 66, 95–104.

Ultsch, A., Kringel, D., Kalso, E., Mogil, J.S., Lötsch, J. (2016). A data science approach to candidate gene selection of pain regarded as a process of learning and neural plasticity. Pain, 157(12), 2747–2757.

van Rijsbergen C.J. (1979). Information Retrieval, Butterworth.

Wei, C.-H., Kao, H.-Y., Lu, Z. (2013). PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Research, 41(W1), 518–522.

Wilkinson, L., Friendly, M. (2009). The history of the cluster heat map. The American Statistician, 63(2), 179–184.

Wolting, C., McGlade, C.J., Tritchler, D. (2006). Cluster analysis of protein array results via similarity of Gene Ontology annotation. BMC Bioinformatics, 7(1), 1–13.

Biographies

Stier Quirin

Q. Stier received his bachelor in mathematics at the University of Erlangen in 2017 and his master in data science at the University of Marburg in 2021. His master thesis investigated time series forecasting using wavelet analysis comparing it to popular current state-of-the-art methods. Currently, he is pursuing a PhD in artificial intelligence focusing on interpretable techniques applicable for human-in-the-loop processes at the University of Marburg.

Thrun Michael C.

https://orcid.org/0000-0001-9542-5543

mthrun@informatik.uni-marburg.de

Priv.-Doz. Dr. habil. M.C. Thrun received his diploma in physics (2014) and his doctorate in data science (2017) at the Philipps-University Marburg under the chair of Databionics Prof. Dr. habil. Alfred H.G. Ultsch. Afterwards, he worked for almost two years as a Big Data Scientist for an international manufacturer. He is the author of the book “Projection-Based Clustering through Self-Organization and Swarm Intelligence”. His team specializes in explainable artificial intelligence, predicting time series and knowledge discovery using methods borrowed from nature. Additionally, they are researching the topic of recognizing and explaining diseases. In 2022, he received his habilitation in informatics at the Philipps-University Marburg with a thesis about explainable artificial intelligence and a colloquium about reinforcement learning in praxis. Currently, Thrun holds a position for lecturing on databionic methods of artificial intelligence, time series analysis and knowledge discovery in the Data Science program at the Philipps University of Marburg.

Full article Related articles

Open access article under the CC BY license.

Keywords

gene ontology gene analysis cluster analysis knowledge base ChatGPT

Metrics

since January 2020

452

Article info
views

252

Full article
views

394

PDF
downloads

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file