Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Blanco-Fernández, Yolanda; Gil-Solla, Alberto; Pazos-Arias, José J.; Quisi-Peralta, Diego

doi:10.15388/23-INFOR527

Informatica

Automatically Assembling a Custom-Built Training Corpus for Improving the Learning of In-Domain Word/Document Embeddings

Volume 34, Issue 3 (2023), pp. 491–527

Yolanda Blanco-Fernández

Alberto Gil-Solla José J. Pazos-Arias Diego Quisi-Peralta

https://doi.org/10.15388/23-INFOR527

Pub. online: 8 September 2023 Type: Research Article

Open Access

Received
1 March 2023

Accepted
1 August 2023

Published
8 September 2023

Abstract

Embedding models turn words/documents into real-number vectors via co-occurrence data from unrelated texts. Crafting domain-specific embeddings from general corpora with limited domain vocabulary is challenging. Existing solutions retrain models on small domain datasets, overlooking potential of gathering rich in-domain texts. We exploit Named Entity Recognition and Doc2Vec for autonomous in-domain corpus creation. Our experiments compare models from general and in-domain corpora, highlighting that domain-specific training attains the best outcome.

References

Abacha, A., Dina, D. (2016). Recognizing question entailment for medical question answering. AMIA Annual Symposium Proceedings, 2016, 310–318.

Adjali, R., Besancon, R., Ferret, O., LeBorgne, H., B., G. (2020). Multimodal entity linking for tweets. In: Proceedings of the 42th European Conference on Advanced Information Retrieval, Lisbon, Portugal.

Aggarwal, C.C. (2017). Outlier Analysis. Springer, New York.

An, Y., Liu, S., Wang, H. (2020). Error detection in a large-scale lexical taxonomy. Information, 11(2). https://doi.org/10.3390/info11020097.

Armand, J., Grave, E., Bojanowski, P., Mikolov, T. (2017). Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics (ACL), Valencia, Spain, pp. 427–431. https://www.aclweb.org/anthology/E17-2068.

Axelrod, A., He, X., Gao, J. (2011). A domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), Edinburgh, Scotland, pp. 355–362. https://www.aclweb.org/anthology/D11-1033.

Bansal, B., Srivastava, S. (2019). Hybrid attribute based sentiment classification of online reviews for consumer intelligence. Applied Intelligence, 49, 137–149. https://doi.org/10.1007/s10489-018-1299-7.

Barbaresi, A. (2013a). Challenges in web corpus construction for low-resource languages in a post-BootCaT world. In: Proceedings of the 6th Human Languages Technologies as a Challenge for Computer Science and Linguistics, Poznan, Poland.

Barbaresi, A. (2013b). Crawling microblogging services to gather language-classified URLs workflow and case study. In: Proceedings of the ACL Student Research Workshop, Sofia, Bulgaria.

Barbaresi, A. (2014). Finding viable seed URLs for web corpora: a scouting approach and comparative study of available resources. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweeden.

Baroni, M., Bernardini, S. (2004). BootCaT: bootstrapping corpora and terms from the web. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC), Lisbon, Portugal.

Baroni, M., Kilgafrriff, A., Pomikalk, J., Rychly, P. (2006). WebBootCaT: a web tool for instant corpora. In: Proceedings of the 12th EURALEX International Congress, Torino, Italy.

Barrena, A., Soroa, A., Agirre, E. (2015). Combining mention context and hyperlinks from wikipedia for named entity disambiguation. In: Proceedings of the 4th Joint Conference on Lexical and Computational Semantics, Denver, USA.

Bhattacharya, P., Ghosh, K., Pal, A., Ghosh, S. (2022). Legal case document similarity: You need both network and text. Information Processing & Management, 59(6). https://doi.org/10.1016/j.ipm.2022.103069.

Blanco, Y., Gil-Solla, A., Pazos-Arias, J.J., Ramos-Cabrer, M., Daif, A., López-Nores, M. (2020). Distracting users as per their knowledge: combining linked open data and word embeddings to enhance history learning. Expert Systems with Applications, 143, 1–16. https://doi.org/10.1016/j.eswa.2019.113051.

Bollegala, D., Maehara, T., Kawarabayashi, K. (2015). Learning word representations from relational graphs. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence. AAAI Press, Austin Texas, USA, pp. 2146–2152. https://arxiv.org/pdf/1412.2378.pdf.

Cano, E., Morisio, M. (2017). Quality of word embeddings on sentiment analysis tasks. In: Proceedings of the 22nd International Conference on Natural Language & Information Systems, Liege, Belgium, pp. 1–8. https://doi.org/10.1007/978-3-319-59569-6_42.

Castagnoli, S. (2015). Using the Web as asource of LSP corpora in the terminology classroom. In: Wacky! Working Papers on the Web as Corpus.

Chen, Q., Peng, Y., Lu, Z. (2019). BioSentVec: creating sentence embeddings for biomedical texts. In: Proceedings of the IEEE International Conference on Healthcare Informatics. Association for Computational Linguistics (ACL), Xian, China, pp. 1–5. https://arxiv.org/abs/1810.09302.

Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S. (2016). How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics (ACL), Berlin, Germany, pp. 166–174. https://www.aclweb.org/anthology/W16-2922.

Crystal, D. (2011). Internet Linguistics. Routledge, London.

Dai, A.M., Olah, C., Le, Q.V. (2020). Document embedding with Paragraph Vectors. https://arxiv.org/abs/1507.07998.

Devlin, J., Chang, M.W., Lee, K., Toutanova, K. (2019). Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (ACL), Minneapolis, USA, pp. 4171–4186. https://doi.org/10.18653/v1/N19-1423.

Ethayarajh, K. (2019). How contextual are contextualized word representations? Comparing the geometry of BERT, ELMo, and GPT-2 embeddings. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics (ACL), Hong Kong, China, pp. 55–65. https://arxiv.org/abs/1909.00512v1.

Faustini, P.H.A., Covões, T.F. (2017). Fake news detection in multiple platform and languages. Expert Systems with Applications, 158, 1–9. https://doi.org/10.1016/j.eswa.2020.113503.

Fu, M., Qu, H., Huang, L., Lu, L. (2018). Bag of meta-words: a novel method to represent document for the sentiment classification. Expert Systems with Applications, 113, 33–43. https://doi.org/10.1016/j.eswa.2018.06.052.

Gali, N., Mariescu-Istodor, R., Hostettler, D., Franti, P. (2019). Framework for syntactic string similarity measures. Expert Systems with Applications, 129(1), 169–185. https://doi.org/10.1016/j.eswa.2019.03.048.

Ganea, E., Hofmann, T. (2017). Deep joint entity disambiguation with local neural attention. In: Proceedings of the 17th Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark.

Gatto, M. (2014). Web as Corpus: Theory and Practice. A&C Black, London.

Geng, Z., Zhang, Y., Han, Y. (2021). Joint entity and relation extraction model based on rich semantics. Neurocomputing, 429. https://doi.org/10.1016/j.neucom.2020.12.037.

Giatsoglou, M., Vozalis, M.G., Diamantaras, K., Chatzisavvas, K. (2017). Sentiment analysis leveraging emotions and word embeddings. Expert Systems with Applications, 69(1), 214–224. https://doi.org/10.1016/j.eswa.2016.10.043.

Grefenstette, E., Dinu, G., Zhang, Y., Sadrzadeh, M., Baroni, M. (2013). Multi-step regression learning for compositional distributional semantics. In: Proceedings of Conference on Empirical Methods in Natural Language Processing, Postdam, Germany, pp. 131–142. https://www.aclweb.org/anthology/W13-0112.

Gu, Y., Tinn, R., Cheng, H., Lucas, M., Usuyama, N., Liu, X., Naumann, T., Gao, J., Poon, H. (2021). Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare, 3(1). https://doi.org/10.1145/3458754.

Gutiérrez-Batista, K., Campaña, J.R., Vila, M.A., Martin-Bautista, M. (2018). An ontology-based framework for automatic topic detection in multilingual environments. International Journal of Intelligent Systems, 33, 1459–1475. https://doi.org/10.1002/int.21986.

Han, L., Kashyap, A.L., Finin, T., Mayfield, J., Weese, J. (2013). UMBC_EBIQUITY-CORE: semantic textual similarity systems. In: Proceedings of the 2nd Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, Atlanta, USA, pp. 44–52. https://aclanthology.org/S13-1005.

Hu, L., Ding, J., Shi, C., Shao, C., Li, S. (2020). Graph neural entity disambiguation. Knowledge-Based Systems, 195. https://doi.org/10.1016/j.knosys.2020.105620.

Ismayilov, A., Kontokostas, D., Auer, S., Lehmann, J., Hellmann, S. (2015). Wikidata through the eyes of DBpedia. Semantic Web, 9(4), 1–11. https://doi.org/10.3233/SW-170277.

Jiang, F., Sui, Y., Cao, C. (2009). Some issues about outlier detection in rough set theory. Expert Systems with Applications, 36(3), 4680–4687. https://doi.org/10.1016/j.eswa.2008.06.019.

Jung, G., Shing, J., Lee, S. (2022). Impact of preprocessing and word embeddings on extreme multi-label patent classification tasks. Applied Intelligence, 3(), 4047–4062. https://doi.org/10.1007/s10489-022-03655-5.

Khatua, A., Khatua, A., Cambria, E. (2019). A tale of two epidemics: Contextual Word2Vec for classifying Twitter streams during outbreaks. Information Processing & Management, 56(1), 247–257. https://doi.org/10.1016/j.ipm.2018.10.010.

Kilgarriff, A., Reddy, S., Pomikalek, J., Avinesh, P. (2014). A corpus factory for many languages. In: Proceedings of the 7th International Conference on Language Resources and Evaluation, Malta.

Kim, D., Seo, D., Cho, S., Kang, P. (2018). Multi-co-training for document classification using various document representations: TFIDF, LDA, and Doc2Vec. Information Sciences, 477, 15–29. https://doi.org/10.1016/j.ins.2018.10.006.

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., Fidler, S. (2015). Skip-Thought vectors. In: Proceedings of the Neural Information Processing Systems Conference, pp. 1–11. http://arxiv.org/abs/1506.06726.

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R.S., Torralba, A., Urtasun, R., Fidler, S. (2018). An efficient framework for learning sentence representations. In: Proceedings of the 6th International Conference on Learning Representations, Vancouver, Canada. https://arxiv.org/abs/1803.02893.

Lamsiyah, S., Mahdaouy, A., Espinasse, B., Alaoui, S. (2021). An unsupervised method for extractive multi-document summarization based on centroid approach and sentence embeddings. Expert Systems with Applications, 167(1), 1–16. https://doi.org/10.1016/j.eswa.2020.114152.

Lastra, J.J., Goikoetxea, J., Mohamed, A., Garcia-Serrano, A., Mohamed, B., Agirre, E. (2019). A reproducible survey on word embeddings and ontology-based methods for word similarity: linear combinations outperform the state of the art. Engineering Applications of Artificial Intelligence, 85, 645–665. https://doi.org/10.1016/j.engappai.2019.07.010.

Lau, J.H., Baldwin, T. (2016). An empirical evaluation of Doc2Vec with practical insights into document embedding Generation. In: Proceedings of the 1st Workshop on Representation Learning for NLP. Association for Computational Linguistics (ACL), Berlin, Germany, pp. 78–86. https://www.aclweb.org/anthology/W16-1609.

Le, Q., Mikolov, T. (2014). Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning (ICML). Association for Computational Linguistics (ACL), Beijing, China, pp. 1188–1196. https://arxiv.org/abs/1405.4053.

Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Bizer, C. (2012). DBpedia: a large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web, 9(4), 1–5. https://doi.org/10.3233/SW-140134.

Liu, X., Gao, J., He, X., Deng, L., Duh, K., Wang, Y.Y. (2015). Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (ACL), Denver, USA, pp. 912–921. https://www.aclweb.org/anthology/N15-1092.

Logeswaran, L., Lee, H. (2018). An efficient framework for learning sentence representations. In: Proceedings of the 6th International Conference on Learning Representations. Association for Computational Linguistics (ACL), Vancouver, Canada, pp. 1–16. https://arxiv.org/abs/1803.02893v1.

Lynn, T., Scannell, K., Maguire, E. (2015). Minority language twitter: part-of-speech tagging and analysis of Irish tweets. In: Proceedings of Workshop on Noisy User-generated Text, Beijing, China. https://doi.org/10.18653/v1/W15-4301.

Ma, J., Duanyang, L., Yonggang, C., Haodong, Z., Xhang, X. (2021). A knowledge graph entity disambiguation method based on entity-relationship embedding and graph structure embedding. Computational Intelligence and Neuroscience. https://doi.org/10.1155/2021/2878189.

Mendes, P., Max, J., García-Silva, A., Bizer, C. (2011). DBpedia Spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, Graz, Austria, pp. 1–8. https://doi.org/10.1145/2063518.2063519.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Jeffrey, D. (2013a). Distributed representations of phrases and their compositionality. In: Proceedings of the 27th Conference on Neural Information Processing Systems, Lake Tahoe, USA, pp. 1–11. https://arxiv.org/abs/1310.4546.

Mikolov, T., Corrado, G.S., Chen, K., Dean, J. (2013b). Efficient estimation of word representations in vector space. In: Proceedings of the International Conference on Learning Representations, Scottsdale, USA, pp. 1–12. https://arxiv.org/abs/1301.3781v3.

Mohd, M., Jan, R., Shah, M. (2020). Text document summarization using word embeddings. Expert Systems with Applications, 143(1), 1–10. https://doi.org/10.1016/j.eswa.2019.112958.

Nooralahzadeh, F., Øvrelid, L., Lønning, J.T. (2018). Evaluation of domain-specific word embeddings using knowledge resources. In: Proceedings of the 11th International Conference on Language Resources and Evaluation. European Language Resources Association (ELRA), Miyazaki, Japan, pp. 1438–1445. https://www.aclweb.org/anthology/L18-1228.

Oliveira, J., Delgado, C., Assaife, A.C. (2017). A recommendation approach for consuming linked open data. Expert Systems with Applications, 72, 407–420. https://doi.org/10.1016/j.eswa.2016.10.037.

Park, S., Cho, J., Park, K., Shin, H. (2021). Customer sentiment analysis with more sensibility. Engineering Applications of Artificial Intelligence, 104, 104356. https://doi.org/10.1016/j.engappai.2021.104356.

Peeters, R., Bizer, C. (2021). Dual-objective fine-tuning of BERT for entity matching. Proceedings of the VLDB Endowment, 1410, 1913–1921.

Peeters, R., Primpeli, A., Wichtlhuber, B., Bizer, C. (2020). Using schema.org annotations for training and maintaining product matchers. In: Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, Biarritz, France.

Pellissier, T.T., Weikum, G., Schanek, F. (2020). YAGO 4: a reasonable knowledge base. In: Proceedings of the 17th International Conference on the Semantic Web (ESWC), Vol. 12123. Springer, Crete, Greece, pp. 583–596. https://doi.org/10.1007/978-3-030-49461-2_34.

Pennington, J., Socher, R., Manning, C.D. (2014). GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing, pp. 1532–1543. http://www.aclweb.org/anthology/D14-1162.

Petters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L. (2018). Deep contextualized word representations. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (ACL), New Orleans, USA, pp. 2227–2237. https://arxiv.org/abs/1802.05365v2.

Phan, M.C., Sun, A., Tay, Y., Han, J., Li, C. (2017). NeuPL: attention-based semantic matching andpair-linking forentity disambiguation. In: Proceedings of the 2017 ACM Conference on Information Knowledge Management, Singapore.

Pilehvar, M.T., Collier, N. (2016). Improve semantic representation for domain-specific entities. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics (ACL), Berlin, Germany, pp. 12–16. https://aclanthology.org/W16-2902/.

Primpeli, A., Peeters, R., Bizer, C. (2019). The WDC training dataset and gold standard for large-scale product matching. In: Proceedings of the 2019 World Wide Web Conference, San Francisco, USA.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. Computer Science, 1–24. https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.

Rani, R., Lobiyal, D.K. (2022). Document vector embedding based extractive text summarization system for Hindi and English text. Applied Intelligence, 52, 9353–9372. https://doi.org/10.1007/s10489-021-02871-9.

Rehürek, R., Sojka, P. (2010). Software framework for topic modelling with large corpora. In: Proceedings of the LREC Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, pp. 45–50.

Reimers, N., Gurevych, I. (2019). Sentence-BERT: sentence embeddings using siamese BERT-networks. In: Proceedings of Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics (ACL), Hong Kong, China, pp. 3982–3992. https://arxiv.org/abs/1908.10084.

Rezaeinia, S.M., Rahmani, R., Ghodsi, A., Veisi, H. (2019). Sentiment analysis based on improved pre-trained word embeddings. Expert Systems with Applications, 177(1), 139–147. https://doi.org/10.1016/j.eswa.2018.08.044.

Silva, R.M., Santos, R., Almeida, T., Pardo, T. (2020). Towards automatically filtering fake news in Portuguese. Expert Systems with Applications, 146, 1–14. https://doi.org/10.1016/j.eswa.2020.113199.

Smith, J., Plamada, M., Koehn, P., Callison, C., Lopez, A. (2013). Dirt cheap web-scale parallel text from the Common Crawl. In: Proceedings of the 51th Annual Meeting of the Association for Computational Linguistics, Sofia, Bulgaria.

Songa, X., Mina, Y., Da-Xionga, L., Fengb, W.Z., Shua, C. (2019). Research on text error detection and repair method based on online learning community. Procedia Computer Science, 154, 13–19. https://doi.org/10.1016/j.procs.2019.06.004.

Sunitha, L., BalRaju, M., Sasikiran, J., Venkat Ramana, E. (2014). Automatic outlier identification in data mining using IQR in real-time data. International Journal of Advanced Research in Computer and Communication Engineering, 3(6), 1–10.

Symseridou, E. (2018). The web as a corpus and for building corpora in the teaching of specialised translation. FITISPOS International Journal, 5(1), 60–82. https://doi.org/10.37536/FITISPos-IJ.2018.5.1.160.

Tukey, J.W. (1977). Exploratory Data Analysis. Reading Mass, New York.

Turian, J., Ratinov, L.A., Bengio, Y. (2010). Word representations: a simple and general method for semi-supervised learning. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (ACL), Uppsala, Sweden, pp. 384–394. https://www.aclweb.org/anthology/P10-1040.

Valcarce, D., Landin, A., Parapar, J., Barreiro, A. (2019). Collaborative filtering embeddings for memory-based recommender systems. Engineering Applications of Artificial Intelligence, 85, 347–356. https://doi.org/10.1016/j.engappai.2019.06.020.

Vrandeĉić, D., Krötzsch, M. (2014). Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10), 78–85. https://doi.org/10.1145/2629489.

Wang, B., Jay-Kuo, C.C. (2020). SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models. https://arxiv.org/abs/2002.06652.

Xu, Y., Liu, X., Shen, Y., Liu, J., Gao, J. (2019). Multi-task learning with sample re-weighting for machine reading comprehension. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (ACL), Minneapolis, USA, pp. 2644–2655. https://arxiv.org/abs/1809.06963v3.

Yoo, S., Jeong, O. (2020). Automating the expansion of a knowledge graph. Expert Systems with Applications, 141(1), 1–10. https://doi.org/10.1016/j.eswa.2019.112965.

Zanzotto, F., Pennacchiotti, M. (2010). Expanding textual entailment corpora from Wikipedia using co-training. In: Proceedings of the 2nd Workshop on Collaborative Constructed Semantic Resources, Beijing, China.

Zhang, D., Yuan, Z., Liu, Y., Zhuang, F., H., C., Xiong, H. (2021). E-BERT: adapting BERT to e-commerce with adaptive hybrid masking and neighbor product reconstruction. https://arxiv.org/pdf/2009.02835.

Zhang, Z., Song, X. (2022). An exploratory study on utilising the web of linked data for product data mining. SN Computer Science, 4(15). https://doi.org/10.1007/s42979-022-01415-3.

Zhou, Y., Hu, X., Chung, V. (2022). Automatic construction of fine-grained paraphrase corpora system using language inference model. Applied Intelligence, 12(1). https://doi.org/10.3390/app12010499.

Zwicklbauer, S., Seifert, C., Granitzer, M. (2016). DoSeR: a knowledge-base-agnostic framework for entity disambiguation using semantic embeddings. In: Proceedings of the 13th International Conference on Semantic Web Latest Advances and New Domains, Heraklion, Crete.

Biographies

Blanco-Fernández Yolanda

https://orcid.org/0000-0002-1816-1377

yolanda@det.uvigo.es

Y. Blanco-Fernández obtained her PhD in telecommunications engineering from the University of Vigo in 2007 and currently serves as an associate professor at the same institution. Her research focuses on semantic reasoning in personalization systems, wireless ad hoc networks for mobile devices, machine learning, deep learning models, and natural language processing. She has authored 50+ JCR-indexed journal articles, 13 book chapters, and presented 93 communications at international conferences. She has also advised 5 doctoral theses and has contributed to 30+ competitively funded projects, both nationally and internationally (including H2020 and FP7). She has also engaged in 4 technology transfer contracts. Since 2021, she has held the position of deputy director at the Research Center for Telecommunication Technologies (atlanTTic).

Gil-Solla Alberto

agil@det.uvigo.es

A. Gil-Solla holds a degree in telecommunication engineering (1991) and earned his PhD in telecommunication (2000) from the University of Vigo. Currently, he holds the position of professor at the same institution, where he teaches in the Telecommunication programme. He has advised 5 PhD theses and supervised more than 20 undergraduate theses. He is a member of the Group of Services of the Information Society, which is part of the Department of Telematic Engineering at the University of Vigo. Throughout his career, he has been involved in over 40 national and international research projects, including FP7 and H2020 initiatives, with many of them being carried out in collaboration with industrial partners. His research interests focus on the design and development of intelligent systems for personalization of Internet and mobile applications. This includes automatic content recommendation, particularly utilizing Natural Language Processing techniques and other Machine Learning approaches involving neural networks. He has authored over 50 publications in journals indexed in the JCR, as well as more than 60 presentations at international conferences.

Pazos-Arias José J.

jose@det.uvigo.es

J.J. Pazos-Arias is a telecommunications engineer (1987) and holds a PhD in telecommunications engineering (1995) from the Polytechnic University of Madrid. He joined the University of Vigo in 1988 and has held the position of professor since 2009 in the Department of Telematics Engineering. Since June 2016, he has been a Numerary Academician of the Royal Academy of Sciences of Galicia. He co-authored 75 articles in JCR-indexed journals, contributed to 21 chapters in internationally recognized books, and presented over 150 communications at international congresses. Additionally, he has edited 2 books of research monographs and served as the principal investigator in 4 out of 5 projects of the National R&D Plan in which he participated. He has also advised 9 doctoral theses. He has been involved in numerous projects funded through competitive calls. Over the past decade, he has participated or is currently participating in two projects of the EU H2020 program, one project of the 7th EU Framework Program, one project under the Erasmus+ program, 4 projects funded through competitive national calls in collaboration with European partners, several regional projects, and 13 collaborative projects with companies funded through competitive national calls. He assumed the role of principal investigator in many of these projects. In terms of transferring research results, he has contributed to more than 30 technology transfer contracts and 17 contracts for training courses. Additionally, he has been the person responsible for overseeing many of these activities. In 1995, he founded the Information Society Services Group (GSSI) and continues to serve as its head. This group has attained the classification of a Reference Group within the R&D system of the Galician region, securing significant stable funding not tied to specific projects. He is also a member of the Research Center for Telecommunication Technologies (atlanTTic).

Quisi-Peralta Diego

dquisi@ups.edu.ec

D. Quisi-Peralta received a degree in computer systems engineering from the Universidad Politécnica Salesiana (Ecuador) in 2013. Furthermore, he obtained a master’s degree in advanced computer technologies from the University of Castilla-La Mancha (Spain) in 2015, as well as a master’s in Strategic Management of Communication Technologies from the University of Cuenca (Ecuador) in 2017. Currently, he is a PhD student at the School of Telecommunications Engineering at the University of Vigo (Spain). He works as the director of the ICT department at Livingnet and serves as an external researcher for PUCE (Pontificia Universidad Católica del Ecuador) and UPS (Universidad Politécnica Salesiana). His research interests encompass the application of AI technologies, data mining, ontologies, large language models, computer vision, and application development.

Full article

Open access article under the CC BY license.

Keywords

embedding models Named Entity Recognition Doc2Vec ad hoc corpus

Metrics

since January 2020

348

Article info
views

247

Full article
views

205

PDF
downloads

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file