1 Introduction
The learning of word embeddings has gained momentum in many Natural Language Processing (NLP) applications, ranging from text document summarisation (Mohd
et al.,
2020), fake news detection (Faustini and Covões,
2017; Silva
et al.,
2020), and term similarity measure (Lastra
et al.,
2019; Gali
et al.,
2019) to sentiment classification (Rezaeinia
et al.,
2019; Giatsoglou
et al.,
2017; Park
et al.,
2021), edutainment (Blanco
et al.,
2020), Named Entity Recognition (Turian
et al.,
2010; Gutiérrez-Batista
et al.,
2018), classification tasks (Jung
et al.,
2022) and personalization systems (Valcarce
et al.,
2019), just to name a few. Most popular methods consider a large corpus of texts and represent each word with a real-valued dense vector, which captures its meaning assuming that words sharing common contexts in the input corpus are semantically related to each other (and consequently their respective word vectors are close in the vector space) (Mikolov
et al.,
2013b). Drawing inspiration from such word representations, in last years document embeddings have emerged as a natural extension of word embeddings, by mapping variable-length documents (sentences, paragraphs or full documents) to vector representations. Their effectiveness has been remarkable in a wide diversity of tasks, such as text classification and sentiment analysis (Fu
et al.,
2018; Le and Mikolov,
2014; Bansal and Srivastava,
2019), multi-document summarisation (Lamsiyah
et al.,
2021; Rani and Lobiyal,
2022), forum question duplication (Lau and Baldwin,
2016), document similarity (Dai
et al.,
2020), sentence pair similarity (Chen
et al.,
2019), and even semantic relatedness and paraphrase detection (Logeswaran and Lee,
2018).
Mostly-adopted approaches to word and document embeddings leverage unsupervised learning methods from large collections of unlabelled documents that serve as training corpora in considering word-word co-occurrences. In the literature, commonly-used corpora compile a huge number of unrelated texts, such as a full collection of English Wikipedia, the Associated Press English news articles released from 2009 to 2015, or a dataset of high quality English paragraphs containing over three billion words (Han
et al.,
2013). Such collections lead to learning general-domain embedding models that do not perform well when working in a very specific domain (for example, on a particular historical event or a concrete medical discipline), whose common vocabulary is unlikely to be included in a generic corpus. As stated in Nooralahzadeh
et al. (
2018), “
domain-specific terms are challenging for general domain embeddings since there are few statistical clues in the underlying corpora for these items”. This idea was also previously stemmed from the results attained in Bollegala
et al. (
2015), Pilehvar and Collier (
2016).
Bearing this limitation in mind, many researchers leveraged the findings of fields such as multi-task learning and transfer learning (Axelrod
et al.,
2011) and adopted a mixed-domain training in two phases: First, a general domain corpus is used to train an embedding model, which is next trained incrementally with specialised documents that are related to the particular domain. Thanks to this continual training, the general knowledge of the first phase can be transfered to the second one in order to be exploited along with the lexical and semantic specificities in that domain (Liu
et al.,
2015; Xu
et al.,
2019). However, there also exist works that concluded that, in domains with abundant unlabelled texts, the domain-specific training is not improved with the transfer from general domains. As a running example in biomedicine, the authors of Gu
et al. (
2021) showed that “
domain-specific training from scratch substantially outperforms continual pretraining of generic language models, thus demonstrating that the prevailing assumption in support of mixed-domain pretraining is not always applicable”.
The benefits of resorting to only a domain-specific training from scratch have also been confirmed in other works (Nooralahzadeh
et al.,
2018; Kim
et al.,
2018; Lau and Baldwin,
2016). In particular, the authors of Nooralahzadeh
et al. (
2018) concluded that models learned from ad hoc corpora provide “
better results than general domain models for a domain-specific benchmark”, demonstrating besides that “
constructing domain-specific word embeddings is beneficial even with limited input data”. Actually, these approaches rely on small-sized ad hoc corpora, whose documents are gathered by hand and indiscriminately from publicly-available sources in the Internet (Chiu
et al.,
2016; Nooralahzadeh
et al.,
2018; Gu
et al.,
2021; Cano and Morisio,
2017). Specifically, to the best of the authors’ knowledge, existing approaches do not consider the relevance of a document (in the particular domain) when deciding whether or not that text should be included in the ad hoc corpus. This process is obviously costly and clearly unfeasible without automatic assistance, in view of the myriad of possible domains/topics (and the huge amount of available documents on each of them). However, according to the results achieved in Chiu
et al. (
2016), the relevance of the specialised texts chosen as training documents is a critical parameter in learning of embedding models. Specifically, these researchers handled two ad hoc corpora including each one a different number of in-domain documents about biomedicine. Their results confirmed that the highest quality embedding models were learned from the smallest ad hoc corpus, proving that “
bigger corpora do not necessarily produce better biomedical domain word embeddings”. In other words, disregarding the suitability of the considered documents in the particular domain may distort the training.
Taking into account the above conditions, the interest and main contributions of the proposed approach can be summarised as follows:
-
• Ad hoc corpora enable learning successful embedding models in very specific domains (e.g. medicine, history or chemical engineering, to name a few), where the huge public generic datasets that are usually adopted fail to accurately model the peculiarities (and the particular vocabulary) of these specialised domains.
-
• The approach described in the paper automatically builds such corpora retrieving large amounts of candidate training texts from Internet sources and incorporating only those that are relevant in the context under consideration. For that purpose, NER facilities (to identify named entities in the input text) and Doc2Vec models (to assess the relationship between the input text and each of the retrieved candidate documents) are exploited.
-
• The only input required in the approach is a fragment of text representative of the context, making human assistance during the creation of the ad hoc corpus unnecessary and allowing to deal with any topic or domain, however specific they may be.
-
• The automatic procedure for building the resulting corpus greatly simplifies the hard work associated with traditional manual collection procedures, while providing more and better domain-specific documents.
This paper is organized as follows: Section
2 explores relevant works within the context of our approach to automatic custom training corpus creation. Section
3 focuses on the procedure devised to gather a set of in-domain candidate texts from the Internet, using a particular history event (the Battle of Thermopylae between Greeks and Persians in 480 BC) to illustrate the approach. The mechanism to validate the selection of
relevant candidates to be incorporated into the automatically-built tailor-made corpus is detailed in Section
4. In this section, we also describe the tests conducted to evaluate the consistency of several Doc2Vec models, including a model trained on a general-domain corpus sourced from news articles of the Associated Press (AP), as well as in-domain models learned from ad hoc corpora. Finally, Section
5 concludes the paper and highlights further research directions.
2 Related Work
Our literature review is organized into three main sections. In Section
2.1, the focus is on prominent embedding models that lay the foundation for our research, whose learning in specialised scopes requires domain-specific training documents. Since our research contributes to the automatic generation of such kind of ad hoc corpora, Section
2.2 describes relevant related approaches for constructing custom datasets, emphasizing the key distinctions from our procedure. Our algorithm for selecting in-domain training documents starts by identifying named entities in the input text that contextualizes the specific theme for building the ad hoc dataset. Since named entities can have multiple possible interpretations, accurately distinguishing the associated meaning for each entity is crucial in this process. To achieve this, the commonly-adopted approaches to named entity disambiguation will be thoroughly reviewed in Section
2.3.
2.1 Embedding Models
The germ of learning language representations using models pre-trained on large collections of unlabelled texts springs from word embeddings such as Word2Vec (Mikolov
et al.,
2013b) and GloVe (Pennington
et al.,
2014). Word2Vec trains a model on the context of each word such that similar words have similar vector representations. Considering word-word co-occurrences, these embeddings capture semantics and meaning-related relationships (enabling, for instance, to detect that two words are similar or opposite, or that the pair of words
Spain and
Madrid have an analogous relation to
Canada and
Ottawa), along with syntax and grammar-related relationships (
have and
had are at same level as
are and
were). Word2Vec is a feed-forward neural network that learns vectors to improve its predictive ability, offering two different models: CBOW (where the goal is to predict a word based on the words in its context) and Skip-Gram (where the aim is to predict surrounding words given an input word) (Khatua
et al.,
2019). This model suffers from two main weaknesses which, briefly, are mainly related to the impossibility of (i) dealing with words that do not appear in the training corpus, and (ii) considering different meanings for the same word.
-
• Regarding the first limitation, these kind of approaches are not able to embed Out-Of- Vocabulary (OOV) words unseen in the training corpus, which makes it impossible to deal with rare/unusual terms and misspelling. The embedding model FastText circumvents the OOV problem by working at character-n-gram level (Armand
et al.,
2017).
-
• On the other hand, more sophisticated models have emerged in the literature which capture the contextualized meaning of words. In particular, models like ELMo (Petters
et al.,
2018), GPT (Radford
et al.,
2019) and BERT (Devlin
et al.,
2019) enable to learn contextual relationships among words. For instance, in the sentences “
I have hit my head” and “
The Head of school sent a message to the students” the meaning of the word
head depends on its left context in the first sentence (
I have hit my) and on the right context in the second one (
of school sent a message to the students). Bearing this motivation example in mind, some approaches moved from fixed word embeddings to contextualized models that consider both the sense of the next and previous words.
The main differences between the most popular contextualized embeddings (BERT, ELMo and GPT) are related to architectural internals: While ELMo relies on a Long Short-Term Memory (LSTM) model, GPT, BERT and its variants resort to a Transformer-based architecture (Reimers and Gurevych,
2019; Wang and Jay-Kuo,
2020). Details of both architectures, out of the scope of this paper, can be found in (Ethayarajh,
2019). Before the irruption of the latest BERT-based document embedding models, the commonly adopted approach was Paragraph Vector (also called Doc2Vec), the natural extension of Word2Vec for learning vector representations for pieces of variable-length texts (sentences, paragraphs and documents) (Le and Mikolov,
2014; Lau and Baldwin,
2016; Kim
et al.,
2018; Bhattacharya
et al.,
2022). Similar to Word2Vec, Doc2Vec works with two different models: Distributed Memory version of Paragraph Vector (PV-DM) and Distributed Bag Of Words version of Paragraph Vector (PV-DBOW). In both models the training enables to learn a vector for the initial text (called paragraph vector in Le and Mikolov,
2014), considering or not the Word2Vec embeddings of the single words depending on the approach, replicating such procedure in the prediction phase to provide a vector representing the paragraph/document.
The results described in Kim
et al. (
2018) highlighted that Doc2Vec outperformed previous sentence embeddings methods, ranging from simple approaches that use a weighted average of all the words in the document (Grefenstette
et al.,
2013; Mikolov
et al.,
2013a) to more sophisticated models like Skip-Thought (Kiros
et al.,
2015) and Quick-Thoughts (Kiros
et al.,
2018) that have attained good performance on diverse NLP tasks, such as semantic relatedness, paraphrase detection, image-sentence ranking and question-type classification (Kiros
et al.,
2018). The main features of the above models are summarised in Table
1.
Table 1
Embedding models defined in the literature and their main features.
Embedding model |
Word-level model |
Sentence/document level model |
Contextualized embeddings |
Architecture |
Word2Vec |
✓ |
× |
× |
Feed-forward neural network |
GloVe |
✓ |
× |
× |
Count-based model |
FastText |
✓ |
× |
× |
Feed-forward neural network |
ELMo |
✓ |
× |
✓ |
LSTM |
GPT |
✓ |
× |
✓ |
Transformer |
BERT |
✓ |
× |
✓ |
Transformer |
Doc2Vec |
✓ |
✓ |
✓ |
Feed-forward neural network |
Skip-Thought |
✓ |
✓ |
✓ |
Encoder-decoder |
Quick-Thoughts |
✓ |
✓ |
✓ |
Encoder-decoder |
As noted in Section
1, many existing embedding models enable to fine-tune the training process on a new in-domain dataset for a concrete NLP task. However, as far as the authors of this paper know, there are no relevant approaches in the literature that automatically assemble such a dataset as a custom-built training corpus containing a large amount of relevant documents to capture meaningful domain-specific information. This would lead to more accurate embedding representations, which could improve the performance of existing (word-level and document- level) models. In this regard, it should be noted that this paper is not about using an ad hoc corpus to train existing models with an ad hoc corpus in order to compare their respective performances and assess their strengths. Instead, the goal of this research is to devise a semantics-driven mechanism to automatically collect an in-domain dataset and verify that can lead to models with better performance that the ones trained with a generic corpus. To do so, the presented research considers a particular model (in this case, Doc2Vec) and a very specific application domain (a historical event like the Battle of the Thermopylae, as will be described in the validation scenario presented in Section
4).
Having tested this hypothesis with a Doc2Vec model, the viability and utility of the proposed semantics-driven mechanism are confirmed, and the door is open for its application in different scenarios involving other more sophisticated Transformer-based embeddings defined in the literature. This further goal is beyond the scope of the experimental validation presented in this paper, where the focus is on assessing the quality of a tailor-made training corpus rather than on quantifying the performance of the multiple models that could be learned from it.
2.2 Automatic Creation of Corpora
The web has long been considered a mega corpus with the potential to uncover new information across diverse fields (Crystal,
2011; Gatto,
2014). Consequently, numerous works in the literature focus on processing web-based corpora to find, extract, or transform information. This trend has intensified with the explosive growth of NLP research over the past decade, leading to extensive efforts in gathering both labelled and raw text corpora for discovering linguistic regularities, generating embeddings, and performing downstream tasks like classification, sentiment analysis, summarisation, or Q&A. Despite this importance, relevant results regarding automating the generation of such corpora remain scarce; there is hardly any research work related to automatic corpus creation.
The corpus construction approaches from the web that can be found in the literature primarily aim to support research and professional training in linguistic and translation fields. Typically, their objective is to create corpora for gaining an overview of a given language. Thus, these approaches mainly involve exploring the web to study pages based on their language rather than their content, resulting in general corpora rather than specialised ones. The most prevalent approach in these initiatives is the BootCat tool (Baroni and Bernardini,
2004), followed by successive refinements known as the WebBootCat web application (Baroni
et al.,
2006) later marketed as SketchEngine (Kilgarriff
et al.,
2014), relies on issuing a general set of search-engine queries involving domain-specific keywords to obtain focused collections of documents. This approach requires users to provide initial seeds (keywords) to begin the search. Subsequently, a large number of queries combining these seeds are issued against search engines like Google, Yahoo, or Bing, and the more relevant results are recovered to form the corpus.
However, these approaches primarily rely on web crawling followed by some linguistic processing to filter inadequate results. They work with literal keywords, querying for documents (web pages) containing those keywords, and repeat the process with the links contained in each page. They do not explore categories to classify the pages nor similarities among documents to establish their relevance for inclusion in the corpus or to trigger new searching processes or URL selection beyond those explicitly contained in the recovered documents. The text gathering procedure is somewhat coarse, and manual refinement is often necessary to guide these tools in the search for appropriate texts for the corpus to achieve a quality output. As a result, these projects lead to relatively small corpora, suitable for studying a language in a teaching environment but insufficient for training neural networks.
Similar approaches are used in projects creating corpora for linguistic and translation purposes, such as specialised health corpora (Symseridou,
2018) for instructing translation professionals in required abilities like locating terms, studying collocations, grammar, and syntax. The same approach is followed in Castagnoli (
2015) to build corpora for health, law, and cell phone scenarios, and in Lynn
et al. (
2015), aimed at constructing linguistic corpora for less commonly used languages (e.g. Irish). In all these cases, the core algorithm is based on web crawling, with language being the main driving constraint for selecting pages to add or continue searching.
The evolution of these approaches has been significantly influenced by the explosive growth of the web and subsequent restrictions imposed by search engines regarding massive querying of the web (Barbaresi,
2013a). This led to the exploration of alternative ways to discover documents for the corpus, such as exploring social networks and blogging platforms (Barbaresi,
2013b), the Open Directory Project and Wikipedia (Barbaresi,
2014), or the Common Crawl platform, a free Internet crawling initiative (Smith
et al.,
2013).
There are also automatic corpus-construction experiences centred not on the whole web but on closed repositories, aimed at filtering relevant documents for specific queries. These projects typically involve structured and well-known repositories, where the goal is to create information corpora restricted to characteristics specified by users, resembling more of a database query than web exploration. Their objectives focus on information analysis to guarantee compliance with restrictions rather than finding more similar texts from the current one. An example is Primpeli
et al. (
2019), which focuses on recovering text resources about e-commerce items from the WDC Product Data Corpus, extracted from the Common Crawl data repository. The compiled data, originally attached to product pages by e-commerce companies, forms an automatically created corpus used to group similar products in clusters. The quality of this corpus is confirmed by Peeters
et al. (
2020), Peeters and Bizer (
2021), where embedding models were trained using such corpora. Zhang and Song (
2022) utilizes information extracted from the same sources to feed various processes in the field of NLP, such as embedding generation or model training. Both scenarios share some similarities with ours as a corpus is created for training Machine Learning models. However, the documents collected in their research are located in a specific source, originating from an already available corpus containing semantic annotations, with no relevance to the subject being measured for such documents, and no new text is discovered from the analysed ones.
Numerous other research works claim automatic corpus creation in Machine Learning settings. For example, Zhang
et al. (
2021) creates a new corpus from the Amazon product dataset to train a new BERT model; Abacha and Dina (
2016) automatically creates a corpus of equivalent pairs of questions (Textual Entailments) from the National Library of Medicine’s database of questions (USA); Zanzotto and Pennacchiotti (
2010) extracts pairs of entailments from Wikipedia by studying the successive historic revisions of some articles; and Zhou
et al. (
2022) builds a new corpus of Paraphrase Detection by refining existing corpora like the Stanford Natural Language Inference corpus (SNLI) and the Multi-Genre Natural Language Inference corpus (MNLI). However, to our knowledge, all these approaches focus on processing closed repositories to obtain a new corpus, with no exploration of the web (or large repositories like Wikipedia) to search for new unknown documents. Moreover, no examples exist of studying the relevance of documents for a given subject openly specified by the user; the working theme is already fixed at the creation of the project. In the first group, new documents are discovered simply by crawling (with the language constraint), while no new previously unknown document is discovered in the second group.
In summary, the literature includes several works on the automatic creation of corpora. On the one hand, there is an older research line centred on linguistic and translation fields, where approaches are relatively simple, exploring the web from user-provided seeds, and generally involving simple web crawlers primarily driven by the language of pages. On the other hand, more elaborate approaches are found in specific fields like health or e-commerce, as well as a significant number of cases also centered on Machine Learning model training. However, to our knowledge, all these approaches are focused on processing closed repositories to create a new corpus, with no exploration of the web (or large repositories like Wikipedia) to search for new unknown documents. In neither approach do examples exist of studying the relevance of documents for a given subject openly specified by the user, and no new documents are discovered beyond the initial corpus.
2.3 Named Entity Disambiguation
In the literature, named entity disambiguation (NED) is commonly defined as the process of determining the precise meaning or sense of a named entity within a given context. These named entities can be identified by well-known named entity recognition tools like DBpedia-Spotlight (Mendes
et al.,
2011). More specifically, the goal of NED is to resolve ambiguity by associating the named entity with a specific concept within a semantic knowledge base. Previous studies have addressed the challenge of entity disambiguation through the utilization of statistical methods and rule-based approaches. These works take into account the contextual words surrounding the target named entity during the disambiguation process. However, they often neglect the semantic nuances of words and lack generalizability since the rules are typically specific to certain domains (An
et al.,
2020; Songa
et al.,
2019).
Subsequently, more advanced mechanisms emerged, such as the methods based on entity features. In these approaches, when an entity possesses multiple interpretations, inconsistent entities are filtered out by assessing their semantic similarity. The disambiguation process considers the semantic attributes of the entity, the contextual information surrounding the entity, and even its frequency of occurrence in the processed text. Notably, these methods leverage contextual embedding models, which assign different vector representations to entities with the same spelling based on their specific meanings within each context. To achieve this, entity features-based disambiguation methods typically obtain the contextual embedding vector of the target entity. They then calculate the semantic distance between this vector and the embedding vectors of each candidate entity to effectively disambiguate and remove any ambiguous entities (Barrena
et al.,
2015; Zwicklbauer
et al.,
2016). However, despite the efficacy of this approach, it disregards the structural characteristics of the knowledge base in which the target entity is situated, such as the interconnections between entities. Consequently, it fails to capture the global semantic features of each entity (Adjali
et al.,
2020). Additionally, this disambiguation method requires large training corpus to learn an embedding model.
To address this challenge, recent studies have turned to deep neural networks for entity disambiguation. Specifically, neural network-based approaches have gained popularity by incorporating the subgraph structure features of knowledge bases. These features are utilized as inputs to graph neural networks, enabling the disambiguation of entities within the knowledge base (Ma
et al.,
2021). Various methods have been explored, including convolutional and recurrent neural networks, as well as LSTM networks, to disambiguate entities based on extracted associations among them (Geng
et al.,
2021; Phan
et al.,
2017).
Transformer-based language models have also demonstrated significant promising in capturing complex linguistic knowledge, leading researchers to employ attention mechanisms to obtain contextual embedding vectors for each entity and consider coherence between entities for joint disambiguation (Ganea and Hofmann,
2017). Furthermore, graph neural networks have been trained to acquire entity graph embeddings that encode global semantic features, subsequently transferred to statistical models to address entity ambiguity. While these approaches demonstrate potential in achieving human-level performance in entity disambiguation, they often require substantial amounts of training data and computational resources (Hu
et al.,
2020). Therefore, challenges persist in optimizing these models and reducing their reliance on in-domain training datasets, which may not always be readily available, especially in highly specific or specialised domains like those handled in our ad hoc corpus generation approach. In simple terms, these models are not suitable for our purposes because they require training with ad hoc corpora, which is precisely what our work seeks to achieve, that is, automatically gathering collections of in-domain texts that were previously absent in the literature.
While we acknowledge the positive outcomes achieved by existing approaches in entity disambiguation within recent literature, their complexity and requirements, such as domain-specific training datasets and high computational demands, surpass the needs of our ad hoc corpus generation algorithm. In contrast, we employ a simpler yet effective mechanism, as evidenced by the obtained results, to identify the right entities in the given initial text. Details will be given in Section
3.2.
3 How to Build a Domain-Specific Training Corpus
The candidate documents to be incorporated into the automatically-built ad hoc corpus (for training the embedding models) are gathered from the Internet by a procedure that has been implemented in Python and made freely available in a GitHub repository (
https://github.com/gssi-uvigo/Plethora). Specifically, the approach takes as input an initial text and retrieves Wikipedia articles that have a meaningful relationship with it through an algorithm that can be outlined as follows:
-
• First, the NER facilities provided by the DBpedia Spotlight tool are exploited to identify DBpedia named entities present in the input text (denoted as DB-SL entities). In addition, the approach searches for other semantically related entities that share some common features with these DB-SL entities (e.g. semantic topics and categories or wikicats). This step is addressed in Sections
3.1,
3.2 and
3.3.
-
• Next, the goal is to retrieve (and preprocess to remove irrelevant information) Wikipedia articles in which the previously identified entities are mentioned, as described in Sections
3.4 and
3.5.
-
• Finally, the retrieved texts that are actually relevant (according to the relatedness measured between each of them and the input text) are incorporated into the ad hoc corpus. As detailed in Section
3.6, this stage of the algorithm is driven by a semantic similarity metric based on a Doc2Vec embedding model.
Before delving into each phase of our algorithm, it is essential to justify the usage of Wikipedia in our research. Specifically, we prioritize retrieving texts from this source due to several compelling advantages: (i) Wikipedia serves as an extensive repository encompassing information about any subject, and it includes entries for relevant individuals, places, or events; (ii) DBpedia provides a wealth of semantic information about these entries, enhancing the depth and context of our analysis; and (iii) there is a well-established and reliable mechanism to follow links between these repositories, and even connect them to others, which facilitates the discovery and retrieval of new documents.
While our approach to constructing ad hoc corpora is equally effective for texts from Wikipedia or any other source, there are additional remarkable features of this information repository that make it particularly suitable: documents cover a wide range of topics and disciplines; these articles are written and reviewed by a committed community of volunteer contributors; and it is constantly updated, reflecting recent advances in different fields of knowledge. Further evidence supporting the quality, representativeness, and significance of the texts within this online encyclopedia is demonstrated by the use of large corpora of articles extracted from Wikipedia in Transformer architectures. These architectures have garnered remarkable achievements in the field of NLP by employing such corpora for pre-training their base models, enabling them to acquire extensive language knowledge and a broad contextual understanding. This initial pre-training phase primes the models before they are fine-tuned for specific NLP tasks, underscoring the significance and value of the texts extracted from Wikipedia in fostering the advancement of sophisticated language models.
3.1 Strategy and Sources of Information
As the aim is to build an ad hoc corpus, it is necessary to define some way of characterising the topic on which this tailor-made dataset should be based (i.e. a seed describing the context of interest). For that purpose, a short initial text is used, representative of the thematic to which all the document in the corpus should be more or less related. In other words, the goal is to search the Internet for documents with some kind of relationship to this initial text.
All along this document, the following initial text will be used.
This 1926 character-long text (hereafter denoted as
${T_{0}}$) is related to the Battle of Thermopylae among Greeks and Persians in 480 BC (Blanco
et al.,
2020). That is, the aim is to compose a corpus related to the Greco-Persian wars, starting from a brief text related to the second Persian invasion of Greece and specifically to the famous and inspiring Battle of Thermopylae.
The approach conducted in this paper to discover documents leverages the Semantic Web and the Linked Open Data (LOD) (Oliveira
et al.,
2017) initiatives, which form a global repository of interrelated knowledge with a multitude of structured data to study their relationships and obtain new information from the available one. Thus, the core of the procedure is based on identifying relevant entities in the initial text (e.g. people, locations, events…), discovering categories in which these entities are classified and, finally, gathering other entities also classified in those categories. For each entity discovered, its description can be retrieved from the LOD repositories, becoming a new candidate text to be included in the domain-specific corpus.
To delimit this work, the initial source of the data considered is Wikipedia and its structured counterpart, the DBpedia (Lehmann
et al.,
2012). Given the vast amount of information available in these repositories, we leverage the existing capability to freely query them through well-known endpoints by SPARQL (SPARQL Protocol And RDF Query Language) queries. In particular, SPARQL is a language explicitly designed for retrieving data stored in RDF format through queries to repositories like DBpedia. DBpedia is not an isolated information repository but allows establishing links to other well-known datasets to enhance query results, such as YAGO (Pellissier
et al.,
2020) and WikiData (Ismayilov
et al.,
2015) that are extensively used in this work. In sum, SPARQL plays a crucial role in the Semantic Web and Linked Open Data (LOD) initiatives due to its remarkable capabilities in pattern searching and result filtering based on specified conditions, enabling efficient access to information within semantic repositories.
Regarding the categories in which to classify the DB-SL entities identified in ${T_{0}}$, it is interesting to highlight two properties that are frequently used in metadata descriptions to link Wikipedia pages to categories (and their members).
-
• On the one hand, the dct:subject property links pages to categories in the Wikipedia categorization system. Each category is usually composed of several words joined by an underscore character (Traitors_in_history, People_of_the_Greco-Persian_Wars) and may represent classifications by page contents or by administrative goals (Wikipedia_administration_templates, Articles_with_broken_or_outdated_citations…).
-
• On the other one, through the rdf:type property, pages (and the entities they represent) are associated to YAGO wikicats, some classes of the YAGO ontology reflecting the Wikipedia categorization system. The format of such wikicats is WikicatW1W2$\dots $Wn, that is, the Wikicat string followed by a set of words concatenated where each word starts with an uppercase letter and continues with lowercase ones (e.g. BattlesInvolvingAthens, LocationsInGreekMythology).
As will be described in the next sections, both properties are exploited in the paper as they are significant sources of information on the subject of a document, thus helping to discover new data and to assess its relevance.
3.2 Identifying DBpedia Entities in the Initial Text
The identification of the relevant entities present in the initial text ${T_{0}}$ is based on the Named Entity Recognition capabilities provided by DBpedia Spotlight (DB-SL), which enable to move from raw text (strings referred as to surface forms) to structured data (the URLs of the corresponding DBpedia entities). Through several sophisticated procedures (such as entity detection, name resolution, candidate disambiguation…), DB-SL establishes the association among surface forms and DBpedia entities depending on the context: the same text can lead to different entities, and different surface forms can lead to the same entity.
To this aim, DB-SL provides both a web application interface available online and a well-known endpoint running an API to programmatically access the service remotely. This last option is the most interesting one since the aim of this work is develop an automatic service that should be as autonomous as possible. However, this official service rejects bulk queries as it is only provided for testing purposes, so to speed up the execution it is advisable to install a local copy of DBpedia-Spotlight using a Docker image, for instance, provided by its creators.
So, the text
${T_{0}}$ is sent to the local DB-SL deployment to identify the relevant DBpedia entities contained in it. As a result, some DBpedia entities are retrieved, including the surface forms, links to DBpedia pages associated to them, and some additional information (e.g. the candidates for disambiguation and their rankings). The set of DBpedia entities obtained is denoted as
$\textit{DE}({T_{0}})$ in Eq. (
1) (the mathematical notation adopted throughout the description of the approach can be found in Appendix
A):
In this example, 18 entities were detected in the text0 ${T_{0}}$.
Sometimes, DB-SL is not able to properly disambiguate candidates and provides some wrong entity associations in the results. In the example, for instance, the First_French_Empire entity is incorrectly identified by DB-SL from the word empire of the initial text. In other cases, some entities not denoting real persons, locations, events... but concepts are identified (e.g. Battle). These cases can lead to a huge amount of useless data that will be discarded later, but this usually introduces a heavy computation load and disk requirements that it would be convenient to avoid. In these circumstances, a named entity disambiguation method becomes necessary.
In spite of the notable performance achieved by the existing approaches in named entity disambiguation described in Section
2.3, their complexity and requirements, such as the necessity of domain-specific training datasets that are hard to find and high computational demands, go beyond the needs of our ad hoc corpus generation algorithm. Depending on such custom in-domain datasets for disambiguating named entities in a work like ours, which specifically aims to construct tailor-made ad hoc corpora that are missing in the literature, would be impractical. In these circumstances, we have opted to employ a simpler yet effective mechanism to identify the correct entities in the initial text
${T_{0}}$.
In particular, our mechanism leverages the semantic attributes, such as wikicats and subjects, associated with each entity in DBpedia and other linked repositories. Indeed, our approach specifically targets the identification of overlaps between the attributes of the target entity and those of other entities within ${T_{0}}$. The underlying assumption is that the correct interpretation of a target named entity will be categorized under the same wikicats and subjects as the other named entities identified in ${T_{0}}$. By considering these shared attributes, the mechanism increases the likelihood of accurately disambiguating the target entity within its given context.
Applying this procedure, some entities are discarded, such as
First_French_Empire or
Battle. Finally, after this filtering, only
10 entities remained and were used in the following phases (
Themistocles,
Ephialtes_of_Trachis,
Thespiae,
Battle_of_Artemisium,
Battle_of_Marathon,
Leonidas_I,
Darius_I,
Sparta,
Xerxes_I,
Battle_of_Thermopylae). As it can be seen, all of them have a strong relationship with the initial text. So, Eq. (
1) is refined resulting in Eq. (
2):
This approach has proven to be sufficient as any errors in the disambiguation process only have a limited impact on our algorithm. As explained throughout the paper, if we fail to identify any entity, we consider tangentially related documents to ${T_{0}}$ as alternatives, but these documents are subsequently discarded in the following steps of the algorithm. Essentially, misidentifying an entity may cause a delay in constructing the ad hoc corpus (which is not critical since real-time requirements are absent), but it will not result in irrelevant documents (according to the specific theme of ${T_{0}}$) being included within this custom dataset.
3.3 Selecting All Relevant Wikicats that Characterise the Initial Text
With the goal of finding new documents that are significantly related to the initial text,
${T_{0}}$ is characterised with the set of wikicats linked to all the entities discovered in the previous step. As shown in Eq. (
3), the set of wikicats that characterise a given entity
e is denoted as
$\textit{WK}(e)$.
To carry out this characterisation process, it is necessary to analyse the property
rdf:type of each entity
e, as wikicats are the values of this property belonging to the
yago namespace and starting with the string “
Wikicat”, as depicted in the next example:
This rdf:type property is returned by DB-SL, but most of the times incompletely, so it is necessary to resort to the original information repository to collect extended structural descriptions of the identified entities, including the categories to which they belong to. In particular, the well-known properties rdf:type and dct:subject are used to discover the categories to which a given entity is associated. As an example, note the SPARQL query launched against the DBpedia well-known endpoint to complete the information related to the entity Leonidas_I:
Simple wikicats consisting of a single word are eliminated – e.g.
WikicatKings or
WikicatBattle – as they mostly lead to very general concepts, not sufficiently related to the initial text
${T_{0}}$. As mentioned before, these cases are likely to lead to a large number of URLs that would be ranked lower later and discarded, introducing unnecessarily huge computation requirements. Any relevant URLs reached from such wikicats are likely to be also obtained from other more significant wikicats. So, Eq. (
3) is refined resulting in Eq. (
4):
Next, the set
$\textit{WK}({T_{0}})$ is created as the aggregation of wikicats (removing duplicates) coming from the different entities contained in
$\textit{DE}({T_{0}})$. This is the set of relevant wikicats that characterise
${T_{0}}$, as shown in Eq. (
5):
In the example, 42 different wikicats were detected associated to the 10 entities identified in the input text.
Sometimes, depending on the entities found by DB-SL, a large number of wikicats are collected in this phase. In order not to disperse the search, at this time the user has the possibility to discard some of those wikicats (see Fig.
1) if they are not meaningful in the target context, keeping only the selected set of wikicats for the following steps.
Fig. 1
Snapshot of the corpus builder tool developed to explore and identify wikicats that are relevant in the context of the initial text ${T_{0}}$ (available at https://github.com/gssi-uvigo/Plethora).
3.4 Discovering New URLs Associated to the Relevant Wikicats
So far, a number of entities have been identified in ${T_{0}}$ and some of their properties (wikicats and subjects) have been obtained. Now, it is time to exploit the rich capabilities of the LOD infrastructure to perform the reverse operation, that is, to collect objects (identified by URLs) that meet some requirements. For this purpose, well-known repositories will be explored to discover web pages characterised with the same tags that describe the initial text ${T_{0}}$.
First, for each wikicat in
$\textit{WK}({T_{0}})$, the DBpedia repository is queried to gather all the known DBpedia entities that are associated with it (denoted as
${U_{\textit{DB}}}({w_{k}})$ in Eq. (
6)). As each DBpedia entity is linked to a Wikipedia page, the set of Wikipedia URLs about entities tagged with that wikicat is also retrieved in this process. That is:
To fetch this set of URLs, the following SPARQL query is sent to the well-known DBpedia endpoint (being Wikicat5th-centuryBCRulers the wikicat searched in this example):
The approach considers the primaryTopic property as this relationship is the one that leads directly to the Wikipedia page corresponding to the DBpedia entity (if any). If this property is not defined, the URL is discarded as it does not correspond to a Wikipedia page (since the interest does not lie in the URL of the DBpedia entity but in the text of its corresponding Wikipedia page). This text is the training document that will be added to the ad hoc corpus if is related enough to the initial text.
In addition, Wikidata (Vrandeĉić and Krötzsch,
2014; Yoo and Jeong,
2020) is also queried to gather all the Wikipedia pages related to the components of the wikicat name. Wikidata is the central repository of structured information for all the projects of the Wikimedia Foundation, storing more than 92 million data items (text, images, dates, …) accessible by SPARQL queries. Same as Wikipedia, Wikidata is aimed at a crowdsourced data acquisition, being freely editable by people or programs, not only regarding contents, but also data structure. To fetch this second set of URLs, the following SPARQL query is made to the Wikidata well-known access endpoint (being this time
Greco-PersianWars the wikicat searched in the query):
This query provides a second set of URLs (denoted as
${U_{\textit{WK}}}({w_{k}})$ in Eq. (
7)), usually larger than the first one, although composed of less reliable documents.
The second query permits to collect some interesting documents tightly related to the application scenario that, by any reason, have not been tagged by users with the set of characterising wikicats (may be even they are tagged with some similar wikicat that has not been retrieved in the first phase, e.g. NavalBattlesInvolvingGreece). In any case, less reliable documents will obtain a low rank in the following steps, and they will be discarded.
Finally, both sets of URLs (represented in Eqs. (
6) and (
7)) are joined (removing duplicates) resulting in Eq. (
8):
At this point,
$U({T_{0}})$ is defined as the set of URLs that are associated (in DBpedia or Wikidata) with some wikicat included in
$\textit{WK}({T_{0}})$ (that is, URLs of pages that may have a strong relationship with
${T_{0}}$), as shown in Eq. (
9):
In the example, 90735 different URLs were identified from the 42 wikicats collected. All of them corresponded to Wikipedia pages that were likely related to the context defined by the initial text ${T_{0}}$.
3.5 Fetching and Cleaning Discovered URLs
Every URL
${u_{j}}$ included in
$U({T_{0}})$ is now downloaded and cleaned (markup, styling, references, graphical items, etc., are removed) to become a candidate text to be included in the corpus. Actually, only those documents with a minimum text length (currently 300 bytes) are considered as candidate texts, just because short texts usually denote meaningless pages, unlikely to contain DBpedia entities (for instance, disambiguation pages). In this regard, note that in the example considered in the paper,
83919 texts were downloaded which had a length above the mentioned threshold. At this point, several thousands of documents with content that is likely to be similar to some extent to the initial text
${T_{0}}$ are available, which are candidates to be incorporated into the ad hoc corpus pursued in the approach. This set of candidate texts are denoted as
$\textit{CT}({T_{0}})$ in Eq. (
10):
Of course, a large number of these documents might have a tangential relationship to the initial text (e.g. a battle involving Athens corresponding to a different historical stage). It is therefore necessary to measure their similarity to the particular context, order them according to that metric, and discard the irrelevant ones.
3.6 Assessing Relevance of Each Candidate Text
By simple visual inspection, it is easy to notice that a significant amount of the documents obtained in the previous phase do not have a strong-enough relationship to the proposed domain, and for that reason they should not be incorporated into a domain-specific corpus. Of course, the wikicats retrieved could be quite specific (e.g. Greco-PersianWars) leading to documents that are likely to belong to the thematic of the initial text. But they can be also wide spectrum, mixing similar URLs with unrelated ones (e.g. PeopleFromAthens leading both to Themistocles – leader of Greeks in the scenario under consideration – and Queen Sofia of Spain, currently alive and probably irrelevant in the context of the battle of the Thermopylae).
To detect and discard these uninteresting documents, the similarity between the initial text
${T_{0}}$ and each of the candidate texts
${T_{j}}$ is measured. Depending on these values, the relationship between
${T_{j}}$ and
${T_{0}}$ can be relevant (and thus the document is incorporated into the ad-hoc corpus) or irrelevant (the document is discarded), as depicted in Eq. (
11):
The different
similarity metrics that have been taken into account to detect the relationship between each of the candidate texts and
${T_{0}}$ are explained in Section
3.6.1. The way in which the best metric has been identified is detailed in Section
3.6.2. Finally, the
relevance criteria considered to evaluate the candidate texts are discussed in Section
4.
3.6.1 How to Measure Similarity Between Each Candidate ${T_{j}}$ and ${T_{0}}$
When it comes to detecting resemblance between each candidate text and the input one (
${T_{0}}$), part of the significant information bound to them (that has been discovered by the procedures described in previous sections) can be leveraged, such as their common wikicats and subjects. Besides, it is also possible to resort to approaches defined in the literature – based on using word-level and sentence-level embeddings – that have attained good performance for measuring semantic similarity between texts (Le and Mikolov,
2014; Lau and Baldwin,
2016; Fu
et al.,
2018; Gali
et al.,
2019; Dai
et al.,
2020; Chen
et al.,
2019). Details of the four metrics adopted in this work and the results of the experiments justifying the adoption of a Doc2Vec-based solution are presented next.
-
1. Wikicats Jaccard similarity (${\textbf{\textit{Sim}}_{W}}$).
This metric measures the similarity between the initial text ${T_{0}}$ and the candidate one ${T_{c}}$ according to the coincidence of the wikicats that characterise each of them. That is:
-
2. Subjects Jaccard similarity (${\textbf{\textit{Sim}}_{S}}$).
This metric is similar to the previous one but using common subjects between ${T_{0}}$ and ${T_{c}}$, instead of wikicats.
-
3. spaCy similarity (${\textbf{\textit{Sim}}_{C}}$).
The computation of similarity values by means of this metric is driven by spaCy, a Python package for natural language processing. As usual in similar packages, it provides functions for tokenizing texts, removing stopwords and punctuation, classifying words according grammatical categories… In addition, it also implements mechanisms to assign a vector to each word. For this task, it uses algorithms like Glove or Word2Vec (a variant of this by default) to assign a word embeddings vector to each word of the vocabulary. And, naturally, it provides functions to measure similarity between words, comparing vectors through the traditional cosine-based similarity.
Directly derived from this, spaCy also provides a simple mechanism to measure similarity between texts, generating a vector for each text (the average of the corresponding vectors for each word of the text) and computing cosine-based similarity between the text vectors.
-
4. Doc2Vec similarity (${\textbf{\textit{Sim}}_{\textbf{\textit{AP}}}}$).
Some works in the literature have shown that Doc2Vec performs robustly in measuring document similarity when trained using large external corpora (Lau and Baldwin,
2016; Dai
et al.,
2020). Bearing these results in mind, this embedding model has been explored to select the most similar candidate documents to the initial text
${T_{0}}$. Since the aim is to build an ad hoc corpus, at this point there was no dataset to learn a custom-trained Doc2Vec model. Therefore, a model pre-trained on some publicly available generic corpus has been adopted. In particular, this paper uses the well-known Doc2Vec model trained on a large corpus of news from the Associated Press (AP).
Using this model and the Gensim implementation of the Doc2Vec algorithm, the approach obtained the characteristic vector for both each candidate text ${T_{c}}$ and ${T_{0}}$, and then used the cosine-based similarity to measure similarity between vectors (and so documents). For the training process, a simple pre-processing has been carried out on every text using the Gensim API: tokenize to obtain a list of words, lowercase them, remove punctuation, and remove stop-words.
3.6.2 How to Select the Best Similarity Metric in the Approach
In order to decide which of the four metrics described in the previous section (${\textit{Sim}_{W}}$, ${\textit{Sim}_{S}}$, ${\textit{Sim}_{C}}$ and ${\textit{Sim}_{\textit{AP}}}$) is the best for the intended purpose, a supervised approach will be followed to measure how well they can identify the similarity between ${T_{0}}$ and a set of documents recognized as highly similar.
-
• First, the similarity between the initial text and each candidate text will be computed using each one of the similarity metrics. This allows the set of candidate texts in
$\textit{CT}({T_{0}})$ (Eq. (
10)) to be sorted in decreasing order according to their similarity value
${\textit{Sim}_{x}}$ with respect to
${T_{0}}$, that is, each
${\textit{Sim}_{x}}$ will lead to a different
${\textit{CT}_{x}}({T_{0}})$. In this regard, note that the starting point is the input text
${T_{0}}$ containing several DBpedia entities detected by the DBpedia Spotlight (
$\textit{DE}({T_{0}})$ in Eq. (
2)). As such entities represent Wikipedia pages, their corresponding cleaned texts (included as candidates in
$\textit{CT}({T_{0}})$) are also available, and they will be the testing set, as it is clear that all of them are texts quite related to the subject of
${T_{0}}$.
-
• Each entity
${e_{i}}$ appearing in
${T_{0}}$ will be located in a different position
${P_{{x_{i}}}}$ in each
${\textit{CT}_{x}}({T_{0}})$ (so that the higher the similarity, the lower the position). So, the best similarity metric
${\textit{Sim}_{x}}$ is the one that locates all the
${T_{0}}$’s entities the lower the better in the set
${\textit{CT}_{x}}({T_{0}})$. In particular, the approach selects the similarity value with the lowest average position for all the entities of
${T_{0}}$ (denoted as
$\textit{Avrg}({P_{x}})$). The results of this procedure are shown in the snapshot of the tool developed that is depicted in Fig.
2.
Fig. 2
Interface of the corpus builder tool developed by the authors of the paper, which aims to evaluate the similarity between the initial text ${T_{0}}$ and the selected candidate texts through the four metrics considered in the approach (${\textit{Sim}_{W}}$, ${\textit{Sim}_{S}}$, ${\textit{Sim}_{C}}$ and ${\textit{Sim}_{\textit{AP}}}$).
Returning to the example illustrated throughout this section, Table
2 shows the positions occupied by the 10 DB-SL entities (discovered in
${T_{0}}$) in the ordered sets that have been obtained using the four metrics considered in the tests (
${\textit{Sim}_{W}}$,
${\textit{Sim}_{S}}$,
${\textit{Sim}_{C}}$ and
${\textit{Sim}_{\textit{AP}}}$). Note that
${\textit{Sim}_{\textit{AP}}}$ is the Doc2Vec similarity metric computed with the AP pre-trained model.
Table 2
Positions occupied by the 10 DB-SL entities (discovered in the initial text ${T_{0}}$) in a set of candidate documents that have been ordered (as per their similarity to that text) using each of the four metrics considered in the approach. The best metric is the one that finds the entities in the lowest positions in the ordered set, that is, in the documents most similar to the input text (${\textit{Sim}_{\textit{AP}}}$ in this case).
These data are not deterministic for the Doc2Vec-based similarity ${\textit{Sim}_{\textit{AP}}}$, because the Doc2Vec model provides slightly different similarity values for any pair of documents in every execution. This is expected behaviour, as inferring a vector for a new document is not a deterministic process but an iterative one with some randomization involved in the negative sampling feature in the training process. These slight variations are not usually important when estimating a similarity, but, in the example illustrated in the paper, 83919 documents are sorted, and such variations affect significantly the final order among them (and the positions of the entities).
To solve that issue, and compare Doc2Vec with the other similarity metrics, the proposed algorithm (which is outlined in Algorithm 1) computes the results for 5 different executions, and uses the average for every entity. As depicted in Table
2, the Doc2Vec-driven metric was found to be the best one according to the defined criterion (the average is 100.3). In light of the results, the Doc2Vec similarity metric (denoted by
${\textit{Sim}_{\textit{AP}}}$) is adopted to order the candidate texts in
$\textit{CT}({T_{0}})$, hereafter denoted as
${\textit{CT}_{\textit{AP}}}({T_{0}})$.
As depicted in above sketched algorithm, a subset of the candidate documents that occupy the top positions in ${\textit{CT}_{\textit{AP}}}({T_{0}})$ should be selected to be incorporated into the ad hoc corpus. As can be guessed, there is no red line marking which candidates should be included in the corpus and which should not. There should be negligible differences among candidates on either side of any threshold. Of course, the more candidates are added, the better it is to achieve a more stable embeddings model, but the documents will be less and less similar to the initial text. And it also seems clear that, at the end, any consistency measure can be influenced by the final application in which the corpus is used.
But one thing that can be done is to analyse different scenarios and study their results according to a common criterion. This allows to assess the performance of the models obtained by training on different subsets of the ${\textit{CT}_{\textit{AP}}}({T_{0}})$, and to observe how the selected size affects them. This is the criterion adopted in the next section to validate the proposed approach to select the documents that will finally be incorporated into the tailor-made corpus.
4 Validating the Selection of In-Domain Documents for the Corpus
The starting point is the collection of candidate documents that have been gathered from Wikipedia and ordered according to their similarity to
${T_{0}}$. Recall that such ordered set was denoted as
${\textit{CT}_{\textit{AP}}}({T_{0}})$ as it was the result of using the best metric (
$\textit{Sim}AP$), which is based on the pre-trained Doc2Vec model on the AP news collection. The documents in this set are ordered from most to least similar to
${T_{0}}$, but it is likely that not all of them are strongly-enough related to the context because they have been collected from some wikicats that may be tangentially related to the initial text. Therefore, only a subset of these documents needs to be considered. In particular, only the texts that occupy the first positions of the ordered set should be incorporated into the ad hoc corpus being pursued. As there are no clear guidelines on this threshold (the similarity results are certainly quite continuous over the 83919 computed values), different thresholds have been chosen in order to be able to use the resulting corpora in some scenario and to evaluate the quality of the results obtained. In this regard, it was decided to select all the percentages of the set
${\textit{CT}_{\textit{AP}}}({T_{0}})$, from 1% to 10%, in order to train the corresponding Doc2Vec models with that set of ad hoc corpora (resulting in sizes 839, 1678, …). The consistency of such models (denoted as
${M_{1}}$ to
${M_{10}}$) was then analysed individually (Section
4.1) and compared with each other and with the generic AP model (Section
4.2).
This comparison should serve to answer a relevant outstanding question: “are any of these ad hoc models better than the generic AP model?’ Note that this is the cornerstone of the research described in the paper, which aims to devise a procedure to build an ad hoc corpus (composed of documents significantly related to a given application scenario), as there are several references in the literature stating that the model derived from such a corpus should be better than a model trained on a generic corpus of documents (not particularly related to the scenario under consideration). Therefore, evidences in this respect should be provided.
4.1 Consistency Tests for Evaluating Embedding Models Learned from Ad Hoc Corpora
One simple consistency test to check the resulting models is to compute the self-rank of each training document when searching for similar documents. That is, for each one of the training files, the model is asked for the N most similar documents to it, as if such file were new and not already in the model. Obviously, the model should select the same file as the most similar to itself (1-rank), that is, the first in the list of most similar docs. But sometimes, the model makes a mistake and finds other document that is even more similar. The less mistakes, the better the model.
Table
3 depicts the results for the 10 Doc2Vec models (which have been trained on the 10 ad hoc corpora), where the
$Ran{k_{i}}$ row shows the percentage of training document that had a 1-
$rank$ for each
${M_{i}}$ model. The rather high value (close to 1) of the resulting figures confirms that the behaviour of all learned ad hoc models is consistent.
Table 3
1-rank results obtained for 10 ad hoc Doc2Vec models generated in the approach (${M_{1}}$ to ${M_{10}}$). The results confirm that, given a training document, these models were almost always correct in identifying that document as the most similar to itself (on average, this was true for 98.5% of the training documents).
${M_{i}}$ |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
${\textit{Rank}_{i}}$ |
94.5 |
98.9 |
99.2 |
99 |
98.9 |
98.8 |
98.8 |
99 |
98.9 |
98.8 |
Another simple consistency check for the models consists of observing how well they discriminate between similar and dissimilar documents. To this aim, a simple experiment was made where two lists of documents in ${\textit{CT}_{\textit{AP}}}({T_{0}})$ were selected:
-
• ${L_{1}}$: the 100 most similar documents to the initial text, according to the ${\textit{Sim}_{\textit{AP}}}$ similarity metric.
-
• ${L_{2}}$: the 100 less similar documents to the initial text, as per the same metric.
Each of these 200 documents was divided into two parts of equal size, and several similarity values were computed for each one of the 10 ad hoc Doc2Vec models:
-
• ${S_{1}}$: similarity values between both parts of each file in ${L_{1}}$ (these values should be high as the document is similar to ${T_{0}}$ and both parts are about the same theme). A list of 100 similarity values was obtained for each model.
-
• ${S_{2}}$: similarity values between both parts of each file in ${L_{2}}$ (they should be also high, as both parts are about the same theme). A list of 100 values was computed for each model.
-
• ${S_{3}}$: similarity values between the first part of each file in ${L_{1}}$ and the first part of the file occupying the same index in ${L_{2}}$ (these values should be low as most of the time they are unrelated documents, as can be seen from the example mentioned earlier on Themistocles and Queen Sofia of Spain).
The average of the similarities
${S_{1}}$ to
${S_{3}}$ was finally computed for each model
${M_{i}}$, as depicted in Table
4. Once again, the results confirm that the 10 ad hoc models are completely consistent with what expected (similarities between both parts of similar or dissimilar documents are extremely high while cross similarities are low).
Table 4
Similarity values that our 10 Doc2Vec models have measured between documents that are related to ${T_{0}}$ to different extents. The results confirm that the 10 ad hoc models are able to detect high similarity between strongly related documents (values close to 1 included in the rows corresponding to the average of ${S_{1}}$ and ${S_{2}}$), and low similarity between unrelated texts (very low values referring to the average of ${S_{3}}$).
${M_{i}}$ |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
$\textit{Avg}$ ${S_{1}}$
|
0.93 |
0.93 |
0.93 |
0.93 |
0.94 |
0.93 |
0.94 |
0.94 |
0.94 |
0.93 |
$\textit{Avg}$ ${S_{2}}$
|
0.83 |
0.84 |
0.83 |
0.84 |
0.84 |
0.83 |
0.84 |
0.84 |
0.84 |
0.85 |
$\textit{Avg}$ ${S_{3}}$
|
0.18 |
0.19 |
0.19 |
0.19 |
0.18 |
0.21 |
0.21 |
0.18 |
0.19 |
0.22 |
4.2 Comparison to the Doc2Vec-AP Model
The previous tests showed that the models generated in this approach work well, but they did not confirm which one is the best or if they are better than the generic Doc2Vec AP model. To try to shed light on such issue, the performance of the models when they are used in some scenario has to be evaluated. The methodology adopted in the validation and the discussion on results obtained are detailed in Section
4.2.1 and Section
4.2.2, respectively.
4.2.1 Experimental Methodology
The proposed validation scenario is inspired by the procedure that was adopted to select the similarity metric based on the Doc2Vec AP model (
${\textit{Sim}_{\textit{AP}}}$) as the best one among the 4 possibilities explored in Section
3.6.2. In particular, the procedure is organized as follows:
-
1. First, the ${M_{1}}$ to ${M_{10}}$ models were used to measure the Doc2Vec similarity among ${T_{0}}$ and each candidate text (included in the set ${\textit{CT}_{\textit{AP}}}({T_{0}})$). This led to 10 sets whose documents were arranged in decreasing order according to the similarity values measured with these models (denoted as ${\textit{CT}_{1}}({T_{0}})$ to ${\textit{CT}_{10}}({T_{0}})$).
-
2. Next, the focus was put on the positions of the ${T_{0}}$’s entities in the ordered sets. These position values were averaged to obtain a consistency indicator that allowed the 10 ad hoc models to be compared with each other and with the generic $\textit{AP}$ one.
Table 5
Set of 18 DB-SL entities considered in the experimental validation in the context of the Battle of Thermopylae.
The 10 DB-SL entities initially identified from the initial text ${T_{0}}$
|
Events |
People |
Places |
Battle_of_Thermopylae |
Leonidas_I |
Thespiae |
Battle_of_Artemisium |
Xerxes_I |
Sparta |
Battle_of_Marathon |
Themistocles |
|
|
Ephialtes_of_Trachis |
|
|
Darius_I |
|
The new 8 DB-SL entities about the Battle of Thermopylae which are not present in ${T_{0}}$
|
Events |
People |
Battle_of_Salamis |
Mardonius |
Battle_of_Mycale |
Hydarnes |
Battle_of_Plataea |
Hydarnes_II |
|
Immortals_(Achaemenid_Empire) |
|
Herodotus |
-
3. For a more robust comparison in the application scenario linked to the
Battle of Thermopylae, the list of 10 DB-SL entities initially discovered was extended with new entities that were actually significant in the context of the second Persian invasion of Greece (but were not mentioned in the initial text
${T_{0}}$), as depicted in Table
5. This way, a very descriptive set of 18 DBpedia entities that characterised the context under consideration was defined. Of course, all of them were included in the set of candidates collected in the initial phases described in Section
3. Moreover, their similarity to
${T_{0}}$ could be calculated with the Doc2Vec algorithm using the testing models
${M_{1}}$ to
${M_{10}}$, and they were also present in the aforementioned ordered sets
${\textit{CT}_{x}}({T_{0}})$ calculated from these models (with
$x\in [1,10]$).
-
4. Lastly, the positions of those 18 DBpedia entities were searched in the sets
${\textit{CT}_{1}}({T_{0}})$ to
${\textit{CT}_{10}}({T_{0}})$. A typical run (remember that results slightly change from run to run due to the randomness features of the Doc2Vec algorithm) of the results obtained with both the 10 ad hoc models and the AP generic one is depicted in Table
6.
Table 6
Positions occupied by the 18 DB-SL entities of ${T_{0}}$ within a set of candidate texts that have been ordered (as per their similarity with ${T_{0}}$) considering the generic AP model and our 10 in-domain Doc2Vec models.
4.2.2 Discussion on Experimental Results
As shown in Table
6, most of the entities occupy relevant positions in all of the studied orderings. This happened to be always true, even though the actual numbers change among different trainings of the same subset of candidates. As mentioned before, the Doc2Vec training algorithm involves some randomness in its steps (for instance, some data is randomly discarded to accelerate the convergence without having significant influence in the final results). This is not important for calculating a given similarity, as the fact that this value is 0.865 or 0.863 should not make any difference. But when ordering 83919 documents, these little differences can lead to all entities slightly changing their positions up or down. For instance, the
Battle of Plataea (the final battle of the Second Persian Invasion of Greece) moves between position 1 and position 6, all of them quite important rankings when ordering 83919 documents, in any case.
In addition, some discordant values have been highlighted with background gray color in Table
6. They are clearly outliers that have to be taken into account to conduct a more appropriate analysis. In this regard, note that word embeddings are machine learning techniques that can be influenced by many factors that lead sometimes to unexpected results. For example, it may happen that a key figure of the historical event under consideration is described in the candidate text (retrieved from Wikipedia) in a way that is not rich enough for these methods to lead to the expected similarity values. This is the case for the candidate text of the entity
Herodotus: although he is the main source of information about the
Battle of Thermopylae (hence his mention in the candidate document), in reality this battle is only a small part of his work as a historian. Experiments have also confirmed the influence of other aspects. In particular, even considering the same set of texts, there are several training parameters that can greatly affect the resulting model. With different training sets, these differences can be even more pronounced.
For the above reasons, it is natural that outliers appear. These are documents that should have been rated high, but are not. This happens both with small ad hoc corpora and with the huge generic AP corpus. But, as shown in Table
6, these outliers are not the same in all cases (although they are quite similar in the ad hoc corpora). Because of this, in order to appropriately compare ad hoc models among them and with AP model, it is convenient to remove such outliers, as the overall quality of a model should be assessed by the most of its results, and not influenced by a small number of irregular items.
This way, the Z-score (Jiang
et al.,
2009; Aggarwal,
2017) and the IQR (Tukey,
1977; Sunitha
et al.,
2014) methods were adopted to identify outliers. Both methods confirmed the highlighted values as discordant with the previous ones. Only values over position 400 were analysed as it is clear that low values may be discordant from a mathematical point of view but still significative regarding similarity.
When removing discordant outliers, it should be borne in mind that averages are being compared and therefore it is necessary to include the same number of elements in the calculation. This is due to the fact that outliers occupy the highest positions in any ranking and simply discarding them in their individual rankings would lead to results that are not directly comparable. Thus, the different rankings were studied to find the one with the highest number of outliers (3 in AP), and this amount was removed from all models. Thus, the 15 entities ranked in the lowest positions were considered for all models. It is worth noting that this is the best case scenario for the AP model (the 3 highest entities were removed from that ordering, thus reducing its mean and giving more value to any other model that outperforms AP).
The results obtained with the 10 ad hoc models learned by taking as training datasets different percentages (from 1% to 10%) of the candidate texts in
$\textit{CT}({T_{0}})$ are shown in the top row of Table
7, together with the outcome of the existing AP Doc2Vec model. Note that the results include the average of several runs, considering only the entities found in the 15 lowest positions for each model.
Table 7
Average positions occupied by the 15 best ranked entities, considering both the AP model and 20 Doc2Vec in-domain models learned by taking between 1% and 20% of the initially-selected candidate documents. The results confirm that the ${M_{5}}$ model obtains the best outcome as it allows finding the 15 entities in the candidate documents that have been selected as the most similar to the input text (i.e. those occupying the lowest positions in the set ${\textit{CT}_{5}}({T_{0}})$).
${M_{i}}$ |
$\textit{AP}$ |
${M_{1}}$ |
${M_{2}}$ |
${M_{3}}$ |
${M_{4}}$ |
${M_{5}}$ |
${M_{6}}$ |
${M_{7}}$ |
${M_{8}}$ |
${M_{9}}$ |
${M_{10}}$ |
${\textit{Avrg}_{i}}$ |
$\textbf{78}$ |
58 |
46 |
52 |
47 |
$\textbf{43}$ |
44 |
46 |
55 |
58 |
$\textbf{83}$ |
${M_{i}}$ |
${M_{11}}$ |
${M_{12}}$ |
${M_{13}}$ |
${M_{14}}$ |
${M_{15}}$ |
${M_{16}}$ |
${M_{17}}$ |
${M_{18}}$ |
${M_{19}}$ |
${M_{20}}$ |
${\textit{Avrg}_{i}}$ |
68 |
60 |
61 |
74 |
66 |
75 |
$\textbf{93}$ |
$\textbf{105}$ |
$\textbf{92}$ |
$\textbf{90}$ |
Based on the results, when compared to the 78-score of the
$\textit{AP}$ model, the ad hoc models show excellent average positions for percentages ranging from 1% to 10%. This outcome is reasonable because when training with a small percentage of candidate documents, the selected texts are on the topic of the initial one (i.e. they are very similar to
${T_{0}}$). Consequently, the models learned from these ad hoc corpora are effective at measuring similarities between
${T_{0}}$ and any of these documents. Particularly, as shown in Table
7, the ad hoc models
${M_{1}}$ to
${M_{9}}$ outperform the generic
$\textit{AP}$ model. This can be observed from the consistently lower average positions measured by each of these models, which are always significantly below 78. Notably, among them, model
${M_{5}}$ achieves the best results.
To examine the evolution of the consistency across a larger number of models derived from different percentages of
$\textit{CT}({T_{0}})$, we replicated the above procedure for values ranging from 11% to 20%, as shown in the bottom row of Table
7. The results indicate that the quality of the learned ad hoc models starts to degrade as an increasing number of documents are included in the training. This weakening is attributed to their reduced relation to the thematic of
${T_{0}}$, leading to the resulting model being less focused on the target scenario.
Based on Table
7, it can be observed that models trained with high percentages (17% to 20%) of candidate documents perform less effectively than the generic model
$\textit{AP}$. The inclusion of a large number of out-of-domain training texts leads to the development of generic models. Consequently, models
${M_{17}}$ to
${M_{20}}$ are inferior to the
$\textit{AP}$ model, as the latter has been trained with a more extensive document set, making it better suited for detecting similarities between two generic texts rather than being specifically tailored to the topic of
${T_{0}}$.
To sum up, the results presented in Table
7 confirm that the initial ad hoc models significantly outperform the generic AP model. For the domain studied in this paper, a corpus containing only 5% of the candidate texts yielded the best results with an average position of 43, as opposed to the AP model’s 78-score. This implies that, based on the criteria employed in this study, such ad hoc models (which have been constructed fully automatically, without requiring human contributions except for optimizing corpus creation) are more effective than the generic AP model in identifying the documents most relevant to a specific context defined in the initial text
${T_{0}}$.
5 Conclusions and Further Work
In this paper, an automatic procedure based on the Linked Open Data infrastructure has been proposed, which allows easily to obtain ad hoc corpora (from a user-specified short input text) that bring benefits for existing word-level and document-level embedding models. So far, such models have been fine-tuned on small collections of in-domain documents in order to improve the performance. These documents are often compiled manually and without assessing in any way the relevance of each text in the particular domain. In opposite, the approach described in this paper automatically gathers numerous in-domain training texts by relying on NER tools and state-of-the-art embedding models in order to guarantee a meaningful relationship between each possible training document and the initial text.
On the one hand, DBpedia-Spotlight is used to recognize DBpedia named entities in the initial text and drive the process of building an initial ad hoc corpus. On the other one, Doc2Vec models allow to identify new relevant in-domain texts (to be incorporated into the ad hoc training dataset). This way, the final tailor-made corpus brings together a large amount of meaningful and precise information sources, which lead to learning high quality domain-specific embeddings.
These dense vector representations accurately model domain peculiarities, which is especially critical for exploiting the language representation capabilities of embedding models in very particular fields (e.g. medicine, History or mechanical engineering, just to name a few). These fields are not conveniently considered in either the huge publicly available generic training datasets or the small-sized hand-collected domain datasets adopted in some existing approaches. Unlike these works, our approach is able to build, without human assistance, a training corpus on any topic and domain that can be exploited by existing models, requiring only to provide as input a variable-length piece of text.
In line with this, well-known models (like Word2Vec and GloVe) could take advantage of this approach to fight the Out-Of-Vocabulary problem, which is stemmed from the usage of generic training corpora. This limitation has been traditionally alleviated in other models (like FastText) by working at subword-level, which introduces excessive computational costs and memory requirements (Armand
et al.,
2017). However, this approach enables to embed in-domain words that would be rare/unusual in a publicly-available generic corpus, and therefore impossible to be learned from general-domain datasets or even from imprecise/incomplete domain-specific datasets.
Our procedure to custom corpus construction shows several advantages over the different approaches presented in the related work section that address similar objectives: the approach here described is more autonomous, less dependent on user feedback to guaranty the quality of outputs; in our work, the relevance of the documents for the given subject is measured before including them in the corpus; and the search process is based on an open large repository where new unknown documents can be discovered, analysed and included into the corpus, besides being used to trigger new searching processes, thus leading to larger custom corpora composed of thousands of documents.
As an honest limitation, the paper concentrates on evaluating the performance of the proposed method using Doc2Vec embeddings, while not considering other Transformer-based models. Despite acknowledging the potential and remarkable results achieved by sophisticated Transformer architectures like BERT, such approaches have been omitted due to certain characteristics of the available models that do not align well with the specific objectives of our validation. In particular, our experimental validation has demonstrated that the performance of the custom-built in-domain corpus, when compared to a generic training dataset, is superior within the context of the specific embedding model used (Doc2Vec in this case). To achieve this, we trained a Doc2Vec model from scratch using the tailored collection and compared it to a generic model. However, training a BERT model from scratch in the case of Transformers is not feasible for us due to its unaffordable computational requirements. Instead, it is common practice to start with a pre-trained model on a massive collection of generic information (referred to as base model) and then fine-tune it for specific NLP tasks and custom corpora.
Given that our approach aimed to compare a model trained from scratch on the ad-hoc collection against one trained on a generic collection, we found the use of Doc2Vec more suitable for the purposes of our research. This choice was driven by the need for a more equitable comparison scenario between the general and ad hoc approaches, which is made possible by comparing models trained from scratch with Doc2Vec. The proposed experimental validation has tested the research hypothesis considered in the approach and has demonstrated that the performance of the automatically-built in-domain corpus is better than that of a generic training dataset in the context of a particular embedding model (Doc2Vec in this case). In reality, replicating or even improving this behaviour with other recent and sophisticated Transformer-based embedding models (such as GPT, BERT and their multiple variants) does not invalidate the results obtained in this work with Doc2Vec. In particular, apart from the aforementioned reasons, Doc2Vec presents two additional compelling advantages. Firstly, its good performance against related document-level models (Bhattacharya
et al.,
2022; Kim
et al.,
2018; Grefenstette
et al.,
2013; Mikolov
et al.,
2013a; Kiros
et al.,
2015,
2018) and secondly, the existence of a mature implementation of it through GenSim (Rehürek and Sojka,
2010), which allowed to train own models from scratch and evaluate the effects of the training corpus on the quality of the resulting models (rather than being able to simply use pre-trained models that have been learned from inaccessible documents).
Regarding the further work, having experimentally validated that in-domain corpora improve generic training datasets in a very specific domain, short-term research plans to explore the performance of models that have first been learned from a generic corpus and then fine-tuned on a collection of in-domain texts (which will be automatically retrieved by the proposed algorithm). The goal of these experiments is to incorporate a diverse array of models, encompassing advanced Transformer-based approaches like BERT, along with numerous other models that are continually emerging in the literature.