<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "JATS-journalpublishing1.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">INFORMATICA</journal-id>
<journal-title-group><journal-title>Informatica</journal-title></journal-title-group>
<issn pub-type="epub">1822-8844</issn><issn pub-type="ppub">0868-4952</issn><issn-l>0868-4952</issn-l>
<publisher>
<publisher-name>Vilnius University</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">INFOR562</article-id>
<article-id pub-id-type="doi">10.15388/24-INFOR562</article-id>
<article-categories><subj-group subj-group-type="heading">
<subject>Research Article</subject></subj-group></article-categories>
<title-group>
<article-title>Online Detection and Infographic Explanation of Spam Reviews with Data Drift Adaptation</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>de Arriba-Pérez</surname><given-names>Francisco</given-names></name><email xlink:href="farriba@gti.uvigo.es">farriba@gti.uvigo.es</email><xref ref-type="aff" rid="j_infor562_aff_001">1</xref><bio>
<p><bold>F. de Arriba-Pérez</bold> received a BS degree in telecommunication technologies engineering in 2013, an MS degree in telecommunication engineering in 2014, and a PhD in 2019 from the University of Vigo, Spain. He is currently a researcher in the Information Technologies Group at the University of Vigo, Spain. His research includes the development of machine learning solutions for different domains like finance and health.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>García-Méndez</surname><given-names>Silvia</given-names></name><email xlink:href="sgarcia@gti.uvigo.es">sgarcia@gti.uvigo.es</email><xref ref-type="aff" rid="j_infor562_aff_001">1</xref><xref ref-type="corresp" rid="cor1">∗</xref><bio>
<p><bold>S. García-Méndez</bold> received a PhD in information and communication technologies from the University of Vigo in 2021. Since 2015, she has worked as a researcher with the Information Technologies Group at the University of Vigo. She is collaborating with foreign research centres as part of her postdoctoral stage. Her research interests include natural language processing techniques and machine learning algorithms.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Leal</surname><given-names>Fátima</given-names></name><email xlink:href="fatimal@upt.pt">fatimal@upt.pt</email><xref ref-type="aff" rid="j_infor562_aff_002">2</xref><bio>
<p><bold>F. Leal</bold> holds a PhD in information and communication technologies from the University of Vigo, Spain. She is an auxiliary professor at Universidade Portucalense in Porto, Portugal, and a researcher at REMIT (Research on Economics, Management, and Information Technologies). Her research is based on crowdsourced information, including trust and reputation, big data, data streams, and recommendation systems. Recently, she has been exploring blockchain technologies for responsible data processing.</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Malheiro</surname><given-names>Benedita</given-names></name><email xlink:href="mbm@isep.ipp.pt">mbm@isep.ipp.pt</email><xref ref-type="aff" rid="j_infor562_aff_003">3</xref><xref ref-type="aff" rid="j_infor562_aff_004">4</xref><bio>
<p><bold>B. Malheiro</bold> is a coordinator professor at Instituto Superior de Engenharia do Porto, the School of Engineering of the Polytechnic of Porto, and senior researcher at <sc>inesc tec</sc>, Porto, Portugal. She holds a PhD and an MSc in electrical engineering and computers and a five-year graduation in electrical engineering from the University of Porto. Her research interests include artificial intelligence, computer science, and engineering education. She is a member of the Association for the Advancement of Artificial Intelligence (<sc>aaai</sc>), the Portuguese Association for Artificial Intelligence (<sc>appia</sc>), the Association for Computing Machinery (<sc>acm</sc>), and the Professional Association of Portuguese Engineers (<sc>oe</sc>).</p></bio>
</contrib>
<contrib contrib-type="author">
<name><surname>Burguillo</surname><given-names>Juan C.</given-names></name><email xlink:href="J.C.Burguillo@uvigo.es">J.C.Burguillo@uvigo.es</email><xref ref-type="aff" rid="j_infor562_aff_001">1</xref><bio>
<p><bold>J.C. Burguillo</bold> received an MSc degree in telecommunication engineering and a PhD degree in telematics at the University of Vigo, Spain. He is currently a full professor at the Department of Telematic Engineering and a researcher at the AtlanTTic Research Center in Telecom Technologies at the University of Vigo. He is the area editor of the journal <italic>Simulation Modelling Practice and Theory</italic> (<italic>SIMPAT</italic>), and his topics of interest are intelligent systems, evolutionary game theory, self-organization, and complex adaptive systems.</p></bio>
</contrib>
<aff id="j_infor562_aff_001"><label>1</label>Information Technologies Group, <institution>atlanTTic, University of Vigo</institution>, <country>Spain</country></aff>
<aff id="j_infor562_aff_002"><label>2</label>Research on Economics, Management and Information Technologies, <institution>Universidade Portucalense</institution>, <country>Portugal</country></aff>
<aff id="j_infor562_aff_003"><label>3</label>ISEP, <institution>Polytechnic of Porto</institution>, Rua Dr. António Bernardino de Almeida, 4249-015 Porto, <country>Portugal</country></aff>
<aff id="j_infor562_aff_004"><label>4</label><institution>Institute for Systems and Computer Engineering, Technology and Science</institution>, <country>Portugal</country></aff>
</contrib-group>
<author-notes>
<corresp id="cor1"><label>∗</label>Corresponding author.</corresp>
</author-notes>
<pub-date pub-type="ppub"><year>2024</year></pub-date><pub-date pub-type="epub"><day>17</day><month>6</month><year>2024</year></pub-date><volume>35</volume><issue>3</issue><fpage>483</fpage><lpage>507</lpage><history><date date-type="received"><month>6</month><year>2023</year></date><date date-type="accepted"><month>6</month><year>2024</year></date></history>
<permissions><copyright-statement>© 2024 Vilnius University</copyright-statement><copyright-year>2024</copyright-year>
<license license-type="open-access" xlink:href="http://creativecommons.org/licenses/by/4.0/">
<license-p>Open access article under the <ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">CC BY</ext-link> license.</license-p></license></permissions>
<abstract>
<p>Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (<italic>i</italic>) incremental profiling, (<italic>ii</italic>) data drift detection &amp; adaptation, and (<italic>iii</italic>) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87% spam <italic>F</italic>-measure.</p>
</abstract>
<kwd-group>
<label>Key words</label>
<kwd>data drift</kwd>
<kwd>interpretability and explainability</kwd>
<kwd>Natural Language Processing</kwd>
<kwd>online machine learning</kwd>
<kwd>spam detection</kwd>
</kwd-group>
<funding-group><funding-statement>This work was partially supported by: (<italic>i</italic>) Xunta de Galicia grants ED481B-2021-118 and ED481B-2022-093, Spain; and (<italic>ii</italic>) Portuguese national funds through FCT – Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) – as part of project UIDP/50014/2020 (<uri>https://doi.org/10.54499/UIDP/50014/2020</uri>).</funding-statement></funding-group>
</article-meta>
</front>
<body>
<sec id="j_infor562_s_001">
<label>1</label>
<title>Introduction</title>
<p>Online reviews are a valuable source of information that influences public opinion and directly impacts customers’ decision to acquire a product or service (Zhang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_061">2018</xref>). However, some reviews are fabricated to promote or undervalue goods and services artificially, i.e. creating spam data (Reyes-Menendez <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_041">2019</xref>; Hutama and Suhartono, <xref ref-type="bibr" rid="j_infor562_ref_024">2022</xref>). Spammers can be humans or bots dedicated to creating deceptive reviews (García-Méndez <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_018">2022b</xref>; Hamida <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_021">2022</xref>). In this context, spam detection is a critical task in online systems. Spam negatively impacts the user experience and the performance and security of the system (Wang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_057">2021</xref>).</p>
<p>Consequently, a broad set of Machine Learning (<sc>ml</sc>) methods has been explored for spam detection, mainly supervised learning (Crawford <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_009">2015</xref>). In recent years, Natural Language Processing (<sc>nlp</sc>) techniques (García-Méndez <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_017">2022a</xref>) have been adopted to improve the accuracy of spam detection (Garg and Girdhar, <xref ref-type="bibr" rid="j_infor562_ref_019">2021</xref>). Given the dynamic nature of the language and behaviour of spammers, the challenge is maintaining the effectiveness of spam detection over time, integrating the detection of model drifts in a stream-based environment as data and concept drifts (Wang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_058">2019</xref>). While data drifts are related to changes in the input data, concept drifts reflect over time in the predicted target (Duckworth <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_011">2021</xref>). Specifically, concept drifts in spam detection refer to the changes in the statistical properties of the spam and non-spam entries over time, which can cause the spam detection system to misclassify reviews. In addition, in a data stream environment, the distribution of input features used to train the spam detection model can change over time, producing data drifts (Barddal <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_003">2017</xref>). Notably, the latter drifts are easier to detect and deal with in a transparent model than in an opaque one (Cano and Krawczyk, <xref ref-type="bibr" rid="j_infor562_ref_005">2019</xref>).</p>
<p>Explainability in spam detection refers to understanding and explaining how a particular text was classified as spam by an automated system (Stites <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_050">2021</xref>). Therefore, in spam detection, an interpretable mechanism for <sc>nlp</sc> and concept drift techniques is required to detect spammers in real-time efficiently. According to Crawford <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_009">2015</xref>), the existing data stream spam detection research is scant. Consequently, this paper contributes to an interpretable online spam detection framework that combines <sc>nlp</sc> techniques and data drift detectors. The proposed framework achieves high accuracy in spam detection and makes the detection process transparent, allowing users to understand why a review is classified as spam. The evaluation with two experimental data sets presents about 85% in the considered evaluation metrics.</p>
<p>The rest of this paper is organized as follows. Section <xref rid="j_infor562_s_002">2</xref> overviews relevant work concerning profiling, classification, data drifts, and explainability for spam detection tasks. Section <xref rid="j_infor562_s_008">3</xref> introduces the proposed method, detailing the data processing, stream-based classification procedures, and online explainability. Section <xref rid="j_infor562_s_014">4</xref> describes the experimental setup and presents the empirical evaluation results considering the online classification and explanation. Finally, Section <xref rid="j_infor562_s_021">5</xref> highlights the achievements and future work.</p>
</sec>
<sec id="j_infor562_s_002">
<label>2</label>
<title>Related Work</title>
<p>As previously mentioned, online reviews have become an essential source of information for consumers to make purchasing decisions (Zhang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_061">2018</xref>; Al-Otaibi and Al-Rasheed, <xref ref-type="bibr" rid="j_infor562_ref_001">2022</xref>). However, spam reviews, which are fake or biased reviews, have become a significant problem, leading to distrust and confusion among consumers (Bian <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_004">2021</xref>). Accordingly, detecting spam reviews is challenging due to the variety of spamming techniques used by spammers; hence, researchers have proposed various approaches for spam review detection (Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_059">2018</xref>). These techniques are based on <sc>ml</sc> methods (Albayati and Altamimi, <xref ref-type="bibr" rid="j_infor562_ref_002">2019</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_031">2019</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_051">2022</xref>) and social network analysis (Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_030">2016</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_051">2022</xref>). A representative example of the latter is the work by Rathore <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_039">2021</xref>) on fake reviewer group detection. Their offline graph-based solution, where nodes and edges represent reviewers and products reviewed, respectively, combines the DeepWalk algorithm with semi-supervised clustering. The authors do not perform textual analysis of the reviews, except sentiment analysis.</p>
<p>Spam detection involves large volumes of data, which can be dynamic and continuously changing (Wang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_058">2019</xref>). In the case of data streams, not only are reviews continuously arriving, but their statistical properties may change over time, leading to the concept and data drifts (Karakaşlı <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_026">2019</xref>). On the one hand, the volume and speed of online reviews require the adoption of online spam detection techniques (Miller <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_035">2014</xref>). On the other hand, outcome explainability is crucial for humans to comprehend, trust, and manage the next generation of cyber defense mechanisms such as spam detection (Charmet <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_007">2022</xref>). Therefore, this related work compares existing works in terms of (<italic>i</italic>) stream-based profile modelling for spam detection, (<italic>ii</italic>) stream-based classification mechanisms, and (<italic>iii</italic>) transparency and credibility in detection tasks.</p>
<sec id="j_infor562_s_003">
<label>2.1</label>
<title>Profiling and Classification</title>
<p>Profiling is the process of modelling stakeholders according to their contributions and interactions (Kakar <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_025">2021</xref>; García-Méndez <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_018">2022b</xref>). In the case of spam detection, individual profiles are built from the content generated by each stakeholder, humans or bots alike. To overcome information sparsity, the profiles are expected to include side and content information (Faris <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_014">2019</xref>; Mohawesh <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>), since a richer profile impacts the quality of <sc>ml</sc> results (Rustam <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_045">2021</xref>). Mainly, with stream-based modelling, profiles are incrementally updated and refined over time (Veloso <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_055">2019</xref>, <xref ref-type="bibr" rid="j_infor562_ref_056">2020</xref>). Concerning online spam detection, the literature considers primary profiling methodologies:</p>
<def-list><def-item><term><bold>Content-based</bold></term><def>
<p>profiling explores textual features extracted from the text to identify the meaning of the content (Song <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_048">2016</xref>; Henke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_023">2021</xref>; Mohawesh <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>). It can be obtained using linguistic and semantic knowledge or style analysis via <sc>nlp</sc> approaches.</p></def></def-item><def-item><term><bold>User-based</bold></term><def>
<p>profiling focuses on both the demographic and the behavioural activity of the user (Miller <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_035">2014</xref>; Eshraqi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_013">2015</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_030">2016</xref>, <xref ref-type="bibr" rid="j_infor562_ref_031">2019</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_051">2022</xref>). It contemplates demography information, frequency, timing, and content of posts to distinguish legitimate from spammer users. In addition, exploiting the social graph can be relevant since spammers have many followers or friends who are also suspected of being spammers.</p></def></def-item></def-list>
<p>Spam detection is a classification task (Vaitkevicius and Marcinkevicius, <xref ref-type="bibr" rid="j_infor562_ref_054">2020</xref>; Mohawesh <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>). The main classification techniques encompass supervised, semi-supervised, unsupervised, and deep learning approaches (Crawford <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_009">2015</xref>) and can be applied offline or online. While offline or batch processing builds static models from pre-existing data sets, online or stream-based processing computes incremental models from live data streams (Leal <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_028">2021</xref>). This paper focuses on stream-based environments. Regarding transparency, classification models can be divided into interpretable and opaque. Opaque mechanisms behave as black boxes (e.g. deep learning), and interpretable models are self-explainable (e.g. trees- or neighbour-based algorithms) (Carvalho <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_006">2019</xref>). Interpretable classifiers explain classification outcomes, clarifying why a given content is false or misleading (Škrlj <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_046">2021</xref>).</p>
</sec>
<sec id="j_infor562_s_004">
<label>2.2</label>
<title>Stream-Based Spam Detection Approaches</title>
<p>Social networking has increased spam activity (Kaur <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_027">2018</xref>). In this context, spam detection approaches have been explored by social networks (e.g. Twitter,<xref ref-type="fn" rid="j_infor562_fn_001">1</xref><fn id="j_infor562_fn_001"><label><sup>1</sup></label>
<p>Available at <uri>https://twitter.com</uri>, May 2024.</p></fn> or Facebook<xref ref-type="fn" rid="j_infor562_fn_002">2</xref><fn id="j_infor562_fn_002"><label><sup>2</sup></label>
<p>Available at <uri>https://www.facebook.com</uri>, May 2024.</p></fn>) (Miller <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_035">2014</xref>; Eshraqi <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_013">2015</xref>; Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_030">2016</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_051">2022</xref>), email boxes (Henke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_023">2021</xref>), or crowdsourcing platforms (e.g. Wikipedia,<xref ref-type="fn" rid="j_infor562_fn_003">3</xref><fn id="j_infor562_fn_003"><label><sup>3</sup></label>
<p>Available at <uri>https://es.wikipedia.org</uri>, May 2024.</p></fn> Yelp,<xref ref-type="fn" rid="j_infor562_fn_004">4</xref><fn id="j_infor562_fn_004"><label><sup>4</sup></label>
<p>Available at <uri>https://yelp.com</uri>, May 2024.</p></fn> and TripAdvisor<xref ref-type="fn" rid="j_infor562_fn_005">5</xref><fn id="j_infor562_fn_005"><label><sup>5</sup></label>
<p>Available at <uri>https://www.tripadvisor.com</uri>, May 2024.</p></fn>) (Mohawesh <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>). Stream mining became the most effective spam detection approach due to the speed and volume of data. It has been explored in the literature using: 
<list>
<list-item id="j_infor562_li_001">
<label>•</label>
<p><bold>Data stream clustering</bold> approaches. Miller <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_035">2014</xref>) treated spam detection as an anomaly prediction problem. The proposed solution identifies spammers on Twitter using account information and streaming tweets employing stream-based clustering algorithms. Eshraqi <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_013">2015</xref>) followed the same methodology, creating clusters of tweets and considering outliers as spam. Song <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_048">2016</xref>) proposed a new ensemble approach named Dynamic Clustering Forest (<sc>dcf</sc>) for the classification of textual streams, which combines decision trees and clustering algorithms.</p>
</list-item>
<list-item id="j_infor562_li_002">
<label>•</label>
<p><bold>Data stream classification</bold> for spam detection. Sun <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_051">2022</xref>) proposed a near real-time Twitter spam detection system employing multiple classification algorithms and parallel computing.</p>
</list-item>
<list-item id="j_infor562_li_003">
<label>•</label>
<p><bold>Outlier detection for stream data</bold>. Liu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_031">2019</xref>) proposed solution identifies outlier reviews, analyses the differences between the patterns of product reviews, and employs an isolation forest algorithm.</p>
</list-item>
</list>
</p>
<sec id="j_infor562_s_005">
<label>2.2.1</label>
<title>Drifts in Spam Detection</title>
<p>Model drift occurs when the performance of an <sc>ml</sc> model loses accuracy over time (Ma <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_033">2023</xref>). The literature identifies two types of drifts: (<italic>i</italic>) data drifts and (<italic>ii</italic>) concept drifts. While data drift occurs when the characteristics of the incoming data change, in concept drifts, both input and output distributions present modifications over time (Desale <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_010">2023</xref>). According to Gama <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_016">2014</xref>), concept drift detection methods can be divided into three categories: (<italic>i</italic>) sequential analysis, (<italic>ii</italic>) statistical analysis, and (<italic>iii</italic>) sliding windows. In addition, for Lu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_032">2018</xref>), drift detection involves four stages: (<italic>i</italic>) data retrieval, (<italic>ii</italic>) data modelling, (<italic>iii</italic>) test statistics calculation, and (<italic>iv</italic>) hypothesis test.</p>
<p>Liu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_030">2016</xref>) proposed and applied two online drift detection techniques to improve the classification of Twitter spam reviews: (<italic>i</italic>) fuzzy-based redistribution and (<italic>ii</italic>) asymmetric sampling. While the fuzzy-based redistribution technique explores information decomposition, asymmetric sampling balances the size of classes in the training data. Song <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_048">2016</xref>) analysed the distribution of textual information to identify concept drifts in a textual data classification approach. Moreover, Mohawesh <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>) employed a comprehensive analysis to address concept drift in detecting fake Yelp reviews. Finally, Henke <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_023">2021</xref>) monitored feature evolution based on the similarity between feature vectors to concept drifts in emails. The solution performs spam classification and concept drift detection as parallel and independent tasks.</p>
<p>In contrast to the previous drift detection works, the current approach adopts self-explainable models to provide explanations, increasing classification quality and user trust.</p>
</sec>
<sec id="j_infor562_s_006">
<label>2.2.2</label>
<title>Explainability</title>
<p>Explainable spam detection refers to explaining why an input was classified as spam. It promotes transparency and clarity, detailing why a particular review was flagged as spam (Stites <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_050">2021</xref>). Accordingly, interpretable models, such as rule-based systems or decision trees, can explain their reasoning, enhancing trust, reducing bias, and helping to discover additional insights (Rudin, <xref ref-type="bibr" rid="j_infor562_ref_044">2019</xref>). In addition, <sc>nlp</sc> enriches the explanations by adding a textual description (Upadhyay <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_053">2021</xref>). Explainable spam detection has been explored in the literature using Local Interpretable Model Agnostic Explanation (<sc>lime</sc>) (Ribeiro <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_042">2016</xref>) and Shapley Additive Explanations (<sc>shap</sc>) (Reis <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_040">2019</xref>; Han <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_022">2022</xref>; Zhang <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_062">2022</xref>).</p>
<p>The literature shows that existing explainable detectors of fake content in online platforms adopt essentially supervised classification and implement offline processing (Crawford <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_009">2015</xref>; Henke <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_023">2021</xref>). Therefore, this paper intends to address this problem by proposing an online solution for identifying and explaining spam reviews, incorporating data drift detection and adaptation.</p>
</sec>
</sec>
<sec id="j_infor562_s_007">
<label>2.3</label>
<title>Research Contribution</title>
<p>The literature review shows a research gap in detecting data drifts and explaining the classification of textual reviews as spam in real time. In this respect, Rao <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_038">2021</xref>) identifies spam drift detection as a challenge requiring more research. Table <xref rid="j_infor562_tab_001">1</xref> provides an overview of the above works considering the data domain, profiling (user- and content-based), spam detection, drift detection, and explainability.</p>
<p>Therefore, this work contributes with an online explainable classification method to recognize spam reviews and, thus, promote trust in digital media. The solution employs data stream processing, updating profiles, and classifying each incoming event. First, user profiles are built using user- and content-based features engineered through <sc>nlp</sc>. Then, the proposed system monitors the incoming streams to detect data drifts using static and sliding windows. Tree-based classifiers are exploited to obtain an interpretable stream-based classification for classification. Finally, the proposed method provides the user with a dashboard combining visual data and natural language knowledge to explain why an incoming review was classified as spam.</p>
<table-wrap id="j_infor562_tab_001">
<label>Table 1</label>
<caption>
<p>Comparison of stream-based spam and drift detection approaches.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Authorship</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Domain</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Profiling</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Spam detection</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Drift detection</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Explainability</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left">Liu <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_030">2016</xref>)</td>
<td style="vertical-align: top; text-align: left">Twitter</td>
<td style="vertical-align: top; text-align: left">Content User</td>
<td style="vertical-align: top; text-align: left">Classification (Multiple)</td>
<td style="vertical-align: top; text-align: left">Data</td>
<td style="vertical-align: top; text-align: left">✗</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Song <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_048">2016</xref>)</td>
<td style="vertical-align: top; text-align: left">Spam</td>
<td style="vertical-align: top; text-align: left">Content</td>
<td style="vertical-align: top; text-align: left">Clustering (<sc>dt</sc>)</td>
<td style="vertical-align: top; text-align: left">Concept</td>
<td style="vertical-align: top; text-align: left">✗</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Mohawesh <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>)</td>
<td style="vertical-align: top; text-align: left">Yelp</td>
<td style="vertical-align: top; text-align: left">Content</td>
<td style="vertical-align: top; text-align: left">Classification (<sc>lr</sc>, <sc>pnn</sc>, <sc>svm</sc>)</td>
<td style="vertical-align: top; text-align: left">Concept</td>
<td style="vertical-align: top; text-align: left">✗</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Henke <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_023">2021</xref>)</td>
<td style="vertical-align: top; text-align: left">Email</td>
<td style="vertical-align: top; text-align: left">Content</td>
<td style="vertical-align: top; text-align: left">Classification (<sc>svm</sc>)</td>
<td style="vertical-align: top; text-align: left">Concept</td>
<td style="vertical-align: top; text-align: left">✗</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>Proposed solution</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Yelp</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Content User</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Classification (<sc>dt</sc>, <sc>rf</sc>)</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Data</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">✓</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
 <p>DT – Decision Tree, LR – Logistic Regression, PNN – Perceptron Neural Network, RF – Random Forest, SVM – Support Vector Machine.</p> 
</table-wrap-foot>
</table-wrap>
<p>As previously explained, concept drift refers to changes in the predicted target over time (i.e. changes in the statistical properties of the spam and non-spam entries), while data drift focuses on input data variations (i.e. changes in the input features used to train the spam detection model). This work focuses on data drift detection, considering its relationship with the transparency of the model. Specifically, detecting data drifts and associated characteristics helps provide richer information to end users via the explainability dashboard. Although no other work has explored the Yelp dataset for data drift and spam detection, work on other topics, such as sentiment analysis, indicates its suitability (Chumakov <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_008">2023</xref>; Madaan <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_034">2023</xref>; Wu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_060">2023</xref>).</p>
</sec>
</sec>
<sec id="j_infor562_s_008" sec-type="methods">
<label>3</label>
<title>Method</title>
<p>The proposed method explores online reviews for stream-based spam classification with drift detection. In addition, it explores self-explainable <sc>ml</sc> models for transparency. Hence, the data stream classification pipeline, represented in Fig. <xref rid="j_infor562_fig_001">1</xref>, comprises: (<italic>i</italic>) feature engineering &amp; incremental profiling (Section <xref rid="j_infor562_s_009">3.1</xref>), (<italic>ii</italic>) feature selection (Section <xref rid="j_infor562_s_010">3.2</xref>), (<italic>iii</italic>) data drift detection &amp; adaptation (Section <xref rid="j_infor562_s_011">3.3</xref>), (<italic>iv</italic>) <sc>ml</sc> classification (Section <xref rid="j_infor562_s_012">3.4</xref>), and (<italic>v</italic>) explainability (Section <xref rid="j_infor562_s_013">3.5</xref>).</p>
<fig id="j_infor562_fig_001">
<label>Fig. 1</label>
<caption>
<p>Data stream classification pipeline.</p>
</caption>
<graphic xlink:href="infor562_g001.jpg"/>
</fig>
<sec id="j_infor562_s_009">
<label>3.1</label>
<title>Feature Engineering &amp; Incremental Profiling</title>
<p>The proposed solution processes the content of the reviews with the help of <sc>nlp</sc> techniques. The content-based features extracted represent relevant linguistic (morphological, syntactical, and semantic) attributes of the reviews. The engineered features are the ratio of adjectives, adverbs, interjections, nouns, pronouns, punctuation marks, verbs, characters, words, difficult words, and <sc>url</sc> counters. Moreover, the system also considers the emotional charge of the content (i.e. anger, fear, happiness, sadness, and surprise). The same applies to the polarity charge among negative, neutral, and positive sentiments. More sophisticated linguistic features include readability, using the Flesch readability score, the McAlpine <sc>eflaw</sc> score,<xref ref-type="fn" rid="j_infor562_fn_006">6</xref><fn id="j_infor562_fn_006"><label><sup>6</sup></label>
<p>A value higher than 25 points is unfavorable.</p></fn> and the reading time. In the end, the content itself, i.e. the words, are analysed through word-grams. The char-grams were discarded due to their low scalability in online operation. These content-based features are then used to incrementally build the corresponding user values to update the user profiles. Additionally, incremental relational item features are computed by building a graph of item and user nodes connected by edges containing the corresponding incremental engineered features of the user-reviewed items.</p>
</sec>
<sec id="j_infor562_s_010">
<label>3.2</label>
<title>Feature Selection</title>
<p>Feature selection reduces the feature space dimension by choosing the most relevant features for the classification and contributes to improving the quality of the input data. The adopted selection technique relies on feature variance to discard those with variance lower than a configurable threshold, as suggested by the literature (Engelbrecht <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_012">2019</xref>; Treistman <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_052">2022</xref>). In the case of online classification, where the arriving data may evolve with time, the selection of representative features must be performed continuously or periodically.</p>
</sec>
<sec id="j_infor562_s_011">
<label>3.3</label>
<title>Data Drift Detection and Adaptation</title>
<p>The variability of real data over time may affect the performance of <sc>ml</sc> models, namely the values of evaluation metrics (e.g. accuracy, precision, recovery, etc.). However, the source of the problem may be due to data drifts, concept drifts, ineffective hyper-parameter optimization, and/or class imbalance.</p>
<p>Thus, the proposed system continuously monitors the incoming stream for data drifts and, periodically, under-samples and optimizes the hyperparameters, using two windows: the past (<sc>p</sc>) static window and the current adaptive (<sc>ca</sc>) sliding window, holding <italic>n</italic> and <italic>w</italic> samples, respectively.</p>
<p>The data drift detector starts operating when the cold start ends, and the <sc>p</sc> window is initialized with the expected <italic>n</italic> samples. The detector identifies a data drift whenever: (<italic>i</italic>) the inter-window word-gram <italic>p-value</italic> is lower<xref ref-type="fn" rid="j_infor562_fn_007">7</xref><fn id="j_infor562_fn_007"><label><sup>7</sup></label>
<p>In a modern language, the most frequent words in a text are not expected to vary over time, leading to <italic>p-values</italic> greater than 0.05. However, the contents and the words within spam texts are anticipated to vary over time, resulting in <italic>p-values</italic> below 0.05.</p></fn> than 0.05, and (<italic>ii</italic>) the inter-window <italic>absolute accuracy difference</italic> (<sc>aad</sc>) is higher than 0.05. Algorithm <xref rid="j_infor562_fig_002">1</xref> details the data drift detection and adaptation process. The threshold values of 0.05, 0.1, and 0.5 were inspired by the works by Solari <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_047">2017</xref>), Leo and Sardanelli (<xref ref-type="bibr" rid="j_infor562_ref_029">2020</xref>), Ritu Aggrawal (<xref ref-type="bibr" rid="j_infor562_ref_043">2021</xref>), respectively. Figure <xref rid="j_infor562_fig_003">2</xref> illustrates this process. The data drift detector works as follows:</p>
<list>
<list-item id="j_infor562_li_004">
<label>•</label>
<p>Calculates the word-gram frequency matrices (i.e. the columns represent the word-grams and the rows, the entries) for the <sc>p</sc> and <sc>ca</sc> windows.</p>
</list-item>
<list-item id="j_infor562_li_005">
<label>•</label>
<p>Sum_wordgrams method transforms the latter matrices into vector format (a vector for <sc>p</sc> and a vector for <sc>ca</sc>) by summing the word-gram frequency for all entries.</p>
</list-item>
<list-item id="j_infor562_li_006">
<label>•</label>
<p>Discards the columns with a frequency lower than 6 in both sum_wordgrams vectors.</p>
</list-item>
<list-item id="j_infor562_li_007">
<label>•</label>
<p>Computes the <italic>p-value</italic> between the word-grams frequency vectors of <sc>p</sc> and <sc>ca</sc> windows.</p>
</list-item>
<list-item id="j_infor562_li_008">
<label>•</label>
<p>Computes the inter-window <sc>aad</sc>.</p>
</list-item>
<list-item id="j_infor562_li_009">
<label>•</label>
<p>Updates the size of the <sc>ca</sc>:</p>
<list>
<list-item id="j_infor562_li_010">
<label>–</label>
<p>If the <italic>p-value</italic> ⩽ 0.1, the <sc>ca</sc> windows size decrements by one.</p>
</list-item>
<list-item id="j_infor562_li_011">
<label>–</label>
<p>If the <italic>p-value</italic> &gt; 0.1 and <italic>p-value</italic> &lt; 0.5, the <sc>ca</sc> windows size remains unchanged.</p>
</list-item>
<list-item id="j_infor562_li_012">
<label>–</label>
<p>If the <italic>p-value</italic> ⩾ 0.5, the <sc>ca</sc> windows size increments by one.</p>
</list-item>
</list>
</list-item>
<list-item id="j_infor562_li_013">
<label>•</label>
<p>Identifies a data drift when the inter-window word-gram <italic>p-value</italic> is lower (or equal) and the inter-window <sc>aad</sc> is higher (or equal) than 0.05. Then, it replaces the <sc>p</sc> with the <sc>ca</sc> window and recalculates the optimal hyperparameters. The hyperparameter_computation method applies an exhaustive search technique over the configuration parameters listed in Fig. <xref rid="j_infor562_fig_004">3</xref>. Ultimately, the <sc>ml</sc> model is trained using the ml_update function with the hyperparameters selected and the <sc>ca</sc> samples.</p>
</list-item>
</list>
<fig id="j_infor562_fig_002">
<label>Algorithm 1</label>
<caption>
<p>: <bold>Data drift detection and classification</bold></p>
</caption>
<graphic xlink:href="infor562_g002.jpg"/>
</fig>
<fig id="j_infor562_fig_003">
<label>Fig. 2</label>
<caption>
<p>Data drift detection and adaptation.</p>
</caption>
<graphic xlink:href="infor562_g003.jpg"/>
</fig>
</sec>
<sec id="j_infor562_s_012">
<label>3.4</label>
<title>ML Classification</title>
<p>The following online <sc>ml</sc> algorithms were used as they exhibited good performance in similar classification problems (Liu <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_030">2016</xref>; Song <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_048">2016</xref>; Sun <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_051">2022</xref>).</p>
<list>
<list-item id="j_infor562_li_014">
<label>•</label>
<p><bold>Hoeffding Tree Classifier</bold> (<sc>htc</sc>) (Pham <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_037">2017</xref>) is the basic decision tree model for online learning.</p>
</list-item>
<list-item id="j_infor562_li_015">
<label>•</label>
<p><bold>Hoeffding Adaptive Tree Classifier</bold> (<sc>hatc</sc>) (Stirling <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_049">2018</xref>) monitors branches and replaces them based on their performance.</p>
</list-item>
<list-item id="j_infor562_li_016">
<label>•</label>
<p><bold>Adaptive Random Forest Classifier</bold> (<sc>arfc</sc>) (Gomes <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_020">2017</xref>) is an ensemble of trees with diversity induction through random re-sampling and concept drift detection. The prediction results are obtained using majority voting.</p>
</list-item>
</list>
<p>The algorithmic performance assessment follows the prequential evaluation protocol (Gama <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_015">2013</xref>) and considers accuracy, macro- and micro-averaging <italic>F</italic>-measure, and run-time metrics.</p>
</sec>
<sec id="j_infor562_s_013">
<label>3.5</label>
<title>Explainability</title>
<p>This module provides information about the most relevant features for the classification, i.e. those with a frequency of appearance greater than a configurable threshold. This information is extracted from the estimators of the tree models used (see Fig. <xref rid="j_infor562_fig_004">3</xref>): <sc>htc</sc> single-model estimator (Pham <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_037">2017</xref>), <sc>hatc</sc> single-model estimator (Stirling <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_049">2018</xref>) and <sc>arfc</sc> multi-model estimator (Gomes <italic>et al.</italic>, <xref ref-type="bibr" rid="j_infor562_ref_020">2017</xref>). The predictions regarding the most relevant features and data drift detection are described in natural language. Furthermore, the decision tree path followed is also provided, along with an automatic description obtained from a Large Language Model.</p>
<fig id="j_infor562_fig_004">
<label>Fig. 3</label>
<caption>
<p>Model hyperparameter configuration (best values in bold).</p>
</caption>
<graphic xlink:href="infor562_g004.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_infor562_s_014">
<label>4</label>
<title>Experimental Results</title>
<p>This section describes the experimental data set (Section <xref rid="j_infor562_s_015">4.1</xref>) and the implementation of the different modules<xref ref-type="fn" rid="j_infor562_fn_008">8</xref><fn id="j_infor562_fn_008"><label><sup>8</sup></label>
<p>Code available at <uri>https://github.com/nlpgti/data_drift</uri></p></fn>: (<italic>i</italic>) feature engineering &amp; incremental profiling (Section <xref rid="j_infor562_s_016">4.2</xref>), (<italic>ii</italic>) feature selection (Section <xref rid="j_infor562_s_017">4.3</xref>), and (<italic>iii</italic>) data drift detection &amp; adaptation (Section <xref rid="j_infor562_s_018">4.4</xref>). The classification and explainability results are detailed in Section <xref rid="j_infor562_s_019">4.5</xref> and Section <xref rid="j_infor562_s_020">4.6</xref>, respectively.</p>
<p>The experiments contemplate four stream classification scenarios, incorporating feature selection, hyperparameter optimization<xref ref-type="fn" rid="j_infor562_fn_009">9</xref><fn id="j_infor562_fn_009"><label><sup>9</sup></label>
<p>Hyper-parameter optimization was performed with the 0.005% of the experimental samples in Section <xref rid="j_infor562_s_019">4.5</xref>.</p></fn> and incremental accuracy updating.</p>
<def-list><def-item><term/><def>
<p><bold>Scenario 1</bold>. The data stream classification runs on a single processing thread.</p></def></def-item><def-item><term/><def>
<p><bold>Scenario 2</bold>. The data stream classification runs on a range of 10–20 parallel threads based on the workload to reduce the experimental run-time. To preserve the original data distribution, the chronologically ordered data stream was divided into consecutive sub-streams, and then, each sub-stream was processed in a dedicated thread.</p></def></def-item><def-item><term/><def>
<p><bold>Scenario 3</bold>. The data stream classification includes data drift detection &amp; adaptation and runs according to scenario 2.</p></def></def-item><def-item><term/><def>
<p><bold>Scenario 4</bold>. The data stream classification runs on a single processing thread with data drift detection &amp; adaptation.<xref ref-type="fn" rid="j_infor562_fn_010">10</xref><fn id="j_infor562_fn_010"><label><sup>10</sup></label>
<p>Due to time limitations, this scenario will only be applied with the best classifier so far.</p></fn></p></def></def-item></def-list>
<p>All experiments were performed using a server with the following hardware specifications: 
<list>
<list-item id="j_infor562_li_017">
<label>•</label>
<p><bold>Operating System</bold>: Ubuntu 18.04.2 LTS 64 bits</p>
</list-item>
<list-item id="j_infor562_li_018">
<label>•</label>
<p><bold>Processor</bold>: IntelCore i9-10900K 2.80 GHz</p>
</list-item>
<list-item id="j_infor562_li_019">
<label>•</label>
<p><bold>RAM</bold>: 96 GB DDR4</p>
</list-item>
<list-item id="j_infor562_li_020">
<label>•</label>
<p><bold>Disk</bold>: 480 GB NVME + 500 GB SSD</p>
</list-item>
</list>
</p>
<sec id="j_infor562_s_015">
<label>4.1</label>
<title>Experimental Data Set</title>
<p>The Yelp data set<xref ref-type="fn" rid="j_infor562_fn_011">11</xref><fn id="j_infor562_fn_011"><label><sup>11</sup></label>
<p>Available at <uri>https://www.kaggle.com/datasets/abidmeeraj/yelp-labelled-dataset?select=Labelled+Yelp+Dataset.csv</uri>, May 2024.</p></fn> is composed of 359 052 leisure activity entries between October 2004 and January 2015, distributed between 36 885 and 322 167 samples of spam and non-spam content, respectively (see Table <xref rid="j_infor562_tab_002">2</xref>). Moreover, the MediaWiki data set<xref ref-type="fn" rid="j_infor562_fn_012">12</xref><fn id="j_infor562_fn_012"><label><sup>12</sup></label>
<p>Available from the corresponding author on reasonable request.</p></fn> contains contributions to travel wikis between August 2003 and June 2020. It is composed of 319 856 entries, distributed between 24 877 and 249 979 samples of spam and non-spam content, respectively (see Table <xref rid="j_infor562_tab_002">2</xref>).</p>
<table-wrap id="j_infor562_tab_002">
<label>Table 2</label>
<caption>
<p>Distribution of classes in the experimental data sets.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Data set</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Class</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Number of entries</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" style="vertical-align: top; text-align: left">Yelp</td>
<td style="vertical-align: top; text-align: left">Spam</td>
<td style="vertical-align: top; text-align: left">36885</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Non-spam</td>
<td style="vertical-align: top; text-align: left">322167</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><bold>Total</bold></td>
<td style="vertical-align: top; text-align: left">359052</td>
</tr>
<tr>
<td rowspan="3" style="vertical-align: top; text-align: left; border-bottom: solid thin">MediaWiki</td>
<td style="vertical-align: top; text-align: left">Spam</td>
<td style="vertical-align: top; text-align: left">24877</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Non-spam</td>
<td style="vertical-align: top; text-align: left">294979</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>Total</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">319856</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor562_s_016">
<label>4.2</label>
<title>Feature Engineering &amp; Incremental Profiling</title>
<table-wrap id="j_infor562_tab_003">
<label>Table 3</label>
<caption>
<p>Content-based features explored per experimental data set.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: center; border-top: solid thin; border-bottom: solid thin">Data set</td>
<td style="vertical-align: top; text-align: center; border-top: solid thin; border-bottom: solid thin">ID</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Name</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Description</td>
<td style="vertical-align: top; text-align: center; border-top: solid thin; border-bottom: solid thin">Type</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="17" style="vertical-align: middle; text-align: center; border-bottom: solid thin">Common</td>
<td style="vertical-align: top; text-align: center">1</td>
<td style="vertical-align: top; text-align: left">Adjective ratio</td>
<td style="vertical-align: top; text-align: left">Ratio of adjectives in the content</td>
<td rowspan="17" style="vertical-align: middle; text-align: center; border-bottom: solid thin">Engineerd (Eng.)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">2</td>
<td style="vertical-align: top; text-align: left">Adverb ratio</td>
<td style="vertical-align: top; text-align: left">Ratio of adverbs in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">3</td>
<td style="vertical-align: top; text-align: left">Character count</td>
<td style="vertical-align: top; text-align: left">Number of characters in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">4</td>
<td style="vertical-align: top; text-align: left">Difficult word count</td>
<td style="vertical-align: top; text-align: left">Number of the difficult words in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">5</td>
<td style="vertical-align: top; text-align: left">Emotion (anger, fear, happiness, sadness, surprise)</td>
<td style="vertical-align: top; text-align: left">Load of the different emotions in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">6</td>
<td style="vertical-align: top; text-align: left">Flesch readability</td>
<td style="vertical-align: top; text-align: left">Readability score of the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">7</td>
<td style="vertical-align: top; text-align: left">Interjection ratio</td>
<td style="vertical-align: top; text-align: left">Ratio of interjections in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">8</td>
<td style="vertical-align: top; text-align: left">McAlpine <sc>eflaw</sc> readability</td>
<td style="vertical-align: top; text-align: left">Readability score of the content for non-native English speakers</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">9</td>
<td style="vertical-align: top; text-align: left">Noun ratio</td>
<td style="vertical-align: top; text-align: left">Ratio of nouns in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">10</td>
<td style="vertical-align: top; text-align: left">Polarity</td>
<td style="vertical-align: top; text-align: left">Sentiment of the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">11</td>
<td style="vertical-align: top; text-align: left">Pronoun ratio</td>
<td style="vertical-align: top; text-align: left">Ratio of pronouns in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">12</td>
<td style="vertical-align: top; text-align: left">Punctuation ratio</td>
<td style="vertical-align: top; text-align: left">Ratio of punctuation marks in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">13</td>
<td style="vertical-align: top; text-align: left">Reading time</td>
<td style="vertical-align: top; text-align: left">Content reading time</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">14</td>
<td style="vertical-align: top; text-align: left"><sc>url</sc> count</td>
<td style="vertical-align: top; text-align: left">Number of <sc>url</sc> in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">15</td>
<td style="vertical-align: top; text-align: left">Verb ratio</td>
<td style="vertical-align: top; text-align: left">Ratio of verbs in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">16</td>
<td style="vertical-align: top; text-align: left">Word count</td>
<td style="vertical-align: top; text-align: left">Number of words in the content</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">17</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Word <italic>n</italic>-grams</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Single and bi-words grams</td>
</tr>
</tbody><tbody>
<tr>
<td rowspan="2" style="vertical-align: middle; text-align: center; border-bottom: solid thin">Yelp</td>
<td style="vertical-align: top; text-align: center">18</td>
<td style="vertical-align: top; text-align: left">Rating-polarity deviation</td>
<td style="vertical-align: top; text-align: left">Rating deviation concerning the polarity of the content</td>
<td style="vertical-align: top; text-align: center">Eng.</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">19</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Review rating</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Rating of the review</td>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">Raw</td>
</tr>
</tbody><tbody>
<tr>
<td rowspan="8" style="vertical-align: middle; text-align: center; border-bottom: solid thin">MediWiki</td>
<td style="vertical-align: top; text-align: center">20</td>
<td style="vertical-align: top; text-align: left">Bot flag</td>
<td style="vertical-align: top; text-align: left">The user is a bot</td>
<td rowspan="8" style="vertical-align: middle; text-align: center; border-bottom: solid thin">Raw</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">21</td>
<td style="vertical-align: top; text-align: left">Deleted flag</td>
<td style="vertical-align: top; text-align: left">Part of the revision content is hidden</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">22</td>
<td style="vertical-align: top; text-align: left">New flag</td>
<td style="vertical-align: top; text-align: left">It is the first revision of a page</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">23</td>
<td style="vertical-align: top; text-align: left">Revert flag</td>
<td style="vertical-align: top; text-align: left">The revision was reverted</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">24</td>
<td style="vertical-align: top; text-align: left">Size difference</td>
<td style="vertical-align: top; text-align: left">Difference in the number of characters added and deleted in the revision</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">25</td>
<td style="vertical-align: top; text-align: left">Edit quality</td>
<td style="vertical-align: top; text-align: left">False/true damaging &amp; good faith probability</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center">26</td>
<td style="vertical-align: top; text-align: left">Item quality</td>
<td style="vertical-align: top; text-align: left"><sc>a</sc>, <sc>b</sc>, <sc>c</sc>, <sc>d</sc>, <sc>e</sc> probability</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: center; border-bottom: solid thin">27</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Article quality</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><sc>ok</sc>, attack, vandalism, <sc>wp10b</sc>, <sc>wp10c</sc>, <sc>wp10fa</sc>, <sc>wp10ga</sc>, <sc>wp10start</sc>, <sc>wp10stub</sc> probability</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>This section details the implementations and <sc>nlp</sc> techniques used to create the classification features. Table <xref rid="j_infor562_tab_003">3</xref>, Table <xref rid="j_infor562_tab_004">4</xref>, and Table <xref rid="j_infor562_tab_005">5</xref> detail the content features, the incremental user features, and the incremental item features for Yelp and MediaWiki data sets, respectively.</p>
<p>Most ratio and counter features in Table <xref rid="j_infor562_tab_003">3</xref> (features 1, 2, 7, 9, 11, 12, 15 in Table <xref rid="j_infor562_tab_003">3</xref>) are computed using the spaCy<xref ref-type="fn" rid="j_infor562_fn_013">13</xref><fn id="j_infor562_fn_013"><label><sup>13</sup></label>
<p>Available at <uri>https://spacy.io</uri>, May 2024.</p></fn> tool to gather their grammatical category (<monospace>token.pos_</monospace> feature). The character and word count (features 3 and 16, respectively) have been directly calculated with the Python <monospace>len</monospace> function.<xref ref-type="fn" rid="j_infor562_fn_014">14</xref><fn id="j_infor562_fn_014"><label><sup>14</sup></label>
<p>For feature 16, the text was first separated into word tokens.</p></fn> The <sc>url</sc> count (feature 14) was computed using a regular expression.<xref ref-type="fn" rid="j_infor562_fn_015">15</xref><fn id="j_infor562_fn_015"><label><sup>15</sup></label>
<p>Available at <uri>https://bit.ly/3N4GNM3</uri>, May 2024.</p></fn> The emotion (feature 5) and polarity (feature 10) are calculated using <monospace>Text2emotion</monospace><xref ref-type="fn" rid="j_infor562_fn_016">16</xref><fn id="j_infor562_fn_016"><label><sup>16</sup></label>
<p>Values between 0 and 1. Available at <uri>https://pypi.org/project/text2emotion</uri>, May 2024.</p></fn> and <monospace>TextBlob</monospace>,<xref ref-type="fn" rid="j_infor562_fn_017">17</xref><fn id="j_infor562_fn_017"><label><sup>17</sup></label>
<p>Values between −1 and 1. Available at <uri>https://pypi.org/project/spacytextblob</uri>, May 2024.</p></fn> respectively. The rating-polarity deviation is computed as the difference between those values after moving the polarity to a Likert scale<xref ref-type="fn" rid="j_infor562_fn_018">18</xref><fn id="j_infor562_fn_018"><label><sup>18</sup></label>
<p>Polarity_likert = 2.5*(polarity + 1).</p></fn> (feature 18). The system uses <monospace>Textstat</monospace><xref ref-type="fn" rid="j_infor562_fn_019">19</xref><fn id="j_infor562_fn_019"><label><sup>19</sup></label>
<p>Available at <uri>https://pypi.org/project/textstat</uri>, May 2024.</p></fn> for the readability (features 4, 6 and 8) and reading time (feature 13). Word-grams (single and bi-words, feature 17) are obtained with <monospace>CountVectorizer</monospace><xref ref-type="fn" rid="j_infor562_fn_020">20</xref><fn id="j_infor562_fn_020"><label><sup>20</sup></label>
<p>Available at <ext-link ext-link-type="uri" xlink:href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html">https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html</ext-link>, May 2024.</p></fn> with the <sc>hatc</sc> model as the meta-transformer, and using the following parameters: <monospace>max_df=0.7</monospace>, <monospace>min_df=0.1</monospace>.<xref ref-type="fn" rid="j_infor562_fn_021">21</xref><fn id="j_infor562_fn_021"><label><sup>21</sup></label>
<p>For the MediaWiki data set, min_df=0.01 since the reviews are shorter.</p></fn> For the word-grams generation, the review is pre-processed, removing non-textual characters (numbers, punctuation marks, and subsequent blank spaces), stop words,<xref ref-type="fn" rid="j_infor562_fn_022">22</xref><fn id="j_infor562_fn_022"><label><sup>22</sup></label>
<p>Available at <uri>https://gist.github.com/sebleier/554280</uri>, May 2024.</p></fn> and <sc>url</sc> instances. Then, the review text is lemmatized with spaCy using the <monospace>en_core_web_md</monospace> model.<xref ref-type="fn" rid="j_infor562_fn_023">23</xref><fn id="j_infor562_fn_023"><label><sup>23</sup></label>
<p>Available at <uri>https://spacy.io/models/en</uri>, May 2024.</p></fn> The drift detector exclusively uses the inter-window word-grams <italic>p-value</italic> variations.</p>
<p>Table <xref rid="j_infor562_tab_004">4</xref> and Table <xref rid="j_infor562_tab_005">5</xref> summarize the user incremental features (58 features) and item incremental features (92 features) generated from the content-based features in Table <xref rid="j_infor562_tab_003">3</xref>. The user engineered features of Table <xref rid="j_infor562_tab_004">4</xref> and Table <xref rid="j_infor562_tab_005">5</xref> correspond to the incremental average <inline-formula id="j_infor562_ineq_001"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">avg</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${f_{{\textit{avg}_{{t_{k}}}}}}$]]></tex-math></alternatives></inline-formula> given by equation (<xref rid="j_infor562_eq_001">1</xref>) and the incremental maximum <inline-formula id="j_infor562_ineq_002"><alternatives><mml:math>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub></mml:math><tex-math><![CDATA[${f_{{\textit{max}_{{t_{k}}}}}}$]]></tex-math></alternatives></inline-formula> given by equation (<xref rid="j_infor562_eq_002">2</xref>), where <italic>f</italic> represents the feature and <inline-formula id="j_infor562_ineq_003"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">[</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">o</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mo>…</mml:mo>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo fence="true" stretchy="false">]</mml:mo></mml:math><tex-math><![CDATA[$[{f_{{t_{o}}}},{f_{{t_{1}}}},\dots ,{f_{{t_{k}}}}]$]]></tex-math></alternatives></inline-formula> the past feature data per user. <disp-formula-group id="j_infor562_dg_001">
<disp-formula id="j_infor562_eq_001">
<label>(1)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">avg</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo><mml:mstyle displaystyle="true">
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mrow>
<mml:mo largeop="false" movablelimits="false">∑</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:mfrac>
</mml:mstyle>
<mml:mo mathvariant="normal">,</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}& {f_{{\textit{avg}_{{t_{k}}}}}}=\frac{{\textstyle\textstyle\sum _{i=0}^{k}}{f_{{t_{i}}}}}{k},\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
<disp-formula id="j_infor562_eq_002">
<label>(2)</label><alternatives><mml:math display="block">
<mml:mtable displaystyle="true" columnalign="right left" columnspacing="0pt">
<mml:mtr>
<mml:mtd class="align-odd"/>
<mml:mtd class="align-even">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mtext mathvariant="italic">max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">k</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo movablelimits="false">max</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="italic">t</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mtd>
</mml:mtr>
</mml:mtable></mml:math><tex-math><![CDATA[\[\begin{aligned}{}& {f_{{\textit{max}_{{t_{k}}}}}}=\underset{i}{\max }{f_{{t_{i}}}}.\end{aligned}\]]]></tex-math></alternatives>
</disp-formula>
</disp-formula-group></p>
<table-wrap id="j_infor562_tab_004">
<label>Table 4</label>
<caption>
<p>User engineered features for both experimental data sets.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">ID</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Name</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Description</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor562_ineq_004"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>28</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>81</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{28,81\}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">User features</td>
<td style="vertical-align: top; text-align: left">Incremental average and maximum per user regarding features 1 to 27 in Table <xref rid="j_infor562_tab_003">3</xref>.</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">82</td>
<td style="vertical-align: top; text-align: left">User post count</td>
<td style="vertical-align: top; text-align: left">Cumulative number of posts per user.</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">83</td>
<td style="vertical-align: top; text-align: left">User spam tendency</td>
<td style="vertical-align: top; text-align: left">Known spamming behaviour per user.</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">84</td>
<td style="vertical-align: top; text-align: left">User posting antiquity</td>
<td style="vertical-align: top; text-align: left">Posting antiquity per user (in weeks).</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">85</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">User posting frequency</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Weekly posting frequency per user.</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="j_infor562_tab_005">
<label>Table 5</label>
<caption>
<p>Item engineered features for both experimental data sets.</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">ID</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Name</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Description</td>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align: top; text-align: left"><inline-formula id="j_infor562_ineq_005"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>86</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>139</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{86,139\}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left">Item features</td>
<td style="vertical-align: top; text-align: left">Incremental average and maximum per item regarding features 1 to 27 in Table <xref rid="j_infor562_tab_003">3</xref>.</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><inline-formula id="j_infor562_ineq_006"><alternatives><mml:math>
<mml:mo fence="true" stretchy="false">{</mml:mo>
<mml:mn>140</mml:mn>
<mml:mo mathvariant="normal">,</mml:mo>
<mml:mn>177</mml:mn>
<mml:mo fence="true" stretchy="false">}</mml:mo></mml:math><tex-math><![CDATA[$\{140,177\}$]]></tex-math></alternatives></inline-formula></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Item and rating features</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Incremental average and maximum per item and rating regarding features 1 to 19 in Table <xref rid="j_infor562_tab_003">3</xref>.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec id="j_infor562_s_017">
<label>4.3</label>
<title>Feature Selection</title>
<p>To reduce the feature space dimension, the variance of the features in Table <xref rid="j_infor562_tab_003">3</xref> and Table <xref rid="j_infor562_tab_004">4</xref> is analysed with the help of the <monospace>VarianceThreshold</monospace><xref ref-type="fn" rid="j_infor562_fn_024">24</xref><fn id="j_infor562_fn_024"><label><sup>24</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/feature-selection/VarianceThreshold</uri>, May 2024.</p></fn> from River 0.11.1.<xref ref-type="fn" rid="j_infor562_fn_025">25</xref><fn id="j_infor562_fn_025"><label><sup>25</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1</uri>, May 2024.</p></fn> The threshold is set to 0, the default value. In the case of Yelp, only feature 14 in Table <xref rid="j_infor562_tab_003">3</xref> and its incremental versions in Table <xref rid="j_infor562_tab_004">4</xref> and Table <xref rid="j_infor562_tab_005">5</xref> were discarded. The discarded MediaWiki features include features 21 and 22 in Table <xref rid="j_infor562_tab_003">3</xref> and their incremental versions in Table <xref rid="j_infor562_tab_004">4</xref> and Table <xref rid="j_infor562_tab_005">5</xref>, along with the incremental version of feature 20 in Table <xref rid="j_infor562_tab_004">4</xref>. All remaining features passed the threshold and were, thus, considered relevant for the classification.</p>
</sec>
<sec id="j_infor562_s_018">
<label>4.4</label>
<title>Data Drift Detection and Adaptation</title>
<p>While standard online <sc>ml</sc> models can adapt to data changes over time, they are still affected by data drift, also known as covariate shift. To address this issue, scenario 3 incorporates data drift detection &amp; adaptation. Moreover, it defines that: (<italic>i</italic>) the cold start spans over the first 500 samples, corresponding to the initial width of the <sc>p</sc> window; (<italic>ii</italic>) the maximum width of <sc>ca</sc> sliding windows is 2000 samples. The proposed data drift detector determines the inter-window word-gram <italic>p-value</italic> and the inter-window <sc>aad</sc>, using the <monospace>Chi2ContingencyResult</monospace> function<xref ref-type="fn" rid="j_infor562_fn_026">26</xref><fn id="j_infor562_fn_026"><label><sup>26</sup></label>
<p>Available at <uri>https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html</uri>, May 2024.</p></fn> and the <monospace>accuracy_score</monospace> function,<xref ref-type="fn" rid="j_infor562_fn_027">27</xref><fn id="j_infor562_fn_027"><label><sup>27</sup></label>
<p>Available at <uri>https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html</uri>, May 2024.</p></fn> respectively.</p>
<p>Figure <xref rid="j_infor562_fig_005">4</xref> shows the evolution of the inter-window <sc>aad</sc> and word-gram <italic>p-value</italic>. The lens marks the detected data drift when <italic>p-value</italic> drops to 0.05, and <sc>aad</sc> is above 0.05.</p>
<fig id="j_infor562_fig_005">
<label>Fig. 4</label>
<caption>
<p>Data drift detection &amp; adaptation based on inter-window <sc>aad</sc> and word-gram <italic>p-value</italic>.</p>
</caption>
<graphic xlink:href="infor562_g005.jpg"/>
</fig>
<p>Once a drift is identified, the hyperparameter optimization starts. This process, which is the most time demanding, employs <monospace>GridSearch</monospace><xref ref-type="fn" rid="j_infor562_fn_028">28</xref><fn id="j_infor562_fn_028"><label><sup>28</sup></label>
<p>Available at <uri>https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html</uri>, May 2024.</p></fn> with reduced configuration parameters (see Fig. <xref rid="j_infor562_fig_004">3</xref>).</p>
</sec>
<sec id="j_infor562_s_019">
<label>4.5</label>
<title>ML Classification</title>
<p>The selected classification techniques include <sc>htc</sc>,<xref ref-type="fn" rid="j_infor562_fn_029">29</xref><fn id="j_infor562_fn_029"><label><sup>29</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/tree/HoeffdingTreeClassifier</uri>, May 2024.</p></fn> <sc>hatc</sc>,<xref ref-type="fn" rid="j_infor562_fn_030">30</xref><fn id="j_infor562_fn_030"><label><sup>30</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/tree/HoeffdingAdaptiveTreeClassifier</uri>, May 2024.</p></fn> and <sc>arfc</sc><xref ref-type="fn" rid="j_infor562_fn_031">31</xref><fn id="j_infor562_fn_031"><label><sup>31</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/ensemble/AdaptiveRandomForestClassifier</uri>, May 2024.</p></fn> from River 0.11.1.<xref ref-type="fn" rid="j_infor562_fn_032">32</xref><fn id="j_infor562_fn_032"><label><sup>32</sup></label>
<p>Due to computational and time constraints, results were obtained with a balanced subset composed of 73 770 and 49 754 samples for the Yelp and MediaWiki data sets, respectively.</p></fn></p>
<p>Figure <xref rid="j_infor562_fig_004">3</xref> details all hyperparameter optimization values. Their ranges and best values were defined experimentally. Identifying the best values relied on an <italic>ad hoc</italic> implementation of <monospace>GridSearch</monospace> for data streams.</p>
<p>As the solution operates in streaming mode, no retraining is needed. However, the model’s performance is expected to be lower during cold start (initial samples) or with tiny data streams. Consequently, this solution is intended for domains continuously producing large volumes of textual data.</p>
<p>Summing up, the results in Tables <xref rid="j_infor562_tab_006">6</xref>, <xref rid="j_infor562_tab_007">7</xref> and <xref rid="j_infor562_tab_008">8</xref> are estimated with an <italic>ad hoc</italic> implementation of the <monospace>progressive_val_score</monospace><xref ref-type="fn" rid="j_infor562_fn_033">33</xref><fn id="j_infor562_fn_033"><label><sup>33</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/evaluate/progressive-val-score</uri>, May 2024.</p></fn> from River 0.11.1. Moreover, the validation scheme comprises prediction and training steps as the system operates in streaming. Consequently, the results displayed correspond to the last computation with the last incoming sample, that is, the last chronologically ordered sample.</p>
<p>Table <xref rid="j_infor562_tab_006">6</xref> shows the results obtained in the spam versus non-spam review classification in the four scenarios with the Yelp data set.</p>
<table-wrap id="j_infor562_tab_006">
<label>Table 6</label>
<caption>
<p>Online spam prediction results (best values in bold) for the Yelp data set.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Scenario</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Model</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Accuracy</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>F</italic>-measure</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Time (s)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Macro</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Non-spam</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Spam</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left">1</td>
<td style="vertical-align: top; text-align: left"><sc>htc</sc></td>
<td style="vertical-align: top; text-align: left">61.22</td>
<td style="vertical-align: top; text-align: left">54.48</td>
<td style="vertical-align: top; text-align: left">72.00</td>
<td style="vertical-align: top; text-align: left">36.96</td>
<td style="vertical-align: top; text-align: left">29.20</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>hatc</sc></td>
<td style="vertical-align: top; text-align: left">61.42</td>
<td style="vertical-align: top; text-align: left">55.07</td>
<td style="vertical-align: top; text-align: left">71.96</td>
<td style="vertical-align: top; text-align: left">38.18</td>
<td style="vertical-align: top; text-align: left">32.26</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left">65.96</td>
<td style="vertical-align: top; text-align: left">65.96</td>
<td style="vertical-align: top; text-align: left">66.13</td>
<td style="vertical-align: top; text-align: left">65.78</td>
<td style="vertical-align: top; text-align: left">205.45</td>
</tr>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left">2</td>
<td style="vertical-align: top; text-align: left"><sc>htc</sc></td>
<td style="vertical-align: top; text-align: left">62.51</td>
<td style="vertical-align: top; text-align: left">57.99</td>
<td style="vertical-align: top; text-align: left">71.77</td>
<td style="vertical-align: top; text-align: left">44.21</td>
<td style="vertical-align: top; text-align: left">5.07</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>hatc</sc></td>
<td style="vertical-align: top; text-align: left">62.17</td>
<td style="vertical-align: top; text-align: left">57.70</td>
<td style="vertical-align: top; text-align: left">71.44</td>
<td style="vertical-align: top; text-align: left">43.97</td>
<td style="vertical-align: top; text-align: left">6.39</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left">60.76</td>
<td style="vertical-align: top; text-align: left">60.75</td>
<td style="vertical-align: top; text-align: left">60.99</td>
<td style="vertical-align: top; text-align: left">60.52</td>
<td style="vertical-align: top; text-align: left">19.93</td>
</tr>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left">3</td>
<td style="vertical-align: top; text-align: left"><sc>htc</sc></td>
<td style="vertical-align: top; text-align: left">67.88</td>
<td style="vertical-align: top; text-align: left">67.06</td>
<td style="vertical-align: top; text-align: left">72.26</td>
<td style="vertical-align: top; text-align: left">61.87</td>
<td style="vertical-align: top; text-align: left">287.50</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>hatc</sc></td>
<td style="vertical-align: top; text-align: left">69.57</td>
<td style="vertical-align: top; text-align: left">69.55</td>
<td style="vertical-align: top; text-align: left">70.26</td>
<td style="vertical-align: top; text-align: left">68.84</td>
<td style="vertical-align: top; text-align: left">515.32</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left">75.82</td>
<td style="vertical-align: top; text-align: left">75.55</td>
<td style="vertical-align: top; text-align: left">73.00</td>
<td style="vertical-align: top; text-align: left">78.10</td>
<td style="vertical-align: top; text-align: left">2346.75</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>78.75</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>78.44</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>75.85</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>81.03</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">9678.32</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In scenarios 1 and 2, the values approach the 60% threshold for all models. Unfortunately, the spam <italic>F</italic>-measure in scenario 1 does not reach the 40% in <sc>htc</sc> and <sc>hatc</sc>. Scenarios 1 and 2 display the same accuracy results since they only differ on the number of running threads. Nonetheless, scenario 3 presents a remarkable improvement in the spam <italic>F</italic>-measure (+30.66 percentage points for <sc>hatc</sc>). Scenario 3, with data drift detection &amp; adaptation, reaches a spam <italic>F</italic>-measure of 78.10% and an average run-time per sample of 32 ms with the <sc>arfc</sc> model, detecting an average of 1.75 drifts per thread (35 data drifts in total). This indicates that data drift detection &amp; adaptation contributes to increasing the spam classification accuracy (+17.58 percentage points in <italic>F</italic>-measure) and that multi-threading with 20 threads can process an average of 31 sample/s. Finally, scenario 4 exploits <sc>arfc</sc>, the best-performing model in scenarios 1, 2, and 3, with data drift detection &amp; adaptation on a single processing thread. It presents top values for all metrics, including an 81.03% in spam <italic>F</italic>-measure and an average run-time per sample of 130 ms. This last scenario was able to detect 14 drifts and process 8 sample/s. The difference in the number of data drifts detected in scenario 3 (35) and scenario 4 (14) is caused by thread cold start, i.e. each one of the 20 threads starts with a void model.</p>
<p>Table <xref rid="j_infor562_tab_007">7</xref> shows the evaluation with the MediaWiki data set. The low results of scenario 2, caused by parallelization, improve in scenario 3 thanks to data drift detection &amp; adaptation. The promising performance of <sc>arfc</sc> is further enhanced in scenario 4 with a notable increase in the non-spam <italic>F</italic>-measure between scenario 1 and 4 (+11.84 percentage points). All evaluation metrics are around 85%. The number of data drifts and sample processing rate are similar to those obtained with the Yelp data set. In scenario 3, the <sc>arfc</sc> model reports an average run-time per sample of 47 ms (21 sample/s) and 38 data drifts (3.8 drifts per thread). The <sc>ml</sc> model in scenario 4 has identified 10 data drifts and processed 116 ms/sample.</p>
<table-wrap id="j_infor562_tab_007">
<label>Table 7</label>
<caption>
<p>Online spam prediction results (best values in bold) for the MediaWiki data set.</p>
</caption>
<table>
<thead>
<tr>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Scenario</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Model</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Accuracy</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>F</italic>-measure</td>
<td rowspan="2" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin">Time (s)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Macro</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Non-spam</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Spam</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left">1</td>
<td style="vertical-align: top; text-align: left"><sc>htc</sc></td>
<td style="vertical-align: top; text-align: left">80.78</td>
<td style="vertical-align: top; text-align: left">80.05</td>
<td style="vertical-align: top; text-align: left">76.23</td>
<td style="vertical-align: top; text-align: left">83.87</td>
<td style="vertical-align: top; text-align: left">18.64</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>hatc</sc></td>
<td style="vertical-align: top; text-align: left">80.75</td>
<td style="vertical-align: top; text-align: left">80.02</td>
<td style="vertical-align: top; text-align: left">76.20</td>
<td style="vertical-align: top; text-align: left">83.84</td>
<td style="vertical-align: top; text-align: left">21.28</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left">71.15</td>
<td style="vertical-align: top; text-align: left">71.11</td>
<td style="vertical-align: top; text-align: left">72.18</td>
<td style="vertical-align: top; text-align: left">70.03</td>
<td style="vertical-align: top; text-align: left">65.21</td>
</tr>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left">2</td>
<td style="vertical-align: top; text-align: left"><sc>htc</sc></td>
<td style="vertical-align: top; text-align: left">79.65</td>
<td style="vertical-align: top; text-align: left">78.95</td>
<td style="vertical-align: top; text-align: left">75.10</td>
<td style="vertical-align: top; text-align: left">82.79</td>
<td style="vertical-align: top; text-align: left">4.40</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>hatc</sc></td>
<td style="vertical-align: top; text-align: left">79.84</td>
<td style="vertical-align: top; text-align: left">79.16</td>
<td style="vertical-align: top; text-align: left">75.40</td>
<td style="vertical-align: top; text-align: left">82.92</td>
<td style="vertical-align: top; text-align: left">5.02</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left">69.75</td>
<td style="vertical-align: top; text-align: left">69.72</td>
<td style="vertical-align: top; text-align: left">70.68</td>
<td style="vertical-align: top; text-align: left">68.76</td>
<td style="vertical-align: top; text-align: left">9.91</td>
</tr>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left">3</td>
<td style="vertical-align: top; text-align: left"><sc>htc</sc></td>
<td style="vertical-align: top; text-align: left">81.78</td>
<td style="vertical-align: top; text-align: left">81.46</td>
<td style="vertical-align: top; text-align: left">79.03</td>
<td style="vertical-align: top; text-align: left">83.89</td>
<td style="vertical-align: top; text-align: left">373.73</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>hatc</sc></td>
<td style="vertical-align: top; text-align: left">82.23</td>
<td style="vertical-align: top; text-align: left">82.00</td>
<td style="vertical-align: top; text-align: left">79.97</td>
<td style="vertical-align: top; text-align: left">84.03</td>
<td style="vertical-align: top; text-align: left">510.45</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left">84.03</td>
<td style="vertical-align: top; text-align: left">83.80</td>
<td style="vertical-align: top; text-align: left">81.84</td>
<td style="vertical-align: top; text-align: left">85.75</td>
<td style="vertical-align: top; text-align: left">2333.31</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">4</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><sc>arfc</sc></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>86.13</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>85.89</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>84.02</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>87.75</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">5817.78</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The appropriateness of the proposed drift detection algorithm is supported by its comparison with the Early Drift Detection Method (<sc>eddm</sc>)<xref ref-type="fn" rid="j_infor562_fn_034">34</xref><fn id="j_infor562_fn_034"><label><sup>34</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/drift/EDDM</uri>, May 2024.</p></fn> and ADaptive WINdowing (<sc>adwin</sc>)<xref ref-type="fn" rid="j_infor562_fn_035">35</xref><fn id="j_infor562_fn_035"><label><sup>35</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/drift/ADWIN</uri>, May 2024.</p></fn> drift detectors. Table <xref rid="j_infor562_tab_008">8</xref> shows the results of the <sc>arfc</sc> model in scenario 4 with the three drift detectors and the selected experimental data sets. The proposed drift detector attains the best results followed by <sc>adwin</sc> (23.24 percent points lower in the <italic>F</italic>- measure for the spam class). Moreover, <sc>eddm</sc> detects many drifts (793 and 161 for the Yelp and MediaWiki data sets, respectively), which increases the number of training sessions, negatively affecting performance. <sc>adwin</sc> identifies a few drifts in the Yelp data set (i.e. 6) and a higher number in the MediaWiki data set (38).</p>
<table-wrap id="j_infor562_tab_008">
<label>Table 8</label>
<caption>
<p>Online spam prediction results in scenario 4 with different drift detectors (best values in bold).</p>
</caption>
<table>
<thead>
<tr>
<td style="vertical-align: top; text-align: left; border-top: solid thin">Data set</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin">Drift detector</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin">Accuracy</td>
<td colspan="3" style="vertical-align: top; text-align: left; border-top: solid thin; border-bottom: solid thin"><italic>F</italic>-measure</td>
<td style="vertical-align: top; text-align: left; border-top: solid thin">Time (s)</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"/>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Macro</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Non-spam</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Spam</td>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left">Yelp</td>
<td style="vertical-align: top; text-align: left"><sc>eddm</sc></td>
<td style="vertical-align: top; text-align: left">54.58</td>
<td style="vertical-align: top; text-align: left">54.58</td>
<td style="vertical-align: top; text-align: left">54.53</td>
<td style="vertical-align: top; text-align: left">54.63</td>
<td style="vertical-align: top; text-align: left">373.37</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>adwin</sc></td>
<td style="vertical-align: top; text-align: left">60.56</td>
<td style="vertical-align: top; text-align: left">60.56</td>
<td style="vertical-align: top; text-align: left">60.70</td>
<td style="vertical-align: top; text-align: left">60.42</td>
<td style="vertical-align: top; text-align: left">363.12</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left">Proposed</td>
<td style="vertical-align: top; text-align: left"><bold>78.75</bold></td>
<td style="vertical-align: top; text-align: left"><bold>78.44</bold></td>
<td style="vertical-align: top; text-align: left"><bold>75.85</bold></td>
<td style="vertical-align: top; text-align: left"><bold>81.03</bold></td>
<td style="vertical-align: top; text-align: left">9678.32</td>
</tr>
<tr>
<td rowspan="3" style="vertical-align: middle; text-align: left; border-bottom: solid thin">MediaWiki</td>
<td style="vertical-align: top; text-align: left"><sc>eddm</sc></td>
<td style="vertical-align: top; text-align: left">62.70</td>
<td style="vertical-align: top; text-align: left">62.70</td>
<td style="vertical-align: top; text-align: left">63.22</td>
<td style="vertical-align: top; text-align: left">62.17</td>
<td style="vertical-align: top; text-align: left">1078.25</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left"><sc>adwin</sc></td>
<td style="vertical-align: top; text-align: left">65.09</td>
<td style="vertical-align: top; text-align: left">65.08</td>
<td style="vertical-align: top; text-align: left">65.65</td>
<td style="vertical-align: top; text-align: left">64.51</td>
<td style="vertical-align: top; text-align: left">1178.08</td>
</tr>
<tr>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">Proposed</td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>86.13</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>85.89</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>84.02</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin"><bold>87.75</bold></td>
<td style="vertical-align: top; text-align: left; border-bottom: solid thin">5817.78</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Analysis of these spam detection results against those of related works found in the literature with the Yelp data set is merely indicative, as it compares the performance of incremental online versus offline classification methods. Nevertheless, the current method outperforms the 62.35% accuracy reported by Mohawesh <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>) by 16.4 percent points in the Yelp NYC data set with 322 167 reviews. Furthermore, the values obtained with the <sc>adwin</sc> concept drift detection technique by Mohawesh <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>) are aligned with those reported in Table <xref rid="j_infor562_tab_008">8</xref>. This helps to validate the current method, which attains superior performance. Unfortunately, no information is provided for the specific case of the spam class (i.e. micro-averaging evaluation), in which the current incremental method surpasses the 80% barrier in <italic>F</italic>-measure. Moreover, Mohawesh <italic>et al.</italic> (<xref ref-type="bibr" rid="j_infor562_ref_036">2021</xref>) focused on concept rather than data drift and did not include explainability capabilities, a distinctive feature of the current method.</p>
</sec>
<sec id="j_infor562_s_020">
<label>4.6</label>
<title>Explainability</title>
<p>Figure <xref rid="j_infor562_fig_006">5</xref> displays the graphical and textual explanation of the classification of an incoming review. The buttons on the left vertical bar enable: (<italic>i</italic>) administrator profile access, (<italic>ii</italic>) search reviews by textual content, (<italic>iii</italic>) search reviews by timestamp, (<italic>iv</italic>) access to alerts, (<italic>v</italic>) visualization of the decision tree and associated natural language description (see Fig. <xref rid="j_infor562_fig_007">6</xref>), (<italic>vi</italic>) saving the results in the cloud, and (<italic>vii</italic>) configuring the colour layout (i.e. dark or clear mode). The most representative features for the classification are shown in the top part. The relevance of the features corresponds to their frequency of appearance in the decision tree path, considering only positive (greater than) bifurcations (see the graph in Fig. <xref rid="j_infor562_fig_007">6</xref>). The white feature navigation panel on the top right displays the most relevant features. The coloured circle that accompanies this drop-down menu represents the level of severity (i.e. green when the value is higher than the 50th user quartile, yellow if the feature value is within the 50th–25th range, and red when it is lower than the 25th user quartile). While these selectors only apply to the coloured cards on the left, the review panel on the bottom affects the whole dashboard and enables the analysis of different reviews (i.e. using the previous and next buttons). Finally, there are two additional buttons for feedback (i.e. to indicate whether the prediction is correct or not). This allows a manager to provide feedback, acting as an expert in the loop. The displayed review exhibits a high charge of anger and a significant deviation between the user rating and the detected polarity, the editor has been associated with spam content in the past, and the sample has been classified as spam with a 75% confidence using the <monospace>Predict_Proba_One</monospace> function<xref ref-type="fn" rid="j_infor562_fn_036">36</xref><fn id="j_infor562_fn_036"><label><sup>36</sup></label>
<p>Available at <uri>https://riverml.xyz/0.11.1/api/base/Classifier</uri>, May 2024.</p></fn> from River 0.11.1.</p>
<fig id="j_infor562_fig_006">
<label>Fig. 5</label>
<caption>
<p>Explainability dashboard (relevant features).</p>
</caption>
<graphic xlink:href="infor562_g006.jpg"/>
</fig>
<p>Finally, the system presents the decision tree path of the prediction and the corresponding natural language description obtained with <sc>gpt</sc>3<xref ref-type="fn" rid="j_infor562_fn_037">37</xref><fn id="j_infor562_fn_037"><label><sup>37</sup></label>
<p>Available at <uri>https://openai.com/product</uri>, May 2024.</p></fn> (see Fig. <xref rid="j_infor562_fig_007">6</xref>). <sc>gpt</sc>3 was configured to use the <monospace>text-davinci-003</monospace> model with the default parameters, except the <monospace>temperature</monospace> parameter, which was set to 0.7, to generate human-like natural language descriptions. At the top, the administrator can navigate the different decision trees using the previous and next buttons, with the decision path highlighted in blue.</p>
<fig id="j_infor562_fig_007">
<label>Fig. 6</label>
<caption>
<p>Explainability dashboard (decision path and Large Language Model description).</p>
</caption>
<graphic xlink:href="infor562_g007.jpg"/>
</fig>
</sec>
</sec>
<sec id="j_infor562_s_021">
<label>5</label>
<title>Conclusion</title>
<p>The use of crowdsourcing platforms to get information about products and services is growing. Customers search for reviews to make the best decision. Individuals submit dishonest and misleading feedback to manipulate a product or service’s reputation or perception. These spam reviews can be created for various reasons, including financial gain, personal grudges, or competitive advantage. To address this problem, the proposed online method identifies and explains spam reviews. In addition, this research contributes with an online explainable classification engine to recognize spam reviews and, thus, to promote trust in digital media.</p>
<p>Specifically, the proposed method comprises (<italic>i</italic>) stream-based data processing (through feature engineering, incremental profiling, and selection), (<italic>ii</italic>) data drift detection &amp; adaptation, (<italic>iii</italic>) stream-based classification, and (<italic>iv</italic>) explainability. The solution relies on stream-based processing, incrementally updating the profiling and classification models on each incoming event. Specifically, user profiles are computed using user- and content-based features engineered through <sc>nlp</sc>. Monitoring the incoming streams, the method detects data drifts using static and sliding windows. The classification relies on tree-based classifiers to obtain an interpretable stream-based classification. As a result, the user dashboard includes visual data and natural language knowledge to explain the classification of each incoming event. The experimental classification results of the proposed explainable and stream-based spam detection method show promising performance: 78.75% accuracy and 78.44% macro <italic>F</italic>-measure obtained with the Yelp data set, and 86.13% accuracy and 85.89% macro <italic>F</italic>-measure with the MediaWiki data set. Moreover, the proposed data drift detection &amp; adaptation approach performs better than well-known drift detectors (23.24 percent points higher in the <italic>F</italic>-measure for spam detection). According to the related work analysis, this proposal is the first to jointly provide stream-based data processing, profiling, classification with data drift detection &amp; adaptation, and explainability.</p>
<p>This solution can be extended to detect orchestrated groups of active spammers thanks to its modular design with <sc>nlp</sc> techniques and <italic>ad doc</italic> clustering methods for streaming operation. To this end, additional side and content features can be incorporated to cluster contributors by location and temporal affinity. New content-based features can be explored to represent the semantic (e.g. ontology-based like WordNet Domains) and non-semantic similarity (e.g. cosine distance) between reviews. In this regard, the current version of the system already considers sentiment and emotion analysis. The corresponding incremental features can then be designed per user and group of closely related users. The system should, therefore, be able to dynamically adapt to changes in the spamming behaviour of both individuals and groups. Moreover, in future work, the online processing throughput can be further improved by adopting parallelization algorithms, which explore the intrinsic distribution of the data together with elastic hardware solutions. Considering the online processing of reviews, the number of threads and the allocation of incoming samples to threads can be location-based, e.g. employing separate dedicated threads to process the reviews of New York, London, or Paris.</p>
</sec>
<sec id="j_infor562_s_022">
<title>Authors’ Contributions</title>
<p><bold>Francisco de Arriba-Pérez</bold>: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Resources, Data Curation, Writing – Original Draft, Writing – Review &amp; Editing, Visualization, Project Administration, Funding Acquisition. <bold>Silvia García-Méndez</bold>: Conceptualization, Methodology, Software, Validation, Formal Analysis, Investigation, Resources, Data Curation, Writing – Original Draft, Writing – Review &amp; Editing, Visualization, Project Administration, Funding Acquisition. <bold>Fátima Leal</bold>: Conceptualization, Resources, Writing – Original Draft. <bold>Benedita Malheiro</bold>: Conceptualization, Methodology, Validation, Writing – Review &amp; Editing, Supervision. <bold>Juan Carlos Burguillo-Rial</bold>: Conceptualization, Writing – Review &amp; Editing.</p>
</sec>
</body>
<back>
<ref-list id="j_infor562_reflist_001">
<title>References</title>
<ref id="j_infor562_ref_001">
<mixed-citation publication-type="journal"><string-name><surname>Al-Otaibi</surname>, <given-names>S.T.</given-names></string-name>, <string-name><surname>Al-Rasheed</surname>, <given-names>A.A.</given-names></string-name> (<year>2022</year>). <article-title>A review and comparative analysis of sentiment analysis techniques</article-title>. <source>Informatica</source>, <volume>46</volume>(<issue>6</issue>), <fpage>33</fpage>–<lpage>44</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.31449/inf.v46i6.3991" xlink:type="simple">https://doi.org/10.31449/inf.v46i6.3991</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_002">
<mixed-citation publication-type="journal"><string-name><surname>Albayati</surname>, <given-names>M.B.</given-names></string-name>, <string-name><surname>Altamimi</surname>, <given-names>A.M.</given-names></string-name> (<year>2019</year>). <article-title>An empirical study for detecting fake facebook profiles using supervised mining techniques</article-title>. <source>Informatica</source>, <volume>43</volume>(<issue>1</issue>), <fpage>77</fpage>–<lpage>86</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.31449/inf.v43i1.2319" xlink:type="simple">https://doi.org/10.31449/inf.v43i1.2319</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_003">
<mixed-citation publication-type="journal"><string-name><surname>Barddal</surname>, <given-names>J.P.</given-names></string-name>, <string-name><surname>Gomes</surname>, <given-names>H.M.</given-names></string-name>, <string-name><surname>Enembreck</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Pfahringer</surname>, <given-names>B.</given-names></string-name> (<year>2017</year>). <article-title>A survey on feature drift adaptation: definition, benchmark, challenges and future directions</article-title>. <source>Journal of Systems and Software</source>, <volume>127</volume>, <fpage>278</fpage>–<lpage>294</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.jss.2016.07.005" xlink:type="simple">https://doi.org/10.1016/j.jss.2016.07.005</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_004">
<mixed-citation publication-type="chapter"><string-name><surname>Bian</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Sweetser</surname>, <given-names>P.</given-names></string-name> (<year>2021</year>). <chapter-title>Detecting spam game reviews on steam with a semi-supervised approach</chapter-title>. In: <source>Proceedings of the International Conference on the Foundations of Digital Games</source>. <publisher-name>Association for Computing Machinery</publisher-name>, pp. <fpage>1</fpage>–<lpage>10</lpage>. <isbn>9781450384223</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/3472538.3472547" xlink:type="simple">https://doi.org/10.1145/3472538.3472547</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_005">
<mixed-citation publication-type="journal"><string-name><surname>Cano</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Krawczyk</surname>, <given-names>B.</given-names></string-name> (<year>2019</year>). <article-title>Evolving rule-based classifiers with genetic programming on GPUs for drifting data streams</article-title>. <source>Pattern Recognition</source>, <volume>87</volume>, <fpage>248</fpage>–<lpage>268</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.patcog.2018.10.024" xlink:type="simple">https://doi.org/10.1016/j.patcog.2018.10.024</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_006">
<mixed-citation publication-type="journal"><string-name><surname>Carvalho</surname>, <given-names>D.V.</given-names></string-name>, <string-name><surname>Pereira</surname>, <given-names>E.M.</given-names></string-name>, <string-name><surname>Cardoso</surname>, <given-names>J.S.</given-names></string-name> (<year>2019</year>). <article-title>Machine learning interpretability: a survey on methods and metrics</article-title>. <source>Electronics</source>, <volume>8</volume>(<issue>8</issue>), <fpage>832</fpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3390/electronics8080832" xlink:type="simple">https://doi.org/10.3390/electronics8080832</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_007">
<mixed-citation publication-type="journal"><string-name><surname>Charmet</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Tanuwidjaja</surname>, <given-names>H.C.</given-names></string-name>, <string-name><surname>Ayoubi</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Gimenez</surname>, <given-names>P.F.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Jmila</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Blanc</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Takahashi</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>Z.</given-names></string-name> (<year>2022</year>). <article-title>Explainable artificial intelligence for cybersecurity: a literature survey</article-title>. <source>Annals of Telecommunications</source>, <volume>77</volume>, <fpage>789</fpage>–<lpage>812</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s12243-022-00926-7" xlink:type="simple">https://doi.org/10.1007/s12243-022-00926-7</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_008">
<mixed-citation publication-type="journal"><string-name><surname>Chumakov</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kovantsev</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Surikov</surname>, <given-names>A.</given-names></string-name> (<year>2023</year>). <article-title>Generative approach to aspect based sentiment analysis with GPT language models</article-title>. <source>Procedia Computer Science</source>, <volume>229</volume>, <fpage>284</fpage>–<lpage>293</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.procs.2023.12.030" xlink:type="simple">https://doi.org/10.1016/j.procs.2023.12.030</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_009">
<mixed-citation publication-type="journal"><string-name><surname>Crawford</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Khoshgoftaar</surname>, <given-names>T.M.</given-names></string-name>, <string-name><surname>Prusa</surname>, <given-names>J.D.</given-names></string-name>, <string-name><surname>Richter</surname>, <given-names>A.N.</given-names></string-name>, <string-name><surname>Al Najada</surname>, <given-names>H.</given-names></string-name> (<year>2015</year>). <article-title>Survey of review spam detection using machine learning techniques</article-title>. <source>Journal of Big Data</source>, <volume>2</volume>(<issue>1</issue>), <fpage>1</fpage>–<lpage>24</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1186/s40537-015-0029-9" xlink:type="simple">https://doi.org/10.1186/s40537-015-0029-9</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_010">
<mixed-citation publication-type="chapter"><string-name><surname>Desale</surname>, <given-names>K.S.</given-names></string-name>, <string-name><surname>Shinde</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Magar</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Kullolli</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kurhade</surname>, <given-names>A.</given-names></string-name> (<year>2023</year>). <chapter-title>Fake review detection with concept drift in the data: a survey</chapter-title>. In: <source>Proceedings of International Congress on Information and Communication Technology</source>, Vol. <volume>448</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>719</fpage>–<lpage>726</lpage>. <isbn>9789811916090</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-981-19-1610-6_63" xlink:type="simple">https://doi.org/10.1007/978-981-19-1610-6_63</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_011">
<mixed-citation publication-type="journal"><string-name><surname>Duckworth</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Chmiel</surname>, <given-names>F.P.</given-names></string-name>, <string-name><surname>Burns</surname>, <given-names>D.K.</given-names></string-name>, <string-name><surname>Zlatev</surname>, <given-names>Z.D.</given-names></string-name>, <string-name><surname>White</surname>, <given-names>N.M.</given-names></string-name>, <string-name><surname>Daniels</surname>, <given-names>T.W.V.</given-names></string-name>, <string-name><surname>Kiuber</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Boniface</surname>, <given-names>M.J.</given-names></string-name> (<year>2021</year>). <article-title>Using explainable machine learning to characterise data drift and detect emergent health risks for emergency department admissions during COVID-19</article-title>. <source>Scientific Reports</source>, <volume>11</volume>, <fpage>23017</fpage>–<lpage>23026</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/s41598-021-02481-y" xlink:type="simple">https://doi.org/10.1038/s41598-021-02481-y</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_012">
<mixed-citation publication-type="journal"><string-name><surname>Engelbrecht</surname>, <given-names>A.P.</given-names></string-name>, <string-name><surname>Grobler</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Langeveld</surname>, <given-names>J.</given-names></string-name> (<year>2019</year>). <article-title>Set based particle swarm optimization for the feature selection problem</article-title>. <source>Engineering Applications of Artificial Intelligence</source>, <volume>85</volume>, <fpage>324</fpage>–<lpage>336</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.engappai.2019.06.008" xlink:type="simple">https://doi.org/10.1016/j.engappai.2019.06.008</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_013">
<mixed-citation publication-type="chapter"><string-name><surname>Eshraqi</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Jalali</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Moattar</surname>, <given-names>M.H.</given-names></string-name> (<year>2015</year>). <chapter-title>Detecting spam tweets in Twitter using a data stream clustering algorithm</chapter-title>. In: <source>Proceedings of the International Congress on Technology, Communication and Knowledge</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>347</fpage>–<lpage>351</lpage>. <isbn>978-1-4673-9762-9</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/ICTCK.2015.7582694" xlink:type="simple">https://doi.org/10.1109/ICTCK.2015.7582694</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_014">
<mixed-citation publication-type="journal"><string-name><surname>Faris</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Al-Zoubi</surname>, <given-names>A.M.</given-names></string-name>, <string-name><surname>Heidari</surname>, <given-names>A.A.</given-names></string-name>, <string-name><surname>Aljarah</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Mafarja</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Hassonah</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Fujita</surname>, <given-names>H.</given-names></string-name> (<year>2019</year>). <article-title>An intelligent system for spam detection and identification of the most relevant features based on evolutionary Random Weight Networks</article-title>. <source>Information Fusion</source>, <volume>48</volume>, <fpage>67</fpage>–<lpage>83</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.inffus.2018.08.002" xlink:type="simple">https://doi.org/10.1016/j.inffus.2018.08.002</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_015">
<mixed-citation publication-type="journal"><string-name><surname>Gama</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Sebastião</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Rodrigues</surname>, <given-names>P.P.</given-names></string-name> (<year>2013</year>). <article-title>On evaluating stream learning algorithms</article-title>. <source>Machine Learning</source>, <volume>90</volume>(<issue>3</issue>), <fpage>317</fpage>–<lpage>346</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10994-012-5320-9" xlink:type="simple">https://doi.org/10.1007/s10994-012-5320-9</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_016">
<mixed-citation publication-type="journal"><string-name><surname>Gama</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Žliobaitė</surname>, <given-names>I.</given-names></string-name>, <string-name><surname>Bifet</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Pechenizkiy</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Bouchachia</surname>, <given-names>A.</given-names></string-name> (<year>2014</year>). <article-title>A survey on concept drift adaptation</article-title>. <source>ACM Computing Surveys</source>, <volume>46</volume>(<issue>4</issue>), <fpage>44</fpage>–<lpage>80</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/2523813" xlink:type="simple">https://doi.org/10.1145/2523813</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_017">
<mixed-citation publication-type="journal"><string-name><surname>García-Méndez</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>de Arriba-Pérez</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Barros-Vila</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>González-Castaño</surname>, <given-names>F.J.</given-names></string-name> (<year>2022</year>a). <article-title>Detection of temporality at discourse level on financial news by combining Natural Language Processing and Machine Learning</article-title>. <source>Expert Systems with Applications</source>, <volume>197</volume>(<issue>1</issue>), <fpage>116648</fpage>–<lpage>116656</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2022.116648" xlink:type="simple">https://doi.org/10.1016/j.eswa.2022.116648</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_018">
<mixed-citation publication-type="journal"><string-name><surname>García-Méndez</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Leal</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Malheiro</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Burguillo-Rial</surname>, <given-names>J.C.</given-names></string-name>, <string-name><surname>Veloso</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Chis</surname>, <given-names>A.E.</given-names></string-name>, <string-name><surname>González–Vélez</surname>, <given-names>H.</given-names></string-name> (<year>2022</year>b). <article-title>Simulation, modelling and classification of wiki contributors: Spotting the good, the bad, and the ugly</article-title>. <source>Simulation Modelling Practice and Theory</source>, <volume>120</volume>, <fpage>102616</fpage>–<lpage>102628</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.simpat.2022.102616" xlink:type="simple">https://doi.org/10.1016/j.simpat.2022.102616</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_019">
<mixed-citation publication-type="chapter"><string-name><surname>Garg</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Girdhar</surname>, <given-names>N.</given-names></string-name> (<year>2021</year>). <chapter-title>A systematic review on spam filtering techniques based on natural language processing framework</chapter-title>. In: <source>Proceedings of the International Conference on Cloud Computing, Data Science &amp; Engineering</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>30</fpage>–<lpage>35</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/Confluence51648.2021.9377042" xlink:type="simple">https://doi.org/10.1109/Confluence51648.2021.9377042</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_020">
<mixed-citation publication-type="journal"><string-name><surname>Gomes</surname>, <given-names>H.M.</given-names></string-name>, <string-name><surname>Bifet</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Read</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Barddal</surname>, <given-names>J.P.</given-names></string-name>, <string-name><surname>Enembreck</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Pfharinger</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Holmes</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Abdessalem</surname>, <given-names>T.</given-names></string-name> (<year>2017</year>). <article-title>Adaptive random forests for evolving data stream classification</article-title>. <source>Machine Learning</source>, <volume>106</volume>(<issue>9–10</issue>), <fpage>1469</fpage>–<lpage>1495</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10994-017-5642-8" xlink:type="simple">https://doi.org/10.1007/s10994-017-5642-8</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_021">
<mixed-citation publication-type="journal"><string-name><surname>Hamida</surname>, <given-names>Z.F.</given-names></string-name>, <string-name><surname>Refouf</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Drif</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Giordano</surname>, <given-names>S.</given-names></string-name> (<year>2022</year>). <article-title>Hybrid-MELAu: a hybrid mixing engineered linguistic features based on autoencoder for social bot detection</article-title>. <source>Informatica</source>, <volume>46</volume>(<issue>6</issue>), <fpage>143</fpage>–<lpage>158</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.31449/inf.v46i6.4081" xlink:type="simple">https://doi.org/10.31449/inf.v46i6.4081</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_022">
<mixed-citation publication-type="journal"><string-name><surname>Han</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Li</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhuang</surname>, <given-names>L.</given-names></string-name> (<year>2022</year>). <article-title>Explainable knowledge integrated sequence model for detecting fake online reviews</article-title>. <source>Applied Intelligence</source>, <volume>53</volume>, <fpage>6953</fpage>–<lpage>6965</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10489-022-03822-8" xlink:type="simple">https://doi.org/10.1007/s10489-022-03822-8</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_023">
<mixed-citation publication-type="journal"><string-name><surname>Henke</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Santos</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Souto</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Santin</surname>, <given-names>A.O.</given-names></string-name> (<year>2021</year>). <article-title>Spam detection based on feature evolution to deal with concept drift</article-title>. <source>Journal of Universal Computer Science</source>, <volume>27</volume>, <fpage>364</fpage>–<lpage>386</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.3897/jucs.66284" xlink:type="simple">https://doi.org/10.3897/jucs.66284</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_024">
<mixed-citation publication-type="journal"><string-name><surname>Hutama</surname>, <given-names>L.B.</given-names></string-name>, <string-name><surname>Suhartono</surname>, <given-names>D.</given-names></string-name> (<year>2022</year>). <article-title>Indonesian hoax news classification with multilingual transformer model and BERTopic</article-title>. <source>Informatica</source>, <volume>46</volume>(<issue>8</issue>), <fpage>81</fpage>–<lpage>90</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.31449/inf.v46i8.4336" xlink:type="simple">https://doi.org/10.31449/inf.v46i8.4336</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_025">
<mixed-citation publication-type="journal"><string-name><surname>Kakar</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Dhaka</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Mehrotra</surname>, <given-names>M.</given-names></string-name> (<year>2021</year>). <article-title>Value-based retweet prediction on twitter</article-title>. <source>Informatica</source>, <volume>45</volume>, <fpage>267</fpage>–<lpage>276</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.31449/inf.v45i2.3465" xlink:type="simple">https://doi.org/10.31449/inf.v45i2.3465</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_026">
<mixed-citation publication-type="chapter"><string-name><surname>Karakaşlı</surname>, <given-names>M.S.</given-names></string-name>, <string-name><surname>Aydin</surname>, <given-names>M.A.</given-names></string-name>, <string-name><surname>Yarkan</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Boyaci</surname>, <given-names>A.</given-names></string-name> (<year>2019</year>). <chapter-title>Dynamic feature selection for spam detection in twitter</chapter-title>. In: <source>Lecture Notes in Electrical Engineering</source>, Vol. <volume>504</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>239</fpage>–<lpage>250</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-981-13-0408-8_20" xlink:type="simple">https://doi.org/10.1007/978-981-13-0408-8_20</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_027">
<mixed-citation publication-type="journal"><string-name><surname>Kaur</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Singh</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>H.</given-names></string-name> (<year>2018</year>). <article-title>Rise of spam and compromised accounts in online social networks: a state-of-the-art review of different combating approaches</article-title>. <source>Journal of Network and Computer Applications</source>, <volume>112</volume>, <fpage>53</fpage>–<lpage>88</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.jnca.2018.03.015" xlink:type="simple">https://doi.org/10.1016/j.jnca.2018.03.015</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_028">
<mixed-citation publication-type="chapter"><string-name><surname>Leal</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Veloso</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Malheiro</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Burguillo</surname>, <given-names>J.C.</given-names></string-name> (<year>2021</year>). <chapter-title>Crowdsourced data stream mining for tourism recommendation</chapter-title>. In: <source>Advances in Intelligent Systems and Computing</source>, vol. <volume>1365 AIST</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>260</fpage>–<lpage>269</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-3-030-72657-7_25" xlink:type="simple">https://doi.org/10.1007/978-3-030-72657-7_25</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_029">
<mixed-citation publication-type="journal"><string-name><surname>Leo</surname>, <given-names>G.D.</given-names></string-name>, <string-name><surname>Sardanelli</surname>, <given-names>F.</given-names></string-name> (<year>2020</year>). <article-title>Statistical significance: <italic>p</italic> value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach</article-title>. <source>Euro Radiology Experimental</source>, <volume>4</volume>, <fpage>1</fpage>–<lpage>8</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1186/s41747-020-0145-y" xlink:type="simple">https://doi.org/10.1186/s41747-020-0145-y</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_030">
<mixed-citation publication-type="chapter"><string-name><surname>Liu</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Xiang</surname>, <given-names>Y.</given-names></string-name> (<year>2016</year>). <chapter-title>Statistical detection of online drifting twitter spam: invited paper</chapter-title>. In: <source>Proceedings of the Asia Conference on Computer and Communications Security</source>. <publisher-name>Association for Computational Linguistics</publisher-name>, pp. <fpage>1</fpage>–<lpage>10</lpage>. <isbn>9781450342339</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/2897845.2897928" xlink:type="simple">https://doi.org/10.1145/2897845.2897928</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_031">
<mixed-citation publication-type="journal"><string-name><surname>Liu</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>He</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Cai</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Yang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Zhu</surname>, <given-names>N.</given-names></string-name> (<year>2019</year>). <article-title>A method for the detection of fake reviews based on temporal features of reviews and comments</article-title>. <source>IEEE Engineering Management Review</source>, <volume>47</volume>, <fpage>67</fpage>–<lpage>79</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/EMR.2019.2928964" xlink:type="simple">https://doi.org/10.1109/EMR.2019.2928964</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_032">
<mixed-citation publication-type="journal"><string-name><surname>Lu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Dong</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Gu</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Gama</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>G.</given-names></string-name> (<year>2018</year>). <article-title>Learning under concept drift: a review</article-title>. <source>IEEE Transactions on Knowledge and Data Engineering</source>, <volume>31</volume>(<issue>12</issue>), <fpage>2346</fpage>–<lpage>2363</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TKDE.2018.2876857" xlink:type="simple">https://doi.org/10.1109/TKDE.2018.2876857</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_033">
<mixed-citation publication-type="journal"><string-name><surname>Ma</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>F.-c.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>S.</given-names></string-name> (<year>2023</year>). <article-title>Research on diversity and accuracy of the recommendation system based on multi-objective optimization</article-title>. <source>Neural Computing and Applications</source>, <volume>35</volume>, <fpage>5155</fpage>–<lpage>5163</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s00521-020-05438-w" xlink:type="simple">https://doi.org/10.1007/s00521-020-05438-w</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_034">
<mixed-citation publication-type="chapter"><string-name><surname>Madaan</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Manjunatha</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Nambiar</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Goel</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Kumar</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Saha</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Bedathur</surname>, <given-names>S.</given-names></string-name> (<year>2023</year>). <chapter-title>DetAIL: a tool to automatically detect and analyze drift in language</chapter-title>. In: <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>, Vol. <volume>37</volume>. <publisher-name>Association for the Advancement of Artificial Intelligence</publisher-name>, pp. <fpage>15767</fpage>–<lpage>15773</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1609/aaai.v37i13.26872" xlink:type="simple">https://doi.org/10.1609/aaai.v37i13.26872</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_035">
<mixed-citation publication-type="journal"><string-name><surname>Miller</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Dickinson</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Deitrick</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Hu</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Wang</surname>, <given-names>A.H.</given-names></string-name> (<year>2014</year>). <article-title>Twitter spammer detection using data stream clustering</article-title>. <source>Information Sciences</source>, <volume>260</volume>, <fpage>64</fpage>–<lpage>73</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.ins.2013.11.016" xlink:type="simple">https://doi.org/10.1016/j.ins.2013.11.016</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_036">
<mixed-citation publication-type="journal"><string-name><surname>Mohawesh</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Tran</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Ollington</surname>, <given-names>R.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>S.</given-names></string-name> (<year>2021</year>). <article-title>Analysis of concept drift in fake reviews detection</article-title>. <source>Expert Systems with Applications</source>, <volume>169</volume>, <elocation-id>114318</elocation-id>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2020.114318" xlink:type="simple">https://doi.org/10.1016/j.eswa.2020.114318</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_037">
<mixed-citation publication-type="chapter"><string-name><surname>Pham</surname>, <given-names>X.C.</given-names></string-name>, <string-name><surname>Dang</surname>, <given-names>M.T.</given-names></string-name>, <string-name><surname>Dinh</surname>, <given-names>S.V.</given-names></string-name>, <string-name><surname>Hoang</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Nguyen</surname>, <given-names>T.T.</given-names></string-name>, <string-name><surname>Liew</surname>, <given-names>A.W.-C.</given-names></string-name> (<year>2017</year>). <chapter-title>Learning from data stream based on random projection and hoeffding tree classifier</chapter-title>. In: <source>Proceedings of the International Conference on Digital Image Computing: Techniques and Applications</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>1</fpage>–<lpage>8</lpage>. <isbn>978-1-5386-2839-3</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/DICTA.2017.8227456" xlink:type="simple">https://doi.org/10.1109/DICTA.2017.8227456</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_038">
<mixed-citation publication-type="journal"><string-name><surname>Rao</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Verma</surname>, <given-names>A.K.</given-names></string-name>, <string-name><surname>Bhatia</surname>, <given-names>T.</given-names></string-name> (<year>2021</year>). <article-title>A review on social spam detection: challenges, open issues, and future directions</article-title>. <source>Expert Systems with Applications</source>, <volume>186</volume>, <elocation-id>115742</elocation-id>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2021.115742" xlink:type="simple">https://doi.org/10.1016/j.eswa.2021.115742</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_039">
<mixed-citation publication-type="journal"><string-name><surname>Rathore</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Soni</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Prabakar</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Palaniswami</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Santi</surname>, <given-names>P.</given-names></string-name> (<year>2021</year>). <article-title>Identifying groups of fake reviewers using a semisupervised approach</article-title>. <source>IEEE Transactions on Computational Social Systems</source>, <volume>8</volume>(<issue>6</issue>), <fpage>1369</fpage>–<lpage>1378</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/TCSS.2021.3085406" xlink:type="simple">https://doi.org/10.1109/TCSS.2021.3085406</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_040">
<mixed-citation publication-type="chapter"><string-name><surname>Reis</surname>, <given-names>J.C.S.</given-names></string-name>, <string-name><surname>Correia</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Murai</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Veloso</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Benevenuto</surname>, <given-names>F.</given-names></string-name> (<year>2019</year>). <chapter-title>Explainable machine learning for fake news detection</chapter-title>. In: <source>Proceedings of the ACM Conference on Web Science</source>. <publisher-name>Association for Computational Linguistics</publisher-name>, pp. <fpage>17</fpage>–<lpage>26</lpage>. <isbn>9781450362023</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/3292522.3326027" xlink:type="simple">https://doi.org/10.1145/3292522.3326027</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_041">
<mixed-citation publication-type="journal"><string-name><surname>Reyes-Menendez</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Saura</surname>, <given-names>J.R.</given-names></string-name>, <string-name><surname>Filipe</surname>, <given-names>F.</given-names></string-name> (<year>2019</year>). <article-title>The importance of behavioral data to identify online fake reviews for tourism businesses: a systematic review</article-title>. <source>PeerJ Computer Science</source>, <volume>5</volume>, <fpage>1</fpage>–<lpage>21</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.7717/peerj-cs.219" xlink:type="simple">https://doi.org/10.7717/peerj-cs.219</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_042">
<mixed-citation publication-type="chapter"><string-name><surname>Ribeiro</surname>, <given-names>M.T.</given-names></string-name>, <string-name><surname>Singh</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Guestrin</surname>, <given-names>C.</given-names></string-name> (<year>2016</year>). <chapter-title>“Why Should I Trust You?”: explaining the predictions of any classifier</chapter-title>. In: <source>Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>. <publisher-name>Association for Computing Machinery</publisher-name>, pp. <fpage>1135</fpage>–<lpage>1144</lpage>. <isbn>9781450342322</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1145/2939672.2939778" xlink:type="simple">https://doi.org/10.1145/2939672.2939778</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_043">
<mixed-citation publication-type="journal"><string-name><surname>Ritu Aggrawal</surname>, <given-names>S.P.</given-names></string-name> (<year>2021</year>). <article-title>Elimination and backward selection of features (P-value technique) in prediction of heart disease by using machine learning algorithms</article-title>. <source>Turkish Journal of Computer and Mathematics Education</source>, <volume>12</volume>(<issue>6</issue>), <fpage>2650</fpage>–<lpage>2665</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.17762/turcomat.v12i6.5765" xlink:type="simple">https://doi.org/10.17762/turcomat.v12i6.5765</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_044">
<mixed-citation publication-type="journal"><string-name><surname>Rudin</surname>, <given-names>C.</given-names></string-name> (<year>2019</year>). <article-title>Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead</article-title>. <source>Nature Machine Intelligence</source>, <volume>1</volume>, <fpage>206</fpage>–<lpage>215</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1038/s42256-019-0048-x" xlink:type="simple">https://doi.org/10.1038/s42256-019-0048-x</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_045">
<mixed-citation publication-type="journal"><string-name><surname>Rustam</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Khalid</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Aslam</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>Rupapara</surname>, <given-names>V.</given-names></string-name>, <string-name><surname>Mehmood</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Choi</surname>, <given-names>G.S.</given-names></string-name> (<year>2021</year>). <article-title>A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis</article-title>. <source>PLoS One</source>, <volume>16</volume>(<issue>2</issue>), <fpage>1</fpage>–<lpage>23</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1371/journal.pone.0245909" xlink:type="simple">https://doi.org/10.1371/journal.pone.0245909</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_046">
<mixed-citation publication-type="journal"><string-name><surname>Škrlj</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Martinc</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Lavrač</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Pollak</surname>, <given-names>S.</given-names></string-name> (<year>2021</year>). <article-title>autoBOT: evolving neuro-symbolic representations for explainable low resource text classification</article-title>. <source>Machine Learning</source>, <volume>110</volume>(<issue>5</issue>), <fpage>989</fpage>–<lpage>1028</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/s10994-021-05968-x" xlink:type="simple">https://doi.org/10.1007/s10994-021-05968-x</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_047">
<mixed-citation publication-type="journal"><string-name><surname>Solari</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Egüen</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Polo</surname>, <given-names>M.J.</given-names></string-name>, <string-name><surname>Losada</surname>, <given-names>M.A.</given-names></string-name> (<year>2017</year>). <article-title>Peaks Over Threshold (POT): a methodology for automatic threshold estimation using goodness of fit p-value</article-title>. <source>Water Resources Research</source>, <volume>53</volume>(<issue>4</issue>), <fpage>2833</fpage>–<lpage>2849</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/2016WR019426" xlink:type="simple">https://doi.org/10.1002/2016WR019426</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_048">
<mixed-citation publication-type="journal"><string-name><surname>Song</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Ye</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Lau</surname>, <given-names>R.Y.K.</given-names></string-name>, <string-name><surname>Liu</surname>, <given-names>F.</given-names></string-name> (<year>2016</year>). <article-title>Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift</article-title>. <source>Information Sciences</source>, <volume>357</volume>, <fpage>125</fpage>–<lpage>143</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.ins.2016.03.043" xlink:type="simple">https://doi.org/10.1016/j.ins.2016.03.043</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_049">
<mixed-citation publication-type="chapter"><string-name><surname>Stirling</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Koh</surname>, <given-names>Y.S.</given-names></string-name>, <string-name><surname>Fournier-Viger</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Ravana</surname>, <given-names>S.D.</given-names></string-name> (<year>2018</year>). <chapter-title>Concept drift detector selection for hoeffding adaptive trees</chapter-title>. In: <source>Proceedings of the Australasian Joint Conference on Artificial Intelligence</source>, Vol. <volume>11320</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>730</fpage>–<lpage>736</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-3-030-03991-2_65" xlink:type="simple">https://doi.org/10.1007/978-3-030-03991-2_65</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_050">
<mixed-citation publication-type="chapter"><string-name><surname>Stites</surname>, <given-names>M.C.</given-names></string-name>, <string-name><surname>Nyre-Yu</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Moss</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Smutz</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Smith</surname>, <given-names>M.R.</given-names></string-name> (<year>2021</year>). <chapter-title>Sage advice? The impacts of explanations for machine learning models on human decision-making in spam detection</chapter-title>. In: <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>, Vol. <volume>12797 LNAI</volume>. <publisher-name>Springer</publisher-name>, pp. <fpage>269</fpage>–<lpage>284</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1007/978-3-030-77772-2_18" xlink:type="simple">https://doi.org/10.1007/978-3-030-77772-2_18</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_051">
<mixed-citation publication-type="journal"><string-name><surname>Sun</surname>, <given-names>N.</given-names></string-name>, <string-name><surname>Lin</surname>, <given-names>G.</given-names></string-name>, <string-name><surname>Qiu</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Rimba</surname>, <given-names>P.</given-names></string-name> (<year>2022</year>). <article-title>Near real-time twitter spam detection with machine learning techniques</article-title>. <source>International Journal of Computers and Applications</source>, <volume>44</volume>(<issue>4</issue>), <fpage>338</fpage>–<lpage>348</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1080/1206212X.2020.1751387" xlink:type="simple">https://doi.org/10.1080/1206212X.2020.1751387</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_052">
<mixed-citation publication-type="journal"><string-name><surname>Treistman</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Mughaz</surname>, <given-names>D.</given-names></string-name>, <string-name><surname>Stulman</surname>, <given-names>A.</given-names></string-name>, <string-name><surname>Dvir</surname>, <given-names>A.</given-names></string-name> (<year>2022</year>). <article-title>Word embedding dimensionality reduction using dynamic variance thresholding (DyVaT)</article-title>. <source>Expert Systems with Applications</source>, <volume>208</volume>, <fpage>118157</fpage>–<lpage>118170</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.eswa.2022.118157" xlink:type="simple">https://doi.org/10.1016/j.eswa.2022.118157</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_053">
<mixed-citation publication-type="chapter"><string-name><surname>Upadhyay</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Abu-Rasheed</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Weber</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Fathi</surname>, <given-names>M.</given-names></string-name> (<year>2021</year>). <chapter-title>Explainable job-posting recommendations using knowledge graphs and named entity recognition</chapter-title>. In: <source>Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>3291</fpage>–<lpage>3296</lpage>. <isbn>978-1-6654-4207-7</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/SMC52423.2021.9658757" xlink:type="simple">https://doi.org/10.1109/SMC52423.2021.9658757</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_054">
<mixed-citation publication-type="journal"><string-name><surname>Vaitkevicius</surname>, <given-names>P.</given-names></string-name>, <string-name><surname>Marcinkevicius</surname>, <given-names>V.</given-names></string-name> (<year>2020</year>). <article-title>Comparison of classification algorithms for detection of phishing websites</article-title>. <source>Informatica</source>, <volume>31</volume>(<issue>1</issue>), <fpage>143</fpage>–<lpage>160</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.15388/20-INFOR404" xlink:type="simple">https://doi.org/10.15388/20-INFOR404</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_055">
<mixed-citation publication-type="journal"><string-name><surname>Veloso</surname>, <given-names>B.M.</given-names></string-name>, <string-name><surname>Leal</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Malheiro</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Burguillo</surname>, <given-names>J.C.</given-names></string-name> (<year>2019</year>). <article-title>On-line guest profiling and hotel recommendation</article-title>. <source>Electronic Commerce Research</source>, <volume>34</volume>, <fpage>100832</fpage>–<lpage>100841</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.elerap.2019.100832" xlink:type="simple">https://doi.org/10.1016/j.elerap.2019.100832</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_056">
<mixed-citation publication-type="journal"><string-name><surname>Veloso</surname>, <given-names>B.M.</given-names></string-name>, <string-name><surname>Leal</surname>, <given-names>F.</given-names></string-name>, <string-name><surname>Malheiro</surname>, <given-names>B.</given-names></string-name>, <string-name><surname>Burguillo</surname>, <given-names>J.C.</given-names></string-name> (<year>2020</year>). <article-title>A 2020 perspective on “Online guest profiling and hotel recommendation”: reliability, scalability, traceability and transparency</article-title>. <source>Electronic Commerce Research and Applications</source>, <volume>40</volume>, <fpage>100957</fpage>–<lpage>100958</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.elerap.2020.100957" xlink:type="simple">https://doi.org/10.1016/j.elerap.2020.100957</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_057">
<mixed-citation publication-type="journal"><string-name><surname>Wang</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Han</surname>, <given-names>L.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>M.</given-names></string-name>, <string-name><surname>Qian</surname>, <given-names>W.</given-names></string-name>, <string-name><surname>An</surname>, <given-names>D.</given-names></string-name> (<year>2021</year>). <article-title>Adaptive evaluation model of web spam based on link relation</article-title>. <source>Transactions on Emerging Telecommunications Technologies</source>, <volume>32</volume>(<issue>5</issue>), <fpage>1</fpage>–<lpage>13</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1002/ett.4047" xlink:type="simple">https://doi.org/10.1002/ett.4047</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_058">
<mixed-citation publication-type="journal"><string-name><surname>Wang</surname>, <given-names>X.</given-names></string-name>, <string-name><surname>Kang</surname>, <given-names>Q.</given-names></string-name>, <string-name><surname>An</surname>, <given-names>J.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>M.</given-names></string-name> (<year>2019</year>). <article-title>Drifted twitter spam classification using multiscale detection test on K-L divergence</article-title>. <source>IEEE Access</source>, <volume>7</volume>, <fpage>108384</fpage>–<lpage>108394</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/ACCESS.2019.2932018" xlink:type="simple">https://doi.org/10.1109/ACCESS.2019.2932018</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_059">
<mixed-citation publication-type="journal"><string-name><surname>Wu</surname>, <given-names>T.</given-names></string-name>, <string-name><surname>Wen</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Xiang</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Zhou</surname>, <given-names>W.</given-names></string-name> (<year>2018</year>). <article-title>Twitter spam detection: survey of new approaches and comparative study</article-title>. <source>Computers &amp; Security</source>, <volume>76</volume>, <fpage>265</fpage>–<lpage>284</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1016/j.cose.2017.11.013" xlink:type="simple">https://doi.org/10.1016/j.cose.2017.11.013</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_060">
<mixed-citation publication-type="chapter"><string-name><surname>Wu</surname>, <given-names>Y.</given-names></string-name>, <string-name><surname>Sharma</surname>, <given-names>K.</given-names></string-name>, <string-name><surname>Seah</surname>, <given-names>C.</given-names></string-name>, <string-name><surname>Zhang</surname>, <given-names>S.</given-names></string-name> (<year>2023</year>). <chapter-title>SentiStream: a co-training framework for adaptive online sentiment analysis in evolving data streams</chapter-title>. In: <source>Proceedings of the Conference on Empirical Methods in Natural Language Processing</source>. <publisher-name>Association for Computational Linguistics</publisher-name>, pp. <fpage>6198</fpage>–<lpage>6212</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.18653/v1/2023.emnlp-main.380" xlink:type="simple">https://doi.org/10.18653/v1/2023.emnlp-main.380</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_061">
<mixed-citation publication-type="journal"><string-name><surname>Zhang</surname>, <given-names>K.Z.K.</given-names></string-name>, <string-name><surname>Xu</surname>, <given-names>H.</given-names></string-name>, <string-name><surname>Zhao</surname>, <given-names>S.</given-names></string-name>, <string-name><surname>Yu</surname>, <given-names>Y.</given-names></string-name> (<year>2018</year>). <article-title>Online reviews and impulse buying behavior: the role of browsing and impulsiveness</article-title>. <source>Internet Research</source>, <volume>28</volume>(<issue>3</issue>), <fpage>522</fpage>–<lpage>543</lpage>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1108/IntR-12-2016-0377" xlink:type="simple">https://doi.org/10.1108/IntR-12-2016-0377</ext-link>.</mixed-citation>
</ref>
<ref id="j_infor562_ref_062">
<mixed-citation publication-type="chapter"><string-name><surname>Zhang</surname>, <given-names>Z.</given-names></string-name>, <string-name><surname>Damiani</surname>, <given-names>E.</given-names></string-name>, <string-name><surname>Hamadi</surname>, <given-names>H.A.</given-names></string-name>, <string-name><surname>Yeun</surname>, <given-names>C.Y.</given-names></string-name>, <string-name><surname>Taher</surname>, <given-names>F.</given-names></string-name> (<year>2022</year>). <chapter-title>Explainable artificial intelligence to detect image spam using convolutional neural network</chapter-title>. In: <source>Proceedings of the International Conference on Cyber Resilience</source>. <publisher-name>IEEE</publisher-name>, pp. <fpage>1</fpage>–<lpage>5</lpage>. <isbn>978-1-6654-6122-1</isbn>. <ext-link ext-link-type="doi" xlink:href="https://doi.org/10.1109/ICCR56254.2022.9995839" xlink:type="simple">https://doi.org/10.1109/ICCR56254.2022.9995839</ext-link>.</mixed-citation>
</ref>
</ref-list>
</back>
</article>
