Informatica logo


Login Register

  1. Home
  2. Issues
  3. Volume 35, Issue 3 (2024)
  4. Online Detection and Infographic Explana ...

Informatica

Information Submit your article For Referees Help ATTENTION!
  • Article info
  • Full article
  • Related articles
  • More
    Article info Full article Related articles

Online Detection and Infographic Explanation of Spam Reviews with Data Drift Adaptation
Volume 35, Issue 3 (2024), pp. 483–507
Francisco de Arriba-Pérez   Silvia García-Méndez   Fátima Leal   Benedita Malheiro   Juan C. Burguillo  

Authors

 
Placeholder
https://doi.org/10.15388/24-INFOR562
Pub. online: 17 June 2024      Type: Research Article      Open accessOpen Access

Received
1 June 2023
Accepted
1 June 2024
Published
17 June 2024

Abstract

Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection & adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87% spam F-measure.

References

 
Al-Otaibi, S.T., Al-Rasheed, A.A. (2022). A review and comparative analysis of sentiment analysis techniques. Informatica, 46(6), 33–44. https://doi.org/10.31449/inf.v46i6.3991.
 
Albayati, M.B., Altamimi, A.M. (2019). An empirical study for detecting fake facebook profiles using supervised mining techniques. Informatica, 43(1), 77–86. https://doi.org/10.31449/inf.v43i1.2319.
 
Barddal, J.P., Gomes, H.M., Enembreck, F., Pfahringer, B. (2017). A survey on feature drift adaptation: definition, benchmark, challenges and future directions. Journal of Systems and Software, 127, 278–294. https://doi.org/10.1016/j.jss.2016.07.005.
 
Bian, P., Liu, L., Sweetser, P. (2021). Detecting spam game reviews on steam with a semi-supervised approach. In: Proceedings of the International Conference on the Foundations of Digital Games. Association for Computing Machinery, pp. 1–10. 9781450384223. https://doi.org/10.1145/3472538.3472547.
 
Cano, A., Krawczyk, B. (2019). Evolving rule-based classifiers with genetic programming on GPUs for drifting data streams. Pattern Recognition, 87, 248–268. https://doi.org/10.1016/j.patcog.2018.10.024.
 
Carvalho, D.V., Pereira, E.M., Cardoso, J.S. (2019). Machine learning interpretability: a survey on methods and metrics. Electronics, 8(8), 832. https://doi.org/10.3390/electronics8080832.
 
Charmet, F., Tanuwidjaja, H.C., Ayoubi, S., Gimenez, P.F., Han, Y., Jmila, H., Blanc, G., Takahashi, T., Zhang, Z. (2022). Explainable artificial intelligence for cybersecurity: a literature survey. Annals of Telecommunications, 77, 789–812. https://doi.org/10.1007/s12243-022-00926-7.
 
Chumakov, S., Kovantsev, A., Surikov, A. (2023). Generative approach to aspect based sentiment analysis with GPT language models. Procedia Computer Science, 229, 284–293. https://doi.org/10.1016/j.procs.2023.12.030.
 
Crawford, M., Khoshgoftaar, T.M., Prusa, J.D., Richter, A.N., Al Najada, H. (2015). Survey of review spam detection using machine learning techniques. Journal of Big Data, 2(1), 1–24. https://doi.org/10.1186/s40537-015-0029-9.
 
Desale, K.S., Shinde, S., Magar, N., Kullolli, S., Kurhade, A. (2023). Fake review detection with concept drift in the data: a survey. In: Proceedings of International Congress on Information and Communication Technology, Vol. 448. Springer, pp. 719–726. 9789811916090. https://doi.org/10.1007/978-981-19-1610-6_63.
 
Duckworth, C., Chmiel, F.P., Burns, D.K., Zlatev, Z.D., White, N.M., Daniels, T.W.V., Kiuber, M., Boniface, M.J. (2021). Using explainable machine learning to characterise data drift and detect emergent health risks for emergency department admissions during COVID-19. Scientific Reports, 11, 23017–23026. https://doi.org/10.1038/s41598-021-02481-y.
 
Engelbrecht, A.P., Grobler, J., Langeveld, J. (2019). Set based particle swarm optimization for the feature selection problem. Engineering Applications of Artificial Intelligence, 85, 324–336. https://doi.org/10.1016/j.engappai.2019.06.008.
 
Eshraqi, N., Jalali, M., Moattar, M.H. (2015). Detecting spam tweets in Twitter using a data stream clustering algorithm. In: Proceedings of the International Congress on Technology, Communication and Knowledge. IEEE, pp. 347–351. 978-1-4673-9762-9. https://doi.org/10.1109/ICTCK.2015.7582694.
 
Faris, H., Al-Zoubi, A.M., Heidari, A.A., Aljarah, I., Mafarja, M., Hassonah, M.A., Fujita, H. (2019). An intelligent system for spam detection and identification of the most relevant features based on evolutionary Random Weight Networks. Information Fusion, 48, 67–83. https://doi.org/10.1016/j.inffus.2018.08.002.
 
Gama, J., Sebastião, R., Rodrigues, P.P. (2013). On evaluating stream learning algorithms. Machine Learning, 90(3), 317–346. https://doi.org/10.1007/s10994-012-5320-9.
 
Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4), 44–80. https://doi.org/10.1145/2523813.
 
García-Méndez, S., de Arriba-Pérez, F., Barros-Vila, A., González-Castaño, F.J. (2022a). Detection of temporality at discourse level on financial news by combining Natural Language Processing and Machine Learning. Expert Systems with Applications, 197(1), 116648–116656. https://doi.org/10.1016/j.eswa.2022.116648.
 
García-Méndez, S., Leal, F., Malheiro, B., Burguillo-Rial, J.C., Veloso, B., Chis, A.E., González–Vélez, H. (2022b). Simulation, modelling and classification of wiki contributors: Spotting the good, the bad, and the ugly. Simulation Modelling Practice and Theory, 120, 102616–102628. https://doi.org/10.1016/j.simpat.2022.102616.
 
Garg, P., Girdhar, N. (2021). A systematic review on spam filtering techniques based on natural language processing framework. In: Proceedings of the International Conference on Cloud Computing, Data Science & Engineering. IEEE, pp. 30–35. https://doi.org/10.1109/Confluence51648.2021.9377042.
 
Gomes, H.M., Bifet, A., Read, J., Barddal, J.P., Enembreck, F., Pfharinger, B., Holmes, G., Abdessalem, T. (2017). Adaptive random forests for evolving data stream classification. Machine Learning, 106(9–10), 1469–1495. https://doi.org/10.1007/s10994-017-5642-8.
 
Hamida, Z.F., Refouf, A., Drif, A., Giordano, S. (2022). Hybrid-MELAu: a hybrid mixing engineered linguistic features based on autoencoder for social bot detection. Informatica, 46(6), 143–158. https://doi.org/10.31449/inf.v46i6.4081.
 
Han, S., Wang, H., Li, W., Zhang, H., Zhuang, L. (2022). Explainable knowledge integrated sequence model for detecting fake online reviews. Applied Intelligence, 53, 6953–6965. https://doi.org/10.1007/s10489-022-03822-8.
 
Henke, M., Santos, E., Souto, E., Santin, A.O. (2021). Spam detection based on feature evolution to deal with concept drift. Journal of Universal Computer Science, 27, 364–386. https://doi.org/10.3897/jucs.66284.
 
Hutama, L.B., Suhartono, D. (2022). Indonesian hoax news classification with multilingual transformer model and BERTopic. Informatica, 46(8), 81–90. https://doi.org/10.31449/inf.v46i8.4336.
 
Kakar, S., Dhaka, D., Mehrotra, M. (2021). Value-based retweet prediction on twitter. Informatica, 45, 267–276. https://doi.org/10.31449/inf.v45i2.3465.
 
Karakaşlı, M.S., Aydin, M.A., Yarkan, S., Boyaci, A. (2019). Dynamic feature selection for spam detection in twitter. In: Lecture Notes in Electrical Engineering, Vol. 504. Springer, pp. 239–250. https://doi.org/10.1007/978-981-13-0408-8_20.
 
Kaur, R., Singh, S., Kumar, H. (2018). Rise of spam and compromised accounts in online social networks: a state-of-the-art review of different combating approaches. Journal of Network and Computer Applications, 112, 53–88. https://doi.org/10.1016/j.jnca.2018.03.015.
 
Leal, F., Veloso, B., Malheiro, B., Burguillo, J.C. (2021). Crowdsourced data stream mining for tourism recommendation. In: Advances in Intelligent Systems and Computing, vol. 1365 AIST. Springer, pp. 260–269. https://doi.org/10.1007/978-3-030-72657-7_25.
 
Leo, G.D., Sardanelli, F. (2020). Statistical significance: p value, 0.05 threshold, and applications to radiomics—reasons for a conservative approach. Euro Radiology Experimental, 4, 1–8. https://doi.org/10.1186/s41747-020-0145-y.
 
Liu, S., Zhang, J., Xiang, Y. (2016). Statistical detection of online drifting twitter spam: invited paper. In: Proceedings of the Asia Conference on Computer and Communications Security. Association for Computational Linguistics, pp. 1–10. 9781450342339. https://doi.org/10.1145/2897845.2897928.
 
Liu, W., He, J., Han, S., Cai, F., Yang, Z., Zhu, N. (2019). A method for the detection of fake reviews based on temporal features of reviews and comments. IEEE Engineering Management Review, 47, 67–79. https://doi.org/10.1109/EMR.2019.2928964.
 
Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G. (2018). Learning under concept drift: a review. IEEE Transactions on Knowledge and Data Engineering, 31(12), 2346–2363. https://doi.org/10.1109/TKDE.2018.2876857.
 
Ma, T., Wang, X., Zhou, F.-c., Wang, S. (2023). Research on diversity and accuracy of the recommendation system based on multi-objective optimization. Neural Computing and Applications, 35, 5155–5163. https://doi.org/10.1007/s00521-020-05438-w.
 
Madaan, N., Manjunatha, A., Nambiar, H., Goel, A., Kumar, H., Saha, D., Bedathur, S. (2023). DetAIL: a tool to automatically detect and analyze drift in language. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. Association for the Advancement of Artificial Intelligence, pp. 15767–15773. https://doi.org/10.1609/aaai.v37i13.26872.
 
Miller, Z., Dickinson, B., Deitrick, W., Hu, W., Wang, A.H. (2014). Twitter spammer detection using data stream clustering. Information Sciences, 260, 64–73. https://doi.org/10.1016/j.ins.2013.11.016.
 
Mohawesh, R., Tran, S., Ollington, R., Xu, S. (2021). Analysis of concept drift in fake reviews detection. Expert Systems with Applications, 169, 114318. https://doi.org/10.1016/j.eswa.2020.114318.
 
Pham, X.C., Dang, M.T., Dinh, S.V., Hoang, S., Nguyen, T.T., Liew, A.W.-C. (2017). Learning from data stream based on random projection and hoeffding tree classifier. In: Proceedings of the International Conference on Digital Image Computing: Techniques and Applications. IEEE, pp. 1–8. 978-1-5386-2839-3. https://doi.org/10.1109/DICTA.2017.8227456.
 
Rao, S., Verma, A.K., Bhatia, T. (2021). A review on social spam detection: challenges, open issues, and future directions. Expert Systems with Applications, 186, 115742. https://doi.org/10.1016/j.eswa.2021.115742.
 
Rathore, P., Soni, J., Prabakar, N., Palaniswami, M., Santi, P. (2021). Identifying groups of fake reviewers using a semisupervised approach. IEEE Transactions on Computational Social Systems, 8(6), 1369–1378. https://doi.org/10.1109/TCSS.2021.3085406.
 
Reis, J.C.S., Correia, A., Murai, F., Veloso, A., Benevenuto, F. (2019). Explainable machine learning for fake news detection. In: Proceedings of the ACM Conference on Web Science. Association for Computational Linguistics, pp. 17–26. 9781450362023. https://doi.org/10.1145/3292522.3326027.
 
Reyes-Menendez, A., Saura, J.R., Filipe, F. (2019). The importance of behavioral data to identify online fake reviews for tourism businesses: a systematic review. PeerJ Computer Science, 5, 1–21. https://doi.org/10.7717/peerj-cs.219.
 
Ribeiro, M.T., Singh, S., Guestrin, C. (2016). “Why Should I Trust You?”: explaining the predictions of any classifier. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, pp. 1135–1144. 9781450342322. https://doi.org/10.1145/2939672.2939778.
 
Ritu Aggrawal, S.P. (2021). Elimination and backward selection of features (P-value technique) in prediction of heart disease by using machine learning algorithms. Turkish Journal of Computer and Mathematics Education, 12(6), 2650–2665. https://doi.org/10.17762/turcomat.v12i6.5765.
 
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1, 206–215. https://doi.org/10.1038/s42256-019-0048-x.
 
Rustam, F., Khalid, M., Aslam, W., Rupapara, V., Mehmood, A., Choi, G.S. (2021). A performance comparison of supervised machine learning models for Covid-19 tweets sentiment analysis. PLoS One, 16(2), 1–23. https://doi.org/10.1371/journal.pone.0245909.
 
Škrlj, B., Martinc, M., Lavrač, N., Pollak, S. (2021). autoBOT: evolving neuro-symbolic representations for explainable low resource text classification. Machine Learning, 110(5), 989–1028. https://doi.org/10.1007/s10994-021-05968-x.
 
Solari, S., Egüen, M., Polo, M.J., Losada, M.A. (2017). Peaks Over Threshold (POT): a methodology for automatic threshold estimation using goodness of fit p-value. Water Resources Research, 53(4), 2833–2849. https://doi.org/10.1002/2016WR019426.
 
Song, G., Ye, Y., Zhang, H., Xu, X., Lau, R.Y.K., Liu, F. (2016). Dynamic clustering forest: an ensemble framework to efficiently classify textual data stream with concept drift. Information Sciences, 357, 125–143. https://doi.org/10.1016/j.ins.2016.03.043.
 
Stirling, M., Koh, Y.S., Fournier-Viger, P., Ravana, S.D. (2018). Concept drift detector selection for hoeffding adaptive trees. In: Proceedings of the Australasian Joint Conference on Artificial Intelligence, Vol. 11320. Springer, pp. 730–736. https://doi.org/10.1007/978-3-030-03991-2_65.
 
Stites, M.C., Nyre-Yu, M., Moss, B., Smutz, C., Smith, M.R. (2021). Sage advice? The impacts of explanations for machine learning models on human decision-making in spam detection. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 12797 LNAI. Springer, pp. 269–284. https://doi.org/10.1007/978-3-030-77772-2_18.
 
Sun, N., Lin, G., Qiu, J., Rimba, P. (2022). Near real-time twitter spam detection with machine learning techniques. International Journal of Computers and Applications, 44(4), 338–348. https://doi.org/10.1080/1206212X.2020.1751387.
 
Treistman, A., Mughaz, D., Stulman, A., Dvir, A. (2022). Word embedding dimensionality reduction using dynamic variance thresholding (DyVaT). Expert Systems with Applications, 208, 118157–118170. https://doi.org/10.1016/j.eswa.2022.118157.
 
Upadhyay, C., Abu-Rasheed, H., Weber, C., Fathi, M. (2021). Explainable job-posting recommendations using knowledge graphs and named entity recognition. In: Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. IEEE, pp. 3291–3296. 978-1-6654-4207-7. https://doi.org/10.1109/SMC52423.2021.9658757.
 
Vaitkevicius, P., Marcinkevicius, V. (2020). Comparison of classification algorithms for detection of phishing websites. Informatica, 31(1), 143–160. https://doi.org/10.15388/20-INFOR404.
 
Veloso, B.M., Leal, F., Malheiro, B., Burguillo, J.C. (2019). On-line guest profiling and hotel recommendation. Electronic Commerce Research, 34, 100832–100841. https://doi.org/10.1016/j.elerap.2019.100832.
 
Veloso, B.M., Leal, F., Malheiro, B., Burguillo, J.C. (2020). A 2020 perspective on “Online guest profiling and hotel recommendation”: reliability, scalability, traceability and transparency. Electronic Commerce Research and Applications, 40, 100957–100958. https://doi.org/10.1016/j.elerap.2020.100957.
 
Wang, J., Han, L., Zhou, M., Qian, W., An, D. (2021). Adaptive evaluation model of web spam based on link relation. Transactions on Emerging Telecommunications Technologies, 32(5), 1–13. https://doi.org/10.1002/ett.4047.
 
Wang, X., Kang, Q., An, J., Zhou, M. (2019). Drifted twitter spam classification using multiscale detection test on K-L divergence. IEEE Access, 7, 108384–108394. https://doi.org/10.1109/ACCESS.2019.2932018.
 
Wu, T., Wen, S., Xiang, Y., Zhou, W. (2018). Twitter spam detection: survey of new approaches and comparative study. Computers & Security, 76, 265–284. https://doi.org/10.1016/j.cose.2017.11.013.
 
Wu, Y., Sharma, K., Seah, C., Zhang, S. (2023). SentiStream: a co-training framework for adaptive online sentiment analysis in evolving data streams. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, pp. 6198–6212. https://doi.org/10.18653/v1/2023.emnlp-main.380.
 
Zhang, K.Z.K., Xu, H., Zhao, S., Yu, Y. (2018). Online reviews and impulse buying behavior: the role of browsing and impulsiveness. Internet Research, 28(3), 522–543. https://doi.org/10.1108/IntR-12-2016-0377.
 
Zhang, Z., Damiani, E., Hamadi, H.A., Yeun, C.Y., Taher, F. (2022). Explainable artificial intelligence to detect image spam using convolutional neural network. In: Proceedings of the International Conference on Cyber Resilience. IEEE, pp. 1–5. 978-1-6654-6122-1. https://doi.org/10.1109/ICCR56254.2022.9995839.

Biographies

de Arriba-Pérez Francisco
farriba@gti.uvigo.es

F. de Arriba-Pérez received a BS degree in telecommunication technologies engineering in 2013, an MS degree in telecommunication engineering in 2014, and a PhD in 2019 from the University of Vigo, Spain. He is currently a researcher in the Information Technologies Group at the University of Vigo, Spain. His research includes the development of machine learning solutions for different domains like finance and health.

García-Méndez Silvia
sgarcia@gti.uvigo.es

S. García-Méndez received a PhD in information and communication technologies from the University of Vigo in 2021. Since 2015, she has worked as a researcher with the Information Technologies Group at the University of Vigo. She is collaborating with foreign research centres as part of her postdoctoral stage. Her research interests include natural language processing techniques and machine learning algorithms.

Leal Fátima
fatimal@upt.pt

F. Leal holds a PhD in information and communication technologies from the University of Vigo, Spain. She is an auxiliary professor at Universidade Portucalense in Porto, Portugal, and a researcher at REMIT (Research on Economics, Management, and Information Technologies). Her research is based on crowdsourced information, including trust and reputation, big data, data streams, and recommendation systems. Recently, she has been exploring blockchain technologies for responsible data processing.

Malheiro Benedita
mbm@isep.ipp.pt

B. Malheiro is a coordinator professor at Instituto Superior de Engenharia do Porto, the School of Engineering of the Polytechnic of Porto, and senior researcher at inesc tec, Porto, Portugal. She holds a PhD and an MSc in electrical engineering and computers and a five-year graduation in electrical engineering from the University of Porto. Her research interests include artificial intelligence, computer science, and engineering education. She is a member of the Association for the Advancement of Artificial Intelligence (aaai), the Portuguese Association for Artificial Intelligence (appia), the Association for Computing Machinery (acm), and the Professional Association of Portuguese Engineers (oe).

Burguillo Juan C.
J.C.Burguillo@uvigo.es

J.C. Burguillo received an MSc degree in telecommunication engineering and a PhD degree in telematics at the University of Vigo, Spain. He is currently a full professor at the Department of Telematic Engineering and a researcher at the AtlanTTic Research Center in Telecom Technologies at the University of Vigo. He is the area editor of the journal Simulation Modelling Practice and Theory (SIMPAT), and his topics of interest are intelligent systems, evolutionary game theory, self-organization, and complex adaptive systems.


Full article Related articles PDF XML
Full article Related articles PDF XML

Copyright
© 2024 Vilnius University
by logo by logo
Open access article under the CC BY license.

Keywords
data drift interpretability and explainability Natural Language Processing online machine learning spam detection

Funding
This work was partially supported by: (i) Xunta de Galicia grants ED481B-2021-118 and ED481B-2022-093, Spain; and (ii) Portuguese national funds through FCT – Fundação para a Ciência e a Tecnologia (Portuguese Foundation for Science and Technology) – as part of project UIDP/50014/2020 (https://doi.org/10.54499/UIDP/50014/2020).

Metrics
since January 2020
629

Article info
views

215

Full article
views

254

PDF
downloads

39

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

INFORMATICA

  • Online ISSN: 1822-8844
  • Print ISSN: 0868-4952
  • Copyright © 2023 Vilnius University

About

  • About journal

For contributors

  • OA Policy
  • Submit your article
  • Instructions for Referees
    •  

    •  

Contact us

  • Institute of Data Science and Digital Technologies
  • Vilnius University

    Akademijos St. 4

    08412 Vilnius, Lithuania

    Phone: (+370 5) 2109 338

    E-mail: informatica@mii.vu.lt

    https://informatica.vu.lt/journal/INFORMATICA
Powered by PubliMill  •  Privacy policy