Open Llama2 Models for the Lithuanian Language

Nakvosas, Artūras; Daniušis, Povilas; Mulevičius, Vytas

doi:10.15388/25-INFOR592

Informatica

Open Llama2 Models for the Lithuanian Language

Volume 36, Issue 2 (2025), pp. 385–406

Artūras Nakvosas Povilas Daniušis Vytas Mulevičius

https://doi.org/10.15388/25-INFOR592

Pub. online: 25 April 2025 Type: Research Article

Open Access

Received
1 September 2024

Accepted
1 April 2025

Published
25 April 2025

Abstract

In this paper, we focus on the problem of whether efficient Lithuanian large language models (LLMs) can be achieved from Llama2 LLMs, which lack Lithuanian-specific components. Although the Llama2 architecture was previously successfully utilised to derive various regional LLMs, we propose and describe the first open Llama2 LLMs for the Lithuanian language (7 and 13 billion parameter versions), an accompanying question/answer (Q/A) dataset, and translations of popular language understanding benchmarks (Arc, Belebele, Hellaswag, MMLU, TruthfulQA, and Winogrande), which contribute to the standardisation of Lithuanian LLM evaluation. We empirically evaluate the proposed models by investigating their perplexity and performance in the translated language understanding benchmarks. The perplexity experiments show that it decreases consistently during pretraining, reflecting enhanced next-token prediction capabilities. Benchmarking the proposed LLMs against language understanding tasks reveals that high-quality pretraining datasets may be essential to achieve models that perform efficiently on these benchmarks. Comparison of the proposed LLMs with the latest open multilingual LLM shows that our model with 13 billion parameters is ranked 4th of 8 models in tasks such as Arc, Hellaswag, and Winogrande, but is generally outperformed in other tasks. These benchmarks allow us to hypothesise that from recent LLMs more efficient Lithuanian language models can be derived in the future. The complete realisations of the LLMs and other contributed components are available in the accompanying open repository https://huggingface.co/neurotechnology.

References

Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, É., Hesslow, D., Launay, J., Malartic, Q., Mazzotta, D., Noune, B., Pannier, B., Penedo, G. (2023). The Falcon Series of Open Language Models. https://arxiv.org/abs/2311.16867.

Bacciu, A., Trappolini, G., Santilliand, A., Rodolà, E., Silvestri, F. (2023). Fauno: The Italian Large Language Model that will Leave you Senza Parole! https://arxiv.org/abs/2306.14457.

Boros, T., Chivereanu, R., Dumitrescu, S.D., Purcaru, O. (2024). Fine-tuning and retrieval augmented generation for question answering using affordable large language models. In: Proceedings of the Third Ukrainian Natural Language Processing Workshop, LREC-COLING. European Language Resources Association, Torino, Italy.

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., Schulman, J. (2021). Training Verifiers to Solve Math Word Problems. https://arxiv.org/abs/2110.14168.

Csaki, Z., Li, B., Li, J., Xu, Q., Pawakapan, P., Zhang, L., Du, Y., Zhao, H., Hu, C., Thakker, U. (2024). SambaLingo: Teaching Large Language Models New Languages. https://arxiv.org/abs/2404.05829.

Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., Guestrin, C., Liang, P., Hashimoto, T.B. (2024). AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback. https://arxiv.org/abs/2305.14387.

Eberhard, D.M., Simons, G.F., Fennig, C.D. (2021). Ethnologue: Languages of the World, 24th ed. SIL International. Accessed: 2024-01-04. https://www.ethnologue.com/.

Ekgren, A., Gyllensten, A.C., Stollenwerk, F., Öhman, J., Isbister, T., Gogoulou, E., Carlsson, F., Heiman, A., Casademont, J., Sahlgren, M. (2023). GPT-SW3: An Autoregressive Language Model for the Nordic Languages. https://arxiv.org/abs/2305.12987.

Faysse, M., Fernandes, P., Guerreiro, N.M., Loison, A., Alves, D.M., Corro, C., Boizard, N., Alves, J., Rei, R., Martins, P.H., Bigata Casademunt, A., Yvon, F., Martins, A.F.T., Viaud, G., Hudelot, C., Colombo, P. (2024). CroissantLLM: A Truly Bilingual French-English Language Model. https://arxiv.org/abs/2402.00786.

Gao, L., Tow, J., Abbasi, B., et al. (2023). A framework for few-shot language model evaluation. Zenodo. https://doi.org/10.5281/zenodo.10256836. https://zenodo.org/records/10256836.

Garcia, G.L., Paiola, P.H., Morelli, L.H., Candido, G., Cândido, A., Jodas, D.S., Afonso, L.C.S., Rizzo Guilherme, I., Penteado, B.E., Papa, J.P. (2024). Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task. https://arxiv.org/abs/2401.02909.

Gonen, H., Iyer, S., Blevins, T., Smith, N., Zettlemoyer, L. (2023). Demystifying prompts in language models via perplexity estimation. In: Bouamor, H., Pino, J., Bali, K. (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Singapore, pp. 10136–10148. https://doi.org/10.18653/v1/2023.findings-emnlp.679.

Gordić, A. (2024). YugoGPT Model. https://huggingface.co/gordicaleksa/YugoGPT. Accessed: 2024-07-15.

Grattafiori, A., Dubey, A., Jauhriand, A., et al. (2024). The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783.

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding. https://arxiv.org/abs/2009.03300.

Hernandez, D., Brown, T.B., Conerly, T., DasSarma, N., Drain, D., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Henighan, T., Hume, T., Johnston, S., Mann, B., Olah, C., Olsson, C., Amodei, D., Joseph, N., Kaplan, J., McCandlish, S. (2022). Scaling Laws and Interpretability of Learning from Repeated Data. https://arxiv.org/abs/2205.10487.

Hu, E.J., Yelong S., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W. (2022). LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9.

INSAIT (2024). BgGPT-7B-Instruct-v0.1 model. https://huggingface.co/INSAIT-Institute/BgGPT-7B-Instruct-v0.1. Accessed: 2024-07-17.

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Singh Chaplot, D., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L.R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., El Sayed, W. (2023). Mistral 7B. https://arxiv.org/abs/2310.06825.

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., Lengyel, G., Bour, G., Lample, G., Lavaud, L.R., Saulnier, L., Lachaux, M.-A., Stock, P., Subramanian, S., Yang, S., Antoniak, S., Le Scao, T., Gervet, T., Lavril, T., Wang, T., Lacroix, T., El Sayed, W. (2024). Mixtral of Experts. https://arxiv.org/abs/2401.04088.

Kapočiūtė-Dzikienė, J., Bergmanis, T., Pinnis, M. (2025). Localizing AI: Evaluating Open-Weight Language Models for Languages of Baltic States. https://arxiv.org/abs/2501.03952.

Kim, Y., Kim, D., Choi, J., Park, J., Oh, N., Park, D. (2024). A Survey on Integration of Large Language Models with Intelligent Robots. https://arxiv.org/abs/2404.09228.

Lai, V.D., Nguyen, C.V., Ngo, N.T., Nguyen, T., Dernoncourt, F., Rossi, R.A., Nguyen, T.H. (2023). Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback. https://arxiv.org/abs/2307.16039.

LumiOpen (2024). Viking 13B Model. https://huggingface.co/LumiOpen/Viking-13B. Accessed: 2024-07-15.

Mabeck (2024). Heidrun-Mistral-7B-chat model. https://huggingface.co/Mabeck/Heidrun-Mistral-7B-chat. Accessed: 2024-07-20.

Martins, P.H., Fernandes, P., Alves, J., Guerreiro, N.M., Rei, R., Alves, D.M., Pombal, J., Farajian, A., Faysse, M., Klimaszewski, M., Colombo, P., Haddow, B., de Souza, J.G.C., Birch, A., Martins, A.F.T. (2024). EuroLLM: Multilingual Language Models for Europe. https://arxiv.org/abs/2409.16235.

Masala, M., Ilie-Ablachim, D.C., Corlatescu, D., Zavelca, M., Leordeanu, M., Velicu, H., Popescu, M., Dascalu, M., Rebedea, T. (2024). OpenLLM-Ro – Technical Report on Open-source Romanian LLMs. https://arxiv.org/abs/2405.07703.

Minaee, S., Mikolov, T., Nikzad, N., Chenaghlu, M., Socher, R., Amatriain, X., Gao, J. (2024). Large Language Models: A Survey. https://arxiv.org/abs/2402.06196.

Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A. (2024). A Comprehensive Overview of Large Language Models. https://arxiv.org/abs/2307.06435.

Neurotechnology (2024). Lt-QA-V1 dataset. https://huggingface.co/datasets/neurotechnology/lithuanian-qa-v1.

Nguyen, T., Nguyen, C.V., Lai, V.D., Man, H., Trung Ngo, N., Dernoncourt, F., Rossi, R.A., Nguyen, T.H. (2023). CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. https://arxiv.org/abs/2309.09400.

Norallm (2024). Normistral-7B-Warm. https://huggingface.co/norallm/normistral-7b-warm. Accessed: 2024-07-20.

OpenAI, Achiam, J., Adler, S., et al. (2024). GPT-4 Technical Report. https://arxiv.org/abs/2303.08774.

Plüster, B., Schuhmann, C. (2024). LAION LeoLM: Linguistically Enhanced Open Language Model. https://huggingface.co/LeoLM. Accessed: 2024-07-17.

Projecte AINA (2024). Aguila 7B Model. https://huggingface.co/projecte-aina/aguila-7b. Accessed: 2024-07-15.

Rijgersberg, L. (2024). GEITje. https://github.com/Rijgersberg/GEITje. Accessed: 2024-07-17.

Riviere, M., Pathak, S., Sessa, P.G. et al. (2024). Gemma 2: Improving Open Language Models at a Practical Size. https://arxiv.org/abs/2408.00118.

Sakaguchi, K., Bras, R.L., Bhagavatula, C., Choi, Y. (2019). WinoGrande: An Adversarial Winograd Schema Challenge at Scale. https://arxiv.org/abs/1907.10641.

Snæbjarnarson, V., Símonarson, H.B., Ragnarsson, P.O., Ingólfsdóttir, S.L., Jónsson, H.P., Þorsteinsson, V., Einarsson, H. (2022). A Warm Start and a Clean Crawled Corpus – A Recipe for Good Language Models. https://arxiv.org/abs/2201.05601.

SPAHE (2024). Meltemi-7B-Instruct-v1-GGUF. https://huggingface.co/SPAHE/Meltemi-7B-Instruct-v1-GGUF. Accessed: 2024-07-17.

Speakleash (2024). Bielik-7B-Instruct-v0.1. https://huggingface.co/speakleash/Bielik-7B-Instruct-v0.1. Accessed: 2024-07-20.

Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/abs/2307.09288.

Ulčar, M., Robnik-Šikonja, M. (2021). SloBERTa: Slovene Monolingual Large Pretrained Masked Language Model. In: 24th International Multiconference Information Society 2021, Volume C. Data Mining and Data Warehouses. Ljubljana.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit J., Jones L., Gomez A.N., Kaiser L., Polosukhin I. (2017). Attention Is All You Need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (Eds.), Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, Inc.

Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., Choi, Y. (2019). HellaSwag: Can a Machine Really Finish Your Sentence? In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Zhang, W., Deng, Y., Liu, B., Pan, S., Bing, L. (2024). Sentiment analysis in the Era of Large Language Models: a reality check. In: Duh, K., Gomez, H., Bethard, S. (Eds.), Findings of the Association for Computational Linguistics: NAACL 2024. Association for Computational Linguistics, Mexico City, Mexico, pp. 3881–3906. https://aclanthology.org/2024.findings-naacl.246.

Biographies

Nakvosas Artūras

arturas@neurotechnology.com

A. Nakvosas was born in Lithuania in 1986. He received a bachelor’s degree in 2009 and a master’s degree in 2012 from Šiauliai University. Since 2013, he has been working in R&D at Neurotechnology. His work spans machine learning and biometric systems, including fingerprint, iris, and facial recognition, and has been recognized in NIST evaluations. In recent years, his research has expanded to natural language processing, with a focus on speech-to-text, text-to-speech, and large language models. His interests include deep learning, transformer architectures, signal processing, and software engineering.

Daniušis Povilas

povilasd@neurotechnology.com

P. Daniušis was born in Lithuania in 1983. He received a bachelor’s degree (mathematics) from Šiauliai University in 2005, a master’s degree (mathematics) from Vilnius University in 2007, and a PhD (computer science) from Vilnius University in 2012. He has been working at Neurotechnology since 2010. His research interests include AI, artificial neural networks, adaptive robotics, causal inference, and statistical dependence estimation.

Mulevičius Vytas

vytas.mulevicius@neurotechnology.com

V. Mulevičius (born in 1997) earned his bachelor’s degree in computer science from the University of Birmingham in 2020. During his studies, he developed a strong interest in artificial intelligence and natural language processing (NLP), which led him to join the NLP team at Neurotechnology in 2018. Since then, he has been actively involved in the development of language technologies, contributing to various projects ranging from speech recognition and text analysis to the creation of large-scale language models.

Full article

Open access article under the CC BY license.

Keywords

Llama2 Regional LLMs LLMs for the Lithuanian language

Metrics

since January 2020

570

Article info
views

265

Full article
views

294

PDF
downloads

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file