4.1 Task Formulation
The main goal of this research is to develop the first approach for propaganda technique detection in Lithuanian. Propaganda technique detection is typically formulated following the SemEval framework (Da San Martino
et al.,
2020), which splits the task into two subtasks: (i) span identification and (ii) technique classification. In this pipeline, the algorithm first identifies text fragments containing propaganda and then assigns a specific propaganda technique label to each detected span using multi-class classification.
However, the HALT-PROP corpus differs substantially from the datasets used in SemEval-style propaganda techniques detection frameworks. In particular, our analysis shows that propaganda techniques in HALT-PROP frequently overlap and are typically annotated as longer textual fragments, often covering large parts of sentences. In contrast, in the PTC-SemEval20 corpus, overlapping annotations are relatively rare: only about 1.8% of spans are associated with multiple techniques (Da San Martino
et al.,
2020). Due to this small proportion of overlapping cases, the SemEval task formulation simplified the problem by treating technique classification as a single-label multi-class task. When a span was annotated with multiple techniques, the dataset included duplicate instances of the same span, each associated with one label, effectively avoiding a multi-label formulation.
Another important difference concerns the length of annotated spans. In the PTC-SemEval20 corpus, most techniques are typically expressed through short lexical cues and therefore correspond to relatively short spans. In contrast, the HALT-PROP corpus contains a much higher degree of overlap between techniques, with more than 50% of spans overlapping with other techniques, and annotations often covering larger textual units such as full clauses or sentences. Because of this high overlap and broader span coverage, applying the standard SemEval pipeline directly would be problematic. In particular, reducing the problem to a single-label classification task would fail to capture the multi-technique nature of many annotated fragments. Therefore, the detection approach must be adapted to account for the multi-label and highly overlapping structure of propaganda annotations in the HALT-PROP corpus.
Based on the insights discussed earlier, we adopt the standard two-subtask framework consisting of span identification and technique classification, while incorporating adaptations that account for the nuances of the HALT-PROP corpus. For the span identification task, we follow the same standard formulation used in prior approaches. However, the second subtask, technique classification, is modified to address the high degree of overlap between techniques and the longer annotated fragments present in our corpus. As a result, our approach consists of the following two subtasks:
-
1. Span Identification. Given a text, the task is to identify spans that contain at least one propaganda technique. This task is formulated as a token-level sequence tagging problem.
-
2. Technique Classification. Given spans identified as containing propaganda techniques, the task is to determine which specific technique is expressed in each span. Due to the high overlap between techniques and the presence of long annotated fragments that often cover entire sentences in our corpus, we fine-tune a separate model for each technique. Consequently, the task is formulated as a binary sentence classification problem, where a sentence is assigned label 1 if it contains the target technique and 0 otherwise.
In addition to developing models for the selected subtasks, we also investigate several task-specific research questions and modelling decisions. In particular, we examine different sequence tagging approaches for span identification and explore alternative formulations of the technique classification task, including whether sentence classification should be performed on annotated spans or on all sentences, as well as whether a binary or multi-class formulation is more appropriate. Furthermore, we define the evaluation metrics used to assess the performance of the proposed approaches. The details of these investigations and modelling choices are described in the subsections dedicated to each task.
4.3 Technique Classification Task
This task is formulated as a binary sentence classification problem. We investigate two settings: (i) a setting without span information, in which all sentences in an article are considered, and (ii) a setting with span information, in which only sentences containing annotated propaganda spans, i.e. text fragments labelled with at least one propaganda technique, are considered.
The data for this task is prepared as follows. First, documents are split into sentences. Then, binary labels are assigned for each propaganda technique separately. In the span-sentence setting, sentences that do not overlap with any annotated span are excluded; in other words, sentences that receive zero labels for all techniques are removed. An example of the data preparation process for this task is illustrated in Fig.
6.

Fig. 6
Illustration of converting span-level annotations into sentence-level binary labels. Highlighted spans correspond to annotated techniques. The tables show the resulting sentence-level labels when using all sentences and when using only sentences containing annotated spans. The English translation of the example is provided in Appendix
B.
In this task, we investigate several research questions to assess whether formulating technique detection as a sentence-level classification task is appropriate for our corpus. First, we evaluate the effect of span-based filtering by comparing models trained on all sentences with models trained only on sentences containing propaganda spans. Second, we examine whether a binary classification formulation is more suitable than a multi-class approach. To this end, we also fine-tune models in a multi-class setting, where each sentence may receive multiple technique labels.
Overall, for the technique classification task we fine-tune the same transformer-based model under several experimental configurations:
-
• Fine-tuning the transformer model separately for each technique using all sentences, formulated as a binary classification task.
-
• Fine-tuning the transformer model separately for each technique using only sentences that contain propaganda spans (i.e. sentences containing at least one propaganda technique). When a span starts or ends in the middle of a sentence, the entire sentence is still used for classification.
-
• Fine-tuning the transformer model jointly for all techniques as a multi-label classification task using all sentences.
-
• Fine-tuning the transformer model jointly for all techniques as a multi-label classification task using only sentences containing propaganda spans.
4.3.1 Evaluation Metrics for Techniques Classification on Sentence Level
To compare the binary and multi-label approaches, predictions are evaluated separately for each propaganda technique as a binary classification task. We use
macro-F1 as the main evaluation metric. For each technique, macro-F1 is computed as the unweighted average of the F1 scores for the positive and negative classes:
where
${\mathrm{F}\mathrm{1}_{+}}$ denotes the F1 score for the positive class, corresponding to sentences containing the target technique, and
${\mathrm{F}\mathrm{1}_{-}}$ denotes the F1 score for the negative class, corresponding to sentences not containing the target technique. This metric is more suitable than accuracy in the presence of class imbalance, since it assigns equal importance to both classes.
Accuracy is also reported as a supplementary metric to provide a general indication of the proportion of correctly classified instances.
4.3.2 Multi-Class Case Evaluation
Since we also fine-tune a propaganda technique detection model in a multi-class setting, we use multi-class evaluation metrics during training, specifically for monitoring model performance and selecting the best model. In particular, the performance of the multi-class model is monitored using the macro-F1 score computed over the positive classes, which is defined as:
where
$F{1_{k}}$ is the F1-score computed for technique
k, and
K is the total number of techniques.
4.4 Transformer Model
In this study, we employ the
LT-MLKM-modernBERT model (State Digital Solutions Agency,
2025).
LT-MLKM-modernBERT is a monolingual, encoder-only transformer based on the
ModernBERT-base architecture and specifically pretrained for the Lithuanian language. Pretraining was conducted on approximately 1.87 billion words (around 49 billion tokens) collected from diverse Lithuanian sources, including news, legal, academic, and public sector texts. The model comprises 22 Transformer encoder layers with 12 attention heads and a hidden representation size of 768 dimensions, resulting in approximately 149 million parameters. It employs a custom Lithuanian tokenizer with a vocabulary of 64 000 tokens and supports a maximum input sequence length of 8 192 tokens. We selected this model because it is currently the largest publicly available Lithuanian language model and supports the longest input sequence length among Lithuanian pretrained transformer models, making it particularly suitable for processing longer textual contexts.
To assess whether
LT-MLKM-modernBERT indeed provides superior performance, we additionally evaluate two alternative transformer models specifically on the span identification task. These models are included solely for comparative purposes and are not used in other tasks within this study. In particular, we consider two multilingual models that support the Lithuanian language: XLM-RoBERTa (Conneau
et al.,
2020) and LitLatBERT (Ulčar and Robnik-Šikonja,
2020). This comparison allows us to determine whether the selected monolingual model offers a tangible advantage over widely used multilingual alternatives for span identification.

Fig. 7
Overview of a Transformer encoder architecture and its application. (A) Encoder producing contextual token representations. (B) Sentence classification using the [CLS] representation. (C) Token-level sequence tagging.
Figure
7 illustrates the general Transformer encoder architecture and its adaptations for the tasks analysed in this study: sequence tagging and sentence classification. In both tasks, the input data undergoes the same processing stages, including tokenization, embedding, encoding, and generation of contextual embeddings. The primary difference lies in the final stage, where task-specific classification layers are applied. In the following section, we describe the architecture in detail.
Transformer Encoder Architecture. A Transformer encoder maps an input token sequence $x=({x_{0}},{x_{1}},\dots ,{x_{n}},{x_{n+1}})$ into contextualized hidden representations $H=({h_{0}},{h_{1}},\dots ,{h_{n}},{h_{n+1}})$, where ${x_{0}}=[\text{CLS}]$ and ${x_{n+1}}=[\text{SEP}]$ denote special tokens. First, the input text is tokenized and converted into token identifiers. Each token identifier is mapped to a trainable token embedding vector, which is combined with a positional embedding to encode word order information. The resulting sequence of embeddings is then passed through a stack of Transformer encoder layers.
Each encoder layer applies multi-head self-attention followed by a position-wise feed-forward network, together with residual connections and layer normalization. Self-attention allows each token representation to attend to all other tokens in the sequence, enabling the model to capture long-range dependencies and contextual interactions. As a result, the final hidden state ${h_{i}}$ corresponding to token ${x_{i}}$ represents a contextual embedding that incorporates information from the entire input sequence.
Sentence-Level Classification. In the sentence classification setting, the model predicts a single label for the entire input sequence based on the contextual representation of the special classification token $[\text{CLS}]$, i.e. the hidden state ${h_{[\text{CLS}]}}$. This vector serves as a fixed-dimensional representation of the whole sequence.
A classification layer maps this representation to a two-dimensional output space corresponding to binary classes. The resulting vector
$z\in {\mathbb{R}^{2}}$ contains unnormalized scores for each class and is computed as:
where
${W_{\text{sent}}}\in {\mathbb{R}^{2\times d}}$,
${b_{\text{sent}}}\in {\mathbb{R}^{2}}$, and
d is the hidden dimension of the encoder.
The predicted label distribution is obtained by applying the softmax function to these scores. During training, the model is optimized using cross-entropy loss with respect to the gold binary label.
Sequence Tagging. In the sequence tagging setting, the model predicts a label for each input token in the sequence, excluding special tokens such as
$[\text{CLS}]$ and
$[\text{SEP}]$. Instead of using only the sequence-level representation
${h_{[\text{CLS}]}}$, the model uses the contextual representations of individual tokens, i.e.
${h_{1}},{h_{2}},\dots ,{h_{n}}$. A token-level classification layer is applied independently to each token representation:
where
${z_{i}}$ denotes the vector of unnormalized scores for token
i.
In this work, token-level labels are modelled using two different tagging formulations: a binary tagging scheme and the BILOU tagging scheme. In the binary tagging formulation, each token is assigned one of two labels indicating whether the token belongs to the target span or not. Formally, the label set is $\{0,1\}$, where label 1 denotes that the token is part of a target span and label 0 indicates that the token does not belong to any labelled span. This formulation simplifies the sequence labelling problem by focusing only on the presence or absence of the target phenomenon at the token level.
In addition to the binary formulation, we also employ the BILOU tagging scheme. The BILOU label set includes the tag O and structured span labels of the form $B\text{-}t$, $I\text{-}t$, $L\text{-}t$, and $U\text{-}t$, where t denotes a target category. The tag B marks the beginning of a multi-token span, I marks a token inside the span, L marks the last token of the span, and U denotes a single-token span. The tag O indicates that the token does not belong to any labelled span.
Compared to binary tagging, the BILOU scheme explicitly models span boundaries, allowing the model to distinguish between the beginning, inside, and end of multi-token spans, as well as single-token spans. This provides richer structural information about entity boundaries.