LITUND: Lithuanian Unreliable News Detection Corpus

Медийна и дигитална комуникация

Media and Digital Communication

DOI 10.55206/RZER7913

Edgaras Dambrauskas

Vytautas Magnus University – Lithuania,

Sofia University “St. Kliment Ohridski” – Bulgaria

E-mail: edgaras.dambrauskas@vdu.lt

Abstract: This paper introduces LITUND, a dedicated corpus of unreliable news texts in the Lithuanian language, developed to support linguistic and interdisciplinary research on disinformation. Today unreliable information tools and corpora are mostly available for high-resource languages and Lithuanian remains underrepresented in this area. The LITUND corpus was compiled using texts sourced from Lithuanian media outlets that were identified as misleading by professional fact-checkers. The compilation process involved a manual search for disinformation across multiple platforms and search engines, as well as critical decisions regarding source selection, categorization, and verification. This paper is dedicated to outlining the methodology behind corpus construction, discussing the encountered challenges and reflecting on the implications for future research. LITUND is intended to serve as an open resource for studying the linguistic features of unreliable content and to support the development of NLP tools, media literacy efforts, and cross-disciplinary analyses of disinformation in low-resource language settings.

Keywords: unreliable news, disinformation, unreliable information, cross-disciplinary analyses, LITUND corpus, Lithuanian media.

Introduction

The past few years have been marked by events having a huge global or regional impact, such 2016 US presidential election, the Brexit referendum, the COVID-19 pandemic, the war in Ukraine, and others. These events have led to an increase in the amount of textual and visual information (Celestini et al., 2020 [1]; Gallotti et al., 2020 [2]; García-Marín & Salvat-Martinrey, 2023 [3]; Yang et al., 2020 [4]) that is difficult or impossible to verify at the time in social media, news websites, therefore, quality and reliability cannot always be expected.

Alongside this increased amount of information, the problem of disinformation has also become more prominent. In some cases, this issue may have ties to information warfare and cybersecurity, where disinformation is often dissemi­nated through targeted computer tools that are used to launch cyber-attacks to spread false or misleading information. It should be noted that studies show that misleading content is less prevalent and generates fewer interactions, however, even a relatively small number of social media accounts yields disproportionate amount of disinformation while being effective of amplifying already circulating misinformation as well (Pierri et al., 2023 [5]; (Aïmeur et al., 2023). [6]

In contemporary society, we face an overload of often conflicting information we find online, on social media and on news websites having different political leanings alike. Ideally, critical thinking and our own research should be enough to determine whether the information we find online is reliable or attempts to influence us, however, due to the amount of such information, it is not realistically possible. Despite public confidence in individual media literacy, research indicates that a staggering number of people overestimate their news judgement skills (Lyons et al. 2021) [7], making them susceptible to false information. Consequently, it results in societal challenges, an increase in marginalisation, Euro-scepticism, distorted voting in elections, and other issues on a national or regional scale.

It is also important to mention that while commonly used, the term fake news is not well established (Gelfert 2018; [8] Di Domenico et al. 2021 [9]; Šalaševičiūtė 2022 [10]) as other, in many cases overlapping, terms are used interchangeably, both in academic discourse and the media itself. Pennycook et al. [11] (2018 define fake news as ‘entirely fabricated and often partisan content that is presented as factual’ which, due to being laconic, suffers from imprecision when considering todays examples in real life. Baptista and Gradim (2022) [12] take a more nuanced approach by defining it as a type of online disinformation with misleading and/or false statements that may or may not be associated with real events, intentionally designed to mislead and/or manipulate a specific or imagined public through the appearance of a news format with an opportunistic structure (title, image, content) to attract the reader’s attention in order to obtain more clicks and shares and, therefore, greater advertising revenue and/or ideological gain’. The keyword here being ‘intentionally […] mislead’, and, therefore, this definition sets itself apart from previous understanding of simply being understood by scholars ‘as a form of falsehood intended to primarily deceive people by mimicking the look and feel of real news’ Tandoc Jr. (2019). [13] Author also notes that there is an underlying issue with such view towards fake as the term fake news can be attributed to various forms of content, including political satire, news satire for entertainment purposes, propaganda, and false advertising for other uses. The difference lies in intention and, naturally, it is a matter of debate whether intentionally deceiving the reader for entertainment purposes and deceiving with malicious intent can both be considered as intentional deception as there is, indeed, an overlap. Different definitions of fake news in an attempt to encompass all possible scenarios while excluding non-malicious content is not the sole issue. Legislating against fake news in the EU risks over-censorship and conflicts with freedom of expression, suggesting that regulation is unnecessary and creates more legal challenges as new laws may overlap or contradict existing ones (Mazur and Chochia, 2022) [14], therefore, rather surprisingly, while being used in EU’s documents, the term fake news is not formally defined within EU giving way to broader terms.

In the EU document, the preferred term is disinformation as used in the The 2022 Code of Practice on Disinformation. [15] The EU further differentiates two broader terms disinformation and misinformation that are partially overlapping. Disinformation is defined as ‘false or misleading content that is spread with an intention to deceive or secure economic or political gain and which may cause public harm’ while misinformation is understood as ‘false or misleading content shared without harmful intent though the effects can still be harmful, e.g. when people share false information with friends and family in good faith’. The key distinction between both terms lies in the intent behind the dissemination of false or misleading content. Disinformation is characterized by deliberate efforts to deceive or manipulate for specific objectives, whereas misinformation arises from unintentional sharing of inaccurate information, often due to a lack of awareness about its veracity. In practice, it is difficult to assess whether the false information encountered online is spread deliberately, especially if the author is unknown to the reader or is anonymous.

For the purposes of this paper, the contemporary understanding of fake news lacks the necessary specificity to distinguish between deliberately misleading content and satirical texts resembling news articles, such as those published by The Onion or The Babylon Bee. At the same time, it presents a challenge for researchers to determine whether false or misleading content is a deliberate feature or an unintended flaw. In other words, there is no definitive method to conclusively prove whether false information was disseminated intentionally or incidentally. Disinformation, on the other hand, proves to be extremely broad, therefore suitable for everyday use or, unexpectedly, it’s rather broad for legal purposes. Disinformation, by contrast, is an extremely broad term, making it suitable for everyday use but unexpectedly expansive for legal purposes, as it aims to be more inclusive in an ever-changing information landscape. The challenge lies in its function as an umbrella term, encompassing not only text-based content, such as news articles, but also a wide variety of other formats, including social media posts, deepfakes, and manipulated video and audio content.

For this reason, the term unreliable news was adopted, referring to texts that mimic the format and style of news articles but contain factually incorrect information that has been debunked, usually post-publication. This terminology emphasizes a distinction from fake news, as the focus is shifted away from the intent behind the content, an aspect that is nearly impossible to verify, and instead placed on the reliance on objectively false or misleading main claims within the text.

The LITUND corpus provides a resource for understanding and differentiating the linguistic features of unreliable and trustworthy news, helping researchers uncover the subtle and overt differences in language use between these categories. Our motivation stems from the remaining prevalence of unreliable information and its impact on public discourse, trust in media, and societal cohesion. By equipping linguists and other researchers with a detailed, categorized dataset, we aim to facilitate studies that go beyond the scope of our study. This understanding is essential not only for improving automated detection systems but also for fostering media literacy and critical thinking. Through this work, we hope to contribute to address the challenge presented by the spread of disinformation while advancing linguistic research into the mechanics of persuasive and deceptive communication.

 

Rationale of creating Lithuanian unreliable news detection corpus

The language we typically encounter while reading daily news has the power to shape opinion, rally support for a certain cause or create a narrative that influences how we perceive and interpret political events and issues. It is essential to understand the factors that contribute to its spread and, currently, at the European level, disinformation and misleading information remain a serious challenge requiring a response from EU institutions and is one of the reasons why EU Code of Practice on Combating Disinformation was introduced. This paper is part of an attempt to tackle this challenge for the Lithuanian language and fill in the gap in research from a purely linguistic standpoint rather than sociological or technological one by expanding the foundations laid by DIGIRES COVID-19 Corpus v.1 as LITUND has a broader scope of topics since it covers other major events that lead to misleading information campaigns.

By spanning a wider range of topics as well as broader timeframe, LITUND captures disinformation across multiple high-impact sociopolitical crises and, therefore, allows further studies of rhetorical evolution and topic shifts in Lithuanian disinformation discourse. As discussed in the following chapter, similar corpora tend to focus on dynamic and often fragmented social media content, whereas LITUND consists of full-length texts, providing greater opportunities for in-depth linguistic analysis of disinformation. To enable comparative study, LITUND contains two comparable corpora, also containing POS-tagged version:

  1. Unreliable news texts. 147 full-text articles identified as misleading by professional fact checkers. Each entry includes metadata as the original source, link to the text, text topic category, the specific false claim addressed, and a corresponding debunking reference.
  2. LRT corpus. 147 full-text articles, published by Lithuania’s national broadcaster (LRT) on topics similar to those in the Unreliable News Corpus. This serves as a baseline for comparing linguistic and rhetorical patterns across reliable and unreliable news sources.

The relevance is particularly accentuated in times of crisis, such as COVID-19 pandemic, the beginning of Russo-Ukrainian war and recent military and political developments in the Middle East. This means that articles that appear to be legitimate news but contain misleading, fabricated or decontextualized information can have serious social or democratic consequences (Bakir and McStay 2018) [17] and even push audiences to make decisions related to personal health, which increased the demand for robust methods to detect and filter out unreliable information (Zhou et al., 2019 [18]; Singh et al., 2021 [19]) and to ensure that the reader only receives reliable information (Shu et al., 2019). [20]

Currently, existing tools and methodologies for detecting fake or unreliable news based on linguistic features are tailored for English and other major languages, so the possibilities to adapt existing work to Lithuanian are currently quite limited. The value of the work is related to the progress in researching the phenomenon in Lithuanian media from a linguistic perspective as opposed to interdisciplinary or machine learning-focused research. The compiled unreliable and traditional media corpora will support further research both in linguistics as well as other, especially interdisciplinary studies. It should be noted that this categorisation requires further research to be properly implemented since a website often containing misleading information may also contain factually correct articles even though these may be biased. At the same time, traditional media, large media companies that employ professional editors and journalists, may contain false information as well, making it difficult to make a clear distinction.

The technological solutions that currently exist are based on employing machine learning techniques but the explainability of machine learning solutions remain an issue. As the reader is confronted with large amounts of information on a daily basis, it becomes increasingly important not only to identify whether a given text can be labelled as reliable or misleading but also to explain how such a determination is made, particularly in the context of machine learning tools, which often lack transparency regarding the linguistic features or criteria used in classification. Considering that disinformation sites often tailor content (Guess & Lyon, 2020) [21] to resonate with the interest of target audience, analysing their stylistic features, such as sensational language, emotive appeals, or specific framing choices offers insight into how such content attracts and influences readers (Przybyla, 2020). [22] These linguistic patterns may actually enable future models to generalize disinformation detection as newly-emerged topics and authors change over time (Zhang et al., 2019). [23]

Most existing fake news detection corpora are designed for English and mainly serve as a tool to train machine learning systems. For instance, there have been 114 datasets labelled as “fake news corpus” in Hugging Face platform [24], most of them (70%) are for the English language and 80% of all corpora have binary labels ‘real’ vs. ‘fake’. Today, researching unreliable content remains to be a critical area of study as AI researchers continue improving and refining algorithms for detecting unreliable or misleading information online in a timely manner, while also being an area of study to those interested in enhancing media literacy and improving public resilience against online manipulation. In both cases, scholars and businesses alike are improving existing techniques, developing methodologies and tools for their cultural and linguistic context. However, progress is constrained by the limited availability of high-quality, manually curated datasets, which are labour-intensive to produce and is subject to ethical and methodological scrutiny.

Besides that, it comes as no surprise that the majority of research and data compilation is done using data in English, making it of limited use for other languages. Some data is available online, for example, on GitHub, such as Fake News Corpus [25], DIGIRES COVID-19 Corpus v.1 [26] developed by Vytautas Magnus University or BuzzFeed-Webis Fake News Corpus 2016. [27] Other research includes corpora for other languages, such as The Spanish Fake News Corpus that includes an extensive methodology of the compilation procedure (Posadas-Durán, 2019) [28], FANG-COVID, a German corpus dedicated to detecting fake news related to COVID-19 (Vogel and Jiang, 2019) [29] or the Portuguese Fake.Br corpus, being one of the first attempts for Brazilian Portuguese (Monteiro et al., 2018). [30]

While these examples represent valuable efforts in addressing the challenge of disinformation by offering valuable insights into data collection and preparation for various languages, they each suffer from certain limitations. Some of these resources focus on particular or specialized topics, making them less suitable for broader linguistic research, these unsurprisingly include texts on politics, controversial events or, more recently, COVID-19 pandemic. Others are primarily designed for machine learning applications, rather than linguistic analysis, which limit their utility in examining the linguistic features and nuances of disinformation in natural language contexts. Additionally, corpora that were created for languages other than Lithuanian, have limited applicability for research focused specifically on Lithuanian disinformation and unreliable news. Therefore, there was still a need for a dedicated, Lithuanian corpus tailored to the Lithuanian context, covering several key areas where disinformation is most prevalent.

The current version of the LITUND corpus is a steppingstone towards a larger resource that can be used for broader research as it is only used for a pilot study in its current form and size. Nevertheless, it is available at CLARIN-LT [31] repository under the academic licence, representing contribution towards open science and fostering transparency and reproducibility. The next steps in our work involve expanding and refining the corpus to increase its utility for a broader range of applications. First, we aim to increase the size and variety of the dataset by including additional texts from across different sources and websites, to ensure comprehensive coverage of unreliable information in its varied styles and forms. Second, we plan to improve the corpus with advanced annotations, such as linguistic features and metadata, which will support more nuanced analyses by researchers in linguistics, computational science or social studies. Furthermore, we intend to collaborate with the research community to integrate user feedback, addressing gaps and identifying new areas for improvement.

Corpus compilation

It should be noted that the corpus described in this paper is not the first attempt to create a similar corpus for the Lithuanian language. Among such attempts is the open-access DIGIRES COVID-19 ML Dataset v.1, compiled by researchers from Vytautas Magnus University (Amilevičius et al. 2023) [32], is used to compile a textbook of 351 articles on the topic of COVID-19, which are further classified as ‘reliable’ and ‘unreliable’. Additionally, its scope is limited to health-related content, with the majority of texts focused on the COVID-19 pandemic, which, while highly relevant at the time, restricts the dataset’s broader applicability.

Table No 1. Comparison of similar corpora

Corpus Language(s) Domain Size
LITUND Lithuanian Multi-topic ~230k tokens, 294 articles (in progress)
DIGIRES COVID-19 Corpus v.1 [33] Lithuanian COVID-19 186,649 tokens, 351 articles
Fake.Br [34] Portuguese Multi-topic 7,200 articles
GermanFakeNC [35] German Multi-topic 490 articles
LIAR [36] English Multi-topic 12.8K short statements were
NELA-GT-2018 [37] English Multi-topic 713k articles from 194 sources
NELA-GT-2020 [38] English Multi-topic 1.8M articles, 519 sources
FakeNewsNet [39] English Political/social ~23k articles with social metadata
BuzzFeed-Webis [40] English U.S. elections of 2016 1,627 articles
Fake News Corpus [41] English Multi-topic ~9.4M articles, includes such content as satire and clickbait
PHEME [42] English Social media rumors 5,802 tweets
Weibo Rumor Dataset [43] Chinese Social media 4,664 posts
The Spanish Fake News Corpus [44] Spanish Multi-topic 971 articles
MiDe22 [45] English, Turkish Multi-topic  including COVID-19, Ukraine war 10,348 tweets
HWB Fake News [46] English Health and well being 1000 documents, 651k tokens
Misinformation & Fake News text dataset 79k [47] English Multi-topic 79k articles of misinformation, including fake news and propaganda
POLygraph: Polish Fake News Dataset Polish Multi-topic 11,360 pairs of news articles

Corpus design principles

LITUND (Lithuanian Unreliable News Detection corpus) is compiled consisting of (100,678 tokens) texts published on Lithuanian alternative news websites that were deemed as misleading by professional fact checkers from “Delfi Melo Detektorius” [48] (en. Delfi’s Lie Detector) and “Patikrinta 15min” [49] (en. Verified by 15min). On their pages, the fact-checking departments provide detailed methodologies, which are used to select facts and carry out information collection and verification, therefore priority is given to the research of these portals during data selection. By relying on professional fact-checkers to detect and evaluate unreliable articles, we minimise subjectivity. The collected dataset ensures that every single statement and, therefore, article has an opposing journalistic investigation and an objective basis for selecting them.

Both of the selected fact-checker teams have their own methodology on which they base their claims and insights. In the case of “Verified by 15min”, the team considers fact-checking to be the process of verifying and evaluating whether statements and/or information is true, partially true, taken out of context, or presented as satire. Besides the text-based content in the form of social media posts or full articles, the process also includes the assessment of edited or computer-generated photographs or videos i.e. whether they have been altered for the purpose of misleading rather than for clarity/quality and distributed without indication of this for the purposes of creating misleading commentary. In order to ensure that the entries or articles under scrutiny can be seen as they were at the time of writing, and to avoid an increase in the number of clicks on links to erroneous texts, 15min utilizes links to archived copies of the entries or articles, such as those archived on the Wayback Machine, Perma CC, Archive Today or Ghost Archive.use different tags to mark verified statements and assign them to the relevant situations. “Verified by 15min” distinguishes several categories based on the conclusion on the statement, however, it should be mentioned that such labelling is only used since March 2023. These include lies, partial lies, lack of context, satire, truth and a couple of others.

“Delfi” only uses claims that can be proven or disproven based on existing sources of information are verified and states that „The lie detector does not check subjective opinions or predictions about the future, nor does it check statements whose truthfulness is readily apparent and does not require further investigation. Methodology developed by “Delfi” includes searching for information on the internet, in databases, research papers, studies, reviewing documents and more. The team claims that articles published in Delfi’s Lie Detector must reflect this research by providing a list of the sources of information used and references in the text. The list of sources used identifies all primary and secondary sources used and the tools used for the research, such as social network analysis platforms or image search engines. The journalist may contact other experts in the field of research and ask for their comments. In the article, the researcher must identify all informants and provide evidence of their expertise, i.e. their positions or achievements in the field under investigation. Similarly, “Delfi” uses similar categories as a type of verdict that include lie, partial lie, forgery, true and manipulation.

In both cases, only texts labelled as lie were selected, ensuring the reliability of the corpus.

Each selected text contains a key statement that has been disproved and the search process for articles containing and based on such false statements were being searched using publicly available search engines. For this reason, 3 popular search engines were selected, namely Google, DuckDuckGo and Yandex. The results vary while giving the same search query. Among the selected texts, there was an article titled 5G yra ginklų sistema, sukurta žmonėms žudyti, sako ginklų ekspertas Markas Steele’as (5G is a weapons system designed to kill people, says weapons expert Mark Steele). Upon searching this full statement or shortened version of the sentence, leaving only the key words i.e. ‘5G weapons system+ Steel’.

Google results: The top results on Google prominently feature three links to a widely recognized website that frequently disseminates misleading content. This indicates a tendency to prioritize popularity or site authority over content reliability.

DuckDuckGo results: The top results on DuckDuckGo appear to systematically exclude lesser-known websites and unreliable sources. Although DuckDuckGo claims to deliver „truly private search results without trade-offs in result quality,“ this exclusion suggests a significant omission of relevant content, including disinformation, from its search results.

Yandex results: Yandex prioritizes social media posts over comprehensive websites or full texts containing disinformation. However, it also provides links to smaller, lesser-known blog-like pages that contain false claims, thereby facilitating access to disinformation during the search process. Considering that different search engines provide very different search outputs, each claim had to be checked using all three tools for texts relevant to this study. The differences that were observed indicate that there is an obstacle in obtaining the information even knowing what we are looking for exactly as search engines have their own algorithmic priorities.

Challenges and considerations

Compiling a comprehensive corpus of Lithuanian news websites for unreliable news detection presents a unique set of challenges. Firstly, it should be noted that the limited availability of annotated datasets in the Lithuanian language makes it difficult to develop robust models, as most existing resources are predominantly in English or other widely spoken languages. This scarcity necessitates the manual collection and annotation of data, which is both time-consuming and resource intensive. Furthermore, the linguistic nuances and cultural context of Lithuanian news require careful consideration to ensure the corpus accurately represents the stylistic and rhetorical features of genuine versus unreliable news. Some issues further defined in this chapter can be encountered in other language as well, while some of them are prevalent in specific websites only.

Another challenge is the dynamic nature of unreliable news itself, which continuously evolves in response to current events and public sentiment. This necessitates frequent updates to the corpus to maintain its relevance and effectiveness. In this case, the corpus is still in its pilot phase, and final conclusions cannot yet be drawn. For this reason, emphasis is put on the corpus development process and relevant issues rather than final conclusions.

Finally, distinguishing between misinformation and disinformation within the corpus can be challenging, as it requires understanding the intent behind the content, which may not always be apparent. Addressing these challenges is cru­cial for developing a reliable corpus that can support the accurate detection and analysis of unreliable news in the Lithuanian context.

Large quantities of cited text. One significant challenge in creating a corpus of unreliable news from Lithuanian news websites is dealing with highlighted text excerpts that are not part of the original articles. These excerpts, while being an integral part of selected texts, are often added by third-party authors or sources and can misrepresent the content and writing style of the original text author. This creates difficulties in accurately categorising and analysing news pieces, as these added excerpts can introduce different bias or context, potentially skewing the corpus data. Ensuring that the corpus reflects the authentic articles requires careful verification and filtering to exclude such extraneous content. An example here titled Seimas Parliament makes it compulsory for you to be listed as an organ donor – where to opt out? [50] includes a response from the Lithuanian National Transplant Bureau, comprising roughly ¼ of the overall text thereby risking potentially distorting the corpus.

Texts translated by machine translation tools. Another challenge in corpus creation arises from the reliance on machine-translated articles, particularly those from the portal SapereAude. [51] The authors of SapereAude explicitly state that their articles are translated using a machine learning tool, which can introduce inaccuracies and distortions in the text. These machine-generated translations often lack the nuances and contextual understanding of human translators, making it difficult to determine whether the resulting content is genuinely misleading or simply a product of translation errors. This ambiguity complicates the classification of such articles within the corpus, as it blurs the line between the original style and writing peculiarities of a native speaker author and unintentional misrepresentation due to translation artifacts.

Same article published on multiple twin websites. Another notable issue arises from the practice of reposting the same article across multiple websites with only minor variations or no changes at all. This can lead to inflated perception of article prevalence and skew the analysis if not carefully managed. When identical content appears on different sites, it may introduce redundancy, affecting the diversity and representativeness of the corpus. Additionally, phenomenon may further complicate the task of determining the original source of information, which is crucial for understanding the propagation and impact of unreliable news, however, it is not always possible, especially if the time of publishing is not provided. It also raises questions about the intent behind such reposting, as it might be used to amplify false narratives or create an illusion of widespread consensus. Alternatively, this strategy may be employed to ensure that the content stays online if one or more of twin websites are removed by authorities, the content remains present on other websites.

Articles containing large portions of translated text. Some websites attempt to create legitimacy quoting experts or simply people working in respective fields, thereby employing the appeal to authority fallacy. It results in extensively quoting one or more experts, therefore the article includes a large portion of the text that was taken from another website or social media post and translated. The issue here is that it is impossible to determine how these translations were carried out as there is a possibility of machine translation resulting in a risk to skew the final results as these excerpts are not originally written in Lithuanian.

Corpus Structure

Gathered data consists of 6 distinct categories that were assigned after creating the corpus in order to categorize articles by the general area as it was hypothesized that texts on different topics may show different use of parts of speech, for example, texts on politics and political events does show a significant difference in the use of proper nouns due to extensive mentions of politician names and countries. The chosen categories are as follows:

  • Environment. This category contains texts on environmental issues, climate change, pollution, chemtrail conspiracy theory and similar ones. These topics are not exactly new and related scepticism is no longer widespread for it to be a major category. Still, most of the texts that fall under this category focus on the politicisation of climate warming or are based on claims that anthropogenic climate change is not true.
  • COVID-19. It comes as no surprise that during the pandemic period, COVID-19 restrictions and the vaccination process across Europe was a major target for unreliable news creators. It can be attributed to a group of factors, including the global uncertainty that was never seen before on such scale, therefore, the fear surrounding the virus created grounds for misinformation to spread rapidly, especially due to social media that facilitated its spread. As a result, most of the unreliable pieces of news are pandemic-related, although it should be noted that clear distinction from politics and health topics were not always obvious.
  • Health. Health-related issues comprise 17% of the whole dataset and is closely connected to the COVID-19 category. However, due to extensive nature of UN regarding the pandemic, it was decided to separate remaining health and healthcare-related topics, as these range from genetically modified products, AIDS spread, 5G effects on human health to even fruits being injected with toxic substances.
  • Politics. This category contains texts of political nature with a single exception of those related to the war in Ukraine that is listed under a separate category. This contains texts regarding Lithuanian and world/regional politics; however, it includes slightly unusual topics of US biological weapons, Joe Biden’s health and death as well as local issues, especially regarding the draft process in Lithuania.
  • War in Ukraine. As the name suggests it is another highly specific topic that suffers from similar issues as the COVID-19 category due to its overlap with politics. As such, it would equally make sense to attribute such articles as “Seimas vote: 100 MPs approve possible troop deployment to Ukraine” or “Ukraine attacked Poland, but the media paints the opposite picture” to politics in general as it discusses other countries beside the warring parties. Similarly, it was decided to use war in Ukraine as a separate category due to its specific nature and a separate wave of misleading content.
  • Other. This is the last category that is used as an umbrella term. This category deals with remaining topics that do not fall under any other major category.

As shown in Fig. 1, categories are not equal in volume. Health and closely related COVID-19 categories contain a majority of volume (13,6% and 36,1% respectively). It should, however, be noted that due to the applied methodology for data search, oldest journalistic research and, subsequently, oldest misleading texts were added first to the corpus. It should be noted that this pilot corpus is only halfway completed, with additional articles being created daily on less reputable websites. In practice, this means that newest topics may not have been included yet, especially events surrounding Israel–Hamas war as well as events in the Middle East since 7 October 2023 in general. Such topics are very likely to skew results towards the politics category and may result in a separate category depending on the number of misleading texts that can be found. It should be noted that the Russo-Ukrainian war receives significantly more attention and updates in Lithuanian media unless, of course, major events take place in the Middle East. Since public attention is turned towards Ukraine, unreliable news in Lithuanian are directed to the same events as well.

Fig. 1. Structure of topic in LITUND corpus

Sources

The articles are collected from 32 different blogs and websites that publish content in Lithuanian as the original language, thus the text should not be explicitly translated using machine translation tools. There is no objective way to check whether or not it is true, however, the SapereAude website openly admits that their texts are translated using the DeepL tool without any changes.

A major part of the dataset is collected from the paranormal.lt website which heavily focuses on less than reliable information on health, vaccination and COVID-19. While other websites change their focus based on the most important world or regional events of the time, Paranormal.lt sticks with the COVID-19 and pandemic narrative, publishing at least several articles on these topics per week simultaneously generating large amount of content in general.

It is also interesting to look at the less resourceful sources, as they can be slightly misleading. A number of websites are essentially copies of one another. We found 7 websites that shared same content with some variation, such as posting a text only on half of them. Due to their misleading nature, the majority of them are no longer available and cannot be accessed directly as they have been closed by the Office of the Inspector of Journalistic Ethics of Lithuania. The only exception is fact checker investigations that often contain archived copies of particular articles, so, to some extent, the original content can still be accessed.

Author-wise, it is impossible for us to determine whether one or several people are writing these articles. Some pages, such as 77.lt and some smaller ones do include journalist or author name, however, it is difficult to assess how reliable a nickname or a given name is. While it is not impactful for this research, it would be beneficial to know whether a portal represents one person’s writing style or if it is a group of people using their particular writing style and vocabulary.

Discussion and results

The LITUND corpus was created as an open-access resource dedicated to supporting future linguistic, computational, and interdisciplinary studies on unreliable news in Lithuania. While the present paper does not include a full statistical or linguistic analysis, several observations and findings emerged during the corpus development process:

  1. Corpus compilation process involving binary logic of articles or claims being [un]reliable exposed inherent ambiguities in the concept of misinformation, as many articles contained a mixture of verifiable facts and misleading claims, or employed rhetorical strategies that cast doubt without stating outright falsehoods. While this is addressed by fact checkers, it should be acknowledged that authors using decontextualised or “cherrypicked” information, avoids the risks related to publishing false information. It raises questions about the limits of binary classification and suggests that future versions of the corpus might benefit from a more nuanced labelling system, such as degrees of reliability or intent to deceive.
  2. A major obstacle was the difficulty of retrieving original misleading articles. While fact check websites usually provided original links, many sources had been removed from first pages of search engine results or banned by governmental institution, or deleted entirely from hosting servers. In these cases, such tools as Wayback Machine were necessary so the original texts could be accessed.
  3. As many sources could no longer be accessed by mainstream search engines, relevant texts required a creative and iterative search strategy. This often involved compiling keyword lists from fact-checker reports or disputed claims (e.g., “5G causes COVID”, “Ukraine biolabs”, “vaccine genocide”), experimenting with phrasing variations, and using alternative search engines, mainly Yandex or DuckDuckGo. The process demanded constant adjustment of queries, especially due to the creative nature of such texts, for instance, some sites deliberately avoid using proper words or terms for vaccines and use such words that can be translated as goop both in the headline and the text itself, creating an obstacle when choosing keywords for search queries.
  4. The creation of LITUND opens research paths not only for computational linguists but also for researchers in communication studies, journalism, political science, and cognitive psychology. Understanding unreliable news in low-resource languages like Lithuanian requires more than text analysis it demands collaboration across disciplines to study why disinformation spreads, how it is received, and what cultural or historical narratives are employed to make it more believable and viral.

 

References

[7] Gelfert, A. (2018). Fake news: A definition. Informal logic38(1), 84–117.

[8] Di Domenico, G., Sit, J., Ishizaka, A., & Nunan, D. (2021). Fake news, social media and marketing: A systematic review. Journal of Business Research124, 329–341.

[9] Šalaševičiūtė, V. (2022). Netikrų naujienų atpažinimo metodika. http://dspace. kaunokolegija.lt//handle/123456789/5710. Retrieved on 01.05.2025.

[10] Pennycook, G., Cannon, T. D., & Rand, D. G. (2018). Prior exposure increases perceived accuracy of fake news. Journal of experimental psychology: general, 147(12), 1865.

[11] Baptista, J. P., & Gradim, A. (2022). A working definition of fake news. Encyclopedia, 2(1).

[13] Tandoc Jr, E. C. (2019). The facts of fake news: A research review. Sociology Compass, 13(9), e12724. Chicago.

[14] Mazur, V., & Chochia, A. (2022). Definition and Regulation as an Effective Measure to Fight Fake News in the European Union. European Studies, 9(1), 15–40. Chicago.

[15] European Commission. the 2022 Code of practice on disinformation. https://digital-strategy.ec.europa.eu/en/policies/code-practice-disinformation. Retrieved on 01.04.2025.

[16] European Commission. Brussels, 26.5.2021 COM(2021) 262 final. European Commission Guidance on Strengthening the Code of Practice on Disinformationhttps:// eur-lex.europa.eu/legal-content/EN/TXT/PDF/?uri=CELEX:52021DC0262. Retrieved on 01.04.2025.

[17] Bakir, V., & McStay, A. (2018). Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism6(2), 154–175.

[18] Zhou, X., Jain, A., Phoha, V. V., & Zafarani, R. (2020). Fake news early detection: A theory-driven model. Digital Threats: Research and Practice1(2), 1–25.

[19] Singh, V. K., Ghosh, I., & Sonagara, D. (2021). Detecting fake news stories via multimodal analysis. Journal of the Association for Information Science and Technology72(1), 3–17.

[20] Shu, K., Mahudeswaran, D., & Liu, H. (2019). FakeNewsTracker: a tool for fake news collection, detection, and visualization. Computational and Mathematical Organization Theory25, 60–71.

[22] Przybyla, P. (2020, April). Capturing the style of fake news. In Proceedings of the AAAI conference on artificial intelligence, 34(1), 490–497.

[23] Zhang, C., Gupta, A., Kauten, C., Deokar, A. V., & Qin, X. (2019). Detecting fake news for reducing misinformation risks using analytics approaches. European Journal of Operational Research279(3), 1036–1052.

[24] Hugging Face Platform. Hugging Face, Langchain, and more: try genai tools with deeplearning.ai. https://www.coursera.org/collections/deeplearning-ai-genai-tools? utm_ medium=sem&utm_source=gg&utm_campaign=b2c_emea_x_multi_ftcof_career-academy_cx_dr_bau_gg_pmax_gc_s1_en_m_hyb_23-12_x&campaignid=20858198 824&adgroupid=&device=c&keyword=&matchtype=&network=x&devicemodel=& creativeid=&assetgroupid=6490027433&targetid=&extensionid=&placement=&gad_ source=1&gad_campaignid=20854471652&gbraid=0aaaaaddkx6ahr1yutfmnrgmtzhx lygvla&gclid=cjwkcajw3f_bbhapeiwaaa3k5atvthchrokdj_9lt29u-vpavjq07f66wbjtna jqi_frisgvqebaaxocs3sqavd_bwe. Retrieved on 17.11.2024.

[25] Fake News Corpus. https://github.com/several27/fakenewscorpus. Retrieved on 01.05.2025.

[26] DIGIRES COVID-19 Corpus v.1. https://clarin.vdu.lt/xmlui/handle/20.500.11821/ 53. Retrieved on 01.05.2025.

[27] Buzzfeed-webis fake news corpus 2016. https://zenodo.org/records/1239675. retrieved on 01.05.2025.

[28] Posadas-Durán, J. P., Gómez-Adorno, H., Sidorov, G., & Escobar, J. J. M. (2019). Detection of fake news in a new corpus for the Spanish language. Journal of Intelligent & Fuzzy Systems36(5), 4869–4876.

[29] Vogel, I., & Jiang, P. (2019, August). Fake news detection with the new German dataset “GermanFakeNC”. In International Conference on Theory and Practice of Digital Libraries (pp. 288–295). Cham: Springer International Publishing.

[30] Monteiro, R. A., Santos, R. L., Pardo, T. A., De Almeida, T. A., Ruiz, E. E., & Vale, O. A. (2018). Contributions to the study of fake news in portuguese: New corpus and automatic detection results. In Computational Processing of the Portuguese Language: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings 13 (pp. 324–334). Springer International Publishing.

Pathak, A., & Srihari, R. K. (2019, July). BREAKING! Presenting fake news corpus for automated fact checking. In Proceedings of the 57th annual meeting of the association for computational linguistics: student research workshop (pp. 357–362).

[31] Calrin-Lt. http://clarin-lt.lt/?lang=en. digires covid-19 corpus v.1. Retrieved on 01.05.2025.

[32] Amilevičius, Darius; Utka, Andrius; Meidutė, Aistė and Ruzaitė, Jūratė, 2023.

Bakir, V., & McStay, A. (2018). Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism6(2), 154–175.

[33] DIGIRES COVID-19 Corpus v.1. https://clarin.vdu.lt/xmlui/handle/20.500.11821/ 53. Retrieved on 01.05.2025.

[34] Fake.Br. https://github.com/roneysco/Fake.br-Corpus. Retrieved on 01.05.2025.

[35] GermanFakeNC. https://live.european-language-grid.eu/catalogue/corpus/7564. Retrieved on 01.05.2025.

[36] LIAR. https://paperswithcode.com/dataset/liar.Retrieved on 01.05.2025.

[37] NELA-GT-2018. https://paperswithcode.com/dataset/nela-gt-2018. Retrieved on 01.05.2025.

[38] NELA-GT-2020. https://paperswithcode.com/dataset/nela-gt-2020. Retrieved on 01.05.2025.

[39] FakeNewsNet. https://github.com/KaiDMML/FakeNewsNet. Retrieved on 01.05.2025.

[40] BuzzFeed-Webis. https://paperswithcode.com/dataset/buzzfeed-webis-fake-news-corpus-2016. Retrieved on 01.05.2025.

[41] Fake News Corpus. https://github.com/several27/FakeNewsCorpus. Retrieved on 01.05.2025.

[42] PHEME. https://www.kaggle.com/datasets/usharengaraju/pheme-dataset. Retrieved on 01.05.2025.

[43] Weibo Rumor Dataset. https://www.scidb.cn/en/detail?dataSetId=1085347f720f 4cfc97a157e469734a66. Retrieved on 01.05.2025.

[44] The Spanish Fake News Corpus. https://github.com/jpposadas/FakeNewsCorpusSpanish.Retrieved on 01.05.2025.

[45] MiDe22. https://github.com/metunlp/MiDe22. Retrieved on 01.05.2025.

[46] HWB Fake News. https://dcs.uoc.ac.in/cida/resources/hwb.html. Retrieved on 01.05.2025.

[47] Misinformation & Fake News text dataset 79k. https://www.kaggle.com/datasets/ stevenpeutz/misinformation-fake-news-text-dataset-79k. Retrieved on 01.05.2025.

[48] Delfi Melo Detektorius. https://www.delfi.lt/puslapis/melo-detektorius/metodologija. Retrieved on 01.05.2025.

[49] Patikrinta 15min. https://www.15min.lt/projektas/patikrinta-15min-metodologija. Retrieved on 01.05.2025.

[50] Seimas Parliament makes it compulsory for you to be listed as an organ donor – where to opt out. https://www.komentaras.lt/laisvalaikis/sveikata/seimas-jus-privalomai-irase-i-organu-donorus-kur-atsisakyti/129028/. Retrieved on 01.05.2025.

[51] SapereAude. https://sapereaude.lt/. Retrieved on 01.05.2025.

 

Bibliography

Alkhair, M., Meftouh, K., Smaïli, K., & Othman, N. (2019). An arabic corpus of fake news: Collection, analysis and classification. In Arabic Language Processing: From Theory to Practice: 7th International Conference, ICALP 2019, Nancy, France, October 16–17, 2019, Proceedings 7 (pp. 292–302). Springer International Publishing.

Amilevičius, D., et al. (2023). DIGIRES COVID-19 Corpus v.1, CLARIN-LT digital library in the Republic of Lithuania. (Amilevičius, Darius; Utka, Andrius; Meidutė, Aistė and Ruzaitė, Jūratė) http://hdl.handle.net/20.500.11821/53. Retrieved on 01.05.2025.

Bakir, V., & McStay, A. (2018). Fake news and the economy of emotions: Problems, causes, solutions. Digital journalism6(2), 154–175.

Baptista, J. P., & Gradim, A. (2022). A working definition of fake news. Encyclopedia, 2(1).

Di Domenico, G., Sit, J., Ishizaka, A., & Nunan, D. (2021). Fake news, social media and marketing: A systematic review. Journal of Business Research124, 329–341.

Gelfert, A. (2018). Fake news: A definition. Informal logic38(1), 84–117.

Lyons, B. A., Montgomery, J. M., Guess, A. M., Nyhan, B., & Reifler, J. (2021). Overconfidence in news judgments is associated with false news suscep­tibility. Proceedings of the National Academy of Sciences118(23), e2019527118.

Mazur, V., & Chochia, A. (2022). Definition and Regulation as an Effective Measure to Fight Fake News in the European Union. European Studies, 9(1), 15–40. Chicago.

Monteiro, R. A., Santos, R. L., Pardo, T. A., De Almeida, T. A., Ruiz, E. E., & Vale, O. A. (2018). Contributions to the study of fake news in portuguese: New corpus and automatic detection results. In Computational Processing of the Portuguese Language: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings 13 (pp. 324–334). Springer International Publishing.

Pathak, A., & Srihari, R. K. (2019, July). BREAKING! Presenting fake news corpus for automated fact checking. In Proceedings of the 57th annual meeting of the association for computational linguistics: student research workshop (pp. 357–362).

Pennycook, G., Cannon, T. D., & Rand, D. G. (2018). Prior exposure increases perceived accuracy of fake news. Journal of experimental psychology: general, 147(12), 1865.

Posadas-Durán, J. P., Gómez-Adorno, H., Sidorov, G., & Escobar, J. J. M. (2019). Detection of fake news in a new corpus for the Spanish language. Journal of Intelligent & Fuzzy Systems36(5), 4869–4876.

Potthast, M., Gollub, T., Komlossy, K., Schuster, S., Wiegmann, M., Fernandez, E. P. G., … & Stein, B. (2018, August). Crowdsourcing a large corpus of clickbait on twitter. In Proceedings of the 27th international conference on computational linguistics (pp. 1498–1507).

Przybyla, P. (2020, April). Capturing the style of fake news. In Proceedings of the AAAI conference on artificial intelligence, 34(1), 490–497.

Šalaševičiūtė, V. (2022). Netikrų naujienų atpažinimo metodika. Šalaševičiūtė, V. (2022).  http://dspace.kaunokolegija.lt//handle/123456789/5710. Retrieved on 01.05.2025.

Shu, K., Mahudeswaran, D., & Liu, H. (2019). FakeNewsTracker: a tool for fake news collection, detection, and visualization. Computational and Mathematical Organization Theory25, 60–71.

Singh, V. K., Ghosh, I., & Sonagara, D. (2021). Detecting fake news stories via multimodal analysis. Journal of the Association for Information Science and Technology72(1), 3–17.

Tandoc Jr, E. C. (2019). The facts of fake news: A research review. Sociology Compass, 13(9), e12724. Chicago.

Vogel, I., & Jiang, P. (2019, August). Fake news detection with the new German dataset “GermanFakeNC”. In International Conference on Theory and Practice of Digital Libraries (pp. 288–295). Cham: Springer International Publishing.

Zhang, C., Gupta, A., Kauten, C., Deokar, A. V., & Qin, X. (2019). Detecting fake news for reducing misinformation risks using analytics approaches. European Journal of Operational Research279(3), 1036–1052.

Zhou, X., Jain, A., Phoha, V. V., & Zafarani, R. (2020). Fake news early detection: A theory-driven model. Digital Threats: Research and Practice1(2), 1–25.

 

Edgaras Dambrauskas, PhD student at Vytautas Magnus University – Lituania, Sofia University “St. Kliment Ohridski” – Bulgaria. ORCID: 0000-0001-8546-0564. Topic of the dissertation is Fake News Recognition: Developing a Model for the Lithuanian Language Using a Specialized Text Corpus”.

Manuscript was submitted: 02.06.2025.

Double Blind Peer Reviews: from 03.06.2025 till 04.07.2025.

Accepted: 05.07.2025.

Брой 64 на сп. „Реторика и комуникации“ (юли 2025 г.) се издава с финансовата помощ на Фонд научни изследвания, договор № КП-06-НП6/48 от 04 декември 2024 г.

Issue 64 of the Rhetoric and Communications Journal (July 2025) is published with the financial support of the Scientific Research Fund, Contract No. KP-06-NP6/48 of December 04, 2024.