Into the Void: Understanding Online Health Information in Low-Web Data Languages
- URL: http://arxiv.org/abs/2509.20245v1
- Date: Wed, 24 Sep 2025 15:35:01 GMT
- Title: Into the Void: Understanding Online Health Information in Low-Web Data Languages
- Authors: Hellina Hailu Nigatu, Nuredin Ali Abdelkadir, Fiker Tewelde, Stevie Chancellor, Daricia Wilkinson,
- Abstract summary: We study the characteristics of search results for health queries in Tigrinya and Amharic as languages of study.<n>We find that search results for health queries in low-web data languages may not always be in the language of search.<n>We show that search results that diverge from their queries in low-resourced languages are due to algorithmic failures, (un)intentional manipulation, or active manipulation by content creators.
- Score: 8.999413477506554
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data voids--areas of the internet where reliable information is scarce or absent--pose significant challenges to online health information seeking, particularly for users operating in low-web data languages. These voids are increasingly encountered not on traditional search engines alone, but on social media platforms, which have gradually morphed into informal search engines for millions of people. In this paper, we introduce the phenomenon of data horizons: a critical boundary where algorithmic structures begin to degrade the relevance and reliability of search results. Unlike the core of a data void, which is often exploited by bad actors to spread misinformation, the data horizon marks the critical space where systemic factors, such as linguistic underrepresentation, algorithmic amplification, and socio-cultural mismatch, create conditions of informational instability. Focusing on Tigrinya and Amharic as languages of study, we evaluate (1) the common characteristics of search results for health queries, (2) the quality and credibility of health information, and (3) characteristics of search results that diverge from their queries. We find that search results for health queries in low-web data languages may not always be in the language of search and may be dominated by nutritional and religious advice. We show that search results that diverge from their queries in low-resourced languages are due to algorithmic failures, (un)intentional manipulation, or active manipulation by content creators. We use our findings to illustrate how a data horizon manifests under several interacting constraints on information availability.
Related papers
- The Synthetic Web: Adversarially-Curated Mini-Internets for Diagnosing Epistemic Weaknesses of Language Agents [0.0]
Language agents increasingly act as web-enabled systems that search, browse, and synthesize information from diverse sources.<n>These sources can include unreliable or adversarial content, and the robustness of agents to adversarial ranking remains poorly understood.<n>We introduce Synthetic Web Benchmark, a procedurally generated environment comprising thousands of hyperlinked articles with ground-truth labels for credibility and factuality.
arXiv Detail & Related papers (2026-02-28T20:27:44Z) - Misspellings in Natural Language Processing: A survey [52.419589623702336]
misspellings have become ubiquitous in digital communication.<n>We reconstruct a history of misspellings as a scientific problem.<n>We discuss the latest advancements to address the challenge of misspellings in NLP.
arXiv Detail & Related papers (2025-01-28T10:26:04Z) - Discovery of the Hidden World with Large Language Models [95.58823685009727]
This paper presents Causal representatiOn AssistanT (COAT) that introduces large language models (LLMs) to bridge the gap.
LLMs are trained on massive observations of the world and have demonstrated great capability in extracting key information from unstructured data.
COAT also adopts CDs to find causal relations among the identified variables as well as to provide feedback to LLMs to iteratively refine the proposed factors.
arXiv Detail & Related papers (2024-02-06T12:18:54Z) - InfoLossQA: Characterizing and Recovering Information Loss in Text Simplification [60.10193972862099]
This work proposes a framework to characterize and recover simplification-induced information loss in form of question-and-answer pairs.
QA pairs are designed to help readers deepen their knowledge of a text.
arXiv Detail & Related papers (2024-01-29T19:00:01Z) - When a Language Question Is at Stake. A Revisited Approach to Label
Sensitive Content [0.0]
Article revisits an approach of pseudo-labeling sensitive data on the example of Ukrainian tweets covering the Russian-Ukrainian war.
We provide a fundamental statistical analysis of the obtained data, evaluation of models used for pseudo-labelling, and set further guidelines on how the scientists can leverage the corpus.
arXiv Detail & Related papers (2023-11-17T13:35:10Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - KoMultiText: Large-Scale Korean Text Dataset for Classifying Biased
Speech in Real-World Online Services [5.03606775899383]
"KoMultiText" is a new comprehensive, large-scale dataset collected from a well-known South Korean SNS platform.
Our approach surpasses human-level accuracy across diverse classification tasks, as measured by various metrics.
Our work can provide solutions for real-world hate speech and bias mitigation, contributing directly to the improvement of online community health.
arXiv Detail & Related papers (2023-10-06T15:19:39Z) - Fine-Tuning Llama 2 Large Language Models for Detecting Online Sexual
Predatory Chats and Abusive Texts [2.406214748890827]
This paper proposes an approach to detection of online sexual predatory chats and abusive language using the open-source pretrained Llama 2 7B- parameter model.
We fine-tune the LLM using datasets with different sizes, imbalance degrees, and languages (i.e., English, Roman Urdu and Urdu)
Experimental results show a strong performance of the proposed approach, which performs proficiently and consistently across three distinct datasets.
arXiv Detail & Related papers (2023-08-28T16:18:50Z) - Harnessing the Power of Text-image Contrastive Models for Automatic
Detection of Online Misinformation [50.46219766161111]
We develop a self-learning model to explore the constrastive learning in the domain of misinformation identification.
Our model shows the superior performance of non-matched image-text pair detection when the training data is insufficient.
arXiv Detail & Related papers (2023-04-19T02:53:59Z) - Integrity and Junkiness Failure Handling for Embedding-based Retrieval:
A Case Study in Social Network Search [26.705196461992845]
Embedding based retrieval has seen its usage in a variety of search applications like e-commerce, social networking search etc.
In this paper, we conduct an analysis of embedding-based retrieval launched in early 2021 on our social network search engine.
We define two main categories of failures introduced by it, integrity and junkiness.
arXiv Detail & Related papers (2023-04-18T20:53:47Z) - COLD: A Benchmark for Chinese Offensive Language Detection [54.60909500459201]
We use COLDataset, a Chinese offensive language dataset with 37k annotated sentences.
We also propose textscCOLDetector to study output offensiveness of popular Chinese language models.
Our resources and analyses are intended to help detoxify the Chinese online communities and evaluate the safety performance of generative language models.
arXiv Detail & Related papers (2022-01-16T11:47:23Z) - Exposing Query Identification for Search Transparency [69.06545074617685]
We explore the feasibility of approximate exposing query identification (EQI) as a retrieval task by reversing the role of queries and documents in two classes of search systems.
We derive an evaluation metric to measure the quality of a ranking of exposing queries, as well as conducting an empirical analysis focusing on various practical aspects of approximate EQI.
arXiv Detail & Related papers (2021-10-14T20:19:27Z) - Natural language technology and query expansion: issues,
state-of-the-art and perspectives [0.0]
Linguistic characteristics that cause ambiguity and misinterpretation of queries as well as additional factors affect the users ability to accurately represent their information needs.
We lay down the anatomy of a generic linguistic based query expansion framework and propose its module-based decomposition.
For each of the modules we review the state-of-the-art solutions in the literature and categorized under the light of the techniques used.
arXiv Detail & Related papers (2020-04-23T11:39:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.