Global Patterns of Knowledge: Language, Genre, and the Geography of Knowledge
- URL: http://arxiv.org/abs/2507.22271v1
- Date: Tue, 29 Jul 2025 22:43:01 GMT
- Title: Global Patterns of Knowledge: Language, Genre, and the Geography of Knowledge
- Authors: Akira Matsui, Fujio Toriumi, Mitsuo Yoshida, Taichi Murayama, Shiori Hironaka,
- Abstract summary: We use economic complexity analysis to understand the editing history of Wikipedia platforms.<n>We reveal that different language communities exhibit distinct specializations, particularly in cultural subjects.<n>Our findings suggest that while a common mode of knowledge production exists for standardized topics such as science, it is more diverse for cultural topics or controversial subjects such as conspiracy theories.
- Score: 0.45666156207236525
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Online platforms, particularly Wikipedia, have become critical infrastructures for providing diverse linguistic and cultural contexts. This human-curated knowledge now forms the foundation for modern AI. However, we have not yet fully explored how knowledge production capability vary across languages and domains. Here, we address this gap by applying economic complexity analysis to understand the editing history of Wikipedia platforms. This approach allows us to infer the latent mode of ``knowledge-production'' of each language community from the diversity and specialization of its contributed content. We reveal that different language communities exhibit distinct specializations, particularly in cultural subjects. Furthermore, we map the global landscape of these production modes, finding that the structure of knowledge production strongly reflects geopolitical boundaries. Our findings suggest that while a common mode of knowledge production exists for standardized topics such as science, it is more diverse for cultural topics or controversial subjects such as conspiracy theories. The association between differences in knowledge production capability and geopolitical factors implies how linguistic and cultural dynamics shape our worldview and the biases embedded in Wikipedia data, a unique, massive, and essential dataset for modern AI.
Related papers
- A Community-driven vision for a new Knowledge Resource for AI [59.29703403953085]
Despite the success of knowledge resources like WordNet, verifiable, general-purpose widely available sources of knowledge remain a critical deficiency in AI infrastructure.<n>This paper synthesizes our findings and outlines a community-driven vision for a new knowledge infrastructure.
arXiv Detail & Related papers (2025-06-19T20:51:28Z) - MAKIEval: A Multilingual Automatic WiKidata-based Framework for Cultural Awareness Evaluation for LLMs [26.806566827956875]
MAKIEval is an automatic multilingual framework for evaluating cultural awareness in large language models.<n>It automatically identifies cultural entities in model outputs and links them to structured knowledge.<n>We assess 7 LLMs developed from different parts of the world, encompassing both open-source and proprietary systems.
arXiv Detail & Related papers (2025-05-27T19:29:40Z) - CulFiT: A Fine-grained Cultural-aware LLM Training Paradigm via Multilingual Critique Data Synthesis [41.261808170896686]
CulFiT is a novel training paradigm that leverages multilingual data and fine-grained reward modeling to enhance cultural sensitivity and inclusivity.<n>Our approach synthesizes diverse cultural-related questions, constructs critique data in culturally relevant languages, and employs fine-grained rewards to decompose cultural texts into verifiable knowledge units.
arXiv Detail & Related papers (2025-05-26T04:08:26Z) - From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs [57.43233760384488]
Adapting cultural values in Large Language Models (LLMs) presents significant challenges.<n>Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data.<n>In this paper, we investigate WVS-based training for cultural value adaptation and find that relying solely on survey data cane cultural norms and interfere with factual knowledge.
arXiv Detail & Related papers (2025-05-22T09:00:01Z) - Language Specific Knowledge: Do Models Know Better in X than in English? [9.923619418000488]
Code-switching is a common phenomenon of alternating between different languages in the same utterance, thought, or conversation.<n>We coin the term Language Specific Knowledge (LSK) to represent this phenomenon.<n>We find that language models can perform better when using chain-of-thought reasoning in some languages other than English.
arXiv Detail & Related papers (2025-05-21T00:31:13Z) - Risks of Cultural Erasure in Large Language Models [4.613949381428196]
We argue for the need of metricizable evaluations of language technologies that interrogate and account for historical power inequities.<n>We probe representations that a language model produces about different places around the world when asked to describe these contexts.<n>We analyze the cultures represented in the travel recommendations produced by a set of language model applications.
arXiv Detail & Related papers (2025-01-02T04:57:50Z) - Extrinsic Evaluation of Cultural Competence in Large Language Models [53.626808086522985]
We focus on extrinsic evaluation of cultural competence in two text generation tasks.
We evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts.
We find weak correlations between text similarity of outputs for different countries and the cultural values of these countries.
arXiv Detail & Related papers (2024-06-17T14:03:27Z) - CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs)
We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z) - Elements of World Knowledge (EWoK): A Cognition-Inspired Framework for Evaluating Basic World Knowledge in Language Models [51.891804790725686]
Elements of World Knowledge (EWoK) is a framework for evaluating language models' understanding of conceptual knowledge underlying world modeling.<n>EWoK-core-1.0 is a dataset of 4,374 items covering 11 world knowledge domains.<n>All tested models perform worse than humans, with results varying drastically across domains.
arXiv Detail & Related papers (2024-05-15T17:19:42Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.