Related papers: A Crosslingual Investigation of Conceptualization in 1335 Languages

A Crosslingual Investigation of Conceptualization in 1335 Languages

URL: http://arxiv.org/abs/2305.08475v2
Date: Fri, 26 May 2023 18:29:24 GMT
Title: A Crosslingual Investigation of Conceptualization in 1335 Languages
Authors: Yihong Liu, Haotian Ye, Leonie Weissweiler, Philipp Wicke, Renhao Pei, Robert Zangenfeind, Hinrich Sch\"utze
Abstract summary: We investigate differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. We propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (bird') and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy.
Score: 0.2216657815393579
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for `belly' and `womb'. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (`bird') and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity of two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracy between 54% and 87%.

Related papers

Universal Conceptual Structure in Neural Translation: Probing NLLB-200's Multilingual Geometry [0.0]
We investigate the representation geometry of Meta's NLLB-200, a 200-language encoder-decoder Transformer.<n>We find that the model's embedding distances significantly correlate with phylogenetic distances from the Automated Similarity Judgment Program.<n>We release InterpretCognates, an open-source interactive toolkit for exploring these phenomena.
arXiv Detail & Related papers (2026-02-27T22:51:01Z)
A Geometric Unification of Concept Learning with Concept Cones [58.70836885177496]
Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs) and Sparse Autoencoders (SAEs)<n>We show that both paradigms instantiate the same geometric structure.<n>CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs.
arXiv Detail & Related papers (2025-12-08T09:51:46Z)
Multilingual Vision-Language Models, A Survey [0.9722250595763385]
This survey examines multilingual vision-language models that process text and images across languages.<n>We review 31 models and 21 benchmarks, spanning encoder-only and generative architectures.
arXiv Detail & Related papers (2025-09-26T09:46:13Z)
Unstable Grounds for Beautiful Trees? Testing the Robustness of Concept Translations in the Compilation of Multilingual Wordlists [1.0136215038345011]
We investigate the variation in concept translations in independently compiled wordlists from 10 dataset pairs covering 9 different language families. On average, only 83% of all translations yield the same word form, while identical forms in terms of phonetic transcriptions can only be found in 23% of all cases.
arXiv Detail & Related papers (2025-03-01T12:16:45Z)
Human-like conceptual representations emerge from language prediction [72.5875173689788]
Large language models (LLMs) trained exclusively through next-token prediction over language data exhibit remarkably human-like behaviors. Are these models developing concepts akin to humans, and if so, how are such concepts represented and organized? Our results demonstrate that LLMs can flexibly derive concepts from linguistic descriptions in relation to contextual cues about other concepts. These findings establish that structured, human-like conceptual representations can naturally emerge from language prediction without real-world grounding.
arXiv Detail & Related papers (2025-01-21T23:54:17Z)
Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization [27.368684663279463]
We investigate the potential for explicitly aligning conceptual correspondence between languages to enhance cross-lingual generalization. Using the syntactic aspect of language as a testbed, our analyses of 43 languages reveal a high degree of alignability. We propose a meta-learning-based method to learn to align conceptual spaces of different languages.
arXiv Detail & Related papers (2023-10-19T14:50:51Z)
A Geometric Notion of Causal Probing [85.49839090913515]
The linear subspace hypothesis states that, in a language model's representation space, all information about a concept such as verbal number is encoded in a linear subspace. We give a set of intrinsic criteria which characterize an ideal linear concept subspace. We find that, for at least one concept across two languages models, the concept subspace can be used to manipulate the concept value of the generated word with precision.
arXiv Detail & Related papers (2023-07-27T17:57:57Z)
From Word Models to World Models: Translating from Natural Language to the Probabilistic Language of Thought [124.40905824051079]
We propose rational meaning construction, a computational framework for language-informed thinking. We frame linguistic meaning as a context-sensitive mapping from natural language into a probabilistic language of thought. We show that LLMs can generate context-sensitive translations that capture pragmatically-appropriate linguistic meanings. We extend our framework to integrate cognitively-motivated symbolic modules.
arXiv Detail & Related papers (2023-06-22T05:14:00Z)
Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns. For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z)
A study of conceptual language similarity: comparison and evaluation [0.3093890460224435]
An interesting line of research in natural language processing (NLP) aims to incorporate linguistic typology to bridge linguistic diversity. Recent work has introduced a novel approach to defining language similarity based on how they represent basic concepts. In this work, we study the conceptual similarity in detail and evaluate it extensively on a binary classification task.
arXiv Detail & Related papers (2023-05-22T18:28:02Z)
Analyzing Encoded Concepts in Transformer Language Models [21.76062029833023]
ConceptX analyses how latent concepts are encoded in representations learned within pre-trained language models. It uses clustering to discover the encoded concepts and explains them by aligning with a large set of human-defined concepts.
arXiv Detail & Related papers (2022-06-27T13:32:10Z)
Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics. We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts. We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z)
Analyzing Gender Representation in Multilingual Models [59.21915055702203]
We focus on the representation of gender distinctions as a practical case study. We examine the extent to which the gender concept is encoded in shared subspaces across different languages.
arXiv Detail & Related papers (2022-04-20T00:13:01Z)
A Massively Multilingual Analysis of Cross-linguality in Shared Embedding Space [61.18554842370824]
In cross-lingual language models, representations for many different languages live in the same space. We compute a task-based measure of cross-lingual alignment in the form of bitext retrieval performance. We examine a range of linguistic, quasi-linguistic, and training-related features as potential predictors of these alignment metrics.
arXiv Detail & Related papers (2021-09-13T21:05:37Z)
A Computational Approach to Measuring the Semantic Divergence of Cognates [2.66418345185993]
We investigate semantic divergence across languages by measuring the semantic similarity of cognate sets in multiple languages. A language-agnostic method facilitates a quantitative analysis of cognates divergence. We introduce the notion of "soft false friend" and "hard false friend", as well as a measure of the degree of "falseness" of a false friends pair.
arXiv Detail & Related papers (2020-12-02T15:52:38Z)
BabelEnconding at SemEval-2020 Task 3: Contextual Similarity as a Combination of Multilingualism and Language Models [0.5276232626689568]
This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity.
arXiv Detail & Related papers (2020-08-19T13:46:37Z)
On the Language Neutrality of Pre-trained Multilingual Representations [70.93503607755055]
We investigate the language-neutrality of multilingual contextual embeddings directly and with respect to lexical semantics. Our results show that contextual embeddings are more language-neutral and, in general, more informative than aligned static word-type embeddings. We show how to reach state-of-the-art accuracy on language identification and match the performance of statistical methods for word alignment of parallel sentences.
arXiv Detail & Related papers (2020-04-09T19:50:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.