ELCC: the Emergent Language Corpus Collection
- URL: http://arxiv.org/abs/2407.04158v2
- Date: Wed, 04 Dec 2024 15:23:54 GMT
- Title: ELCC: the Emergent Language Corpus Collection
- Authors: Brendon Boldt, David Mortensen,
- Abstract summary: The Emergent Language Corpus Collection (ELCC) is a collection of corpora generated from open source implementations of emergent communication systems.
Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus.
- Score: 1.6574413179773761
- License:
- Abstract: We introduce the Emergent Language Corpus Collection (ELCC): a collection of corpora generated from open source implementations of emergent communication systems across the literature. These systems include a variety of signalling game environments as well as more complex environments like a social deduction game and embodied navigation. Each corpus is annotated with metadata describing the characteristics of the source system as well as a suite of analyses of the corpus (e.g., size, entropy, average message length, performance as transfer learning data). Currently, research studying emergent languages requires directly running different systems which takes time away from actual analyses of such languages, makes studies which compare diverse emergent languages rare, and presents a barrier to entry for researchers without a background in deep learning. The availability of a substantial collection of well-documented emergent language corpora, then, will enable research which can analyze a wider variety of emergent languages, which more effectively uncovers general principles in emergent communication rather than artifacts of particular environments. We provide some quantitative and qualitative analyses with ELCC to demonstrate potential use cases of the resource in this vein.
Related papers
- Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - A Survey of Code-switching: Linguistic and Social Perspectives for
Language Technologies [8.202739294785086]
We offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies.
From the linguistic perspective, we provide an overview of structural and functional patterns of C-S focusing on the literature from European and Indian contexts.
From the language technologies perspective, we discuss how massive language models fail to represent diverse C-S types due to lack of appropriate training data.
arXiv Detail & Related papers (2023-01-05T09:08:04Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - The Grammar of Emergent Languages [19.17358904009426]
We show that UGI techniques are appropriate to analyse emergent languages.
We then study if the languages that emerge in a typical referential game setup exhibit syntactic structure.
Our experiments demonstrate that a certain message length and vocabulary size are required for structure to emerge.
arXiv Detail & Related papers (2020-10-05T15:06:27Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z) - Linguistic Typology Features from Text: Inferring the Sparse Features of
World Atlas of Language Structures [73.06435180872293]
We construct a recurrent neural network predictor based on byte embeddings and convolutional layers.
We show that some features from various linguistic types can be predicted reliably.
arXiv Detail & Related papers (2020-04-30T21:00:53Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.