Related papers: Code-Switched Language Identification is Harder Than You Think

Code-Switched Language Identification is Harder Than You Think

URL: http://arxiv.org/abs/2402.01505v1
Date: Fri, 2 Feb 2024 15:38:47 GMT
Title: Code-Switched Language Identification is Harder Than You Think
Authors: Laurie Burchell, Alexandra Birch, Robert P. Thompson, Kenneth Heafield
Abstract summary: Code switching is a common phenomenon in written and spoken communication. We look at the application of building CS corpora. We make the task more realistic by scaling it to more languages. We reformulate the task as a sentence-level multi-label tagging problem to make it more tractable.
Score: 69.63439391717691
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code switching (CS) is a very common phenomenon in written and spoken communication but one that is handled poorly by many natural language processing applications. Looking to the application of building CS corpora, we explore CS language identification (LID) for corpus building. We make the task more realistic by scaling it to more languages and considering models with simpler architectures for faster inference. We also reformulate the task as a sentence-level multi-label tagging problem to make it more tractable. Having defined the task, we investigate three reasonable models for this task and define metrics which better reflect desired performance. We present empirical evidence that no current approach is adequate and finally provide recommendations for future work in this area.

Related papers

The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs [54.59207567677249]
Large language models (LLMs) still struggle across tasks outside of high-resource languages.<n>In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce.
arXiv Detail & Related papers (2025-05-23T20:28:31Z)
UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface [25.898592418636603]
ours is a framework that textbfUnifies textbfFine-grained visual perception tasks through an textbfOpen-ended language interface. ours unifies object-level detection, pixel-level segmentation, and image-level vision-language tasks into a single model. Our framework bridges the gap between fine-grained perception and vision-language tasks, significantly simplifying architectural design and training strategies.
arXiv Detail & Related papers (2025-03-03T09:27:24Z)
Eliciting Better Multilingual Structured Reasoning from LLMs through Code [17.870002864331322]
We introduce a multilingual structured reasoning and explanation dataset, termed xSTREET, that covers four tasks across six languages. xSTREET exposes a gap in base LLM performance between English and non-English reasoning tasks. We propose two methods to remedy this gap, building on the insight that LLMs trained on code are better reasoners.
arXiv Detail & Related papers (2024-03-05T00:48:56Z)
Coupling Large Language Models with Logic Programming for Robust and General Reasoning from Text [5.532477732693001]
We show that a large language model can serve as a highly effective few-shot semantically. It can convert natural language sentences into a logical form that serves as input for answer set programs. We demonstrate that this method achieves state-of-the-art performance on several benchmarks, including bAbI, StepGame, CLUTRR, and gSCAN.
arXiv Detail & Related papers (2023-07-15T03:29:59Z)
Simple yet Effective Code-Switching Language Identification with Multitask Pre-Training and Transfer Learning [0.7242530499990028]
Code-switching is the linguistics phenomenon where in casual settings, multilingual speakers mix words from different languages in one utterance. We propose two novel approaches toward improving language identification accuracy on an English-Mandarin child-directed speech dataset. Our best model achieves a balanced accuracy of 0.781 on a real English-Mandarin code-switching child-directed speech corpus and outperforms the previous baseline by 55.3%.
arXiv Detail & Related papers (2023-05-31T11:43:16Z)
Call Larisa Ivanovna: Code-Switching Fools Multilingual NLU Models [1.827510863075184]
Novel benchmarks for multilingual natural language understanding (NLU) include monolingual sentences in several languages, annotated with intents and slots. Existing benchmarks lack of code-switched utterances, which are difficult to gather and label due to complexity in the grammatical structure. Our work adopts recognized methods to generate plausible and naturally-sounding code-switched utterances and uses them to create a synthetic code-switched test set.
arXiv Detail & Related papers (2021-09-29T11:15:00Z)
Can Machines Read Coding Manuals Yet? -- A Benchmark for Building Better Language Models for Code Understanding [3.98345038769576]
We derive a set of benchmarks that assess code understanding based on tasks such as predicting the best answer to a question in a forum post. We evaluate the performance of current state-of-the-art language models on these tasks and show that there is a significant improvement on each task from fine tuning.
arXiv Detail & Related papers (2021-09-15T17:42:44Z)
Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture. Our models learn to generate labels in text based on the visual and textual inputs. Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)
Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z)
Language Models as Few-Shot Learner for Task-Oriented Dialogue Systems [74.8759568242933]
Task-oriented dialogue systems use four connected modules, namely, Natural Language Understanding (NLU), a Dialogue State Tracking (DST), Dialogue Policy (DP) and Natural Language Generation (NLG) A research challenge is to learn each module with the least amount of samples given the high cost related to the data collection. We evaluate the priming few-shot ability of language models in the NLU, DP and NLG tasks.
arXiv Detail & Related papers (2020-08-14T08:23:21Z)
Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [87.33156149634392]
We critically examine RefCOg, a standard benchmark for visual referring expression recognition. We show that 83.7% of test instances do not require reasoning on linguistic structure. We propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT.
arXiv Detail & Related papers (2020-05-04T17:09:15Z)
Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities. We present a novel vantage point of CS to be style variations between both the participating languages. We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.