Related papers: ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

URL: http://arxiv.org/abs/2409.00120v1
Date: Wed, 28 Aug 2024 11:27:21 GMT
Title: ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings
Authors: Jangyeong Jeon, Sangyeon Cho, Minuk Ma, Junyoung Kim,
Abstract summary: This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges.
Score: 4.68732641979009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have limitations in CS scenarios. We construct a novel Koglish-NLI (Natural Language Inference) dataset using a CS augmentation-based approach to verify this. From this CS-augmented dataset Koglish-NLI, we propose a unified contrastive learning and augmentation method for code-switched embeddings, ConCSE, highlighting the semantics of CS sentences. Experimental results validate the proposed ConCSE with an average performance enhancement of 1.77\% on the Koglish-STS(Semantic Textual Similarity) tasks.

Related papers

Optimizing ASR for Catalan-Spanish Code-Switching: A Comparative Analysis of Methodologies [9.224033819309708]
Code-switching (CS), the alternating use of two or more languages, challenges automatic speech recognition (ASR) due to scarce training data and linguistic similarities.<n>We improve ASR for Catalan-Spanish CS by exploring three strategies: (1) generating synthetic CS data, (2) concatenating monolingual audio, and (3) leveraging real CS data with language tokens.<n>Results show that combining a modest amount of synthetic CS data with the dominant language token yields the best transcription performance.
arXiv Detail & Related papers (2025-07-18T12:54:41Z)
Can we train ASR systems on Code-switch without real code-switch data? Case study for Singapore's languages [3.263178944046948]
Code-switching (CS) presents challenges for ASR due to scarce and costly transcribed data.<n>We propose a phrase-level mixing method to generate synthetic CS data that mimics natural patterns.
arXiv Detail & Related papers (2025-06-17T04:37:16Z)
Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data [21.240439045909724]
Code-switching (CS) is still a critical challenge in Natural Language Processing (NLP) This paper presents a novel methodology to generate CS data using Large Language Models (LLMs) We propose back-translating natural CS sentences into monolingual English, and using the resulting parallel corpus to fine-tune LLMs to turn monolingual sentences into CS.
arXiv Detail & Related papers (2025-02-18T15:04:13Z)
Grammatical Error Correction for Code-Switched Sentences by Learners of English [5.653145656597412]
We conduct the first exploration into the use of Grammar Error Correction systems on CSW text. We generate synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora. We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints. Our best model achieves an average increase of 1.57 $F_0.5$ across 3 CSW test sets without affecting the model's performance on a monolingual dataset.
arXiv Detail & Related papers (2024-04-18T20:05:30Z)
Code-Switched Language Identification is Harder Than You Think [69.63439391717691]
Code switching is a common phenomenon in written and spoken communication. We look at the application of building CS corpora. We make the task more realistic by scaling it to more languages. We reformulate the task as a sentence-level multi-label tagging problem to make it more tractable.
arXiv Detail & Related papers (2024-02-02T15:38:47Z)
Speech collage: code-switched audio generation by collaging monolingual corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments. We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z)
CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation. We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts. Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z)
InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings [61.77760317554826]
This paper proposes an information-d contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE. We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large.
arXiv Detail & Related papers (2022-10-08T15:53:19Z)
End-to-End Speech Translation for Code Switched Speech [13.97982457879585]
Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. We focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.
arXiv Detail & Related papers (2022-04-11T13:25:30Z)
Code-Switching Text Augmentation for Multilingual Speech Processing [36.302629721413155]
Code-switching in spoken content has enforced ASR systems to handle mixed input. Recent ASR studies showed the predominance of E2E-ASR using multilingual data to handle CS phenomena. We propose a methodology to augment the monolingual data for artificially generating spoken CS text to improve different speech modules.
arXiv Detail & Related papers (2022-01-07T17:14:19Z)
KARI: KAnari/QCRI's End-to-End systems for the INTERSPEECH 2021 Indian Languages Code-Switching Challenge [7.711092265101041]
We present the Kanari/QCRI system and the modeling strategies used to participate in the Interspeech 2021 Code-switching (CS) challenge for low-resource Indian languages. The subtask involved developing a speech recognition system for two CS datasets: Hindi-English and Bengali-English. To tackle the CS challenges, we use transfer learning for incorporating the publicly available monolingual Hindi, Bengali, and English speech data.
arXiv Detail & Related papers (2021-06-10T16:12:51Z)
GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations. GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree. We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)
Style Variation as a Vantage Point for Code-Switching [54.34370423151014]
Code-Switching (CS) is a common phenomenon observed in several bilingual and multilingual communities. We present a novel vantage point of CS to be style variations between both the participating languages. We propose a two-stage generative adversarial training approach where the first stage generates competitive negative examples for CS and the second stage generates more realistic CS sentences.
arXiv Detail & Related papers (2020-05-01T15:53:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.