RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval
- URL: http://arxiv.org/abs/2411.04752v1
- Date: Thu, 07 Nov 2024 14:41:01 GMT
- Title: RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval
- Authors: Aniket Deroy, Subhankar Maity,
- Abstract summary: In India, social media users frequently engage in code-mixed conversations using the Roman script.
This paper focuses on the challenges of extracting relevant information from code-mixed conversations.
We develop a mechanism to automatically identify the most relevant answers from code-mixed conversations.
- Score: 0.0
- License:
- Abstract: Code-mixing, the integration of lexical and grammatical elements from multiple languages within a single sentence, is a widespread linguistic phenomenon, particularly prevalent in multilingual societies. In India, social media users frequently engage in code-mixed conversations using the Roman script, especially among migrant communities who form online groups to share relevant local information. This paper focuses on the challenges of extracting relevant information from code-mixed conversations, specifically within Roman transliterated Bengali mixed with English. This study presents a novel approach to address these challenges by developing a mechanism to automatically identify the most relevant answers from code-mixed conversations. We have experimented with a dataset comprising of queries and documents from Facebook, and Query Relevance files (QRels) to aid in this task. Our results demonstrate the effectiveness of our approach in extracting pertinent information from complex, code-mixed digital conversations, contributing to the broader field of natural language processing in multilingual and informal text environments. We use GPT-3.5 Turbo via prompting alongwith using the sequential nature of relevant documents to frame a mathematical model which helps to detect relevant documents corresponding to a query.
Related papers
- $\mu$PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge [72.64847925450368]
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language.
This work presents $mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge.
arXiv Detail & Related papers (2023-05-23T16:25:21Z) - Prompting Multilingual Large Language Models to Generate Code-Mixed
Texts: The Case of South East Asian Languages [47.78634360870564]
We explore prompting multilingual models to generate code-mixed data for seven languages in South East Asia (SEA)
We find that publicly available multilingual instruction-tuned models such as BLOOMZ are incapable of producing texts with phrases or clauses from different languages.
ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing.
arXiv Detail & Related papers (2023-03-23T18:16:30Z) - MUTANT: A Multi-sentential Code-mixed Hinglish Dataset [16.14337612590717]
We propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles.
As a use case, we leverage multilingual articles and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset.
The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs.
arXiv Detail & Related papers (2023-02-23T04:04:18Z) - Talk the Walk: Synthetic Data Generation for Conversational Music
Recommendation [62.019437228000776]
We present TalkWalk, which generates realistic high-quality conversational data by leveraging encoded expertise in widely available item collections.
We generate over one million diverse conversations in a human-collected dataset.
arXiv Detail & Related papers (2023-01-27T01:54:16Z) - Retrieval-Augmented Multilingual Keyphrase Generation with
Retriever-Generator Iterative Training [66.64843711515341]
Keyphrase generation is the task of automatically predicting keyphrases given a piece of long text.
We call attention to a new setting named multilingual keyphrase generation.
We propose a retrieval-augmented method for multilingual keyphrase generation to mitigate the data shortage problem in non-English languages.
arXiv Detail & Related papers (2022-05-21T00:45:21Z) - BERTuit: Understanding Spanish language in Twitter through a native
transformer [70.77033762320572]
We present bfBERTuit, the larger transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230M Spanish tweets.
Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network.
arXiv Detail & Related papers (2022-04-07T14:28:51Z) - HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish
Text [1.6675267471157407]
We present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages)
HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences.
In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data.
arXiv Detail & Related papers (2021-07-08T11:11:37Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Challenges and Limitations with the Metrics Measuring the Complexity of
Code-Mixed Text [1.6675267471157407]
Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech.
This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.
arXiv Detail & Related papers (2021-06-18T13:26:48Z) - GupShup: An Annotated Corpus for Abstractive Summarization of
Open-Domain Code-Switched Conversations [28.693328393260906]
We introduce abstractive summarization of Hindi-English code-switched conversations and develop the first code-switched conversation summarization dataset.
GupShup contains over 6,831 conversations in Hindi-English and their corresponding human-annotated summaries in English and Hindi-English.
We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation.
arXiv Detail & Related papers (2021-04-17T15:42:01Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.