GupShup: An Annotated Corpus for Abstractive Summarization of
Open-Domain Code-Switched Conversations
- URL: http://arxiv.org/abs/2104.08578v1
- Date: Sat, 17 Apr 2021 15:42:01 GMT
- Title: GupShup: An Annotated Corpus for Abstractive Summarization of
Open-Domain Code-Switched Conversations
- Authors: Laiba Mehnaz, Debanjan Mahata, Rakesh Gosangi, Uma Sushmitha Gunturi,
Riya Jain, Gauri Gupta, Amardeep Kumar, Isabelle Lee, Anish Acharya, Rajiv
Ratn Shah
- Abstract summary: We introduce abstractive summarization of Hindi-English code-switched conversations and develop the first code-switched conversation summarization dataset.
GupShup contains over 6,831 conversations in Hindi-English and their corresponding human-annotated summaries in English and Hindi-English.
We train state-of-the-art abstractive summarization models and report their performances using both automated metrics and human evaluation.
- Score: 28.693328393260906
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Code-switching is the communication phenomenon where speakers switch between
different languages during a conversation. With the widespread adoption of
conversational agents and chat platforms, code-switching has become an integral
part of written conversations in many multi-lingual communities worldwide. This
makes it essential to develop techniques for summarizing and understanding
these conversations. Towards this objective, we introduce abstractive
summarization of Hindi-English code-switched conversations and develop the
first code-switched conversation summarization dataset - GupShup, which
contains over 6,831 conversations in Hindi-English and their corresponding
human-annotated summaries in English and Hindi-English. We present a detailed
account of the entire data collection and annotation processes. We analyze the
dataset using various code-switching statistics. We train state-of-the-art
abstractive summarization models and report their performances using both
automated metrics and human evaluation. Our results show that multi-lingual
mBART and multi-view seq2seq models obtain the best performances on the new
dataset
Related papers
- RetrieveGPT: Merging Prompts and Mathematical Models for Enhanced Code-Mixed Information Retrieval [0.0]
In India, social media users frequently engage in code-mixed conversations using the Roman script.
This paper focuses on the challenges of extracting relevant information from code-mixed conversations.
We develop a mechanism to automatically identify the most relevant answers from code-mixed conversations.
arXiv Detail & Related papers (2024-11-07T14:41:01Z) - Increasing faithfulness in human-human dialog summarization with Spoken Language Understanding tasks [0.0]
We propose an exploration of how incorporating task-related information can enhance the summarization process.
Results show that integrating models with task-related information improves summary accuracy, even with varying word error rates.
arXiv Detail & Related papers (2024-09-16T08:15:35Z) - $\mu$PLAN: Summarizing using a Content Plan as Cross-Lingual Bridge [72.64847925450368]
Cross-lingual summarization consists of generating a summary in one language given an input document in a different language.
This work presents $mu$PLAN, an approach to cross-lingual summarization that uses an intermediate planning step as a cross-lingual bridge.
arXiv Detail & Related papers (2023-05-23T16:25:21Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Multilingual Transfer Learning for Code-Switched Language and Speech
Neural Modeling [12.497781134446898]
We address the data scarcity and limitations of linguistic theory by proposing language-agnostic multi-task training methods.
First, we introduce a meta-learning-based approach, meta-transfer learning, in which information is judiciously extracted from high-resource monolingual speech data to the code-switching domain.
Second, we propose a novel multilingual meta-ems approach to effectively represent code-switching data by acquiring useful knowledge learned in other languages.
Third, we introduce multi-task learning to integrate syntactic information as a transfer learning strategy to a language model and learn where to code-switch.
arXiv Detail & Related papers (2021-04-13T14:49:26Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z) - A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching.
Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.