COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework
- URL: http://arxiv.org/abs/2408.03125v1
- Date: Tue, 6 Aug 2024 11:56:26 GMT
- Title: COMMENTATOR: A Code-mixed Multilingual Text Annotation Framework
- Authors: Rajvee Sheth, Shubh Nisar, Heenaben Prajapati, Himanshu Beniwal, Mayank Singh,
- Abstract summary: We introduce a code-mixed multilingual text annotation framework, COMMENTATOR, specifically designed for annotating code-mixed text.
The tool demonstrates its effectiveness in token-level and sentence-level language annotation tasks for Hinglish text.
- Score: 1.114560772534785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the NLP community increasingly addresses challenges associated with multilingualism, robust annotation tools are essential to handle multilingual datasets efficiently. In this paper, we introduce a code-mixed multilingual text annotation framework, COMMENTATOR, specifically designed for annotating code-mixed text. The tool demonstrates its effectiveness in token-level and sentence-level language annotation tasks for Hinglish text. We perform robust qualitative human-based evaluations to showcase COMMENTATOR led to 5x faster annotations than the best baseline. Our code is publicly available at \url{https://github.com/lingo-iitgn/commentator}. The demonstration video is available at \url{https://bit.ly/commentator_video}.
Related papers
- COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing [1.3062731746155414]
COMI-LINGUA is the largest manually annotated dataset for code-mixed text, comprising 100,970 instances evaluated by three expert annotators in both Devanagari and Roman scripts.
The dataset supports five fundamental NLP tasks: Language Identification, Matrix Language Identification, Part-of-Speech Tagging, Named Entity Recognition, and Translation.
We evaluate LLMs on these tasks using COMILINGUA, revealing limitations in current multilingual modeling strategies and emphasizing the need for improved code-mixed text processing capabilities.
arXiv Detail & Related papers (2025-03-27T16:36:39Z) - Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues [56.038123093599815]
Our objective is to translate continuous sign language into spoken language text.
We incorporate additional contextual cues together with the signing video.
We show that our contextual approach significantly enhances the quality of the translations.
arXiv Detail & Related papers (2025-01-16T18:59:03Z) - Multilingual Synopses of Movie Narratives: A Dataset for Vision-Language Story Understanding [19.544839928488972]
We construct a large-scale multilingual video story dataset named Multilingual Synopses of Movie Narratives (M-SYMON)
M-SYMON contains 13,166 movie summary videos from 7 languages, as well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video.
Training on the human annotated data from SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively.
arXiv Detail & Related papers (2024-06-18T22:44:50Z) - Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods mainly focus on aligning vision encoders with Multimodal Large Language Models (MLLMs)
We introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level.
Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - MSNER: A Multilingual Speech Dataset for Named Entity Recognition [34.88608417778945]
We introduce MSNER, a freely available, multilingual speech corpus annotated with named entities.
It provides annotations to the VoxPopuli dataset in four languages.
It results in 590 and 15 hours of silver-annotated speech for training and validation, alongside a 17-hour, manually-annotated evaluation set.
arXiv Detail & Related papers (2024-05-19T11:17:00Z) - Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language
Annotation [0.0]
Antarlekhaka is a tool for manual annotation of a comprehensive set of tasks relevant to Natural Language Processing.
The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators.
It has been used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali.
arXiv Detail & Related papers (2023-10-11T19:09:07Z) - Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained
Text Evaluation [11.690442820401453]
We introduce Thresh, a unified, customizable and deployable platform for fine-grained evaluation.
Thresh provides a community hub that hosts a collection of fine-grained frameworks and corresponding annotations made and collected by the community.
For deployment, Thresh offers multiple options for any scale of annotation projects from small manual inspections to large crowdsourcing ones.
arXiv Detail & Related papers (2023-08-14T06:09:51Z) - Chinese Open Instruction Generalist: A Preliminary Release [33.81265396916227]
We propose the project as an attempt to create a Chinese instruction dataset by various methods adapted to the intrinsic characteristics of 4 sub-tasks.
We collect around 200k Chinese instruction tuning samples, which have been manually checked to guarantee high quality.
We summarize the existing English and Chinese instruction corpora and briefly describe some potential applications of the newly constructed Chinese instruction corpora.
arXiv Detail & Related papers (2023-04-17T04:45:06Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - "Listen, Understand and Translate": Triple Supervision Decouples
End-to-end Speech-to-text Translation [49.610188741500274]
An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language.
Existing methods are limited by the amount of parallel corpus.
We build a system to fully utilize signals in a parallel ST corpus.
arXiv Detail & Related papers (2020-09-21T09:19:07Z) - Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language.
We generate abstractive summaries of narrated instructional videos across a wide variety of topics.
We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.