Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language
Annotation
- URL: http://arxiv.org/abs/2310.07826v1
- Date: Wed, 11 Oct 2023 19:09:07 GMT
- Title: Antarlekhaka: A Comprehensive Tool for Multi-task Natural Language
Annotation
- Authors: Hrishikesh Terdalkar (1) and Arnab Bhattacharya (1) ((1) Indian
Institute of Technology Kanpur)
- Abstract summary: Antarlekhaka is a tool for manual annotation of a comprehensive set of tasks relevant to Natural Language Processing.
The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators.
It has been used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the primary obstacles in the advancement of Natural Language
Processing (NLP) technologies for low-resource languages is the lack of
annotated datasets for training and testing machine learning models. In this
paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive
set of tasks relevant to NLP. The tool is Unicode-compatible,
language-agnostic, Web-deployable and supports distributed annotation by
multiple simultaneous annotators. The system sports user-friendly interfaces
for 8 categories of annotation tasks. These, in turn, enable the annotation of
a considerably larger set of NLP tasks. The task categories include two
linguistic tasks not handled by any other tool, namely, sentence boundary
detection and deciding canonical word order, which are important tasks for text
that is in the form of poetry. We propose the idea of sequential annotation
based on small text units, where an annotator performs several tasks related to
a single text unit before proceeding to the next unit. The research
applications of the proposed mode of multi-task annotation are also discussed.
Antarlekhaka outperforms other annotation tools in objective evaluation. It has
been also used for two real-life annotation tasks on two different languages,
namely, Sanskrit and Bengali. The tool is available at
https://github.com/Antarlekhaka/code.
Related papers
- A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision [74.972172804514]
We introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text.
New dataset annotations provide continuous sign-level annotations for six hours of test videos, and will be made publicly available.
Our model significantly outperforms the previous state of the art on both tasks.
arXiv Detail & Related papers (2024-05-16T17:19:06Z) - Wav2Gloss: Generating Interlinear Glossed Text from Speech [78.64412090339044]
We propose Wav2Gloss, a task in which four linguistic annotation components are extracted automatically from speech.
We provide various baselines to lay the groundwork for future research on Interlinear Glossed Text generation from speech.
arXiv Detail & Related papers (2024-03-19T21:45:29Z) - EEVEE: An Easy Annotation Tool for Natural Language Processing [32.111061774093]
We propose EEVEE, an annotation tool focused on simplicity, efficiency, and ease of use.
It can run directly in the browser (no setup required) and uses tab-separated files (as opposed to character offsets or task-specific formats) for annotation.
arXiv Detail & Related papers (2024-02-05T10:24:40Z) - Thresh: A Unified, Customizable and Deployable Platform for Fine-Grained
Text Evaluation [11.690442820401453]
We introduce Thresh, a unified, customizable and deployable platform for fine-grained evaluation.
Thresh provides a community hub that hosts a collection of fine-grained frameworks and corresponding annotations made and collected by the community.
For deployment, Thresh offers multiple options for any scale of annotation projects from small manual inspections to large crowdsourcing ones.
arXiv Detail & Related papers (2023-08-14T06:09:51Z) - POTATO: The Portable Text Annotation Tool [8.924906491840119]
We present POTATO, a free, fully open-sourced annotation system.
It supports labeling many types of text and multimodal data.
It offers easy-to-configure features to maximize the productivity of both deployers and annotators.
arXiv Detail & Related papers (2022-12-16T17:57:41Z) - Binding Language Models in Symbolic Languages [146.3027328556881]
Binder is a training-free neural-symbolic framework that maps the task input to a program.
In the parsing stage, Codex is able to identify the part of the task input that cannot be answerable by the original programming language.
In the execution stage, Codex can perform versatile functionalities given proper prompts in the API calls.
arXiv Detail & Related papers (2022-10-06T12:55:17Z) - Bridging Cross-Lingual Gaps During Leveraging the Multilingual
Sequence-to-Sequence Pretraining for Text Generation [80.16548523140025]
We extend the vanilla pretrain-finetune pipeline with extra code-switching restore task to bridge the gap between the pretrain and finetune stages.
Our approach could narrow the cross-lingual sentence representation distance and improve low-frequency word translation with trivial computational cost.
arXiv Detail & Related papers (2022-04-16T16:08:38Z) - Annotation Curricula to Implicitly Train Non-Expert Annotators [56.67768938052715]
voluntary studies often require annotators to familiarize themselves with the task, its annotation scheme, and the data domain.
This can be overwhelming in the beginning, mentally taxing, and induce errors into the resulting annotations.
We propose annotation curricula, a novel approach to implicitly train annotators.
arXiv Detail & Related papers (2021-06-04T09:48:28Z) - HIT: A Hierarchically Fused Deep Attention Network for Robust Code-mixed
Language Representation [18.136640008855117]
We propose HIT, a robust representation learning method for code-mixed texts.
HIT is a hierarchical transformer-based framework that captures the semantic relationship among words.
Our evaluation of HIT on one European (Spanish) and five Indic (Hindi, Bengali, Tamil, Telugu, and Malayalam) languages suggests significant performance improvement against various state-of-the-art systems.
arXiv Detail & Related papers (2021-05-30T18:53:33Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z) - Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning
in NLP [24.981991538150584]
MaChAmp is a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings.
The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit.
arXiv Detail & Related papers (2020-05-29T16:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.