MASALA: Modelling and Analysing the Semantics of Adpositions in
Linguistic Annotation of Hindi
- URL: http://arxiv.org/abs/2205.03955v1
- Date: Sun, 8 May 2022 21:13:33 GMT
- Title: MASALA: Modelling and Analysing the Semantics of Adpositions in
Linguistic Annotation of Hindi
- Authors: Aryaman Arora, Nitin Venkateswaran, Nathan Schneider
- Abstract summary: We use language models to attempt automatic labelling of SNACS supersenses in Hindi.
We look towards upstream applications in semantic role labelling and extension to related languages such as Gujarati.
- Score: 11.042037758273226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a completed, publicly available corpus of annotated semantic
relations of adpositions and case markers in Hindi. We used the multilingual
SNACS annotation scheme, which has been applied to a variety of typologically
diverse languages. Building on past work examining linguistic problems in SNACS
annotation, we use language models to attempt automatic labelling of SNACS
supersenses in Hindi and achieve results competitive with past work on English.
We look towards upstream applications in semantic role labelling and extension
to related languages such as Gujarati.
Related papers
- Limpeh ga li gong: Challenges in Singlish Annotations [1.3812010983144802]
We work on a fundamental Natural Language Processing task: Parts-Of-Speech (POS) tagging of Singlish sentences.
For our analysis, we build a parallel Singlish dataset containing direct English translations and POS tags, with translation and POS annotation done by native Singlish speakers.
Experiments show that automatic transition- and transformer-based taggers perform with only $sim 80%$ accuracy when evaluated against human-annotated POS labels.
arXiv Detail & Related papers (2024-10-21T16:21:45Z) - Understanding Cross-Lingual Alignment -- A Survey [52.572071017877704]
Cross-lingual alignment is the meaningful similarity of representations across languages in multilingual language models.
We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field.
arXiv Detail & Related papers (2024-04-09T11:39:53Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - Automatic Readability Assessment for Closely Related Languages [6.233117407988574]
This work focuses on how linguistic aspects such as mutual intelligibility or degree of language relatedness can improve ARA in a low-resource setting.
We collect short stories written in three languages in the Philippines-Tagalog, Bikol, and Cebuano-to train readability assessment models.
Our results show that the inclusion of CrossNGO, a novel specialized feature exploiting n-gram overlap applied to languages with high mutual intelligibility, significantly improves the performance of ARA models.
arXiv Detail & Related papers (2023-05-22T20:42:53Z) - SanskritShala: A Neural Sanskrit NLP Toolkit with Web-Based Interface
for Pedagogical and Annotation Purposes [13.585440544031584]
We present a neural Sanskrit Natural Language Processing (NLP) toolkit named SanskritShala.
Our systems report state-of-the-art performance on available benchmark datasets for all tasks.
SanskritShala is deployed as a web-based application, which allows a user to get real-time analysis for the given input.
arXiv Detail & Related papers (2023-02-19T09:58:55Z) - Multilingual Word Sense Disambiguation with Unified Sense Representation [55.3061179361177]
We propose building knowledge and supervised-based Multilingual Word Sense Disambiguation (MWSD) systems.
We build unified sense representations for multiple languages and address the annotation scarcity problem for MWSD by transferring annotations from rich-sourced languages to poorer ones.
Evaluations of SemEval-13 and SemEval-15 datasets demonstrate the effectiveness of our methodology.
arXiv Detail & Related papers (2022-10-14T01:24:03Z) - Multilingual Extraction and Categorization of Lexical Collocations with
Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context.
Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z) - Utilizing Wordnets for Cognate Detection among Indian Languages [50.83320088758705]
We detect cognate word pairs among ten Indian languages with Hindi.
We use deep learning methodologies to predict whether a word pair is cognate or not.
We report improved performance of up to 26%.
arXiv Detail & Related papers (2021-12-30T16:46:28Z) - Harnessing Cross-lingual Features to Improve Cognate Detection for
Low-resource Languages [50.82410844837726]
We demonstrate the use of cross-lingual word embeddings for detecting cognates among fourteen Indian languages.
We evaluate our methods to detect cognates on a challenging dataset of twelve Indian languages.
We observe an improvement of up to 18% points, in terms of F-score, for cognate detection.
arXiv Detail & Related papers (2021-12-16T11:17:58Z) - For the Purpose of Curry: A UD Treebank for Ashokan Prakrit [2.538209532048867]
We present the first linguistically annotated treebank of Ashokan Prakrit.
This is an early Middle Indo-Aryan dialect continuum attested through Emperor Ashoka Maurya's 3rd century BCE rock and pillar edicts.
arXiv Detail & Related papers (2021-11-24T20:30:09Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.