Vy\=akarana: A Colorless Green Benchmark for Syntactic Evaluation in
Indic Languages
- URL: http://arxiv.org/abs/2103.00854v1
- Date: Mon, 1 Mar 2021 09:07:58 GMT
- Title: Vy\=akarana: A Colorless Green Benchmark for Syntactic Evaluation in
Indic Languages
- Authors: Rajaswa Patil, Jasleen Dhillon, Siddhant Mahurkar, Saumitra Kulkarni,
Manav Malhotra and Veeky Baths
- Abstract summary: Indic languages have rich morphosyntax, grammatical genders, free linear word-order, and highly inflectional morphology.
We introduce Vy=akarana: a benchmark of gender-balanced Colorless Green sentences in Indic languages for syntactic evaluation of multilingual language models.
We use the datasets from the evaluation tasks to probe five multilingual language models of varying architectures for syntax in Indic languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While there has been significant progress towards developing NLU datasets and
benchmarks for Indic languages, syntactic evaluation has been relatively less
explored. Unlike English, Indic languages have rich morphosyntax, grammatical
genders, free linear word-order, and highly inflectional morphology. In this
paper, we introduce Vy\=akarana: a benchmark of gender-balanced Colorless Green
sentences in Indic languages for syntactic evaluation of multilingual language
models. The benchmark comprises four syntax-related tasks: PoS Tagging, Syntax
Tree-depth Prediction, Grammatical Case Marking, and Subject-Verb Agreement. We
use the datasets from the evaluation tasks to probe five multilingual language
models of varying architectures for syntax in Indic languages. Our results show
that the token-level and sentence-level representations from the Indic language
models (IndicBERT and MuRIL) do not capture the syntax in Indic languages as
efficiently as the other highly multilingual language models. Further, our
layer-wise probing experiments reveal that while mBERT, DistilmBERT, and XLM-R
localize the syntax in middle layers, the Indic language models do not show
such syntactic localization.
Related papers
- Prompt Engineering Using GPT for Word-Level Code-Mixed Language Identification in Low-Resource Dravidian Languages [0.0]
In multilingual societies like India, text often exhibits code-mixing, blending local languages with English at different linguistic levels.
This paper introduces a prompt based method for a shared task aimed at addressing word-level LI challenges in Dravidian languages.
In this work, we leveraged GPT-3.5 Turbo to understand whether the large language models is able to correctly classify words into correct categories.
arXiv Detail & Related papers (2024-11-06T16:20:37Z) - IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? [14.77467551053299]
Transformer-based models have revolutionized the field of natural language processing.
How robust are these models in encoding linguistic properties when faced with perturbations in the input text?
In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages.
arXiv Detail & Related papers (2024-10-03T15:50:08Z) - Navigating Text-to-Image Generative Bias across Indic Languages [53.92640848303192]
This research investigates biases in text-to-image (TTI) models for the Indic languages widely spoken across India.
It evaluates and compares the generative performance and cultural relevance of leading TTI models in these languages against their performance in English.
arXiv Detail & Related papers (2024-08-01T04:56:13Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs [2.9521383230206966]
This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP)
RuBLiMP includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon.
We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense.
arXiv Detail & Related papers (2024-06-27T14:55:19Z) - Tokenization Impacts Multilingual Language Modeling: Assessing
Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.
Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Call Larisa Ivanovna: Code-Switching Fools Multilingual NLU Models [1.827510863075184]
Novel benchmarks for multilingual natural language understanding (NLU) include monolingual sentences in several languages, annotated with intents and slots.
Existing benchmarks lack of code-switched utterances, which are difficult to gather and label due to complexity in the grammatical structure.
Our work adopts recognized methods to generate plausible and naturally-sounding code-switched utterances and uses them to create a synthetic code-switched test set.
arXiv Detail & Related papers (2021-09-29T11:15:00Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual
Lexical Semantic Similarity [67.36239720463657]
Multi-SimLex is a large-scale lexical resource and evaluation benchmark covering datasets for 12 diverse languages.
Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs.
Owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets.
arXiv Detail & Related papers (2020-03-10T17:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.