Related papers: Benchmarking zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text

Benchmarking zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text

URL: http://arxiv.org/abs/2208.01814v1
Date: Wed, 3 Aug 2022 02:20:10 GMT
Title: Benchmarking zero-shot and few-shot approaches for tokenization, tagging, and dependency parsing of Tagalog text
Authors: Angelina Aquino and Franz de Leon
Abstract summary: We investigate the use of auxiliary data sources for creating task-specific models in the absence of annotated Tagalog data. We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The grammatical analysis of texts in any human language typically involves a number of basic processing tasks, such as tokenization, morphological tagging, and dependency parsing. State-of-the-art systems can achieve high accuracy on these tasks for languages with large datasets, but yield poor results for languages such as Tagalog which have little to no annotated data. To address this issue for the Tagalog language, we investigate the use of auxiliary data sources for creating task-specific models in the absence of annotated Tagalog data. We also explore the use of word embeddings and data augmentation to improve performance when only a small amount of annotated Tagalog data is available. We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text compared to state-of-the-art supervised baselines.

Related papers

End-to-End Speech Translation for Low-Resource Languages Using Weakly Labeled Data [5.950263765640278]
This paper explores the hypothesis that weakly labeled data can be used to build speech-to-text translation models.<n>We constructed datasets with the help of bitext mining using state-of-the-art sentence encoders.<n>Results demonstrate that ST systems can be built using weakly labeled data, with performance comparable to massive multi-modal multilingual baselines.
arXiv Detail & Related papers (2025-06-19T12:11:01Z)
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project [0.0]
This paper presents UD-NewsCrawl, the largest Tagalog treebank to date, containing 15.6k trees manually according to the Universal Dependencies framework.<n>We detail our treebank development process, including data collection, pre-processing, manual annotation, and quality assurance procedures.
arXiv Detail & Related papers (2025-05-26T18:25:10Z)
The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks. Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%. For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
TabLLM: Few-shot Classification of Tabular Data with Large Language Models [66.03023402174138]
We study the application of large language models to zero-shot and few-shot classification. We evaluate several serialization methods including templates, table-to-text models, and large language models. This approach is also competitive with strong traditional baselines like gradient-boosted trees.
arXiv Detail & Related papers (2022-10-19T17:08:13Z)
Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases [0.0]
We propose a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks.
arXiv Detail & Related papers (2022-07-26T09:08:56Z)
Multilingual Extraction and Categorization of Lexical Collocations with Graph-aware Transformers [86.64972552583941]
We put forward a sequence tagging BERT-based model enhanced with a graph-aware transformer architecture, which we evaluate on the task of collocation recognition in context. Our results suggest that explicitly encoding syntactic dependencies in the model architecture is helpful, and provide insights on differences in collocation typification in English, Spanish and French.
arXiv Detail & Related papers (2022-05-23T16:47:37Z)
On The Ingredients of an Effective Zero-shot Semantic Parser [95.01623036661468]
We analyze zero-shot learning by paraphrasing training examples of canonical utterances and programs from a grammar. We propose bridging these gaps using improved grammars, stronger paraphrasers, and efficient learning methods. Our model achieves strong performance on two semantic parsing benchmarks (Scholar, Geo) with zero labeled data.
arXiv Detail & Related papers (2021-10-15T21:41:16Z)
DT-grams: Structured Dependency Grammar Stylometry for Cross-Language Authorship Attribution [0.20305676256390934]
We present a novel language-independent feature for authorship analysis based on dependency graphs and universal part of speech tags, called DT-grams. We evaluate DT-grams by performing cross-language authorship attribution on untranslated datasets of bilingual authors.
arXiv Detail & Related papers (2021-06-10T11:50:07Z)
Sentiment analysis in tweets: an assessment study from classical to modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information. Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks. This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
Sparsely Factored Neural Machine Translation [3.4376560669160394]
A standard approach to incorporate linguistic information to neural machine translation systems consists in maintaining separate vocabularies for each of the annotated features. We propose a method suited for such a case, showing large improvements in out-of-domain data, and comparable quality for the in-domain data. Experiments are performed in morphologically-rich languages like Basque and German, for the case of low-resource scenarios.
arXiv Detail & Related papers (2021-02-17T18:42:00Z)
Neural Approaches for Data Driven Dependency Parsing in Sanskrit [19.844420181108177]
We evaluate four different data-driven machine learning models, originally proposed for different languages, and compare their performances on Sanskrit data. We compare the performance of each of the models in a low-resource setting, with 1,500 sentences for training. We also investigate the impact of word ordering in which the sentences are provided as input to these systems, by parsing verses and their corresponding prose order.
arXiv Detail & Related papers (2020-04-17T06:47:15Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.