Related papers: ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis

ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis

URL: http://arxiv.org/abs/2307.01387v1
Date: Mon, 3 Jul 2023 22:50:53 GMT
Title: ALBERTI, a Multilingual Domain Specific Language Model for Poetry Analysis
Authors: Javier de la Rosa, \'Alvaro P\'erez Pozo, Salvador Ros, Elena Gonz\'alez-Blanco
Abstract summary: We present textscAlberti, the first multilingual pre-trained large language model for poetry. We further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. textscAlberti achieves state-of-the-art results for German when compared to rule-based systems.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The computational analysis of poetry is limited by the scarcity of tools to automatically analyze and scan poems. In a multilingual settings, the problem is exacerbated as scansion and rhyme systems only exist for individual languages, making comparative studies very challenging and time consuming. In this work, we present \textsc{Alberti}, the first multilingual pre-trained large language model for poetry. Through domain-specific pre-training (DSP), we further trained multilingual BERT on a corpus of over 12 million verses from 12 languages. We evaluated its performance on two structural poetry tasks: Spanish stanza type classification, and metrical pattern prediction for Spanish, English and German. In both cases, \textsc{Alberti} outperforms multilingual BERT and other transformers-based models of similar sizes, and even achieves state-of-the-art results for German when compared to rule-based systems, demonstrating the feasibility and effectiveness of DSP in the poetry domain.

Related papers

Investigating Language-Specific Calibration For Pruning Multilingual Large Language Models [11.421452042888523]
We compare different calibration languages for pruning multilingual models across diverse languages, tasks, models, and SotA pruning techniques. Our results offer practical suggestions, for example, calibrating in the target language can efficiently retain the language modeling capability but does not necessarily benefit downstream tasks.
arXiv Detail & Related papers (2024-08-26T16:29:13Z)
Towards a Deep Understanding of Multilingual End-to-End Speech Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages. We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z)
UniMax: Fairer and more Effective Language Sampling for Large-Scale Multilingual Pretraining [92.3702056505905]
We propose a new sampling method, UniMax, that delivers more uniform coverage of head languages while mitigating overfitting on tail languages. We find that UniMax outperforms standard temperature-based sampling, and the benefits persist as scale increases.
arXiv Detail & Related papers (2023-04-18T17:45:50Z)
Romanization-based Large-scale Adaptation of Multilingual Language Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z)
Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models [23.62852626011989]
We show that grammatical structures in higher-resource languages bleed into lower-resource languages. We show this bias via a novel method for comparing the fluency of multilingual models to the fluency of monolingual Spanish and Greek models.
arXiv Detail & Related papers (2022-10-11T17:06:38Z)
High-resource Language-specific Training for Multilingual Neural Machine Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference. Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder. HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z)
Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models [73.11488464916668]
This study investigates the dynamics of the multilingual pretraining process. We probe checkpoints taken from throughout XLM-R pretraining, using a suite of linguistic tasks. Our analysis shows that the model achieves high in-language performance early on, with lower-level linguistic skills acquired before more complex ones.
arXiv Detail & Related papers (2022-05-24T03:35:00Z)
Exploring Teacher-Student Learning Approach for Multi-lingual Speech-to-Intent Classification [73.5497360800395]
We develop an end-to-end system that supports multiple languages. We exploit knowledge from a pre-trained multi-lingual natural language processing model.
arXiv Detail & Related papers (2021-09-28T04:43:11Z)
Don't Go Far Off: An Empirical Study on Neural Poetry Translation [13.194404923699782]
We present an empirical investigation for poetry translation along several dimensions. We contribute a parallel dataset of poetry translations for several language pairs. Our results show that multilingual fine-tuning on poetic text significantly outperforms multilingual fine-tuning on non-poetic text that is 35X larger in size.
arXiv Detail & Related papers (2021-09-07T10:00:44Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features [0.0]
We provide large poetry corpora for English and German, and annotate prosodic features in smaller corpora to train corpus driven neural models. We show that BiLSTM-CRF models with syllable embeddings outperform a CRF baseline and different BERT-based approaches.
arXiv Detail & Related papers (2021-02-17T16:38:57Z)
Multilingual AMR-to-Text Generation [22.842874899794996]
We create multilingual AMR-to-text models that generate in twenty one different languages. For eighteen languages, based on automatic metrics, our multilingual models surpass baselines that generate into a single language. We analyse the ability of our multilingual models to accurately capture morphology and word order using human evaluation, and find that native speakers judge our generations to be fluent.
arXiv Detail & Related papers (2020-11-10T22:47:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.