Learning language variations in news corpora through differential
embeddings
- URL: http://arxiv.org/abs/2011.06949v1
- Date: Fri, 13 Nov 2020 14:50:08 GMT
- Title: Learning language variations in news corpora through differential
embeddings
- Authors: Carlos Selmo, Julian F. Martinez, Mariano G. Beir\'o and J. Ignacio
Alvarez-Hamelin
- Abstract summary: We show that a model with a central word representation and a slice-dependent contribution can learn word embeddings from different corpora simultaneously.
We show that it can capture both temporal dynamics in the yearly slices of each corpus, and language variations between US and UK English in a curated multi-source corpus.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: There is an increasing interest in the NLP community in capturing variations
in the usage of language, either through time (i.e., semantic drift), across
regions (as dialects or variants) or in different social contexts (i.e.,
professional or media technolects). Several successful dynamical embeddings
have been proposed that can track semantic change through time. Here we show
that a model with a central word representation and a slice-dependent
contribution can learn word embeddings from different corpora simultaneously.
This model is based on a star-like representation of the slices. We apply it to
The New York Times and The Guardian newspapers, and we show that it can capture
both temporal dynamics in the yearly slices of each corpus, and language
variations between US and UK English in a curated multi-source corpus. We
provide an extensive evaluation of this methodology.
Related papers
- Exploring Anisotropy and Outliers in Multilingual Language Models for
Cross-Lingual Semantic Sentence Similarity [64.18762301574954]
Previous work has shown that the representations output by contextual language models are more anisotropic than static type embeddings.
This seems to be true for both monolingual and multilingual models, although much less work has been done on the multilingual context.
We investigate outlier dimensions and their relationship to anisotropy in multiple pre-trained multilingual language models.
arXiv Detail & Related papers (2023-06-01T09:01:48Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Temporal Analysis on Topics Using Word2Vec [0.0]
The present study proposes a novel method of trend detection and visualization - more specifically, modeling the change in a topic over time.
The methodology was tested on a group of articles from various media houses present in the 20 Newsgroups dataset.
arXiv Detail & Related papers (2022-09-23T16:51:29Z) - Fake it Till You Make it: Self-Supervised Semantic Shifts for
Monolingual Word Embedding Tasks [58.87961226278285]
We propose a self-supervised approach to model lexical semantic change.
We show that our method can be used for the detection of semantic change with any alignment method.
We illustrate the utility of our techniques using experimental results on three different datasets.
arXiv Detail & Related papers (2021-01-30T18:59:43Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z) - Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer [101.58431011820755]
We study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications.
We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations.
arXiv Detail & Related papers (2020-05-02T04:34:37Z) - Compass-aligned Distributional Embeddings for Studying Semantic
Differences across Corpora [14.993021283916008]
We present a framework to support cross-corpora language studies with word embeddings.
CADE is the core component of our framework and solves the key problem of aligning the embeddings generated from different corpora.
The results of our experiments suggest that CADE achieves state-of-the-art or superior performance on tasks where several competing approaches are available.
arXiv Detail & Related papers (2020-04-13T15:46:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.