Research on Multilingual News Clustering Based on Cross-Language Word
Embeddings
- URL: http://arxiv.org/abs/2305.18880v1
- Date: Tue, 30 May 2023 09:24:55 GMT
- Title: Research on Multilingual News Clustering Based on Cross-Language Word
Embeddings
- Authors: Lin Wu, Rui Li, Wong-Hing Lam
- Abstract summary: We train a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English.
We adapt the Single-Pass clustering algorithm for the news context to make it more applicable.
- Score: 7.401514098389491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Classifying the same event reported by different countries is of significant
importance for public opinion control and intelligence gathering. Due to the
diverse types of news, relying solely on transla-tors would be costly and
inefficient, while depending solely on translation systems would incur
considerable performance overheads in invoking translation interfaces and
storing translated texts. To address this issue, we mainly focus on the
clustering problem of cross-lingual news. To be specific, we use a combination
of sentence vector representations of news headlines in a mixed semantic space
and the topic probability distributions of news content to represent a news
article. In the training of cross-lingual models, we employ knowledge
distillation techniques to fit two semantic spaces into a mixed semantic space.
We abandon traditional static clustering methods like K-Means and AGNES in
favor of the incremental clustering algorithm Single-Pass, which we further
modify to better suit cross-lingual news clustering scenarios. Our main
contributions are as follows: (1) We adopt the English standard BERT as the
teacher model and XLM-Roberta as the student model, training a cross-lingual
model through knowledge distillation that can represent sentence-level
bilingual texts in both Chinese and English. (2) We use the LDA topic model to
represent news as a combina-tion of cross-lingual vectors for headlines and
topic probability distributions for con-tent, introducing concepts such as
topic similarity to address the cross-lingual issue in news content
representation. (3) We adapt the Single-Pass clustering algorithm for the news
context to make it more applicable. Our optimizations of Single-Pass include
ad-justing the distance algorithm between samples and clusters, adding cluster
merging operations, and incorporating a news time parameter.
Related papers
- mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - InfoCTM: A Mutual Information Maximization Perspective of Cross-Lingual Topic Modeling [40.54497836775837]
Cross-lingual topic models have been prevalent for cross-lingual text analysis by revealing aligned latent topics.
Most existing methods suffer from producing repetitive topics that hinder further analysis and performance decline caused by low-coverage dictionaries.
We propose the Cross-lingual Topic Modeling with Mutual Information (InfoCTM) to produce more coherent, diverse, and well-aligned topics.
arXiv Detail & Related papers (2023-04-07T08:49:43Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Cross-Align: Modeling Deep Cross-lingual Interactions for Word Alignment [63.0407314271459]
The proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.
Experiments show that the proposed Cross-Align achieves the state-of-the-art (SOTA) performance on four out of five language pairs.
arXiv Detail & Related papers (2022-10-09T02:24:35Z) - Simplifying Multilingual News Clustering Through Projection From a
Shared Space [0.39560040546164016]
The task of organizing and clustering multilingual news articles for media monitoring is essential to follow news stories in real time.
Most approaches to this task focus on high-resource languages (mostly English), with low-resource languages being disregarded.
We present a much simpler online system that is able to cluster an incoming stream of documents without depending on language-specific features.
arXiv Detail & Related papers (2022-04-28T11:32:49Z) - Cross-language Sentence Selection via Data Augmentation and Rationale
Training [22.106577427237635]
It uses data augmentation and negative sampling techniques on noisy parallel sentence data to learn a cross-lingual embedding-based query relevance model.
Results show that this approach performs as well as or better than multiple state-of-the-art machine translation + monolingual retrieval systems trained on the same parallel data.
arXiv Detail & Related papers (2021-06-04T07:08:47Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Cross-lingual Spoken Language Understanding with Regularized
Representation Alignment [71.53159402053392]
We propose a regularization approach to align word-level and sentence-level representations across languages without any external resource.
Experiments on the cross-lingual spoken language understanding task show that our model outperforms current state-of-the-art methods in both few-shot and zero-shot scenarios.
arXiv Detail & Related papers (2020-09-30T08:56:53Z) - InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language
Model Pre-Training [135.12061144759517]
We present an information-theoretic framework that formulates cross-lingual language model pre-training.
We propose a new pre-training task based on contrastive learning.
By leveraging both monolingual and parallel corpora, we jointly train the pretext to improve the cross-lingual transferability of pre-trained models.
arXiv Detail & Related papers (2020-07-15T16:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.