Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification
- URL: http://arxiv.org/abs/2110.14764v2
- Date: Mon, 7 Feb 2022 21:32:55 GMT
- Title: Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification
- Authors: Alejandro Moreo, Andrea Pedrotti, Fabrizio Sebastiani
- Abstract summary: emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
- Score: 78.83284164605473
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: \emph{Funnelling} (Fun) is a recently proposed method for cross-lingual text
classification (CLTC) based on a two-tier learning ensemble for heterogeneous
transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each
working on a different and language-dependent feature space, return a vector of
calibrated posterior probabilities (with one dimension for each class) for each
document, and the final classification decision is taken by a metaclassifier
that uses this vector as its input. The metaclassifier can thus exploit
class-class correlations, and this (among other things) gives Fun an edge over
CLTC systems in which these correlations cannot be brought to bear. In this
paper we describe \emph{Generalized Funnelling} (gFun), a generalization of Fun
consisting of an HTL architecture in which 1st-tier components can be arbitrary
\emph{view-generating functions}, i.e., language-dependent functions that each
produce a language-independent representation ("view") of the (monolingual)
document. We describe an instance of gFun in which the metaclassifier receives
as input a vector of calibrated posterior probabilities (as in Fun) aggregated
to other embedded representations that embody other types of correlations, such
as word-class correlations (as encoded by \emph{Word-Class Embeddings}),
word-word correlations (as encoded by \emph{Multilingual Unsupervised or
Supervised Embeddings}), and word-context correlations (as encoded by
\emph{multilingual BERT}). We show that this instance of \textsc{gFun}
substantially improves over Fun and over state-of-the-art baselines, by
reporting experimental results obtained on two large, standard datasets for
multilingual multilabel text classification. Our code that implements gFun is
publicly available.
Related papers
- Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery [50.564146730579424]
We propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples.
Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks.
arXiv Detail & Related papers (2024-03-15T02:40:13Z) - T3L: Translate-and-Test Transfer Learning for Cross-Lingual Text
Classification [50.675552118811]
Cross-lingual text classification is typically built on large-scale, multilingual language models (LMs) pretrained on a variety of languages of interest.
We propose revisiting the classic "translate-and-test" pipeline to neatly separate the translation and classification stages.
arXiv Detail & Related papers (2023-06-08T07:33:22Z) - CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data.
Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods.
We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z) - TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation [53.974228542090046]
Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks.
Existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes.
We propose TagCLIP (Trusty-aware guided CLIP) to address this issue.
arXiv Detail & Related papers (2023-04-15T12:52:23Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Weakly Supervised Text Classification using Supervision Signals from a
Language Model [33.5830441120473]
We design a prompt which combines the document itself and "this article is talking about [MASK]"
A masked language model can generate words for the [MASK] token.
The generated words which summarize the content of a document can be utilized as supervision signals.
arXiv Detail & Related papers (2022-05-13T12:57:15Z) - Exploiting Local and Global Features in Transformer-based Extreme
Multi-label Text Classification [28.28186933768281]
We propose an approach that combines both the local and global features produced by Transformer models to improve the prediction power of the classifier.
Our experiments show that the proposed model either outperforms or is comparable to the state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2022-04-02T19:55:23Z) - Cross-lingual Text Classification with Heterogeneous Graph Neural
Network [2.6936806968297913]
Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages.
Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks.
We propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification.
arXiv Detail & Related papers (2021-05-24T12:45:42Z) - DocSCAN: Unsupervised Text Classification via Learning from Neighbors [2.2082422928825145]
We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN)
For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels.
Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
arXiv Detail & Related papers (2021-05-09T21:20:31Z) - Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.