Personalized Dictionary Learning for Heterogeneous Datasets
- URL: http://arxiv.org/abs/2305.15311v1
- Date: Wed, 24 May 2023 16:31:30 GMT
- Title: Personalized Dictionary Learning for Heterogeneous Datasets
- Authors: Geyu Liang and Naichen Shi and Raed Al Kontar and Salar Fattahi
- Abstract summary: We introduce a relevant yet challenging problem named Personalized Dictionary Learning (PerDL)
The goal is to learn sparse linear representations from heterogeneous datasets that share some commonality.
In PerDL, we model each dataset's shared and unique features as global and local dictionaries.
- Score: 6.8438089867929905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a relevant yet challenging problem named Personalized Dictionary
Learning (PerDL), where the goal is to learn sparse linear representations from
heterogeneous datasets that share some commonality. In PerDL, we model each
dataset's shared and unique features as global and local dictionaries.
Challenges for PerDL not only are inherited from classical dictionary learning
(DL), but also arise due to the unknown nature of the shared and unique
features. In this paper, we rigorously formulate this problem and provide
conditions under which the global and local dictionaries can be provably
disentangled. Under these conditions, we provide a meta-algorithm called
Personalized Matching and Averaging (PerMA) that can recover both global and
local dictionaries from heterogeneous datasets. PerMA is highly efficient; it
converges to the ground truth at a linear rate under suitable conditions.
Moreover, it automatically borrows strength from strong learners to improve the
prediction of weak learners. As a general framework for extracting global and
local dictionaries, we show the application of PerDL in different learning
tasks, such as training with imbalanced datasets and video surveillance.
Related papers
- Designing NLP Systems That Adapt to Diverse Worldviews [4.915541242112533]
We argue that existing NLP datasets often obscure this by aggregating labels or filtering out disagreement.
We propose a perspectivist approach: building datasets that capture annotator demographics, values, and justifications for their labels.
arXiv Detail & Related papers (2024-05-18T06:48:09Z) - Forget NLI, Use a Dictionary: Zero-Shot Topic Classification for Low-Resource Languages with Application to Luxembourgish [6.6635650150737815]
In NLP, zero-shot classification (ZSC) is the task of assigning labels to textual data without any labeled examples for the target classes.
We propose an alternative solution that leverages dictionaries as a source of data for ZSC.
We focus on Luxembourgish, a low-resource language spoken in Luxembourg, and construct two new topic relevance classification datasets.
arXiv Detail & Related papers (2024-04-05T06:35:31Z) - Scaling Expert Language Models with Unsupervised Domain Discovery [107.08940500543447]
We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora.
Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference.
arXiv Detail & Related papers (2023-03-24T17:38:58Z) - Towards Understanding and Mitigating Dimensional Collapse in Heterogeneous Federated Learning [112.69497636932955]
Federated learning aims to train models across different clients without the sharing of data for privacy considerations.
We study how data heterogeneity affects the representations of the globally aggregated models.
We propose sc FedDecorr, a novel method that can effectively mitigate dimensional collapse in federated learning.
arXiv Detail & Related papers (2022-10-01T09:04:17Z) - DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for
Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary.
By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning.
The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z) - Cross-lingual Transfer for Text Classification with Dictionary-based
Heterogeneous Graph [10.64488240379972]
In cross-lingual text classification, it is required that task-specific training data in high-resource source languages are available.
Collecting such training data can be infeasible because of the labeling cost, task characteristics, and privacy concerns.
This paper proposes an alternative solution that uses only task-independent word embeddings of high-resource languages and bilingual dictionaries.
arXiv Detail & Related papers (2021-09-09T16:40:40Z) - Exploiting Image Translations via Ensemble Self-Supervised Learning for
Unsupervised Domain Adaptation [0.0]
We introduce an unsupervised domain adaption (UDA) strategy that combines multiple image translations, ensemble learning and self-supervised learning in one coherent approach.
We focus on one of the standard tasks of UDA in which a semantic segmentation model is trained on labeled synthetic data together with unlabeled real-world data.
arXiv Detail & Related papers (2021-07-13T16:43:02Z) - Sparsely Factored Neural Machine Translation [3.4376560669160394]
A standard approach to incorporate linguistic information to neural machine translation systems consists in maintaining separate vocabularies for each of the annotated features.
We propose a method suited for such a case, showing large improvements in out-of-domain data, and comparable quality for the in-domain data.
Experiments are performed in morphologically-rich languages like Basque and German, for the case of low-resource scenarios.
arXiv Detail & Related papers (2021-02-17T18:42:00Z) - Structured Prediction as Translation between Augmented Natural Languages [109.50236248762877]
We propose a new framework, Translation between Augmented Natural Languages (TANL), to solve many structured prediction language tasks.
Instead of tackling the problem by training task-specific discriminatives, we frame it as a translation task between augmented natural languages.
Our approach can match or outperform task-specific models on all tasks, and in particular, achieves new state-of-the-art results on joint entity and relation extraction.
arXiv Detail & Related papers (2021-01-14T18:32:21Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.