infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information
- URL: http://arxiv.org/abs/2305.19344v2
- Date: Mon, 12 Jun 2023 10:46:10 GMT
- Title: infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information
- Authors: Jaehyung Kim, Yekyung Kim, Karin de Langis, Jinwoo Shin, Dongyeop Kang
- Abstract summary: infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
- Score: 68.76707843019886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of NLP systems often relies on the availability of large,
high-quality datasets. However, not all samples in these datasets are equally
valuable for learning, as some may be redundant or noisy. Several methods for
characterizing datasets based on model-driven meta-information (e.g., model's
confidence) have been developed, but the relationship and complementary effects
of these methods have received less attention. In this paper, we introduce
infoVerse, a universal framework for dataset characterization, which provides a
new feature space that effectively captures multidimensional characteristics of
datasets by incorporating various model-driven meta-information. infoVerse
reveals distinctive regions of the dataset that are not apparent in the
original semantic space, hence guiding users (or models) in identifying which
samples to focus on for exploration, assessment, or annotation. Additionally,
we propose a novel sampling method on infoVerse to select a set of data points
that maximizes informativeness. In three real-world applications (data pruning,
active learning, and data annotation), the samples chosen on infoVerse space
consistently outperform strong baselines in all applications. Our code and demo
are publicly available.
Related papers
- Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning [3.623224034411137]
offline multi-agent reinforcement learning (MARL) is an exciting direction of research that uses static datasets to find optimal control policies for multi-agent systems.
Though the field is by definition data-driven, efforts have thus far neglected data in their drive to achieve state-of-the-art results.
We show how the majority of works generate their own datasets without consistent methodology and provide sparse information about the characteristics of these datasets.
arXiv Detail & Related papers (2024-09-18T14:13:24Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Rethinking of Encoder-based Warm-start Methods in Hyperparameter Optimization [0.0]
We introduce a new approach for representation learning on tabular data based on Tomoharu Iwata and Atsutoshi Kumagai.
We show that general representations may not suffice for some meta-tasks where requirements are not explicitly considered during extraction.
arXiv Detail & Related papers (2024-03-07T18:16:29Z) - Revisiting Table Detection Datasets for Visually Rich Documents [17.846536373106268]
This study revisits some open datasets with high-quality annotations, identifies and cleans the noise, and aligns the annotation definitions of these datasets to merge a larger dataset, termed Open-Tables.
To enrich the data sources, we propose a new ICT-TD dataset using the PDF files of Information and Communication Technologies (ICT) commodities, a different domain containing unique samples that hardly appear in open datasets.
Our experimental results show that the domain differences among existing open datasets are minor despite having different data sources.
arXiv Detail & Related papers (2023-05-04T01:08:15Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Learning a Self-Expressive Network for Subspace Clustering [15.096251922264281]
We propose a novel framework for subspace clustering, termed Self-Expressive Network (SENet), which employs a properly designed neural network to learn a self-expressive representation of the data.
Our SENet can not only learn the self-expressive coefficients with desired properties on the training data, but also handle out-of-sample data.
In particular, SENet yields highly competitive performance on MNIST, Fashion MNIST and Extended MNIST and state-of-the-art performance on CIFAR-10.
arXiv Detail & Related papers (2021-10-08T18:06:06Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.