A State-Vector Framework for Dataset Effects
- URL: http://arxiv.org/abs/2310.10955v1
- Date: Tue, 17 Oct 2023 03:05:06 GMT
- Title: A State-Vector Framework for Dataset Effects
- Authors: Esmat Sahak, Zining Zhu, Frank Rudzicz
- Abstract summary: We propose a state-vector framework to enable rigorous studies in this direction.
This framework uses idealized probing test results as the bases of a vector space.
We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions.
- Score: 20.255403795164856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The impressive success of recent deep neural network (DNN)-based systems is
significantly influenced by the high-quality datasets used in training.
However, the effects of the datasets, especially how they interact with each
other, remain underexplored. We propose a state-vector framework to enable
rigorous studies in this direction. This framework uses idealized probing test
results as the bases of a vector space. This framework allows us to quantify
the effects of both standalone and interacting datasets. We show that the
significant effects of some commonly-used language understanding datasets are
characteristic and are concentrated on a few linguistic dimensions.
Additionally, we observe some ``spill-over'' effects: the datasets could impact
the models along dimensions that may seem unrelated to the intended tasks. Our
state-vector framework paves the way for a systematic understanding of the
dataset effects, a crucial component in responsible and robust model
development.
Related papers
- Beyond Features: How Dataset Design Influences Multi-Agent Trajectory Prediction Performance [37.850085364753845]
This work examines how feature selection, cross-dataset transfer, and geographic diversity influence trajectory prediction accuracy in multi-agent settings.<n>We evaluate a state-of-the-art model using our novel L4 Motion Forecasting dataset based on our own data recordings in Germany and the US.
arXiv Detail & Related papers (2025-07-07T15:18:51Z) - Detecting Instruction Fine-tuning Attack on Language Models with Influence Function [6.760293300577228]
Instruction fine-tuning attacks undermine model alignment and pose security risks in real-world deployment.
We present a simple and effective approach to detect and mitigate such attacks using influence functions.
We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets.
arXiv Detail & Related papers (2025-04-12T00:50:28Z) - Evaluating Data Influence in Meta Learning [6.757424294625179]
We propose a general data attribution evaluation framework for meta-learning within the bilevel optimization framework.
This framework comprehensively models data contributions across both the inner and outer training processes.
arXiv Detail & Related papers (2025-01-27T11:14:04Z) - Impact of Data Breadth and Depth on Performance of Siamese Neural Network Model: Experiments with Three Keystroke Dynamic Datasets [0.9786690381850356]
We study the impact of dataset breadth and depth on deep learning models for behavioral biometrics.
We find that increasing dataset breadth enables the training of a well-trained model that effectively captures more inter-subject variability.
In contrast, the extent of depth's impact from a dataset depends on the nature of the dataset.
arXiv Detail & Related papers (2025-01-10T17:06:46Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - Comparing Importance Sampling Based Methods for Mitigating the Effect of
Class Imbalance [0.0]
We compare three techniques that derive from importance sampling: loss reweighting, undersampling, and oversampling.
We find that up-weighting the loss for and undersampling has a negigible effect on the performance on underrepresented classes.
Our findings also indicate that there may exist some redundancy in data in the Planet dataset.
arXiv Detail & Related papers (2024-02-28T22:52:27Z) - The Impact of Different Backbone Architecture on Autonomous Vehicle
Dataset [120.08736654413637]
The quality of the features extracted by the backbone architecture can have a significant impact on the overall detection performance.
Our study evaluates three well-known autonomous vehicle datasets, namely KITTI, NuScenes, and BDD, to compare the performance of different backbone architectures on object detection tasks.
arXiv Detail & Related papers (2023-09-15T17:32:15Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - CARLA-GeAR: a Dataset Generator for a Systematic Evaluation of
Adversarial Robustness of Vision Models [61.68061613161187]
This paper presents CARLA-GeAR, a tool for the automatic generation of synthetic datasets for evaluating the robustness of neural models against physical adversarial patches.
The tool is built on the CARLA simulator, using its Python API, and allows the generation of datasets for several vision tasks in the context of autonomous driving.
The paper presents an experimental study to evaluate the performance of some defense methods against such attacks, showing how the datasets generated with CARLA-GeAR might be used in future work as a benchmark for adversarial defense in the real world.
arXiv Detail & Related papers (2022-06-09T09:17:38Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - The Stanford Drone Dataset is More Complex than We Think: An Analysis of
Key Characteristics [2.064612766965483]
We discuss the characteristics of the Stanford Drone dataset (SDD)
We demonstrate how this insufficiency reduces the information available to users and can impact performance.
Our intention is to increase the performance and methods applied to this dataset going forward, while also clearly detailing less obvious features of the dataset for new users.
arXiv Detail & Related papers (2022-03-22T13:58:14Z) - On The State of Data In Computer Vision: Human Annotations Remain
Indispensable for Developing Deep Learning Models [0.0]
High-quality labeled datasets play a crucial role in fueling the development of machine learning (ML)
Since the emergence of the ImageNet dataset and the AlexNet model in 2012, the size of new open-source labeled vision datasets has remained roughly constant.
Only a minority of publications in the computer vision community tackle supervised learning on datasets that are orders of magnitude larger than Imagenet.
arXiv Detail & Related papers (2021-07-31T00:08:21Z) - Deep Structure Learning using Feature Extraction in Trained Projection
Space [0.0]
We introduce a network architecture using a self-adjusting and data dependent version of the Radon-transform (linear data projection), also known as x-ray projection, to enable feature extraction via convolutions in lower-dimensional space.
The resulting framework, named PiNet, can be trained end-to-end and shows promising performance on volumetric segmentation tasks.
arXiv Detail & Related papers (2020-09-01T12:16:55Z) - Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions.
influence estimates are fairly accurate for shallow networks.
Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.