A Data-Driven Novelty Score for Diverse In-Vehicle Data Recording
- URL: http://arxiv.org/abs/2507.04529v1
- Date: Sun, 06 Jul 2025 20:46:19 GMT
- Title: A Data-Driven Novelty Score for Diverse In-Vehicle Data Recording
- Authors: Philipp Reis, Joshua Ransiek, David Petri, Jacob Langner, Eric Sax,
- Abstract summary: Real-world data collection is often biased toward common scenes and objects, leaving novel cases underrepresented.<n>This work presents a real-time data selection method focused on object-level novelty detection.<n>The proposed method supports real-time deployment with 32 frames per second and is constant over time.
- Score: 0.1398098625978622
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: High-quality datasets are essential for training robust perception systems in autonomous driving. However, real-world data collection is often biased toward common scenes and objects, leaving novel cases underrepresented. This imbalance hinders model generalization and compromises safety. The core issue is the curse of rarity. Over time, novel events occur infrequently, and standard logging methods fail to capture them effectively. As a result, large volumes of redundant data are stored, while critical novel cases are diluted, leading to biased datasets. This work presents a real-time data selection method focused on object-level novelty detection to build more balanced and diverse datasets. The method assigns a data-driven novelty score to image frames using a novel dynamic Mean Shift algorithm. It models normal content based on mean and covariance statistics to identify frames with novel objects, discarding those with redundant elements. The main findings show that reducing the training dataset size with this method can improve model performance, whereas higher redundancy tends to degrade it. Moreover, as data redundancy increases, more aggressive filtering becomes both possible and beneficial. While random sampling can offer some gains, it often leads to overfitting and unpredictability in outcomes. The proposed method supports real-time deployment with 32 frames per second and is constant over time. By continuously updating the definition of normal content, it enables efficient detection of novelties in a continuous data stream.
Related papers
- OmniTraj: Pre-Training on Heterogeneous Data for Adaptive and Zero-Shot Human Trajectory Prediction [62.385417528148224]
We present OmniTraj, a Transformer-based model pre-trained on a large-scale, heterogeneous dataset.<n>Experiments show that explicitly conditioning on the frame rate enables OmniTraj to achieve state-of-the-art zero-shot transfer performance.
arXiv Detail & Related papers (2025-07-31T15:37:09Z) - Self-attention-based Diffusion Model for Time-series Imputation in Partial Blackout Scenarios [23.160007389272575]
Missing values in time series data can harm machine learning performance and introduce bias.<n>Previous work has tackled the imputation of missing data in random, complete blackouts and forecasting scenarios.<n>We introduce a two-stage imputation process using self-attention and diffusion processes to model feature and temporal correlations.
arXiv Detail & Related papers (2025-03-03T16:58:15Z) - Enhancing Consistency and Mitigating Bias: A Data Replay Approach for Incremental Learning [93.90047628101155]
Deep learning systems are prone to catastrophic forgetting when learning from a sequence of tasks.<n>To address this, some methods propose replaying data from previous tasks during new task learning.<n>However, it is not expected in practice due to memory constraints and data privacy issues.
arXiv Detail & Related papers (2024-01-12T12:51:12Z) - LARA: A Light and Anti-overfitting Retraining Approach for Unsupervised
Time Series Anomaly Detection [49.52429991848581]
We propose a Light and Anti-overfitting Retraining Approach (LARA) for deep variational auto-encoder based time series anomaly detection methods (VAEs)
This work aims to make three novel contributions: 1) the retraining process is formulated as a convex problem and can converge at a fast rate as well as prevent overfitting; 2) designing a ruminate block, which leverages the historical data without the need to store them; and 3) mathematically proving that when fine-tuning the latent vector and reconstructed data, the linear formations can achieve the least adjusting errors between the ground truths and the fine-tuned ones.
arXiv Detail & Related papers (2023-10-09T12:36:16Z) - STING: Self-attention based Time-series Imputation Networks using GAN [4.052758394413726]
STING (Self-attention based Time-series Imputation Networks using GAN) is proposed.
We take advantage of generative adversarial networks and bidirectional recurrent neural networks to learn latent representations of the time series.
Experimental results on three real-world datasets demonstrate that STING outperforms the existing state-of-the-art methods in terms of imputation accuracy.
arXiv Detail & Related papers (2022-09-22T06:06:56Z) - Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant.
One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning.
Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks.
We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z) - Training Deep Normalizing Flow Models in Highly Incomplete Data
Scenarios with Prior Regularization [13.985534521589257]
We propose a novel framework to facilitate the learning of data distributions in high paucity scenarios.
The proposed framework naturally stems from posing the process of learning from incomplete data as a joint optimization task.
arXiv Detail & Related papers (2021-04-03T20:57:57Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - TadGAN: Time Series Anomaly Detection Using Generative Adversarial
Networks [73.01104041298031]
TadGAN is an unsupervised anomaly detection approach built on Generative Adversarial Networks (GANs)
To capture the temporal correlations of time series, we use LSTM Recurrent Neural Networks as base models for Generators and Critics.
To demonstrate the performance and generalizability of our approach, we test several anomaly scoring techniques and report the best-suited one.
arXiv Detail & Related papers (2020-09-16T15:52:04Z) - Anomaly Detection at Scale: The Case for Deep Distributional Time Series
Models [14.621700495712647]
Main novelty in our approach is that instead of modeling time series consisting of real values or vectors of real values, we model time series of probability distributions over real values (or vectors)
Our method is amenable to streaming anomaly detection and scales to monitoring for anomalies on millions of time series.
We show that we outperform popular open-source anomaly detection tools by up to 17% average improvement for a real-world data set.
arXiv Detail & Related papers (2020-07-30T15:48:55Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - Deep Context-Aware Novelty Detection [6.599344783327053]
A common assumption of novelty detection is that the distribution of both "normal" and "novel" data are static.
This is often not the case - for example scenarios where data evolves over time or scenarios in which the definition of normal and novel depends on contextual information.
This can lead to significant difficulties when attempting to train a model on datasets where the distribution of normal data in one scenario is similar to that of novel data in another scenario.
arXiv Detail & Related papers (2020-06-01T18:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.