The Stanford Drone Dataset is More Complex than We Think: An Analysis of
Key Characteristics
- URL: http://arxiv.org/abs/2203.11743v1
- Date: Tue, 22 Mar 2022 13:58:14 GMT
- Title: The Stanford Drone Dataset is More Complex than We Think: An Analysis of
Key Characteristics
- Authors: Joshua Andle, Nicholas Soucy, Simon Socolow, Salimeh Yasaei Sekeh
- Abstract summary: We discuss the characteristics of the Stanford Drone dataset (SDD)
We demonstrate how this insufficiency reduces the information available to users and can impact performance.
Our intention is to increase the performance and methods applied to this dataset going forward, while also clearly detailing less obvious features of the dataset for new users.
- Score: 2.064612766965483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several datasets exist which contain annotated information of individuals'
trajectories. Such datasets are vital for many real-world applications,
including trajectory prediction and autonomous navigation. One prominent
dataset currently in use is the Stanford Drone Dataset (SDD). Despite its
prominence, discussion surrounding the characteristics of this dataset is
insufficient. We demonstrate how this insufficiency reduces the information
available to users and can impact performance. Our contributions include the
outlining of key characteristics in the SDD, employment of an
information-theoretic measure and custom metric to clearly visualize those
characteristics, the implementation of the PECNet and Y-Net trajectory
prediction models to demonstrate the outlined characteristics' impact on
predictive performance, and lastly we provide a comparison between the SDD and
Intersection Drone (inD) Dataset. Our analysis of the SDD's key characteristics
is important because without adequate information about available datasets a
user's ability to select the most suitable dataset for their methods, to
reproduce one another's results, and to interpret their own results are
hindered. The observations we make through this analysis provide a readily
accessible and interpretable source of information for those planning to use
the SDD. Our intention is to increase the performance and reproducibility of
methods applied to this dataset going forward, while also clearly detailing
less obvious features of the dataset for new users.
Related papers
- DRUPI: Dataset Reduction Using Privileged Information [20.59889438709671]
dataset reduction (DR) seeks to select or distill samples from large datasets into smaller subsets while preserving performance on target tasks.
We introduce dataset Reduction Using Privileged Information (DRUPI), which enriches DR by synthesizing privileged information alongside the reduced dataset.
Our findings reveal that effective feature labels must balance between being overly discriminative and excessively diverse, with a moderate level proving optimal for improving the reduced dataset's efficacy.
arXiv Detail & Related papers (2024-10-02T14:49:05Z) - Data Proportion Detection for Optimized Data Management for Large Language Models [32.62631669919273]
We introduce a new topic, textitdata proportion detection, which enables the automatic estimation of pre-training data proportions.
We provide rigorous theoretical proofs, practical algorithms, and preliminary experimental results for data proportion detection.
arXiv Detail & Related papers (2024-09-26T04:30:32Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - Computationally and Memory-Efficient Robust Predictive Analytics Using Big Data [0.0]
This study navigates through the challenges of data uncertainties, storage limitations, and predictive data-driven modeling using big data.
We utilize Robust Principal Component Analysis (RPCA) for effective noise reduction and outlier elimination, and Optimal Sensor Placement (OSP) for efficient data compression and storage.
arXiv Detail & Related papers (2024-03-27T22:39:08Z) - UniTraj: A Unified Framework for Scalable Vehicle Trajectory Prediction [93.77809355002591]
We introduce UniTraj, a comprehensive framework that unifies various datasets, models, and evaluation criteria.
We conduct extensive experiments and find that model performance significantly drops when transferred to other datasets.
We provide insights into dataset characteristics to explain these findings.
arXiv Detail & Related papers (2024-03-22T10:36:50Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.