Related papers: Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

URL: http://arxiv.org/abs/2504.21614v1
Date: Wed, 30 Apr 2025 13:10:59 GMT
Title: Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection
Authors: Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu,
Abstract summary: We present the Mcity Data Engine, which provides modules for the complete data-based development cycle.<n>The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process.
Score: 9.883149193286304
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

Related papers

Meta-Learning and Synthetic Data for Automated Pretraining and Finetuning [2.657867981416885]
Growing number of pretrained models in Machine Learning (ML) presents significant challenges for practitioners.<n>As models grow in scale, the increasing reliance on real-world data poses a bottleneck for training and requires leveraging data more effectively.<n>This dissertation adopts meta-learning to extend automated machine learning to the deep learning domain.
arXiv Detail & Related papers (2025-06-11T12:48:45Z)
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding [126.15907330726067]
We build a Perception Model Language (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.<n>We analyze standard training pipelines without distillation from models and explore large-scale synthetic data to identify critical data gaps.
arXiv Detail & Related papers (2025-04-17T17:59:56Z)
Active Learning from Scene Embeddings for End-to-End Autonomous Driving [30.667451458189902]
Training end-to-end deep learning models requires large amounts of labeled data.<n>We propose an active learning framework that relies on vectorized scene-level features, called SEAD.<n>Experiments show that we only need 30% of the nuScenes training data to achieve performance close to what can be achieved with the full dataset.
arXiv Detail & Related papers (2025-03-14T03:56:22Z)
Cuvis.Ai: An Open-Source, Low-Code Software Ecosystem for Hyperspectral Processing and Classification [0.4038539043067986]
cuvis.ai is an open-source and low-code software ecosystem for data acquisition, preprocessing, and model training. The package is written in Python and provides wrappers around common machine learning libraries.
arXiv Detail & Related papers (2024-11-18T06:33:40Z)
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models [146.85788712792177]
Molmo is a new family of vision-language models (VLMs) that are state-of-the-art in their class of openness.<n>Our best-in-class 72B model outperforms others in the class of open weight and data models.
arXiv Detail & Related papers (2024-09-25T17:59:51Z)
GenQA: Generating Millions of Instructions from a Handful of Prompts [67.54980063851605]
Most public instruction finetuning datasets are relatively small compared to the closed source datasets used to train industry models. In this work, we study methods for generating large instruction datasets from a single prompt. Our dataset meets or exceeds both WizardLM and Ultrachat on both knowledge-intensive leaderboard tasks as well as conversational evaluations.
arXiv Detail & Related papers (2024-06-14T17:44:08Z)
Modyn: Data-Centric Machine Learning Pipeline Orchestration [1.4448995242976572]
Modyn is a data-centric end-to-end machine learning platform.<n>We present Modyn, a data-centric end-to-end machine learning platform.
arXiv Detail & Related papers (2023-12-11T09:50:52Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching. AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage. Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z)
Diverse Complexity Measures for Dataset Curation in Self-driving [80.55417232642124]
We propose a new data selection method that exploits a diverse set of criteria that quantize interestingness of traffic scenes. Our experiments show that the proposed curation pipeline is able to select datasets that lead to better generalization and higher performance.
arXiv Detail & Related papers (2021-01-16T23:45:02Z)
Have you forgotten? A method to assess if machine learning models have forgotten data [20.9131206112401]
In the era of deep learning, aggregation of data from several sources is a common approach to ensuring data diversity. In this paper, we want to address the challenging question of whether data have been forgotten by a model. We establish statistical methods that compare the target's outputs with outputs of models trained with different datasets.
arXiv Detail & Related papers (2020-04-21T16:13:45Z)
From Data to Actions in Intelligent Transportation Systems: a Prescription of Functional Requirements for Model Actionability [10.27718355111707]
This work aims to describe how data, coming from diverse ITS sources, can be used to learn and adapt data-driven models for efficiently operating ITS assets, systems and processes. Grounded in this described data modeling pipeline for ITS, wedefine the characteristics, engineering requisites and intrinsic challenges to its three compounding stages, namely, data fusion, adaptive learning and model evaluation.
arXiv Detail & Related papers (2020-02-06T12:02:30Z)
Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data [78.74367441804183]
We introduce Neural Data Server (NDS), a large-scale search engine for finding the most useful transfer learning data to the target domain. NDS consists of a dataserver which indexes several large popular image datasets, and aims to recommend data to a client. We show the effectiveness of NDS in various transfer learning scenarios, demonstrating state-of-the-art performance on several target datasets.
arXiv Detail & Related papers (2020-01-09T01:21:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.