Action100M: A Large-scale Video Action Dataset
- URL: http://arxiv.org/abs/2601.10592v1
- Date: Thu, 15 Jan 2026 17:02:27 GMT
- Title: Action100M: A Large-scale Video Action Dataset
- Authors: Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, Pascale Fung,
- Abstract summary: Action100M is a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration)<n>It yields O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions.
- Score: 33.33351591459689
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inferring physical actions from visual observations is a fundamental capability for advancing machine intelligence in the physical world. Achieving this requires large-scale, open-vocabulary video action datasets that span broad domains. We introduce Action100M, a large-scale dataset constructed from 1.2M Internet instructional videos (14.6 years of duration), yielding O(100 million) temporally localized segments with open-vocabulary action supervision and rich captions. Action100M is generated by a fully automated pipeline that (i) performs hierarchical temporal segmentation using V-JEPA 2 embeddings, (ii) produces multi-level frame and segment captions organized as a Tree-of-Captions, and (iii) aggregates evidence with a reasoning model (GPT-OSS-120B) under a multi-round Self-Refine procedure to output structured annotations (brief/detailed action, actor, brief/detailed caption). Training VL-JEPA on Action100M demonstrates consistent data-scaling improvements and strong zero-shot performance across diverse action recognition benchmarks, establishing Action100M as a new foundation for scalable research in video understanding and world modeling.
Related papers
- ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries [77.41072125938636]
ARC-Chapter is the first large-scale video chaptering model trained on over million-level long video chapters.<n>It unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries.<n>It establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score.
arXiv Detail & Related papers (2025-11-18T10:53:14Z) - DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding [13.256830504062332]
We introduce DEL, a framework for dense semantic action localization.<n> DEL aims to accurately detect and classify multiple actions at fine-grained temporal resolutions in long untrimmed videos.
arXiv Detail & Related papers (2025-06-29T11:50:19Z) - Perceive Anything: Recognize, Explain, Caption, and Segment Anything in Images and Videos [53.723410664944566]
We present Perceive Anything Model (PAM), a framework for comprehensive region-level visual understanding in images and videos.<n>Our approach extends the powerful segmentation model SAM 2 by integrating Large Language Models (LLMs), enabling simultaneous object segmentation.<n>A key component, Semantic Perceiver, is introduced to efficiently transform SAM 2's rich visual features into multi-modal tokens.
arXiv Detail & Related papers (2025-06-05T17:51:39Z) - Open-World Human-Object Interaction Detection via Multi-modal Prompts [26.355054079885463]
MP-HOI is a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions.
MP-HOI could serve as a generalist HOI detector, surpassing the HOI vocabulary of existing expert models by more than 30 times.
arXiv Detail & Related papers (2024-06-11T13:01:45Z) - Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset.
We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them.
Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z) - Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning [102.54669633984278]
We propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks.
We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization.
arXiv Detail & Related papers (2024-02-18T03:04:38Z) - Temporal Alignment Networks for Long-term Video [103.69904379356413]
We propose a temporal alignment network that ingests long term video sequences, and associated text sentences.
We train such networks from large-scale datasets, such as HowTo100M, where the associated text sentences have significant noise.
Our proposed model, trained on HowTo100M, outperforms strong baselines (CLIP, MIL-NCE) on this alignment dataset.
arXiv Detail & Related papers (2022-04-06T17:59:46Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.