Video Understanding by Design: How Datasets Shape Architectures and Insights
- URL: http://arxiv.org/abs/2509.09151v1
- Date: Thu, 11 Sep 2025 05:06:30 GMT
- Title: Video Understanding by Design: How Datasets Shape Architectures and Insights
- Authors: Lei Wang, Piotr Koniusz, Yongsheng Gao,
- Abstract summary: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures.<n>This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode.
- Score: 47.846604113207206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video understanding has advanced rapidly, fueled by increasingly complex datasets and powerful architectures. Yet existing surveys largely classify models by task or family, overlooking the structural pressures through which datasets guide architectural evolution. This survey is the first to adopt a dataset-driven perspective, showing how motion complexity, temporal span, hierarchical composition, and multimodal richness impose inductive biases that models should encode. We reinterpret milestones, from two-stream and 3D CNNs to sequential, transformer, and multimodal foundation models, as concrete responses to these dataset-driven pressures. Building on this synthesis, we offer practical guidance for aligning model design with dataset invariances while balancing scalability and task demands. By unifying datasets, inductive biases, and architectures into a coherent framework, this survey provides both a comprehensive retrospective and a prescriptive roadmap for advancing general-purpose video understanding.
Related papers
- Revisiting the Generic Transformer: Deconstructing a Strong Baseline for Time Series Foundation Models [18.841505010078112]
We investigate the potential of a standard patch Transformer, demonstrating that it achieves state-of-the-art zero-shot forecasting performance.<n>We conduct a comprehensive ablation study that covers model scaling, data composition, and training techniques to isolate the essential ingredients for high performance.
arXiv Detail & Related papers (2026-02-06T18:01:44Z) - CoMa: Contextual Massing Generation with Vision-Language Models [7.943264761730892]
We propose an automated framework for generating building massing based on functional requirements and site context.<n>A primary obstacle to such data-driven methods has been the lack of suitable datasets.<n>We benchmark this dataset by formulating massing generation as a conditional task for Vision-Language Models.
arXiv Detail & Related papers (2026-01-13T11:44:00Z) - Factuality Matters: When Image Generation and Editing Meet Structured Visuals [46.627460447235855]
We construct a large-scale dataset of 1.3 million high-quality structured image pairs.<n>We train a unified model that integrates a VLM with FLUX.1 Kontext.<n>A three-stage training curriculum enables progressive feature alignment, knowledge infusion, and reasoning-augmented generation.
arXiv Detail & Related papers (2025-10-06T17:56:55Z) - Scaling Towards the Information Boundary of Instruction Set: InfinityInstruct-Subject Technical Report [11.70656700216213]
Construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability.<n>We propose a systematic instruction data synthesis framework, which integrates a hierarchical labeling system, an informative seed selection algorithm, and a model deficiency diagnosis.<n>Based on this framework, we construct InfinityInstruct-Subject, a high-quality dataset containing 1.5 million instructions.
arXiv Detail & Related papers (2025-07-09T15:59:02Z) - SimVecVis: A Dataset for Enhancing MLLMs in Visualization Understanding [10.168582728627042]
Current large language models (MLLMs) struggle with visualization understanding due to their inability to decode the data-to-visual mapping and extract structured information.<n>We propose SimVec, a novel simplified vector format that encodes chart elements such as mark type, position, and size.<n>We build a new visualization dataset, SimVecVis, to enhance the performance of MLLMs in visualization understanding.
arXiv Detail & Related papers (2025-06-26T14:35:59Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - DIPO: Dual-State Images Controlled Articulated Object Generation Powered by Diverse Data [67.99373622902827]
DIPO is a framework for controllable generation of articulated 3D objects from a pair of images.<n>We propose a dual-image diffusion model that captures relationships between the image pair to generate part layouts and joint parameters.<n>We propose PM-X, a large-scale dataset of complex articulated 3D objects, accompanied by rendered images, URDF annotations, and textual descriptions.
arXiv Detail & Related papers (2025-05-26T18:55:14Z) - GridMind: A Multi-Agent NLP Framework for Unified, Cross-Modal NFL Data Insights [0.0]
This paper introduces GridMind, a framework that unifies structured, semi-structured, and unstructured data through Retrieval-Augmented Generation (RAG) and large language models (LLMs)<n>This approach aligns with the evolving field of multimodal representation learning, where unified models are increasingly essential for real-time, cross-modal interactions.
arXiv Detail & Related papers (2025-03-24T18:33:36Z) - Investigating Public Fine-Tuning Datasets: A Complex Review of Current Practices from a Construction Perspective [2.12587313410587]
This paper reviews current public fine-tuning datasets from the perspective of data construction.
An overview of public fine-tuning datasets from two sides: evolution and taxonomy, is provided in this review.
arXiv Detail & Related papers (2024-07-11T13:11:16Z) - Defining Neural Network Architecture through Polytope Structures of Dataset [53.512432492636236]
This paper defines upper and lower bounds for neural network widths, which are informed by the polytope structure of the dataset in question.
We develop an algorithm to investigate a converse situation where the polytope structure of a dataset can be inferred from its corresponding trained neural networks.
It is established that popular datasets such as MNIST, Fashion-MNIST, and CIFAR10 can be efficiently encapsulated using no more than two polytopes with a small number of faces.
arXiv Detail & Related papers (2024-02-04T08:57:42Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.