Related papers: MV-TAP: Tracking Any Point in Multi-View Videos

MV-TAP: Tracking Any Point in Multi-View Videos

URL: http://arxiv.org/abs/2512.02006v1
Date: Mon, 01 Dec 2025 18:59:01 GMT
Title: MV-TAP: Tracking Any Point in Multi-View Videos
Authors: Jahyeok Koo, Inès Hyeonsu Kim, Mungyeom Kim, Junghyun Park, Seohyun Park, Jaeyeong Kim, Jung Yi, Seokju Cho, Seungryong Kim,
Abstract summary: MV-TAP is a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information.<n>To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking.
Score: 34.91357343992975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-view camera systems enable rich observations of complex real-world scenes, and understanding dynamic objects in multi-view settings has become central to various applications. In this work, we present MV-TAP, a novel point tracker that tracks points across multi-view videos of dynamic scenes by leveraging cross-view information. MV-TAP utilizes camera geometry and a cross-view attention mechanism to aggregate spatio-temporal information across views, enabling more complete and reliable trajectory estimation in multi-view videos. To support this task, we construct a large-scale synthetic training dataset and real-world evaluation sets tailored for multi-view tracking. Extensive experiments demonstrate that MV-TAP outperforms existing point-tracking methods on challenging benchmarks, establishing an effective baseline for advancing research in multi-view point tracking.

Related papers

A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects [53.15503034595476]
Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision.<n>VSP has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes.
arXiv Detail & Related papers (2025-06-16T14:39:03Z)
FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment [7.5152380894919055]
We propose an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association.<n>Experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.
arXiv Detail & Related papers (2025-05-24T14:51:19Z)
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models [11.184459657989914]
We introduce NuPlanQA-Eval, a multi-view, multi-modal evaluation benchmark for driving scene understanding.<n>We also propose NuPlanQA-1M, a large-scale dataset comprising 1M real-world visual question-answering (VQA) pairs.<n>Our evaluation results reveal key challenges that existing MLLMs face in driving scene-specific perception and spatial reasoning from ego-centric perspectives.
arXiv Detail & Related papers (2025-03-17T03:12:39Z)
Seeing Beyond Views: Multi-View Driving Scene Video Generation with Holistic Attention [61.3281618482513]
We present CogDriving, a novel network designed for synthesizing high-quality multi-view driving videos.<n>CogDriving leverages a Diffusion Transformer architecture with holistic-4D attention modules, enabling simultaneous associations across the dimensions.<n>CogDriving demonstrates strong performance on the nuScenes validation set, achieving an FVD score of 37.8, highlighting its ability to generate realistic driving videos.
arXiv Detail & Related papers (2024-12-04T18:02:49Z)
MCTR: Multi Camera Tracking Transformer [45.66952089591361]
Multi-Camera Tracking tRansformer (MCTR) is a novel end-to-end approach tailored for multi-object detection and tracking across multiple cameras. MCTR leverages end-to-end detectors like DEtector TRansformer (DETR) to produce detections and detection embeddings independently for each camera view. The framework maintains set of track embeddings that encaplusate global information about the tracked objects, and updates them at every frame by integrating local information from the view-specific detection embeddings.
arXiv Detail & Related papers (2024-08-23T17:37:03Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.<n>Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.<n>We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
ViewFormer: Exploring Spatiotemporal Modeling for Multi-View 3D Occupancy Perception via View-Guided Transformers [9.271932084757646]
3D occupancy represents the entire scene without distinguishing between foreground and background by the physical space into a grid map. We propose our learning-first view attention mechanism for effective multi-view feature aggregation. We present FlowOcc3D, a benchmark built on top existing high-quality datasets.
arXiv Detail & Related papers (2024-05-07T13:15:07Z)
MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark [63.878793340338035]
Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. Existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting. We present MTMMC, a real-world, large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments.
arXiv Detail & Related papers (2024-03-29T15:08:37Z)
CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation [31.167405688707575]
We propose a framework for instance-level visual analysis on video frames. It can simultaneously conduct object detection, instance segmentation, and multi-object tracking. We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets.
arXiv Detail & Related papers (2023-11-02T04:32:24Z)
Multi-Spectral Image Stitching via Spatial Graph Reasoning [52.27796682972484]
We propose a spatial graph reasoning based multi-spectral image stitching method. We embed multi-scale complementary features from the same view position into a set of nodes. By introducing long-range coherence along spatial and channel dimensions, the complementarity of pixel relations and channel interdependencies aids in the reconstruction of aligned multi-view features.
arXiv Detail & Related papers (2023-07-31T15:04:52Z)
DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes [74.64897845999677]
We introduce a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians. Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available. Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named CrossMOT.
arXiv Detail & Related papers (2023-02-15T14:10:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.