Seeing without Pixels: Perception from Camera Trajectories
- URL: http://arxiv.org/abs/2511.21681v1
- Date: Wed, 26 Nov 2025 18:57:01 GMT
- Title: Seeing without Pixels: Perception from Camera Trajectories
- Authors: Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han,
- Abstract summary: We propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space.<n>We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content.<n>Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
- Score: 111.74720805348426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, "how you move" can indeed reveal "what you are doing" (egocentric) or "observing" (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.
Related papers
- CamPilot: Improving Camera Control in Video Diffusion Model with Efficient Camera Reward Feedback [43.174121093566264]
We build upon Reward Feedback Learning and aim to further improve camera controllability.<n>Current reward models lack the capacity to assess video-camera alignment.<n>We introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization.
arXiv Detail & Related papers (2026-01-22T18:59:56Z) - Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation [49.12018869332346]
InfCam is a camera-controlled video-to-video generation framework with high pose fidelity.<n>The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model.
arXiv Detail & Related papers (2025-12-18T20:03:05Z) - Towards Understanding Camera Motions in Any Video [89.97247162415158]
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding.<n>CameraBench consists of 3,000 diverse internet videos annotated by experts through a rigorous quality control process.<n>One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers.
arXiv Detail & Related papers (2025-04-21T18:34:57Z) - ReCamMaster: Camera-Controlled Generative Rendering from A Single Video [72.42376733537925]
ReCamMaster is a camera-controlled generative video re-rendering framework.<n>It reproduces the dynamic scene of an input video at novel camera trajectories.<n>Our method also finds promising applications in video stabilization, super-resolution, and outpainting.
arXiv Detail & Related papers (2025-03-14T17:59:31Z) - Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention [62.2447324481159]
Cavia is a novel framework for camera-controllable, multi-view video generation.
Our framework extends the spatial and temporal attention modules, improving both viewpoint and temporal consistency.
Cavia is the first of its kind that allows the user to specify distinct camera motion while obtaining object motion.
arXiv Detail & Related papers (2024-10-14T17:46:32Z) - Training-free Camera Control for Video Generation [15.79168688275606]
We propose a training-free and robust solution to offer camera movement control for off-the-shelf video diffusion models.<n>Our method does not require any supervised finetuning on camera-annotated datasets or self-supervised training via data augmentation.<n>It can be plug-and-play with most pretrained video diffusion models and generate camera-controllable videos with a single image or text prompt as input.
arXiv Detail & Related papers (2024-06-14T15:33:00Z) - SVDistNet: Self-Supervised Near-Field Distance Estimation on Surround
View Fisheye Cameras [30.480562747903186]
A 360deg perception of scene geometry is essential for automated driving, notably for parking and urban driving scenarios.
We present novel camera-geometry adaptive multi-scale convolutions which utilize the camera parameters as a conditional input.
We evaluate our approach on the Fisheye WoodScape surround-view dataset, significantly improving over previous approaches.
arXiv Detail & Related papers (2021-04-09T15:20:20Z) - Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs by
Implicitly Unprojecting to 3D [100.93808824091258]
We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras.
Our approach is to "lift" each image individually into a frustum of features for each camera, then "splat" all frustums into a bird's-eye-view grid.
We show that the representations inferred by our model enable interpretable end-to-end motion planning by "shooting" template trajectories into a bird's-eye-view cost map output by our network.
arXiv Detail & Related papers (2020-08-13T06:29:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.