Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
- URL: http://arxiv.org/abs/2602.18977v1
- Date: Sat, 21 Feb 2026 23:05:53 GMT
- Title: Frame2Freq: Spectral Adapters for Fine-Grained Video Understanding
- Authors: Thinesh Thiyakesan Ponbagavathi, Constantin Seibold, Alina Roitberg,
- Abstract summary: We introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs)<n>Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them.
- Score: 9.139773470565556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapting image-pretrained backbones to video typically relies on time-domain adapters tuned to a single temporal scale. Our experiments show that these modules pick up static image cues and very fast flicker changes, while overlooking medium-speed motion. Capturing dynamics across multiple time-scales is, however, crucial for fine-grained temporal analysis (i.e., opening vs. closing bottle). To address this, we introduce Frame2Freq -- a family of frequency-aware adapters that perform spectral encoding during image-to-video adaptation of pretrained Vision Foundation Models (VFMs), improving fine-grained action recognition. Frame2Freq uses Fast Fourier Transform (FFT) along time and learns frequency-band specific embeddings that adaptively highlight the most discriminative frequency ranges. Across five fine-grained activity recognition datasets, Frame2Freq outperforms prior PEFT methods and even surpasses fully fine-tuned models on four of them. These results provide encouraging evidence that frequency analysis methods are a powerful tool for modeling temporal dynamics in image-to-video transfer. Code is available at https://github.com/th-nesh/Frame2Freq.
Related papers
- Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding [43.587729230845525]
Current methods typically select frames with high relevance to a given query.<n>We introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework.<n>WFS-SB significantly boosts LVLM performance, improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench.
arXiv Detail & Related papers (2026-02-28T07:18:07Z) - STORM: Token-Efficient Long Video Understanding for Multimodal LLMs [116.4479155699528]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.<n>We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Enhancing Long Video Generation Consistency without Tuning [92.1714656167712]
We address issues to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which judiciously edits the attention score matrix.<n>For videos generated by multiple prompts, we further uncover key factors such as the alignment of the prompts affecting prompt quality.<n>Inspired by our analyses, we propose PromptBlend, an advanced prompt pipeline that systematically aligns the prompts.
arXiv Detail & Related papers (2024-12-23T03:56:27Z) - FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding [25.21011724370177]
Text-guided Video Temporal Grounding (VTG) aims to localize relevant segments in untrimmed videos based on textual descriptions.<n>We introduce FlashVTG, a framework featuring a Temporal Feature Layering (TFL) module and an Adaptive Score Refinement (ASR) module.<n>FlashVTG achieves state-of-the-art performance on four widely adopted datasets in both Moment Retrieval (MR) and Highlight Detection (HD)
arXiv Detail & Related papers (2024-12-18T02:23:33Z) - ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.<n>We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.<n>Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z) - Frame Flexible Network [52.623337134518835]
Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers.
If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly.
We propose a general framework, named Frame Flexible Network (FFN), which enables the model to be evaluated at different frames to adjust its computation.
arXiv Detail & Related papers (2023-03-26T20:51:35Z) - Real-time Online Video Detection with Temporal Smoothing Transformers [4.545986838009774]
A good streaming recognition model captures both long-term dynamics and short-term changes of video.
To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel.
We build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing overhead.
arXiv Detail & Related papers (2022-09-19T17:59:02Z) - TTVFI: Learning Trajectory-Aware Transformer for Video Frame
Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames.
We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI)
Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z) - Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in
VIS and NIR Scenario [87.72258480670627]
Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images.
This paper proposes a Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation.
arXiv Detail & Related papers (2022-07-05T09:27:53Z) - Spatiotemporal Augmentation on Selective Frequencies for Video
Representation Learning [36.352159541825095]
We propose FreqAug to filter data augmentation in frequency domain for video representation.
FreqAug pushes the model to focus more on dynamic features in the video via dropping spatial or temporal low-frequency components.
To verify the generality of the proposed method, we experiment with FreqAug on multiple self-supervised learning frameworks along with standard augmentations.
arXiv Detail & Related papers (2022-04-08T06:19:32Z) - Exploring Spatial-Temporal Multi-Frequency Analysis for High-Fidelity
and Temporal-Consistency Video Prediction [12.84409065286371]
We propose a video prediction network based on multi-level wavelet analysis to deal with spatial and temporal information in a unified manner.
Our model shows significant improvements on fidelity and temporal consistency over state-of-the-art works.
arXiv Detail & Related papers (2020-02-23T13:46:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.