Related papers: STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models

URL: http://arxiv.org/abs/2508.13470v1
Date: Tue, 19 Aug 2025 03:03:29 GMT
Title: STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
Authors: Tinh-Anh Nguyen-Nhu, Triet Dao Hoang Minh, Dat To-Thanh, Phuc Le-Gia, Tuan Vo-Lan, Tien-Huy Nguyen,
Abstract summary: This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance.<n> Experimental results on the WTS citekong2024wts and BDD citeBDD datasets demonstrate substantial gains in semantic richness and traffic scene interpretation.<n>Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) have emerged as powerful tools for enabling automated traffic analysis; however, current approaches often demand substantial computational resources and struggle with fine-grained spatio-temporal understanding. This paper introduces STER-VLM, a computationally efficient framework that enhances VLM performance through (1) caption decomposition to tackle spatial and temporal information separately, (2) temporal frame selection with best-view filtering for sufficient temporal information, and (3) reference-driven understanding for capturing fine-grained motion and dynamic context and (4) curated visual/textual prompt techniques. Experimental results on the WTS \cite{kong2024wts} and BDD \cite{BDD} datasets demonstrate substantial gains in semantic richness and traffic scene interpretation. Our framework is validated through a decent test score of 55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in advancing resource-efficient and accurate traffic analysis for real-world applications.

Related papers

STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning [65.36458157092207]
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
arXiv Detail & Related papers (2026-02-12T08:53:32Z)
Nüwa: Mending the Spatial Integrity Torn by VLM Token Pruning [82.39668822222386]
Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM)<n>We propose $textNwa$, a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity.<n>Experiments demonstrate that $textNwa$ achieves SOTA performance on multiple VQA benchmarks (from 94% to 95%) and yields substantial improvements on visual grounding tasks (from 7% to 47%)
arXiv Detail & Related papers (2026-02-03T00:51:03Z)
BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion [7.382475458362566]
We present BREATH-VL, a hybrid framework that integrates semantic cues from vision-language models with geometric information from registration methods for accurate 6-DoF pose estimation.<n>Building on this, our BREATH-VL outperforms state-of-the-art vision-only localization methods in both accuracy and generalization, reducing translational error by 25.5% compared with the best-performing baseline.
arXiv Detail & Related papers (2026-01-07T09:00:52Z)
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs [81.78017865436816]
We present TimeLens, a systematic investigation into building MLLMs with strong video temporal grounding ability.<n>We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks.<n>We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset.
arXiv Detail & Related papers (2025-12-16T18:59:58Z)
Vision-LLMs for Spatiotemporal Traffic Forecasting [14.700408329373998]
Large Language Models (LLMs) inherently struggle to model the complex spatial dependencies of grid-based traffic data.<n>We propose ST-Vision-LLM, a novel framework reframe thatstemporal forecasting as a vision-language fusion problem.<n>We show that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-13T11:15:56Z)
Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling [3.5408685781175016]
Training vision-language models on cognitively-plausible amounts of data requires rethinking how models integrate multimodal information.<n>We propose a lightweight decoder-based architecture with token-wise dynamic gating for adaptive fusion of linguistic and visual cues.
arXiv Detail & Related papers (2025-10-09T17:10:36Z)
Harnessing Vision-Language Models for Time Series Anomaly Detection [9.257985820123]
Time-series anomaly detection (TSAD) has played a vital role in a variety of fields, including healthcare, finance, and industrial monitoring.<n>Prior methods, which mainly focus on training domain-specific models on numerical data, lack the visual-temporal reasoning capacity that human experts have to identify contextual anomalies.<n>We propose a two-stage solution, with ViT4TS, a vision-screening stage built on a relatively lightweight pretrained vision encoder, and VLM4TS, a VLM-based stage that integrates global temporal context and VLM reasoning capacity.
arXiv Detail & Related papers (2025-06-07T15:27:30Z)
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput [12.996955972977986]
Flash-VL 2B is a novel approach to optimizing Vision-Language Models for real-time applications.<n>We show that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy.
arXiv Detail & Related papers (2025-05-14T15:45:17Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
Scenario Understanding of Traffic Scenes Through Large Visual Language Models [2.3302708486956454]
Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries.<n>In this study, we evaluate the capabilities of LVLMs to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K.<n>We propose a scalable captioning pipeline that integrates state-of-the-art models, enabling a flexible deployment on new datasets.
arXiv Detail & Related papers (2025-01-28T18:23:12Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z)
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance. We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z)
Reinforcement Learning with Latent Flow [78.74671595139613]
Flow of Latents for Reinforcement Learning (Flare) is a network architecture for RL that explicitly encodes temporal information through latent vector differences. We show that Flare recovers optimal performance in state-based RL without explicit access to the state velocity. We also show that Flare achieves state-of-the-art performance on pixel-based challenging continuous control tasks within the DeepMind control benchmark suite.
arXiv Detail & Related papers (2021-01-06T03:50:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.