Related papers: StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing

StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing

URL: http://arxiv.org/abs/2601.22738v1
Date: Fri, 30 Jan 2026 09:19:22 GMT
Title: StreamSense: Streaming Social Task Detection with Selective Vision-Language Model Routing
Authors: Han Wang, Deyi Ji, Lanyun Zhu, Jiebo Luo, Roy Ka-Wei Lee,
Abstract summary: StreamSense is a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model expert.<n>We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation)<n>Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks.
Score: 56.32296785595906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Live streaming platforms require real-time monitoring and reaction to social signals, utilizing partial and asynchronous evidence from video, text, and audio. We propose StreamSense, a streaming detector that couples a lightweight streaming encoder with selective routing to a Vision-Language Model (VLM) expert. StreamSense handles most timestamps with the lightweight streaming encoder, escalates hard/ambiguous cases to the VLM, and defers decisions when context is insufficient. The encoder is trained using (i) a cross-modal contrastive term to align visual/audio cues with textual signals, and (ii) an IoU-weighted loss that down-weights poorly overlapping target segments, mitigating label interference across segment boundaries. We evaluate StreamSense on multiple social streaming detection tasks (e.g., sentiment classification and hate content moderation), and the results show that StreamSense achieves higher accuracy than VLM-only streaming while only occasionally invoking the VLM, thereby reducing average latency and compute. Our results indicate that selective escalation and deferral are effective primitives for understanding streaming social tasks. Code is publicly available on GitHub.

Related papers

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling [27.468345201477504]
Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions.<n>We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs.<n> Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with stable low latency, ensuring robustness and efficiency in real-world deployment.
arXiv Detail & Related papers (2025-07-07T17:49:41Z)
Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization [60.73623588349311]
We propose a universal context-aware contrastive learning framework (UniCaCLF) for temporal forgery localization.<n>Our approach leverages supervised contrastive learning to discover and identify forged instants by means of anomaly detection.<n>An efficient context-aware contrastive coding is introduced to further push the limit of instant feature distinguishability between genuine and forged instants.
arXiv Detail & Related papers (2025-06-10T06:40:43Z)
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition [20.608124640950276]
We introduce StreamMind, a video LLM framework that achieves ultra-FPS streaming video processing (100 fps on a single A100)<n>We propose a novel perception-cognition intertemporal paradigm named ''event-gated LLM invocation''<n> Experiments on Ego4D and SoccerNet streaming tasks, as well as standard offline benchmarks, demonstrate state-of-the-art performance in both model capability and real-time efficiency.
arXiv Detail & Related papers (2025-03-08T13:44:38Z)
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z)
Real-time Stereo-based 3D Object Detection for Streaming Perception [12.52037626475608]
We introduce StreamDSGN, the first real-time stereo-based 3D object detection framework designed for streaming perception. StreamDSGN directly predicts the 3D properties of objects in the next moment by leveraging historical information. Compared with the strong baseline, StreamDSGN significantly improves the streaming average precision by up to 4.33%.
arXiv Detail & Related papers (2024-10-16T09:23:02Z)
Learn to Compress (LtC): Efficient Learning-based Streaming Video Analytics [3.2872586139884623]
LtC is a collaborative framework between the video source and the analytics server that efficiently learns to reduce the video streams within an analytics pipeline. LtC is able to use 28-35% less bandwidth and has up to 45% shorter response delay compared to recently published state of the art streaming frameworks.
arXiv Detail & Related papers (2023-07-22T21:36:03Z)
Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception. We build a simple and effective framework for streaming perception. Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
Towards Streaming Perception [70.68520310095155]
We present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception. The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant. We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations.
arXiv Detail & Related papers (2020-05-21T01:51:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.