Related papers: While recognizing actions, LMMs struggle to detect core interaction events

While recognizing actions, LMMs struggle to detect core interaction events

URL: http://arxiv.org/abs/2511.20162v1
Date: Tue, 25 Nov 2025 10:38:41 GMT
Title: While recognizing actions, LMMs struggle to detect core interaction events
Authors: Daniel Harari, Michael Sidorov, Liel David, Chen Shterental, Abrham Kahsay Gebreselasie, Muhammad Haris Khan,
Abstract summary: We introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset.<n>250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached.<n>We show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends.
Score: 18.828641379675243
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached ('contact') or detached ('release'). We asked two LMMs (Qwen-2.5VL and GPT-4o) to locate these events in short videos, each with a single event. The results show that although the models can reliably name the target objects, identify the action and provide coherent reasoning, they consistently fail to identify the frame where the interaction begins or ends and cannot localize the event within the scene. Our findings suggest that in struggling to pinpoint the moment and location of physical contact that defines the interaction, the models lack the perceptual grounding required for deeper understanding of dynamic scenes.

Related papers

InterRVOS: Interaction-aware Referring Video Object Segmentation [44.55538737075162]
We introduce Interaction-aware Referring Video Object (InterRVOS), a novel task that focuses on the modeling of interactions.<n>It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction.<n>We present InterRVOS-127K, a large-scale dataset with over 127K automatically annotated expressions, including interaction expressions annotated with distinct masks for actor and target objects.
arXiv Detail & Related papers (2025-06-03T01:16:13Z)
BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding [18.991160292960277]
BYE is a class-agnostic, per-scene point cloud encoder that removes the need for predefined categories, shape priors, or extensive association datasets.<n>We propose an ensembling scheme combining the semantic strengths of Vision Language Models with the scene-specific expertise of BYE, achieving a 7% improvement and a 95% success rate in object association tasks.
arXiv Detail & Related papers (2024-12-03T13:34:42Z)
VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries. We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z)
Glance and Focus: Memory Prompting for Multi-Event Video Question Answering [36.00733800536469]
VideoQA has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors. Humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning. We propose the Glance-Focus model to mimic this effective reasoning strategy.
arXiv Detail & Related papers (2024-01-03T03:51:16Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
Exploring Motion and Appearance Information for Temporal Sentence Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding. We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations. Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z)
Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations. Our ORViT block consists of two object-level streams: appearance and dynamics. We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z)
Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video. Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training. Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
Event-based Motion Segmentation with Spatio-Temporal Graph Cuts [51.17064599766138]
We have developed a method to identify independently objects acquired with an event-based camera. The method performs on par or better than the state of the art without having to predetermine the number of expected moving objects.
arXiv Detail & Related papers (2020-12-16T04:06:02Z)
Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.