Related papers: OSCaR: Object State Captioning and State Change Representation

OSCaR: Object State Captioning and State Change Representation

URL: http://arxiv.org/abs/2402.17128v4
Date: Tue, 2 Apr 2024 23:14:42 GMT
Title: OSCaR: Object State Captioning and State Change Representation
Authors: Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu,
Abstract summary: This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs)
Score: 52.13461424520107
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.

Related papers

SPOC: Spatially-Progressing Object State Change Segmentation in Video [52.65373395382122]
We introduce the spatially-progressing object state change segmentation task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed. We demonstrate useful implications for tracking activity progress to benefit robotic agents.
arXiv Detail & Related papers (2025-03-15T01:48:54Z)
Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework. CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks. Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z)
M$^3$-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation [51.82272563578793]
We introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. We present a new benchmark, Multi-Phase, Multi-Transition and Multi-Scenery Video Object (M$3$-VOS), to verify the ability of models to understand object phases. We propose ReVOS, a new plug-andplay model that improves its performance by reversal refinement.
arXiv Detail & Related papers (2024-12-18T12:50:11Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks. Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context. This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
Do Pre-trained Vision-Language Models Encode Object States? [13.4206464539947]
We investigate if vision-language models (VLMs) learn to encode object states on web-scale data. We evaluate nine open-source VLMs, including models trained with contrastive and generative objectives. We identify three areas for improvements for better encode object states.
arXiv Detail & Related papers (2024-09-16T17:22:18Z)
Learning Multiple Object States from Actions via Large Language Models [15.053419817253145]
We formulate object state recognition as a multi-label classification task that explicitly handles multiple states. We generate pseudo-labels from the transcribed narrations, capturing the influence of past states. Our model trained on LLM-generated pseudo-labels significantly outperforms strong vision-language models.
arXiv Detail & Related papers (2024-05-02T08:43:16Z)
Learning State-Invariant Representations of Objects from Image Collections with State, Pose, and Viewpoint Changes [0.6577148087211809]
We present a novel dataset, ObjectsWithStateChange, that captures state and pose variations in the object images recorded from arbitrary viewpoints. The goal of such research would be to train models capable of generating object embeddings that remain invariant to state changes. We propose a curriculum learning strategy that uses the similarity relationships in the learned embedding space after each epoch to guide the training process.
arXiv Detail & Related papers (2024-04-09T17:17:48Z)
DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision. We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
Universal Instance Perception as Object Discovery and Retrieval [90.96031157557806]
UNI reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm. It can flexibly perceive different types of objects by simply changing the input prompts. UNI shows superior performance on 20 challenging benchmarks from 10 instance-level tasks.
arXiv Detail & Related papers (2023-03-12T14:28:24Z)
Generic Event Boundary Captioning: A Benchmark for Status Changes Understanding [22.618840285782127]
We introduce a new dataset called Kinetic-GEBC (Generic Event Boundary Captioning) The dataset consists of over 170k boundaries associated with captions describing status changes in 12K videos. We propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes.
arXiv Detail & Related papers (2022-04-01T14:45:30Z)
COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.