OSCaR: Object State Captioning and State Change Representation
- URL: http://arxiv.org/abs/2402.17128v4
- Date: Tue, 2 Apr 2024 23:14:42 GMT
- Title: OSCaR: Object State Captioning and State Change Representation
- Authors: Nguyen Nguyen, Jing Bi, Ali Vosoughi, Yapeng Tian, Pooyan Fazli, Chenliang Xu,
- Abstract summary: This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
- Score: 52.13461424520107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The capability of intelligent models to extrapolate and comprehend changes in object states is a crucial yet demanding aspect of AI research, particularly through the lens of human interaction in real-world settings. This task involves describing complex visual environments, identifying active objects, and interpreting their changes as conveyed through language. Traditional methods, which isolate object captioning and state change detection, offer a limited view of dynamic environments. Moreover, relying on a small set of symbolic words to represent changes has restricted the expressiveness of the language. To address these challenges, in this paper, we introduce the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark. OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections. It sets a new testbed for evaluating multimodal large language models (MLLMs). Our experiments demonstrate that while MLLMs show some skill, they lack a full understanding of object state changes. The benchmark includes a fine-tuned model that, despite initial capabilities, requires significant improvements in accuracy and generalization ability for effective understanding of these changes. Our code and dataset are available at https://github.com/nguyennm1024/OSCaR.
Related papers
- Do Pre-trained Vision-Language Models Encode Object States? [13.4206464539947]
We investigate if vision-language models (VLMs) learn to encode object states on web-scale data.
We evaluate nine open-source VLMs, including models trained with contrastive and generative objectives.
We identify three areas for improvements for better encode object states.
arXiv Detail & Related papers (2024-09-16T17:22:18Z) - Learning Multiple Object States from Actions via Large Language Models [15.053419817253145]
We formulate object state recognition as a multi-label classification task that explicitly handles multiple states.
We generate pseudo-labels from the transcribed narrations, capturing the influence of past states.
Our model trained on LLM-generated pseudo-labels significantly outperforms strong vision-language models.
arXiv Detail & Related papers (2024-05-02T08:43:16Z) - Learning State-Invariant Representations of Objects from Image Collections with State, Pose, and Viewpoint Changes [0.6577148087211809]
We present a novel dataset, ObjectsWithStateChange, that captures state and pose variations in the object images recorded from arbitrary viewpoints.
The goal of such research would be to train models capable of generating object embeddings that remain invariant to state changes.
We propose a curriculum learning strategy that uses the similarity relationships in the learned embedding space after each epoch to guide the training process.
arXiv Detail & Related papers (2024-04-09T17:17:48Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - DesCo: Learning Object Recognition with Rich Language Descriptions [93.8177229428617]
Recent development in vision-language approaches has instigated a paradigm shift in learning visual recognition models from language supervision.
We propose a new description-conditioned (DesCo) paradigm of learning object recognition models with rich language descriptions.
arXiv Detail & Related papers (2023-06-24T21:05:02Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Universal Instance Perception as Object Discovery and Retrieval [90.96031157557806]
UNI reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm.
It can flexibly perceive different types of objects by simply changing the input prompts.
UNI shows superior performance on 20 challenging benchmarks from 10 instance-level tasks.
arXiv Detail & Related papers (2023-03-12T14:28:24Z) - Generic Event Boundary Captioning: A Benchmark for Status Changes
Understanding [22.618840285782127]
We introduce a new dataset called Kinetic-GEBC (Generic Event Boundary Captioning)
The dataset consists of over 170k boundaries associated with captions describing status changes in 12K videos.
We propose three tasks supporting the development of a more fine-grained, robust, and human-like understanding of videos through status changes.
arXiv Detail & Related papers (2022-04-01T14:45:30Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.