Learning Multiple Object States from Actions via Large Language Models
- URL: http://arxiv.org/abs/2405.01090v2
- Date: Wed, 06 Nov 2024 11:44:50 GMT
- Title: Learning Multiple Object States from Actions via Large Language Models
- Authors: Masatoshi Tateno, Takuma Yagi, Ryosuke Furuta, Yoichi Sato,
- Abstract summary: We formulate object state recognition as a multi-label classification task that explicitly handles multiple states.
We generate pseudo-labels from the transcribed narrations, capturing the influence of past states.
Our model trained on LLM-generated pseudo-labels significantly outperforms strong vision-language models.
- Score: 15.053419817253145
- License:
- Abstract: Recognizing the states of objects in a video is crucial in understanding the scene beyond actions and objects. For instance, an egg can be raw, cracked, and whisked while cooking an omelet, and these states can coexist simultaneously (an egg can be both raw and whisked). However, most existing research assumes a single object state change (e.g., uncracked -> cracked), overlooking the coexisting nature of multiple object states and the influence of past states on the current state. We formulate object state recognition as a multi-label classification task that explicitly handles multiple states. We then propose to learn multiple object states from narrated videos by leveraging large language models (LLMs) to generate pseudo-labels from the transcribed narrations, capturing the influence of past states. The challenge is that narrations mostly describe human actions in the video but rarely explain object states. Therefore, we use the LLMs knowledge of the relationship between actions and states to derive the missing object states. We further accumulate the derived object states to consider past state contexts to infer current object state pseudo-labels. We newly collect a dataset called the Multiple Object States Transition (MOST) dataset, which includes manual multi-label annotation for evaluation purposes, covering 60 object states across six object categories. Experimental results show that our model trained on LLM-generated pseudo-labels significantly outperforms strong vision-language models, demonstrating the effectiveness of our pseudo-labeling framework that considers past context via LLMs.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - VOVTrack: Exploring the Potentiality in Videos for Open-Vocabulary Object Tracking [61.56592503861093]
This issue amalgamates the complexities of open-vocabulary object detection (OVD) and multi-object tracking (MOT)
Existing approaches to OVMOT often merge OVD and MOT methodologies as separate modules, predominantly focusing on the problem through an image-centric lens.
We propose VOVTrack, a novel method that integrates object states relevant to MOT and video-centric training to address this challenge from a video object tracking standpoint.
arXiv Detail & Related papers (2024-10-11T05:01:49Z) - Do Pre-trained Vision-Language Models Encode Object States? [13.4206464539947]
We investigate if vision-language models (VLMs) learn to encode object states on web-scale data.
We evaluate nine open-source VLMs, including models trained with contrastive and generative objectives.
We identify three areas for improvements for better encode object states.
arXiv Detail & Related papers (2024-09-16T17:22:18Z) - Learning State-Invariant Representations of Objects from Image Collections with State, Pose, and Viewpoint Changes [0.6577148087211809]
We present a novel dataset, ObjectsWithStateChange, that captures state and pose variations in the object images recorded from arbitrary viewpoints.
The goal of such research would be to train models capable of generating object embeddings that remain invariant to state changes.
We propose a curriculum learning strategy that uses the similarity relationships in the learned embedding space after each epoch to guide the training process.
arXiv Detail & Related papers (2024-04-09T17:17:48Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Exploiting Unlabeled Data with Vision and Language Models for Object
Detection [64.94365501586118]
Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets.
We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images.
We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection and semi-supervised object detection.
arXiv Detail & Related papers (2022-07-18T21:47:15Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.