Zero-Shot Action Recognition from Diverse Object-Scene Compositions
- URL: http://arxiv.org/abs/2110.13479v1
- Date: Tue, 26 Oct 2021 08:23:14 GMT
- Title: Zero-Shot Action Recognition from Diverse Object-Scene Compositions
- Authors: Carlo Bretti and Pascal Mettes
- Abstract summary: This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available.
For this challenging scenario, the current leading approach is to transfer knowledge from the image domain by recognizing objects in videos using pre-trained networks.
Where objects provide a local view on the content in videos, in this work we also seek to include a global view of the scene in which actions occur.
We find that scenes on their own are also capable of recognizing unseen actions, albeit more marginally than objects, and a direct combination of object-based and scene-based scores degrades the action recognition
- Score: 15.942187254262091
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the problem of zero-shot action recognition, in the
setting where no training videos with seen actions are available. For this
challenging scenario, the current leading approach is to transfer knowledge
from the image domain by recognizing objects in videos using pre-trained
networks, followed by a semantic matching between objects and actions. Where
objects provide a local view on the content in videos, in this work we also
seek to include a global view of the scene in which actions occur. We find that
scenes on their own are also capable of recognizing unseen actions, albeit more
marginally than objects, and a direct combination of object-based and
scene-based scores degrades the action recognition performance. To get the best
out of objects and scenes, we propose to construct them as a Cartesian product
of all possible compositions. We outline how to determine the likelihood of
object-scene compositions in videos, as well as a semantic matching from
object-scene compositions to actions that enforces diversity among the most
relevant compositions for each action. While simple, our composition-based
approach outperforms object-based approaches and even state-of-the-art
zero-shot approaches that rely on large-scale video datasets with hundreds of
seen actions for training and knowledge transfer.
Related papers
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework.
It simultaneously performs Detection And Interaction Reasoning in one stage.
We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z) - LOCATE: Self-supervised Object Discovery via Flow-guided Graph-cut and
Bootstrapped Self-training [13.985488693082981]
We propose a self-supervised object discovery approach that leverages motion and appearance information to produce high-quality object segmentation masks.
We demonstrate the effectiveness of our approach, named LOCATE, on multiple standard video object segmentation, image saliency detection, and object segmentation benchmarks.
arXiv Detail & Related papers (2023-08-22T07:27:09Z) - Hyperbolic Contrastive Learning for Visual Representations beyond
Objects [30.618032825306187]
We focus on learning representations for objects and scenes that preserve the structure among them.
Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure.
arXiv Detail & Related papers (2022-12-01T16:58:57Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - Learning Visual Affordance Grounding from Demonstration Videos [76.46484684007706]
Affordance grounding aims to segment all possible interaction regions between people and objects from an image/video.
We propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos.
arXiv Detail & Related papers (2021-08-12T11:45:38Z) - Motion Guided Attention Fusion to Recognize Interactions from Videos [40.1565059238891]
We present a dual-pathway approach for recognizing fine-grained interactions from videos.
We fuse the bottom-up features in the motion pathway with features captured from object detections to learn the temporal aspects of an action.
We show that our approach can generalize across appearance effectively and recognize actions where an actor interacts with previously unseen objects.
arXiv Detail & Related papers (2021-04-01T17:44:34Z) - Neural Scene Graphs for Dynamic Scenes [57.65413768984925]
We present the first neural rendering method that decomposes dynamic scenes into scene graphs.
We learn implicitly encoded scenes, combined with a jointly learned latent representation to describe objects with a single implicit function.
arXiv Detail & Related papers (2020-11-20T12:37:10Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.