Object-Region Video Transformers
- URL: http://arxiv.org/abs/2110.06915v1
- Date: Wed, 13 Oct 2021 17:51:46 GMT
- Title: Object-Region Video Transformers
- Authors: Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal
Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
- Abstract summary: We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
- Score: 100.23380634952083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evidence from cognitive psychology suggests that understanding
spatio-temporal object interactions and dynamics can be essential for
recognizing actions in complex videos. Therefore, action recognition models are
expected to benefit from explicit modeling of objects, including their
appearance, interaction, and dynamics. Recently, video transformers have shown
great success in video understanding, exceeding CNN performance. Yet, existing
video transformer models do not explicitly model objects. In this work, we
present Object-Region Video Transformers (ORViT), an \emph{object-centric}
approach that extends video transformer layers with a block that directly
incorporates object representations. The key idea is to fuse object-centric
spatio-temporal representations throughout multiple transformer layers. Our
ORViT block consists of two object-level streams: appearance and dynamics. In
the appearance stream, an ``Object-Region Attention'' element applies
self-attention over the patches and \emph{object regions}. In this way, visual
object regions interact with uniform patch tokens and enrich them with
contextualized object information. We further model object dynamics via a
separate ``Object-Dynamics Module'', which captures trajectory interactions,
and show how to integrate the two streams. We evaluate our model on standard
and compositional action recognition on Something-Something V2, standard action
recognition on Epic-Kitchen100 and Diving48, and spatio-temporal action
detection on AVA. We show strong improvement in performance across all tasks
and datasets considered, demonstrating the value of a model that incorporates
object representations into a transformer architecture. For code and pretrained
models, visit the project page at https://roeiherz.github.io/ORViT/.
Related papers
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - EgoViT: Pyramid Video Transformer for Egocentric Action Recognition [18.05706639179499]
Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos.
We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
arXiv Detail & Related papers (2023-03-15T20:33:50Z) - Interaction Region Visual Transformer for Egocentric Action Anticipation [18.873728614415946]
We propose a novel way to represent human-object interactions for egocentric action anticipation.
We model interactions between hands and objects using Spatial Cross-Attention.
We then infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens.
Using these tokens, we construct an interaction-centric video representation for action anticipation.
arXiv Detail & Related papers (2022-11-25T15:00:51Z) - Is an Object-Centric Video Representation Beneficial for Transfer? [86.40870804449737]
We introduce a new object-centric video recognition model on a transformer architecture.
We show that the object-centric model outperforms prior video representations.
arXiv Detail & Related papers (2022-07-20T17:59:44Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.