Is an Object-Centric Video Representation Beneficial for Transfer?
- URL: http://arxiv.org/abs/2207.10075v1
- Date: Wed, 20 Jul 2022 17:59:44 GMT
- Title: Is an Object-Centric Video Representation Beneficial for Transfer?
- Authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman
- Abstract summary: We introduce a new object-centric video recognition model on a transformer architecture.
We show that the object-centric model outperforms prior video representations.
- Score: 86.40870804449737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this work is to learn an object-centric video
representation, with the aim of improving transferability to novel tasks, i.e.,
tasks different from the pre-training task of action classification. To this
end, we introduce a new object-centric video recognition model based on a
transformer architecture. The model learns a set of object-centric summary
vectors for the video, and uses these vectors to fuse the visual and
spatio-temporal trajectory `modalities' of the video clip. We also introduce a
novel trajectory contrast loss to further enhance objectness in these summary
vectors. With experiments on four datasets -- SomethingSomething-V2,
SomethingElse, Action Genome and EpicKitchens -- we show that the
object-centric model outperforms prior video representations (both
object-agnostic and object-aware), when: (1) classifying actions on unseen
objects and unseen environments; (2) low-shot learning to novel classes; (3)
linear probe to other downstream tasks; as well as (4) for standard action
classification.
Related papers
- ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare [84.80956484848505]
MegaPose is a method to estimate the 6D pose of novel objects, that is, objects unseen during training.
We present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects.
Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner.
arXiv Detail & Related papers (2022-12-13T19:30:03Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - Disentangling What and Where for 3D Object-Centric Representations
Through Active Inference [4.088019409160893]
We propose an active inference agent that can learn novel object categories over time.
We show that our agent is able to learn representations for many object categories in an unsupervised way.
We validate our system in an end-to-end fashion where the agent is able to search for an object at a given pose from a pixel-based rendering.
arXiv Detail & Related papers (2021-08-26T12:49:07Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.