SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition
- URL: http://arxiv.org/abs/2204.04796v1
- Date: Sun, 10 Apr 2022 23:27:19 GMT
- Title: SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition
- Authors: Victor Escorcia, Ricardo Guerrero, Xiatian Zhu, Brais Martinez
- Abstract summary: We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
- Score: 35.4163266882568
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Learning an egocentric action recognition model from video data is
challenging due to distractors (e.g., irrelevant objects) in the background.
Further integrating object information into an action model is hence
beneficial. Existing methods often leverage a generic object detector to
identify and represent the objects in the scene. However, several important
issues remain. Object class annotations of good quality for the target domain
(dataset) are still required for learning good object representation. Besides,
previous methods deeply couple the existing action models and need to retrain
them jointly with object representation, leading to costly and inflexible
integration. To overcome both limitations, we introduce Self-Supervised
Learning Over Sets (SOS), an approach to pre-train a generic Objects In Contact
(OIC) representation model from video object regions detected by an
off-the-shelf hand-object contact detector. Instead of augmenting object
regions individually as in conventional self-supervised learning, we view the
action process as a means of natural data transformations with unique
spatio-temporal continuity and exploit the inherent relationships among
per-video object sets. Extensive experiments on two datasets, EPIC-KITCHENS-100
and EGTEA, show that our OIC significantly boosts the performance of multiple
state-of-the-art video classification models.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Simultaneous Detection and Interaction Reasoning for Object-Centric Action Recognition [21.655278000690686]
We propose an end-to-end object-centric action recognition framework.
It simultaneously performs Detection And Interaction Reasoning in one stage.
We conduct experiments on two datasets, Something-Else and Ikea-Assembly.
arXiv Detail & Related papers (2024-04-18T05:06:12Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - ROAM: Robust and Object-Aware Motion Generation Using Neural Pose
Descriptors [73.26004792375556]
This paper shows that robustness and generalisation to novel scene objects in 3D object-aware character synthesis can be achieved by training a motion model with as few as one reference object.
We leverage an implicit feature representation trained on object-only datasets, which encodes an SE(3)-equivariant descriptor field around the object.
We demonstrate substantial improvements in 3D virtual character motion and interaction quality and robustness to scenarios with unseen objects.
arXiv Detail & Related papers (2023-08-24T17:59:51Z) - InterTracker: Discovering and Tracking General Objects Interacting with
Hands in the Wild [40.489171608114574]
Existing methods rely on frame-based detectors to locate interacting objects.
We propose to leverage hand-object interaction to track interactive objects.
Our proposed method outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-06T09:09:17Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Discovering Objects that Can Move [55.743225595012966]
We study the problem of object discovery -- separating objects from the background without manual labels.
Existing approaches utilize appearance cues, such as color, texture, and location, to group pixels into object-like regions.
We choose to focus on dynamic objects -- entities that can move independently in the world.
arXiv Detail & Related papers (2022-03-18T21:13:56Z) - Object-to-Scene: Learning to Transfer Object Knowledge to Indoor Scene
Recognition [19.503027767462605]
We propose an Object-to-Scene (OTS) method, which extracts object features and learns object relations to recognize indoor scenes.
OTS outperforms the state-of-the-art methods by more than 2% on indoor scene recognition without using any additional streams.
arXiv Detail & Related papers (2021-08-01T08:37:08Z) - A Deep Learning Approach to Object Affordance Segmentation [31.221897360610114]
We design an autoencoder that infers pixel-wise affordance labels in both videos and static images.
Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism.
We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF.
arXiv Detail & Related papers (2020-04-18T15:34:41Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.