Learning Visual Affordance Grounding from Demonstration Videos
- URL: http://arxiv.org/abs/2108.05675v1
- Date: Thu, 12 Aug 2021 11:45:38 GMT
- Title: Learning Visual Affordance Grounding from Demonstration Videos
- Authors: Hongchen Luo, Wei Zhai, Jing Zhang, Yang Cao, Dacheng Tao
- Abstract summary: Affordance grounding aims to segment all possible interaction regions between people and objects from an image/video.
We propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos.
- Score: 76.46484684007706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual affordance grounding aims to segment all possible interaction regions
between people and objects from an image/video, which is beneficial for many
applications, such as robot grasping and action recognition. However, existing
methods mainly rely on the appearance feature of the objects to segment each
region of the image, which face the following two problems: (i) there are
multiple possible regions in an object that people interact with; and (ii)
there are multiple possible human interactions in the same object region. To
address these problems, we propose a Hand-aided Affordance Grounding Network
(HAGNet) that leverages the aided clues provided by the position and action of
the hand in demonstration videos to eliminate the multiple possibilities and
better locate the interaction regions in the object. Specifically, HAG-Net has
a dual-branch structure to process the demonstration video and object image.
For the video branch, we introduce hand-aided attention to enhance the region
around the hand in each video frame and then use the LSTM network to aggregate
the action features. For the object branch, we introduce a semantic enhancement
module (SEM) to make the network focus on different parts of the object
according to the action classes and utilize a distillation loss to align the
output features of the object branch with that of the video branch and transfer
the knowledge in the video branch to the object branch. Quantitative and
qualitative evaluations on two challenging datasets show that our method has
achieved stateof-the-art results for affordance grounding. The source code will
be made available to the public.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking
and Segmentation [31.167405688707575]
We propose a framework for instance-level visual analysis on video frames.
It can simultaneously conduct object detection, instance segmentation, and multi-object tracking.
We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets.
arXiv Detail & Related papers (2023-11-02T04:32:24Z) - Grounding 3D Object Affordance from 2D Interactions in Images [128.6316708679246]
Grounding 3D object affordance seeks to locate objects' ''action possibilities'' regions in the 3D space.
Humans possess the ability to perceive object affordances in the physical world through demonstration images or videos.
We devise an Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources.
arXiv Detail & Related papers (2023-03-18T15:37:35Z) - The Second Place Solution for The 4th Large-scale Video Object
Segmentation Challenge--Track 3: Referring Video Object Segmentation [18.630453674396534]
ReferFormer aims to segment object instances in a given video referred by a language expression in all video frames.
This work proposes several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference.
The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
arXiv Detail & Related papers (2022-06-24T02:15:06Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Co-segmentation Inspired Attention Module for Video-based Computer
Vision Tasks [11.61956970623165]
We propose a generic module called "Co-Segmentation Module Activation" (COSAM) to promote the notion of co-segmentation based attention among a sequence of video frame features.
We show the application of COSAM in three video based tasks namely 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification.
arXiv Detail & Related papers (2021-11-14T15:35:37Z) - The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [59.12750806239545]
We show that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis.
Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images.
By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively.
arXiv Detail & Related papers (2021-11-11T18:59:11Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Rethinking Cross-modal Interaction from a Top-down Perspective for
Referring Video Object Segmentation [140.4291169276062]
Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference.
Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice.
In this work, we put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video.
Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently.
arXiv Detail & Related papers (2021-06-02T10:26:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.