Co-segmentation Inspired Attention Module for Video-based Computer
Vision Tasks
- URL: http://arxiv.org/abs/2111.07370v1
- Date: Sun, 14 Nov 2021 15:35:37 GMT
- Title: Co-segmentation Inspired Attention Module for Video-based Computer
Vision Tasks
- Authors: Arulkumar Subramaniam, Jayesh Vaidya, Muhammed Abdul Majeed Ameen,
Athira Nambiar and Anurag Mittal
- Abstract summary: We propose a generic module called "Co-Segmentation Module Activation" (COSAM) to promote the notion of co-segmentation based attention among a sequence of video frame features.
We show the application of COSAM in three video based tasks namely 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification.
- Score: 11.61956970623165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computer vision tasks can benefit from the estimation of the salient object
regions and interactions between those object regions. Identifying the object
regions involves utilizing pretrained models to perform object detection,
object segmentation and/or object pose estimation. However, it is infeasible in
practice due to the following reasons: 1) The object categories of pretrained
models' training dataset may not cover all the object categories exhaustively
needed for general computer vision tasks, 2) The domain gap between pretrained
models' training dataset and target task's dataset may differ and negatively
impact the performance, 3) The bias and variance present in pretrained models
may leak into target task leading to an inadvertently biased target model. To
overcome these downsides, we propose to utilize the common rationale that a
sequence of video frames capture a set of common objects and interactions
between them, thus a notion of co-segmentation between the video frame features
may equip the model with the ability to automatically focus on salient regions
and improve underlying task's performance in an end-to-end manner. In this
regard, we propose a generic module called "Co-Segmentation Activation Module"
(COSAM) that can be plugged-in to any CNN to promote the notion of
co-segmentation based attention among a sequence of video frame features. We
show the application of COSAM in three video based tasks namely 1) Video-based
person re-ID, 2) Video captioning, & 3) Video action classification and
demonstrate that COSAM is able to capture salient regions in the video frames,
thus leading to notable performance improvements along with interpretable
attention maps.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - The Emergence of Objectness: Learning Zero-Shot Segmentation from Videos [59.12750806239545]
We show that a video has different views of the same scene related by moving components, and the right region segmentation and region flow would allow mutual view synthesis.
Our model starts with two separate pathways: an appearance pathway that outputs feature-based region segmentation for a single image, and a motion pathway that outputs motion features for a pair of images.
By training the model to minimize view synthesis errors based on segment flow, our appearance and motion pathways learn region segmentation and flow estimation automatically without building them up from low-level edges or optical flows respectively.
arXiv Detail & Related papers (2021-11-11T18:59:11Z) - Learning Visual Affordance Grounding from Demonstration Videos [76.46484684007706]
Affordance grounding aims to segment all possible interaction regions between people and objects from an image/video.
We propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos.
arXiv Detail & Related papers (2021-08-12T11:45:38Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.