Audio-Visual Instance Segmentation
- URL: http://arxiv.org/abs/2310.18709v1
- Date: Sat, 28 Oct 2023 13:37:52 GMT
- Title: Audio-Visual Instance Segmentation
- Authors: Ruohao Guo, Yaru Chen, Yanyu Qi, Wenzhen Yue, Dantong Niu, Xianghua
Ying
- Abstract summary: We propose a new multi-modal task, namely audio-visual instance segmentation (AVIS)
The goal is to identify, segment, and track individual sounding object instances in audible videos, simultaneously.
To our knowledge, it is the first time that instance segmentation has been extended into the audio-visual domain.
- Score: 11.25619190194146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new multi-modal task, namely audio-visual
instance segmentation (AVIS), in which the goal is to identify, segment, and
track individual sounding object instances in audible videos, simultaneously.
To our knowledge, it is the first time that instance segmentation has been
extended into the audio-visual domain. To better facilitate this research, we
construct the first audio-visual instance segmentation benchmark (AVISeg).
Specifically, AVISeg consists of 1,258 videos with an average duration of 62.6
seconds from YouTube and public audio-visual datasets, where 117 videos have
been annotated by using an interactive semi-automatic labeling tool based on
the Segment Anything Model (SAM). In addition, we present a simple baseline
model for the AVIS task. Our new model introduces an audio branch and a
cross-modal fusion module to Mask2Former to locate all sounding objects.
Finally, we evaluate the proposed method using two backbones on AVISeg. We
believe that AVIS will inspire the community towards a more comprehensive
multi-modal understanding.
Related papers
- Extending Segment Anything Model into Auditory and Temporal Dimensions for Audio-Visual Segmentation [17.123212921673176]
We propose a Spatio-Temporal, Bi-Visual Attention (ST-B) module integrated into the middle of SAM's encoder and mask decoder.
It adaptively updates the audio-visual features to convey the temporal correspondence between the video frames and audio streams.
Our proposed model outperforms the state-of-the-art methods on AVS benchmarks, especially with an 8.3% mIoU gain on a challenging multi-sources subset.
arXiv Detail & Related papers (2024-06-10T10:53:23Z) - Discovering Sounding Objects by Audio Queries for Audio Visual
Segmentation [36.50512269898893]
To distinguish the sounding objects from silent ones, audio-visual semantic correspondence and temporal interaction are required.
We propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information.
Our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.
arXiv Detail & Related papers (2023-09-18T05:58:06Z) - AV-SAM: Segment Anything Model Meets Audio-Visual Localization and
Segmentation [30.756247389435803]
Segment Anything Model (SAM) has recently shown its powerful effectiveness in visual segmentation tasks.
We propose a framework based on AV-SAM that can generate sounding object masks corresponding to the audio.
We conduct extensive experiments on Flickr-SoundNet and AVSBench datasets.
arXiv Detail & Related papers (2023-05-03T00:33:52Z) - Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale
Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video.
We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events.
Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z) - Object Segmentation with Audio Context [0.5243460995467893]
This project explores the multimodal feature aggregation for video instance segmentation task.
We integrate audio features into our video segmentation model to conduct an audio-visual learning scheme.
arXiv Detail & Related papers (2023-01-04T01:33:42Z) - Audio-Visual Segmentation [47.10873917119006]
We propose to explore a new problem called audio-visual segmentation (AVS)
The goal is to output a pixel-level map of the object(s) that produce sound at the time of the image frame.
We construct the first audio-visual segmentation benchmark (AVSBench), providing pixel-wise annotations for the sounding objects in audible videos.
arXiv Detail & Related papers (2022-07-11T17:50:36Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos.
Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching.
We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z) - End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video.
Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.
For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.