Audio-Infused Automatic Image Colorization by Exploiting Audio Scene
Semantics
- URL: http://arxiv.org/abs/2401.13270v1
- Date: Wed, 24 Jan 2024 07:22:05 GMT
- Title: Audio-Infused Automatic Image Colorization by Exploiting Audio Scene
Semantics
- Authors: Pengcheng Zhao, Yanxiang Chen, Yang Zhao, Wei Jia, Zhao Zhang,
Ronggang Wang and Richang Hong
- Abstract summary: This paper tries to utilize corresponding audio, which naturally contains extra semantic information about the same scene.
Experiments demonstrate that audio guidance can effectively improve the performance of automatic colorization.
- Score: 54.980359694044566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic image colorization is inherently an ill-posed problem with
uncertainty, which requires an accurate semantic understanding of scenes to
estimate reasonable colors for grayscale images. Although recent
interaction-based methods have achieved impressive performance, it is still a
very difficult task to infer realistic and accurate colors for automatic
colorization. To reduce the difficulty of semantic understanding of grayscale
scenes, this paper tries to utilize corresponding audio, which naturally
contains extra semantic information about the same scene. Specifically, a novel
audio-infused automatic image colorization (AIAIC) network is proposed, which
consists of three stages. First, we take color image semantics as a bridge and
pretrain a colorization network guided by color image semantics. Second, the
natural co-occurrence of audio and video is utilized to learn the color
semantic correlations between audio and visual scenes. Third, the implicit
audio semantic representation is fed into the pretrained network to finally
realize the audio-guided colorization. The whole process is trained in a
self-supervised manner without human annotation. In addition, an audiovisual
colorization dataset is established for training and testing. Experiments
demonstrate that audio guidance can effectively improve the performance of
automatic colorization, especially for some scenes that are difficult to
understand only from visual modality.
Related papers
- Diffusing Colors: Image Colorization with Text Guided Diffusion [11.727899027933466]
We present a novel image colorization framework that utilizes image diffusion techniques with granular text prompts.
Our method provides a balance between automation and control, outperforming existing techniques in terms of visual quality and semantic coherence.
Our approach holds potential particularly for color enhancement and historical image colorization.
arXiv Detail & Related papers (2023-12-07T08:59:20Z) - Speech inpainting: Context-based speech synthesis guided by video [29.233167442719676]
This paper focuses on the problem of audio-visual speech inpainting, which is the task of synthesizing the speech in a corrupted audio segment.
We present an audio-visual transformer-based deep learning model that leverages visual cues that provide information about the content of the corrupted audio.
We also show how visual features extracted with AV-HuBERT, a large audio-visual transformer for speech recognition, are suitable for synthesizing speech.
arXiv Detail & Related papers (2023-06-01T09:40:47Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - TIC: Text-Guided Image Colorization [24.317541784957285]
We propose a novel deep network that takes two inputs (the grayscale image and the respective encoded text description) and tries to predict the relevant color gamut.
As the respective textual descriptions contain color information of the objects present in the scene, the text encoding helps to improve the overall quality of the predicted colors.
We have evaluated our proposed model using different metrics and found that it outperforms the state-of-the-art colorization algorithms both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-08-04T18:40:20Z) - Semantic-Sparse Colorization Network for Deep Exemplar-based
Colorization [23.301799487207035]
Exemplar-based colorization approaches rely on reference image to provide plausible colors for target gray-scale image.
We propose Semantic-Sparse Colorization Network (SSCN) to transfer both the global image style and semantic-related colors to the gray-scale image.
Our network can perfectly balance the global and local colors while alleviating the ambiguous matching problem.
arXiv Detail & Related papers (2021-12-02T15:35:10Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Semantic-driven Colorization [78.88814849391352]
Recent colorization works implicitly predict the semantic information while learning to colorize black-and-white images.
In this study, we simulate that human-like action to let our network first learn to understand the photo, then colorize it.
arXiv Detail & Related papers (2020-06-13T08:13:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.