Refined Semantic Enhancement towards Frequency Diffusion for Video
Captioning
- URL: http://arxiv.org/abs/2211.15076v1
- Date: Mon, 28 Nov 2022 05:45:17 GMT
- Title: Refined Semantic Enhancement towards Frequency Diffusion for Video
Captioning
- Authors: Xian Zhong, Zipeng Li, Shuqin Chen, Kui Jiang, Chen Chen and Mang Ye
- Abstract summary: Video captioning aims to generate natural language sentences that describe the given video accurately.
Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability.
We introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens.
- Score: 29.617527535279574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning aims to generate natural language sentences that describe
the given video accurately. Existing methods obtain favorable generation by
exploring richer visual representations in encode phase or improving the
decoding ability. However, the long-tailed problem hinders these attempts at
low-frequency tokens, which rarely occur but carry critical semantics, playing
a vital role in the detailed generation. In this paper, we introduce a novel
Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a
captioning model that constantly perceives the linguistic representation of the
infrequent tokens. Concretely, a Frequency-Aware Diffusion (FAD) module is
proposed to comprehend the semantics of low-frequency tokens to break through
generation limitations. In this way, the caption is refined by promoting the
absorption of tokens with insufficient occurrence. Based on FAD, we design a
Divergent Semantic Supervisor (DSS) module to compensate for the information
loss of high-frequency tokens brought by the diffusion process, where the
semantics of low-frequency tokens is further emphasized to alleviate the
long-tailed problem. Extensive experiments indicate that RSFD outperforms the
state-of-the-art methods on two benchmark datasets, i.e., MSR-VTT and MSVD,
demonstrate that the enhancement of low-frequency tokens semantics can obtain a
competitive generation effect. Code is available at
https://github.com/lzp870/RSFD.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Denoising-Diffusion Alignment for Continuous Sign Language Recognition [24.376213903941746]
Key challenge of Continuous sign language recognition is how to achieve cross-modality alignment between videos and gloss sequences.
We propose a novel Denoising-Diffusion global alignment (DDA)
DDA uses diffusion-based global alignment techniques to align video with gloss sequence, facilitating global temporal context alignment.
arXiv Detail & Related papers (2023-05-05T15:20:27Z) - Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z) - Semantic-Conditional Diffusion Networks for Image Captioning [116.86677915812508]
We propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net)
In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence.
Experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task.
arXiv Detail & Related papers (2022-12-06T16:08:16Z) - PINs: Progressive Implicit Networks for Multi-Scale Neural
Representations [68.73195473089324]
We propose a progressive positional encoding, exposing a hierarchical structure to incremental sets of frequency encodings.
Our model accurately reconstructs scenes with wide frequency bands and learns a scene representation at progressive level of detail.
Experiments on several 2D and 3D datasets show improvements in reconstruction accuracy, representational capacity and training speed compared to baselines.
arXiv Detail & Related papers (2022-02-09T20:33:37Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.