Factor-Conditioned Speaking-Style Captioning
- URL: http://arxiv.org/abs/2406.18910v1
- Date: Thu, 27 Jun 2024 05:52:10 GMT
- Title: Factor-Conditioned Speaking-Style Captioning
- Authors: Atsushi Ando, Takafumi Moriya, Shota Horiguchi, Ryo Masumura,
- Abstract summary: We introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors.
FCC generates a caption to ensure the model explicitly learns speaking-style factors.
We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy.
- Score: 32.67274840212351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.
Related papers
- StyleCap: Automatic Speaking-Style Captioning from Speech Based on
Speech and Language Self-supervised Learning Models [17.945821635380614]
StyleCap is a method to generate natural language descriptions of speaking styles appearing in speech.
StyleCap is trained with paired data of speech and natural language descriptions.
arXiv Detail & Related papers (2023-11-28T04:49:17Z) - Improving Image Captioning Descriptiveness by Ranking and LLM-based
Fusion [17.99150939602917]
State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training.
We present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused.
arXiv Detail & Related papers (2023-06-20T15:13:02Z) - Pragmatic Inference with a CLIP Listener for Contrastive Captioning [10.669625017690658]
We propose a method for generating discriminative captions that distinguish target images from very similar alternative distractor images.
Our approach is built on a pragmatic inference procedure that formulates captioning as a reference game between a speaker and a listener.
arXiv Detail & Related papers (2023-06-15T02:22:28Z) - Towards Generating Diverse Audio Captions via Adversarial Training [33.76154801580643]
We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems.
A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions.
The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-12-05T05:06:19Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Diverse Image Captioning with Grounded Style [19.434931809979282]
We propose COCO-based augmentations to obtain varied stylized captions from COCO annotations.
We encode the stylized information in the latent space of a Variational Autoencoder.
Our experiments on the Senticap and COCO datasets show the ability of our approach to generate accurate captions.
arXiv Detail & Related papers (2022-05-03T22:57:59Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Structural and Functional Decomposition for Personality Image Captioning
in a Communication Game [53.74847926974122]
Personality image captioning (PIC) aims to describe an image with a natural language caption given a personality trait.
We introduce a novel formulation for PIC based on a communication game between a speaker and a listener.
arXiv Detail & Related papers (2020-11-17T10:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.