Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual
Cross-modal Structure-pivoted Alignment
- URL: http://arxiv.org/abs/2305.12260v2
- Date: Thu, 25 May 2023 04:02:17 GMT
- Title: Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual
Cross-modal Structure-pivoted Alignment
- Authors: Shengqiong Wu, Hao Fei, Wei Ji, Tat-Seng Chua
- Abstract summary: Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency issues.
In this work, we propose to address the above problems by incorporating the scene graph (SG) structures and the syntactic constituency (SC) trees.
Our captioner contains the semantic structure-guided image-to-pivot captioning and the syntactic structure-guided pivot-to-target translation.
- Score: 81.00183950655924
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unpaired cross-lingual image captioning has long suffered from irrelevancy
and disfluency issues, due to the inconsistencies of the semantic scene and
syntax attributes during transfer. In this work, we propose to address the
above problems by incorporating the scene graph (SG) structures and the
syntactic constituency (SC) trees. Our captioner contains the semantic
structure-guided image-to-pivot captioning and the syntactic structure-guided
pivot-to-target translation, two of which are joined via pivot language. We
then take the SG and SC structures as pivoting, performing cross-modal semantic
structure alignment and cross-lingual syntactic structure alignment learning.
We further introduce cross-lingual&cross-modal back-translation training to
fully align the captioning and translation stages. Experiments on
English-Chinese transfers show that our model shows great superiority in
improving captioning relevancy and fluency.
Related papers
- Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries.
One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs.
We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z) - Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
Retrieval [57.98555925471121]
Cross-lingual cross-modal retrieval has attracted increasing attention.
Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation.
We propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR.
arXiv Detail & Related papers (2023-09-11T13:44:46Z) - DiffCloth: Diffusion Based Garment Synthesis and Manipulation via
Structural Cross-modal Semantic Alignment [124.57488600605822]
Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments.
We introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation.
Experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results.
arXiv Detail & Related papers (2023-08-22T05:43:33Z) - Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning [36.14667941845198]
Cross-lingual image captioning is a challenging task that requires addressing both cross-lingual and cross-modal obstacles.
We propose an Embedded Heterogeneous Attention Transformer (EHAT) to establish cross-domain relationships between images and different languages.
We evaluate our approach on the MSCOCO dataset to generate captions in English and Chinese, two languages that exhibit significant differences in their language families.
arXiv Detail & Related papers (2023-07-19T11:35:21Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.