Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual
Cross-modal Structure-pivoted Alignment
- URL: http://arxiv.org/abs/2305.12260v2
- Date: Thu, 25 May 2023 04:02:17 GMT
- Title: Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual
Cross-modal Structure-pivoted Alignment
- Authors: Shengqiong Wu, Hao Fei, Wei Ji, Tat-Seng Chua
- Abstract summary: Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency issues.
In this work, we propose to address the above problems by incorporating the scene graph (SG) structures and the syntactic constituency (SC) trees.
Our captioner contains the semantic structure-guided image-to-pivot captioning and the syntactic structure-guided pivot-to-target translation.
- Score: 81.00183950655924
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Unpaired cross-lingual image captioning has long suffered from irrelevancy
and disfluency issues, due to the inconsistencies of the semantic scene and
syntax attributes during transfer. In this work, we propose to address the
above problems by incorporating the scene graph (SG) structures and the
syntactic constituency (SC) trees. Our captioner contains the semantic
structure-guided image-to-pivot captioning and the syntactic structure-guided
pivot-to-target translation, two of which are joined via pivot language. We
then take the SG and SC structures as pivoting, performing cross-modal semantic
structure alignment and cross-lingual syntactic structure alignment learning.
We further introduce cross-lingual&cross-modal back-translation training to
fully align the captioning and translation stages. Experiments on
English-Chinese transfers show that our model shows great superiority in
improving captioning relevancy and fluency.
Related papers
- Lost in Translation, Found in Embeddings: Sign Language Translation and Alignment [84.39962912136525]
We develop a model for sign language understanding that performs sign language translation (SLT) and sign-subtitle alignment (SSA)<n>Our approach is built upon three components: (i) a lightweight visual backbone that captures manual and non-manual cues from human keypoints and lip-region images; (ii) a Sliding Perceiver mapping network that aggregates consecutive visual features into word-level embeddings; and (iii) a multi-task scalable training strategy that jointly optimises SLT and SSA.
arXiv Detail & Related papers (2025-12-08T21:05:46Z) - TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation [8.48847068018671]
This paper proposes TFANet, a Three-stage Image-Text Feature Alignment Network.<n>It enhances multimodal alignment through a hierarchical framework comprising three stages: Knowledge Plus Stage (KPS), Knowledge Fusion Stage (KFS), and Knowledge Intensification Stage (KIS)<n>In the KPS, we design the Multiscale Linear Cross-Attention Module (MLAM), which establishes rich and efficient alignment between image regions and different granularities of linguistic descriptions.<n>The KFS further strengthens feature alignment through the Cross-modal Feature Scanning Module (CFSM), which applies multimodal selective scanning to capture long-range dependencies
arXiv Detail & Related papers (2025-09-16T13:26:58Z) - Linguistics-Vision Monotonic Consistent Network for Sign Language Production [45.12628941399177]
Sign Language Production (SLP) aims to generate sign videos corresponding to spoken language sentences.
Due to the cross-modal semantic gap, the SLP suffers huge challenges in linguistics-vision consistency.
We propose a Transformer-based Linguistics-Vision Monotonic Consistent Network (LVMCN) for SLP.
arXiv Detail & Related papers (2024-12-22T09:28:06Z) - Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries.
One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs.
We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z) - Dual-view Curricular Optimal Transport for Cross-lingual Cross-modal
Retrieval [57.98555925471121]
Cross-lingual cross-modal retrieval has attracted increasing attention.
Most CCR methods construct pseudo-parallel vision-language corpora via Machine Translation.
We propose Dual-view Curricular Optimal Transport (DCOT) to learn with noisy correspondence in CCR.
arXiv Detail & Related papers (2023-09-11T13:44:46Z) - DiffCloth: Diffusion Based Garment Synthesis and Manipulation via
Structural Cross-modal Semantic Alignment [124.57488600605822]
Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments.
We introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation.
Experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results.
arXiv Detail & Related papers (2023-08-22T05:43:33Z) - Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning [36.14667941845198]
Cross-lingual image captioning is a challenging task that requires addressing both cross-lingual and cross-modal obstacles.
We propose an Embedded Heterogeneous Attention Transformer (EHAT) to establish cross-domain relationships between images and different languages.
We evaluate our approach on the MSCOCO dataset to generate captions in English and Chinese, two languages that exhibit significant differences in their language families.
arXiv Detail & Related papers (2023-07-19T11:35:21Z) - Enhancing Cross-lingual Transfer via Phonemic Transcription Integration [57.109031654219294]
PhoneXL is a framework incorporating phonemic transcriptions as an additional linguistic modality for cross-lingual transfer.
Our pilot study reveals phonemic transcription provides essential information beyond the orthography to enhance cross-lingual transfer.
arXiv Detail & Related papers (2023-07-10T06:17:33Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.