CISum: Learning Cross-modality Interaction to Enhance Multimodal
Semantic Coverage for Multimodal Summarization
- URL: http://arxiv.org/abs/2302.09934v1
- Date: Mon, 20 Feb 2023 11:57:23 GMT
- Title: CISum: Learning Cross-modality Interaction to Enhance Multimodal
Semantic Coverage for Multimodal Summarization
- Authors: Litian Zhang, Xiaoming Zhang, Ziming Guo, Zhipeng Liu
- Abstract summary: This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage.
To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content.
Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content.
- Score: 2.461695698601437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal summarization (MS) aims to generate a summary from multimodal
input. Previous works mainly focus on textual semantic coverage metrics such as
ROUGE, which considers the visual content as supplemental data. Therefore, the
summary is ineffective to cover the semantics of different modalities. This
paper proposes a multi-task cross-modality learning framework (CISum) to
improve multimodal semantic coverage by learning the cross-modality interaction
in the multimodal article. To obtain the visual semantics, we translate images
into visual descriptions based on the correlation with text content. Then, the
visual description and text content are fused to generate the textual summary
to capture the semantics of the multimodal content, and the most relevant image
is selected as the visual summary. Furthermore, we design an automatic
multimodal semantics coverage metric to evaluate the performance. Experimental
results show that CISum outperforms baselines in multimodal semantics coverage
metrics while maintaining the excellent performance of ROUGE and BLEU.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities.
Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics.
We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
arXiv Detail & Related papers (2024-12-11T14:27:30Z) - Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks [0.8999666725996978]
We propose a novel RSSC framework that integrates text descriptions generated by large vision-language models (VLMs) as an auxiliary modality without incurring expensive manual annotation costs.
Experiments with both quantitative and qualitative evaluation across five RSSC datasets demonstrate that our framework consistently outperforms baseline models.
arXiv Detail & Related papers (2024-12-03T16:24:16Z) - Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries.
One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs.
We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - TLDW: Extreme Multimodal Summarisation of News Videos [76.50305095899958]
We introduce eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR.
XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary.
Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans.
arXiv Detail & Related papers (2022-10-16T08:19:59Z) - Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment [80.18786847090522]
We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation.
We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
arXiv Detail & Related papers (2022-10-10T14:27:10Z) - MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains.
Our method contains video and textual segmentation and summarization module, respectively.
It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.