Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment
- URL: http://arxiv.org/abs/2210.04722v1
- Date: Mon, 10 Oct 2022 14:27:10 GMT
- Title: Semantics-Consistent Cross-domain Summarization via Optimal Transport
Alignment
- Authors: Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Franck Dernoncourt, Trung Bui,
Zhaowen Wang, Bo Li, Ding Zhao, Hailin Jin
- Abstract summary: We propose a Semantics-Consistent Cross-domain Summarization model based on optimal transport alignment with visual and textual segmentation.
We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries.
- Score: 80.18786847090522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimedia summarization with multimodal output (MSMO) is a recently explored
application in language grounding. It plays an essential role in real-world
applications, i.e., automatically generating cover images and titles for news
articles or providing introductions to online videos. However, existing methods
extract features from the whole video and article and use fusion methods to
select the representative one, thus usually ignoring the critical structure and
varying semantics. In this work, we propose a Semantics-Consistent Cross-domain
Summarization (SCCS) model based on optimal transport alignment with visual and
textual segmentation. In specific, our method first decomposes both video and
article into segments in order to capture the structural semantics,
respectively. Then SCCS follows a cross-domain alignment objective with optimal
transport distance, which leverages multimodal interaction to match and select
the visual and textual summary. We evaluated our method on three recent
multimodal datasets and demonstrated the effectiveness of our method in
producing high-quality multimodal summaries.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - CISum: Learning Cross-modality Interaction to Enhance Multimodal
Semantic Coverage for Multimodal Summarization [2.461695698601437]
This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage.
To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content.
Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content.
arXiv Detail & Related papers (2023-02-20T11:57:23Z) - TLDW: Extreme Multimodal Summarisation of News Videos [76.50305095899958]
We introduce eXtreme Multimodal Summarisation with Multimodal Output (XMSMO) for the scenario of TL;DW - Too Long; Didn't Watch, akin to TL;DR.
XMSMO aims to summarise a video-document pair into a summary with an extremely short length, which consists of one cover frame as the visual summary and one sentence as the textual summary.
Our method is trained, without using reference summaries, by optimising the visual and textual coverage from the perspectives of the distance between the semantic distributions under optimal transport plans.
arXiv Detail & Related papers (2022-10-16T08:19:59Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains.
Our method contains video and textual segmentation and summarization module, respectively.
It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Comprehensive Multi-Modal Interactions for Referring Image Segmentation [7.064383217512461]
We investigate Referring Image (RIS), which outputs a segmentation map corresponding to the given natural language description.
To solve RIS efficiently, we need to understand each word's relationship with other words, each region in the image to other regions, and cross-modal alignment between linguistic and visual domains.
We propose a Joint Reasoning (JRM) module and a novel Cross-Modal Multi-Level Fusion (CMMLF) module for tackling this task.
arXiv Detail & Related papers (2021-04-21T08:45:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.