Fusing Multimodal Signals on Hyper-complex Space for Extreme Abstractive
Text Summarization (TL;DR) of Scientific Contents
- URL: http://arxiv.org/abs/2306.13968v1
- Date: Sat, 24 Jun 2023 13:51:42 GMT
- Title: Fusing Multimodal Signals on Hyper-complex Space for Extreme Abstractive
Text Summarization (TL;DR) of Scientific Contents
- Authors: Yash Kumar Atri, Vikram Goyal, Tanmoy Chakraborty
- Abstract summary: We deal with a novel task of extreme abstractive text summarization (aka TL;DR generation) by leveraging multiple input modalities.
The mTLDR dataset accompanies a total of 4,182 instances collected from various academic conference proceedings.
We present mTLDRgen, an encoder-decoder-based model that employs a novel dual-fused hyper-complex Transformer.
- Score: 26.32569293387399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The realm of scientific text summarization has experienced remarkable
progress due to the availability of annotated brief summaries and ample data.
However, the utilization of multiple input modalities, such as videos and
audio, has yet to be thoroughly explored. At present, scientific
multimodal-input-based text summarization systems tend to employ longer target
summaries like abstracts, leading to an underwhelming performance in the task
of text summarization.
In this paper, we deal with a novel task of extreme abstractive text
summarization (aka TL;DR generation) by leveraging multiple input modalities.
To this end, we introduce mTLDR, a first-of-its-kind dataset for the
aforementioned task, comprising videos, audio, and text, along with both
author-composed summaries and expert-annotated summaries. The mTLDR dataset
accompanies a total of 4,182 instances collected from various academic
conference proceedings, such as ICLR, ACL, and CVPR. Subsequently, we present
mTLDRgen, an encoder-decoder-based model that employs a novel dual-fused
hyper-complex Transformer combined with a Wasserstein Riemannian Encoder
Transformer, to dexterously capture the intricacies between different
modalities in a hyper-complex latent geometric space. The hyper-complex
Transformer captures the intrinsic properties between the modalities, while the
Wasserstein Riemannian Encoder Transformer captures the latent structure of the
modalities in the latent space geometry, thereby enabling the model to produce
diverse sentences. mTLDRgen outperforms 20 baselines on mTLDR as well as
another non-scientific dataset (How2) across three Rouge-based evaluation
measures. Furthermore, based on the qualitative metrics, BERTScore and FEQA,
and human evaluations, we demonstrate that the summaries generated by mTLDRgen
are fluent and congruent to the original source material.
Related papers
- SKT5SciSumm -- Revisiting Extractive-Generative Approach for Multi-Document Scientific Summarization [24.051692189473723]
We propose SKT5SciSumm - a hybrid framework for multi-document scientific summarization (MDSS)
We leverage the Sentence-Transformer version of Scientific Paper Embeddings using Citation-Informed Transformers (SPECTER) to encode and represent textual sentences.
We employ the T5 family of models to generate abstractive summaries using extracted sentences.
arXiv Detail & Related papers (2024-02-27T08:33:31Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Hierarchical3D Adapters for Long Video-to-text Summarization [79.01926022762093]
multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
Our experiments demonstrate that multimodal information offers superior performance over more memory-heavy and fully fine-tuned textual summarization methods.
arXiv Detail & Related papers (2022-10-10T16:44:36Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Enriching Transformers with Structured Tensor-Product Representations
for Abstractive Summarization [131.23966358405767]
We adapt TP-TRANSFORMER with the explicitly compositional Product Representation (TPR) for the task of abstractive summarization.
Key feature of our model is a structural bias that we introduce by encoding two separate representations for each token.
We show that our TP-TRANSFORMER outperforms the Transformer and the original TP-TRANSFORMER significantly on several abstractive summarization datasets.
arXiv Detail & Related papers (2021-06-02T17:32:33Z) - See, Hear, Read: Leveraging Multimodality with Guided Attention for
Abstractive Text Summarization [14.881597737762316]
We introduce the first large-scale dataset for abstractive text summarization with videos of diverse duration, compiled from presentations in well-known academic conferences like NDSS, ICML, NeurIPS, etc.
We then propose name, a factorized multi-modal Transformer based decoder-only language model, which inherently captures the intra-modal and inter-modal dynamics within various input modalities for the text summarization task.
arXiv Detail & Related papers (2021-05-20T08:56:33Z) - Data Augmentation for Abstractive Query-Focused Multi-Document
Summarization [129.96147867496205]
We present two QMDS training datasets, which we construct using two data augmentation methods.
These two datasets have complementary properties, i.e., QMDSCNN has real summaries but queries are simulated, while QMDSIR has real queries but simulated summaries.
We build end-to-end neural network models on the combined datasets that yield new state-of-the-art transfer results on DUC datasets.
arXiv Detail & Related papers (2021-03-02T16:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.