Sample Efficient Multimodal Semantic Augmentation for Incremental
Summarization
- URL: http://arxiv.org/abs/2303.04361v1
- Date: Wed, 8 Mar 2023 03:58:06 GMT
- Title: Sample Efficient Multimodal Semantic Augmentation for Incremental
Summarization
- Authors: Sumanta Bhattacharyya, Ramesh Manuvinakurike, Sahisnu Mazumder, Saurav
Sahay
- Abstract summary: We develop a prompting approach for incremental summarization of task videos.
We leverage an existing model for extracting semantic concepts from images.
We show the results on a relevant dataset and discuss possible directions for the work.
- Score: 13.529904498331673
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we develop a prompting approach for incremental summarization
of task videos. We develop a sample-efficient few-shot approach for extracting
semantic concepts as an intermediate step. We leverage an existing model for
extracting the concepts from the images and extend it to videos and introduce a
clustering and querying approach for sample efficiency, motivated by the recent
advances in perceiver-based architectures. Our work provides further evidence
that an approach with richer input context with relevant entities and actions
from the videos and using these as prompts could enhance the summaries
generated by the model. We show the results on a relevant dataset and discuss
possible directions for the work.
Related papers
- An Integrated Framework for Multi-Granular Explanation of Video Summarization [6.076406622352117]
This framework integrates methods for producing explanations both at the fragment level and at the visual object level.
The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets.
arXiv Detail & Related papers (2024-05-16T13:25:36Z) - Explore In-Context Segmentation via Latent Diffusion Models [132.26274147026854]
latent diffusion model (LDM) is an effective minimalist for in-context segmentation.
We build a new and fair in-context segmentation benchmark that includes both image and video datasets.
arXiv Detail & Related papers (2024-03-14T17:52:31Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual words, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - Learning Summary-Worthy Visual Representation for Abstractive
Summarization in Video [34.202514532882]
We propose a novel approach to learning the summary-worthy visual representation that facilitates abstractive summarization.
Our method exploits the summary-worthy information from both the cross-modal transcript data and the knowledge that distills from the pseudo summary.
arXiv Detail & Related papers (2023-05-08T16:24:46Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z) - AARGH! End-to-end Retrieval-Generation for Task-Oriented Dialog [3.42658286826597]
AARGH is an end-to-end task-oriented dialog system combining retrieval and generative approaches in a single model.
We show that our approach produces more diverse outputs while maintaining or improving state tracking and context-to-response generation performance.
arXiv Detail & Related papers (2022-09-08T08:15:22Z) - Support-set based Multi-modal Representation Enhancement for Video
Captioning [121.70886789958799]
We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
arXiv Detail & Related papers (2022-05-19T03:40:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.