Sample Efficient Multimodal Semantic Augmentation for Incremental
Summarization
- URL: http://arxiv.org/abs/2303.04361v1
- Date: Wed, 8 Mar 2023 03:58:06 GMT
- Title: Sample Efficient Multimodal Semantic Augmentation for Incremental
Summarization
- Authors: Sumanta Bhattacharyya, Ramesh Manuvinakurike, Sahisnu Mazumder, Saurav
Sahay
- Abstract summary: We develop a prompting approach for incremental summarization of task videos.
We leverage an existing model for extracting semantic concepts from images.
We show the results on a relevant dataset and discuss possible directions for the work.
- Score: 13.529904498331673
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this work, we develop a prompting approach for incremental summarization
of task videos. We develop a sample-efficient few-shot approach for extracting
semantic concepts as an intermediate step. We leverage an existing model for
extracting the concepts from the images and extend it to videos and introduce a
clustering and querying approach for sample efficiency, motivated by the recent
advances in perceiver-based architectures. Our work provides further evidence
that an approach with richer input context with relevant entities and actions
from the videos and using these as prompts could enhance the summaries
generated by the model. We show the results on a relevant dataset and discuss
possible directions for the work.
Related papers
- Personalized Video Summarization using Text-Based Queries and Conditional Modeling [3.4447129363520337]
This thesis explores enhancing video summarization by integrating text-based queries and conditional modeling.
Evaluation metrics such as accuracy and F1-score assess the quality of the generated summaries.
arXiv Detail & Related papers (2024-08-27T02:43:40Z) - An Integrated Framework for Multi-Granular Explanation of Video Summarization [6.076406622352117]
This framework integrates methods for producing explanations both at the fragment level and at the visual object level.
The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets.
arXiv Detail & Related papers (2024-05-16T13:25:36Z) - Explore In-Context Segmentation via Latent Diffusion Models [132.26274147026854]
latent diffusion model (LDM) is an effective minimalist for in-context segmentation.
We build a new and fair in-context segmentation benchmark that includes both image and video datasets.
arXiv Detail & Related papers (2024-03-14T17:52:31Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Conditional Modeling Based Automatic Video Summarization [70.96973928590958]
The aim of video summarization is to shorten videos automatically while retaining the key information necessary to convey the overall story.
Video summarization methods rely on visual factors, such as visual consecutiveness and diversity, which may not be sufficient to fully understand the content of the video.
A new approach to video summarization is proposed based on insights gained from how humans create ground truth video summaries.
arXiv Detail & Related papers (2023-11-20T20:24:45Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.