Support-set based Multi-modal Representation Enhancement for Video
Captioning
- URL: http://arxiv.org/abs/2205.09307v1
- Date: Thu, 19 May 2022 03:40:29 GMT
- Title: Support-set based Multi-modal Representation Enhancement for Video
Captioning
- Authors: Xiaoya Chen, Jingkuan Song, Pengpeng Zeng, Lianli Gao and Heng Tao
Shen
- Abstract summary: We propose a Support-set based Multi-modal Representation Enhancement (SMRE) model to mine rich information in a semantic subspace shared between samples.
Specifically, we propose a Support-set Construction (SC) module to construct a support-set to learn underlying connections between samples and obtain semantic-related visual elements.
During this process, we design a Semantic Space Transformation (SST) module to constrain relative distance and administrate multi-modal interactions in a self-supervised way.
- Score: 121.70886789958799
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning is a challenging task that necessitates a thorough
comprehension of visual scenes. Existing methods follow a typical one-to-one
mapping, which concentrates on a limited sample space while ignoring the
intrinsic semantic associations between samples, resulting in rigid and
uninformative expressions. To address this issue, we propose a novel and
flexible framework, namely Support-set based Multi-modal Representation
Enhancement (SMRE) model, to mine rich information in a semantic subspace
shared between samples. Specifically, we propose a Support-set Construction
(SC) module to construct a support-set to learn underlying connections between
samples and obtain semantic-related visual elements. During this process, we
design a Semantic Space Transformation (SST) module to constrain relative
distance and administrate multi-modal interactions in a self-supervised way.
Extensive experiments on MSVD and MSR-VTT datasets demonstrate that our SMRE
achieves state-of-the-art performance.
Related papers
- Cross-domain Multi-modal Few-shot Object Detection via Rich Text [21.36633828492347]
Cross-modal feature extraction and integration have led to steady performance improvements in few-shot learning tasks.
We study the Cross-Domain few-shot generalization of MM-OD (CDMM-FSOD) and propose a meta-learning based multi-modal few-shot object detection method.
arXiv Detail & Related papers (2024-03-24T15:10:22Z) - Multi-modal Semantic Understanding with Contrastive Cross-modal Feature
Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment.
Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z) - Towards More Unified In-context Visual Understanding [74.55332581979292]
We present a new ICL framework for visual understanding with multi-modal output enabled.
First, we quantize and embed both text and visual prompt into a unified representational space.
Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them.
arXiv Detail & Related papers (2023-12-05T06:02:21Z) - Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations.
These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings.
We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Few-shot Semantic Segmentation with Support-induced Graph Convolutional
Network [28.46908214462594]
Few-shot semantic segmentation (FSS) aims to achieve novel objects segmentation with only a few annotated samples.
We propose a Support-induced Graph Convolutional Network (SiGCN) to explicitly excavate latent context structure in query images.
arXiv Detail & Related papers (2023-01-09T08:00:01Z) - Linguistic Structure Guided Context Modeling for Referring Image
Segmentation [61.701577239317785]
We propose a "gather-propagate-distribute" scheme to model multimodal context by cross-modal interaction.
Our LSCM module builds a Dependency Parsing Tree Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence.
arXiv Detail & Related papers (2020-10-01T16:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.