Lightweight Attentional Feature Fusion for Video Retrieval by Text
- URL: http://arxiv.org/abs/2112.01832v1
- Date: Fri, 3 Dec 2021 10:41:12 GMT
- Title: Lightweight Attentional Feature Fusion for Video Retrieval by Text
- Authors: Fan Hu and Aozhu Chen and Ziyue Wang and Fangming Zhou and Xirong Li
- Abstract summary: We aim for feature fusion for both ends within a unified framework.
We propose Lightweight Attentional Feature Fusion (LAFF)
LAFF performs feature fusion at both early and late stages and at both video and text ends.
- Score: 7.042239213092635
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we revisit \emph{feature fusion}, an old-fashioned topic, in
the new context of video retrieval by text. Different from previous research
that considers feature fusion only at one end, let it be video or text, we aim
for feature fusion for both ends within a unified framework. We hypothesize
that optimizing the convex combination of the features is preferred to modeling
their correlations by computationally heavy multi-head self-attention.
Accordingly, we propose Lightweight Attentional Feature Fusion (LAFF). LAFF
performs feature fusion at both early and late stages and at both video and
text ends, making it a powerful method for exploiting diverse (off-the-shelf)
features. Extensive experiments on four public datasets, i.e. MSR-VTT, MSVD,
TGIF, VATEX, and the large-scale TRECVID AVS benchmark evaluations (2016-2020)
show the viability of LAFF. Moreover, LAFF is extremely simple to implement,
making it appealing for real-world deployment.
Related papers
- A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking [47.312955861553995]
We propose Unified Video Fusion (UniVF), a novel framework for temporally coherent video fusion.<n>To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks.
arXiv Detail & Related papers (2025-05-26T11:45:10Z) - CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion [22.58710742780161]
CFSum is a transformer-based multi-modal video summarization framework with coarse-fine fusion.
CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework.
arXiv Detail & Related papers (2025-03-01T06:13:13Z) - Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification [60.9670254833103]
Person Re-identification (ReID) aims to retrieve the specific person across non-overlapping cameras.
We propose a novel fusion framework called FusionReID to unify the strengths of CNNs and Transformers for image-based person ReID.
arXiv Detail & Related papers (2024-12-23T03:19:19Z) - Fusion Matters: Learning Fusion in Deep Click-through Rate Prediction Models [27.477136474888564]
We introduce OptFusion, a method that automates the learning of fusion, encompassing both the connection learning and the operation selection.
Our experiments are conducted over three large-scale datasets.
arXiv Detail & Related papers (2024-11-24T06:21:59Z) - AMPLE: Emotion-Aware Multimodal Fusion Prompt Learning for Fake News Detection [0.1499944454332829]
This paper introduces Emotion-textbfAware textbfMultimodal Fusion textbfPrompt textbfLtextbfEarning (textbfAMPLE) framework to address the above issue.
This framework extracts emotional elements from texts by leveraging sentiment analysis tools.
It then employs Multi-Head Cross-Attention (MCA) mechanisms and similarity-aware fusion methods to integrate multimodal data.
arXiv Detail & Related papers (2024-10-21T02:19:24Z) - Fusion-Mamba for Cross-modality Object Detection [63.56296480951342]
Cross-modality fusing information from different modalities effectively improves object detection performance.
We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction.
Our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M3FD$ and 4.9% on FLIR-Aligned datasets.
arXiv Detail & Related papers (2024-04-14T05:28:46Z) - An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models [18.184158874126545]
We investigate how different fusion strategies can affect vision-language alignment.
A specially designed intermediate fusion can boost text-to-image alignment with improved generation quality.
Our model achieves a higher CLIP Score and lower FID, with 20% reduced FLOPs, and 50% increased training speed.
arXiv Detail & Related papers (2024-03-25T08:16:06Z) - Unified Coarse-to-Fine Alignment for Video-Text Retrieval [71.85966033484597]
We propose a Unified Coarse-to-fine Alignment model, dubbed UCoFiA.
Our model captures the cross-modal similarity information at different granularity levels.
We apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them.
arXiv Detail & Related papers (2023-09-18T19:04:37Z) - DiffusionRet: Generative Text-Video Retrieval with Diffusion Model [56.03464169048182]
Existing text-video retrieval solutions focus on maximizing the conditional likelihood, i.e., p(candidates|query)
We creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query)
This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise.
arXiv Detail & Related papers (2023-03-17T10:07:19Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z) - FF2: A Feature Fusion Two-Stream Framework for Punctuation Restoration [27.14686854704104]
We propose a Feature Fusion two-stream framework (FF2) for punctuation restoration.
Specifically, one stream leverages a pre-trained language model to capture the semantic feature, while another auxiliary module captures the feature at hand.
Without additional data, the experimental results on the popular benchmark IWSLT demonstrate that FF2 achieves new SOTA performance.
arXiv Detail & Related papers (2022-11-09T06:18:17Z) - Semantic-aligned Fusion Transformer for One-shot Object Detection [18.58772037047498]
One-shot object detection aims at detecting novel objects according to merely one given instance.
Current approaches explore various feature fusions to obtain directly transferable meta-knowledge.
We propose a simple but effective architecture named Semantic-aligned Fusion Transformer (SaFT) to resolve these issues.
arXiv Detail & Related papers (2022-03-17T05:38:47Z) - Image Fusion Transformer [75.71025138448287]
In image fusion, images obtained from different sensors are fused to generate a single image with enhanced information.
In recent years, state-of-the-art methods have adopted Convolution Neural Networks (CNNs) to encode meaningful features for image fusion.
We propose a novel Image Fusion Transformer (IFT) where we develop a transformer-based multi-scale fusion strategy.
arXiv Detail & Related papers (2021-07-19T16:42:49Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.