X-Linear Attention Networks for Image Captioning
- URL: http://arxiv.org/abs/2003.14080v1
- Date: Tue, 31 Mar 2020 10:35:33 GMT
- Title: X-Linear Attention Networks for Image Captioning
- Authors: Yingwei Pan and Ting Yao and Yehao Li and Tao Mei
- Abstract summary: We introduce a unified attention block -- X-Linear attention block, that fully employs bilinear pooling to selectively capitalize on visual information or perform multi-modal reasoning.
X-LAN integrates X-Linear attention block into image encoder and sentence decoder of image captioning model to leverage higher order intra- and inter-modal interactions.
Experiments on COCO benchmark demonstrate that our X-LAN obtains to-date the best published CIDEr performance of 132.0% on COCO Karpathy test split.
- Score: 124.48670699658649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress on fine-grained visual recognition and visual question
answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$
order interactions across multi-modal inputs. Nevertheless, there has not been
evidence in support of building such interactions concurrently with attention
mechanism for image captioning. In this paper, we introduce a unified attention
block -- X-Linear attention block, that fully employs bilinear pooling to
selectively capitalize on visual information or perform multi-modal reasoning.
Technically, X-Linear attention block simultaneously exploits both the spatial
and channel-wise bilinear attention distributions to capture the 2$^{nd}$ order
interactions between the input single-modal or multi-modal features. Higher and
even infinity order feature interactions are readily modeled through stacking
multiple X-Linear attention blocks and equipping the block with Exponential
Linear Unit (ELU) in a parameter-free fashion, respectively. Furthermore, we
present X-Linear Attention Networks (dubbed as X-LAN) that novelly integrates
X-Linear attention block(s) into image encoder and sentence decoder of image
captioning model to leverage higher order intra- and inter-modal interactions.
The experiments on COCO benchmark demonstrate that our X-LAN obtains to-date
the best published CIDEr performance of 132.0% on COCO Karpathy test split.
When further endowing Transformer with X-Linear attention blocks, CIDEr is
boosted up to 132.8%. Source code is available at
\url{https://github.com/Panda-Peter/image-captioning}.
Related papers
- X-Drive: Cross-modality consistent multi-sensor data synthesis for driving scenarios [105.16073169351299]
We propose a novel framework, X-DRIVE, to model the joint distribution of point clouds and multi-view images.
Considering the distinct geometrical spaces of the two modalities, X-DRIVE conditions the synthesis of each modality on the corresponding local regions.
X-DRIVE allows for controllable generation through multi-level input conditions, including text, bounding box, image, and point clouds.
arXiv Detail & Related papers (2024-11-02T03:52:12Z) - X-VILA: Cross-Modality Alignment for Large Language Model [91.96081978952283]
X-VILA is an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.
We propose a visual alignment mechanism with a visual embedding highway module to address the problem of visual information loss.
X-VILA exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins.
arXiv Detail & Related papers (2024-05-29T17:59:58Z) - Xformer: Hybrid X-Shaped Transformer for Image Denoising [114.37510775636811]
We present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks.
Xformer achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.
arXiv Detail & Related papers (2023-03-11T16:32:09Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - Towards Joint Intent Detection and Slot Filling via Higher-order
Attention [47.78365472691051]
Intent detection (ID) and Slot filling (SF) are two major tasks in spoken language understanding (SLU)
We propose a Bilinear attention block, which exploits both the contextual and channel-wise bilinear attention distributions.
We show that our approach yields improvements compared with the state-of-the-art approach.
arXiv Detail & Related papers (2021-09-18T09:50:23Z) - Beyond Self-attention: External Attention using Two Linear Layers for
Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories.
Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.