Related papers: ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning

ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning

URL: http://arxiv.org/abs/2207.03327v1
Date: Thu, 7 Jul 2022 14:37:02 GMT
Title: ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning
Authors: Jia Cheng Hu
Abstract summary: We propose a new method called Expansion Mechanism'' which transforms either dynamically or statically the input sequence into a new one featuring a different sequence length. We exploit such method and achieve competitive performances on the MS-COCO 2014 data set.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Most recent state of art architectures rely on combinations and variations of three approaches: convolutional, recurrent and self-attentive methods. Our work attempts in laying the basis for a new research direction for sequence modeling based upon the idea of modifying the sequence length. In order to do that, we propose a new method called ``Expansion Mechanism'' which transforms either dynamically or statically the input sequence into a new one featuring a different sequence length. Furthermore, we introduce a novel architecture that exploits such method and achieves competitive performances on the MS-COCO 2014 data set, yielding 134.6 and 131.4 CIDEr-D on the Karpathy test split in the ensemble and single model configuration respectively and 130 CIDEr-D in the official online testing server, despite being neither recurrent nor fully attentive. At the same time we address the efficiency aspect in our design and introduce a convenient training strategy suitable for most computational resources in contrast to the standard one. Source code is available at https://github.com/jchenghu/ExpansionNet

Related papers

Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation [54.50526986788175]
Recent advances in efficient sequence modeling have led to attention-free layers, such as Mamba, RWKV, and various gated RNNs. We present a unified view of these models, formulating such layers as implicit causal self-attention layers. Our framework compares the underlying mechanisms on similar grounds for different layers and provides a direct means for applying explainability methods.
arXiv Detail & Related papers (2024-05-26T09:57:45Z)
Ensemble Quadratic Assignment Network for Graph Matching [52.20001802006391]
Graph matching is a commonly used technique in computer vision and pattern recognition. Recent data-driven approaches have improved the graph matching accuracy remarkably. We propose a graph neural network (GNN) based approach to combine the advantages of data-driven and traditional methods.
arXiv Detail & Related papers (2024-03-11T06:34:05Z)
Med-DANet V2: A Flexible Dynamic Architecture for Efficient Medical Volumetric Segmentation [29.082411035685773]
A dynamic architecture network for medical segmentation (i.e. Med-DANet) has achieved a favorable accuracy and efficiency trade-off. This paper explores a unified formulation of the dynamic inference framework from the perspective of both the data itself and the model structure. Our framework improves the model efficiency by up to nearly 4.1 and 17.3 times with comparable segmentation results on BraTS 2019.
arXiv Detail & Related papers (2023-10-28T09:57:28Z)
An Empirical Study of Multimodal Model Merging [148.48412442848795]
Model merging is a technique that fuses multiple models trained on different tasks to generate a multi-task solution. We conduct our study for a novel goal where we can merge vision, language, and cross-modal transformers of a modality-specific architecture. We propose two metrics that assess the distance between weights to be merged and can serve as an indicator of the merging outcomes.
arXiv Detail & Related papers (2023-04-28T15:43:21Z)
DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models. We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn. Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z)
Sequential Ensembling for Semantic Segmentation [4.030520171276982]
We benchmark the popular ensembling approach of combining predictions of multiple, independently-trained, state-of-the-art models. We propose a novel method inspired by boosting to sequentially ensemble networks that significantly outperforms the naive ensemble baseline.
arXiv Detail & Related papers (2022-10-08T22:13:59Z)
Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning [52.25026952905702]
We introduce a method called the Expansion mechanism that processes the input unconstrained by the number of elements in the sequence. By doing so, the model can learn more effectively compared to traditional attention-based approaches.
arXiv Detail & Related papers (2022-08-13T02:50:35Z)
Parameter Decoupling Strategy for Semi-supervised 3D Left Atrium Segmentation [0.0]
We present a novel semi-supervised segmentation model based on parameter decoupling strategy to encourage consistent predictions from diverse views. Our method has achieved a competitive result over the state-of-the-art semisupervised methods on the Atrial Challenge dataset.
arXiv Detail & Related papers (2021-09-20T14:51:42Z)
Improving Transformer-Kernel Ranking Model Using Conformer and Query Term Independence [29.442579683405913]
The Transformer- Kernel (TK) model has demonstrated strong reranking performance on the TREC Deep Learning benchmark. A variant of the TK model -- called TKL -- has been developed that incorporates local self-attention to efficiently process longer input sequences. In this work, we propose a novel Conformer layer as an alternative approach to scale TK to longer input sequences.
arXiv Detail & Related papers (2021-04-19T15:32:34Z)
Learning Dynamic Routing for Semantic Segmentation [86.56049245100084]
This paper studies a conceptually new method to alleviate the scale variance in semantic representation, named dynamic routing. The proposed framework generates data-dependent routes, adapting to the scale distribution of each image. To this end, a differentiable gating function, called soft conditional gate, is proposed to select scale transform paths on the fly.
arXiv Detail & Related papers (2020-03-23T17:22:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.