HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
- URL: http://arxiv.org/abs/2301.04742v1
- Date: Wed, 11 Jan 2023 22:25:20 GMT
- Title: HADA: A Graph-based Amalgamation Framework in Image-text Retrieval
- Authors: Manh-Duy Nguyen, Binh T. Nguyen, Cathal Gurrin
- Abstract summary: We propose a compact graph-based framework, named HADA, which can combine pretrained models to produce a better result.
Our experiments showed that HADA could increase baseline performance by more than 3.6% in terms of evaluation metrics in the Flickr30k dataset.
- Score: 2.3013879633693266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many models have been proposed for vision and language tasks, especially the
image-text retrieval task. All state-of-the-art (SOTA) models in this challenge
contained hundreds of millions of parameters. They also were pretrained on a
large external dataset that has been proven to make a big improvement in
overall performance. It is not easy to propose a new model with a novel
architecture and intensively train it on a massive dataset with many GPUs to
surpass many SOTA models, which are already available to use on the Internet.
In this paper, we proposed a compact graph-based framework, named HADA, which
can combine pretrained models to produce a better result, rather than building
from scratch. First, we created a graph structure in which the nodes were the
features extracted from the pretrained models and the edges connecting them.
The graph structure was employed to capture and fuse the information from every
pretrained model with each other. Then a graph neural network was applied to
update the connection between the nodes to get the representative embedding
vector for an image and text. Finally, we used the cosine similarity to match
images with their relevant texts and vice versa to ensure a low inference time.
Our experiments showed that, although HADA contained a tiny number of trainable
parameters, it could increase baseline performance by more than 3.6% in terms
of evaluation metrics in the Flickr30k dataset. Additionally, the proposed
model did not train on any external dataset and did not require many GPUs but
only 1 to train due to its small number of parameters. The source code is
available at https://github.com/m2man/HADA.
Related papers
- Multi-Modal Parameter-Efficient Fine-tuning via Graph Neural Network [2.12696199609647]
This paper proposes a multi-modal parameter-efficient fine-tuning method based on graph networks.
The proposed model achieves test accuracies on the OxfordPets, Flowers102, and Food101 datasets that improve by 4.45%, 2.92%, and 0.23%, respectively.
arXiv Detail & Related papers (2024-08-01T05:24:20Z) - Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation.
Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z) - LiteNeXt: A Novel Lightweight ConvMixer-based Model with Self-embedding Representation Parallel for Medical Image Segmentation [2.0901574458380403]
We propose a new lightweight but efficient model, namely LiteNeXt, for medical image segmentation.
LiteNeXt is trained from scratch with small amount of parameters (0.71M) and Giga Floating Point Operations Per Second (0.42).
arXiv Detail & Related papers (2024-04-04T01:59:19Z) - Cross-Modal Adapter for Text-Video Retrieval [91.9575196703281]
We present a novel $textbfCross-Modal Adapter$ for parameter-efficient fine-tuning.
Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.
It achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets.
arXiv Detail & Related papers (2022-11-17T16:15:30Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - A Robust Stacking Framework for Training Deep Graph Models with
Multifaceted Node Features [61.92791503017341]
Graph Neural Networks (GNNs) with numerical node features and graph structure as inputs have demonstrated superior performance on various supervised learning tasks with graph data.
The best models for such data types in most standard supervised learning settings with IID (non-graph) data are not easily incorporated into a GNN.
Here we propose a robust stacking framework that fuses graph-aware propagation with arbitrary models intended for IID data.
arXiv Detail & Related papers (2022-06-16T22:46:33Z) - InvGAN: Invertible GANs [88.58338626299837]
InvGAN, short for Invertible GAN, successfully embeds real images to the latent space of a high quality generative model.
This allows us to perform image inpainting, merging, and online data augmentation.
arXiv Detail & Related papers (2021-12-08T21:39:00Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z) - KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation [100.79870384880333]
We propose a knowledge-grounded pre-training (KGPT) to generate knowledge-enriched text.
We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness.
Under zero-shot setting, our model achieves over 30 ROUGE-L on WebNLG while all other baselines fail.
arXiv Detail & Related papers (2020-10-05T19:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.