MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
- URL: http://arxiv.org/abs/2001.03712v1
- Date: Sat, 11 Jan 2020 05:50:19 GMT
- Title: MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding
- Authors: Geondo Park, Chihye Han, Wonjun Yoon, Daeshik Kim
- Abstract summary: We propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data.
Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets.
- Score: 6.4901484665257545
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual-semantic embedding enables various tasks such as image-text retrieval,
image captioning, and visual question answering. The key to successful
visual-semantic embedding is to express visual and textual data properly by
accounting for their intricate relationship. While previous studies have
achieved much advance by encoding the visual and textual data into a joint
space where similar concepts are closely located, they often represent data by
a single vector ignoring the presence of multiple important components in an
image or text. Thus, in addition to the joint embedding space, we propose a
novel multi-head self-attention network to capture various components of visual
and textual data by attending to important parts in data. Our approach achieves
the new state-of-the-art results in image-text retrieval tasks on MS-COCO and
Flicker30K datasets. Through the visualization of the attention maps that
capture distinct semantic components at multiple positions in the image and the
text, we demonstrate that our method achieves an effective and interpretable
visual-semantic joint space.
Related papers
- Few-Shot Relation Extraction with Hybrid Visual Evidence [3.154631846975021]
We propose a multi-modal few-shot relation extraction model (MFS-HVE)
MFS-HVE includes semantic feature extractors and multi-modal fusion components.
Experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
arXiv Detail & Related papers (2024-03-01T18:20:11Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Visual Analytics for Efficient Image Exploration and User-Guided Image
Captioning [35.47078178526536]
Recent advancements in pre-trained large-scale language-image models have ushered in a new era of visual comprehension.
This paper tackles two well-known issues within the realm of visual analytics: (1) the efficient exploration of large-scale image datasets and identification of potential data biases within them; (2) the evaluation of image captions and steering of their generation process.
arXiv Detail & Related papers (2023-11-02T06:21:35Z) - Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.