I2DFormer: Learning Image to Document Attention for Zero-Shot Image
  Classification
        - URL: http://arxiv.org/abs/2209.10304v1
- Date: Wed, 21 Sep 2022 12:18:31 GMT
- Title: I2DFormer: Learning Image to Document Attention for Zero-Shot Image
  Classification
- Authors: Muhammad Ferjad Naeem, Yongqin Xian, Luc Van Gool, Federico Tombari
- Abstract summary: Online textual documents, e.g., Wikipedia, contain rich visual descriptions about object classes.
We propose I2DFormer, a novel transformer-based ZSL framework that jointly learns to encode images and documents.
Our method leads to highly interpretable results where document words can be grounded in the image regions.
- Score: 123.90912800376039
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Despite the tremendous progress in zero-shot learning(ZSL), the majority of
existing methods still rely on human-annotated attributes, which are difficult
to annotate and scale. An unsupervised alternative is to represent each class
using the word embedding associated with its semantic class name. However, word
embeddings extracted from pre-trained language models do not necessarily
capture visual similarities, resulting in poor zero-shot performance. In this
work, we argue that online textual documents, e.g., Wikipedia, contain rich
visual descriptions about object classes, therefore can be used as powerful
unsupervised side information for ZSL. To this end, we propose I2DFormer, a
novel transformer-based ZSL framework that jointly learns to encode images and
documents by aligning both modalities in a shared embedding space. In order to
distill discriminative visual words from noisy documents, we introduce a new
cross-modal attention module that learns fine-grained interactions between
image patches and document words. Consequently, our I2DFormer not only learns
highly discriminative document embeddings that capture visual similarities but
also gains the ability to localize visually relevant words in image regions.
Quantitatively, we demonstrate that our I2DFormer significantly outperforms
previous unsupervised semantic embeddings under both zero-shot and generalized
zero-shot learning settings on three public datasets. Qualitatively, we show
that our method leads to highly interpretable results where document words can
be grounded in the image regions.
 
      
        Related papers
        - MADS: Multi-Attribute Document Supervision for Zero-Shot Image   Classification [13.883913835653711]
 Zero-shot learning aims to train a model on seen classes and recognize unseen classes by knowledge transfer.
Recent studies reveal that documents from encyclopedias provide helpful auxiliary information.
We propose a novel multi-attribute document supervision framework to remove noises at both document collection and model learning stages.
 arXiv  Detail & Related papers  (2025-03-10T02:16:30Z)
- Visual-Semantic Decomposition and Partial Alignment for Document-based   Zero-Shot Learning [14.77066147494556]
 We propose a novel network to extract multi-view semantic concepts from documents and images and align the matching rather than entire concepts.
We consistently outperform state-of-the-art methods under two document sources in three standard benchmarks for document-based zero-shot learning.
 arXiv  Detail & Related papers  (2024-07-22T13:15:04Z)
- AddressCLIP: Empowering Vision-Language Models for City-wide Image   Address Localization [57.34659640776723]
 We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
 arXiv  Detail & Related papers  (2024-07-11T03:18:53Z)
- CAPro: Webly Supervised Learning with Cross-Modality Aligned Prototypes [93.71909293023663]
 Cross-modality Aligned Prototypes (CAPro) is a unified contrastive learning framework to learn visual representations with correct semantics.
 CAPro achieves new state-of-the-art performance and exhibits robustness to open-set recognition.
 arXiv  Detail & Related papers  (2023-10-15T07:20:22Z)
- Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
 News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
 arXiv  Detail & Related papers  (2023-08-16T12:39:39Z)
- DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents [18.080447065002392]
 We propose DocumentCLIP to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents.
Our model is beneficial for the real-world multimodal document understanding like news article, magazines, product descriptions, which contain linguistically and visually richer content.
 arXiv  Detail & Related papers  (2023-06-09T23:51:11Z)
- STAIR: Learning Sparse Text and Image Representation in Grounded Tokens [84.14528645941128]
 We show that it is possible to build a sparse semantic representation that is as powerful as, or even better than, dense presentations.
We extend the CLIP model and build a sparse text and image representation (STAIR), where the image and text are mapped to a sparse token space.
It significantly outperforms a CLIP model with +$4.9%$ and +$4.3%$ absolute Recall@1 improvement.
 arXiv  Detail & Related papers  (2023-01-30T17:21:30Z)
- Knowing Where and What: Unified Word Block Pretraining for Document
  Understanding [11.46378901674016]
 We propose UTel, a language model with Unified TExt and layout pre-training.
Specifically, we propose two pre-training tasks: Surrounding Word Prediction (SWP) for the layout learning, and Contrastive learning of Word Embeddings (CWE) for identifying different word blocks.
In this way, the joint training of Masked Layout-Language Modeling (MLLM) and two newly proposed tasks enables the interaction between semantic and spatial features in a unified way.
 arXiv  Detail & Related papers  (2022-07-28T09:43:06Z)
- VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
 We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning.
Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity.
We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
 arXiv  Detail & Related papers  (2022-03-20T03:49:02Z)
- Scaling Up Visual and Vision-Language Representation Learning With Noisy
  Text Supervision [57.031588264841]
 We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
 arXiv  Detail & Related papers  (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.