EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification
- URL: http://arxiv.org/abs/2305.06923v1
- Date: Thu, 11 May 2023 16:05:03 GMT
- Title: EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification
- Authors: Souhail Bakkali, Ziheng Ming, Mickael Coustaty, Mar\c{c}al Rusi\~nol
- Abstract summary: We design a self-attention-based fusion module that serves as a block in our ensemble trainable network.
It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
This is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
- Score: 1.1470070927586016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the recent past, complex deep neural networks have received huge interest
in various document understanding tasks such as document image classification
and document retrieval. As many document types have a distinct visual style,
learning only visual features with deep CNNs to classify document images have
encountered the problem of low inter-class discrimination, and high intra-class
structural variations between its categories. In parallel, text-level
understanding jointly learned with the corresponding visual properties within a
given document image has considerably improved the classification performance
in terms of accuracy. In this paper, we design a self-attention-based fusion
module that serves as a block in our ensemble trainable network. It allows to
simultaneously learn the discriminant features of image and text modalities
throughout the training stage. Besides, we encourage mutual learning by
transferring the positive knowledge between image and text modalities during
the training stage. This constraint is realized by adding a
truncated-Kullback-Leibler divergence loss Tr-KLD-Reg as a new regularization
term, to the conventional supervised setting. To the best of our knowledge,
this is the first time to leverage a mutual learning approach along with a
self-attention-based fusion module to perform document image classification.
The experimental results illustrate the effectiveness of our approach in terms
of accuracy for the single-modal and multi-modal modalities. Thus, the proposed
ensemble self-attention-based mutual learning model outperforms the
state-of-the-art classification results based on the benchmark RVL-CDIP and
Tobacco-3482 datasets.
Related papers
- FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction [66.98008357232428]
We propose FineMatch, a new aspect-based fine-grained text and image matching benchmark.
FineMatch focuses on text and image mismatch detection and correction.
We show that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches.
arXiv Detail & Related papers (2024-04-23T03:42:14Z) - Cross-Modal Concept Learning and Inference for Vision-Language Models [31.463771883036607]
In existing fine-tuning methods, the class-specific text description is matched against the whole image.
We develop a new method called cross-model concept learning and inference (CCLI)
Our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts.
arXiv Detail & Related papers (2023-07-28T10:26:28Z) - FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism.
We construct a new large-scale image-text pair dataset called FILIP300M for pre-training.
Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Multi-Label Image Classification with Contrastive Learning [57.47567461616912]
We show that a direct application of contrastive learning can hardly improve in multi-label cases.
We propose a novel framework for multi-label classification with contrastive learning in a fully supervised setting.
arXiv Detail & Related papers (2021-07-24T15:00:47Z) - Exploiting the relationship between visual and textual features in
social networks for image classification with zero-shot deep learning [0.0]
In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture.
Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part.
Considering the associated texts to the images can help to improve the accuracy depending on the goal.
arXiv Detail & Related papers (2021-07-08T10:54:59Z) - Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data.
Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously.
By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z) - Learning to Focus: Cascaded Feature Matching Network for Few-shot Image
Recognition [38.49419948988415]
Deep networks can learn to accurately recognize objects of a category by training on a large number of images.
A meta-learning challenge known as a low-shot image recognition task comes when only a few images with annotations are available for learning a recognition model for one category.
Our method, called Cascaded Feature Matching Network (CFMN), is proposed to solve this problem.
Experiments for few-shot learning on two standard datasets, emphminiImageNet and Omniglot, have confirmed the effectiveness of our method.
arXiv Detail & Related papers (2021-01-13T11:37:28Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z) - Self-Supervised Representation Learning on Document Images [8.927538538637783]
We show that patch-based pre-training performs poorly on document images because of their different structural properties and poor intra-sample semantic information.
We propose two context-aware alternatives to improve performance on the Tobacco-3482 image classification task.
arXiv Detail & Related papers (2020-04-18T10:14:06Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.