Multimodal Side-Tuning for Document Classification
- URL: http://arxiv.org/abs/2301.07502v1
- Date: Mon, 16 Jan 2023 11:08:03 GMT
- Title: Multimodal Side-Tuning for Document Classification
- Authors: Stefano Pio Zingaro and Giuseppe Lisanti and Maurizio Gabbrielli
- Abstract summary: Side-tuning is a methodology for network adaptation recently introduced to solve some of the problems related to previous approaches.
We show that side-tuning can be successfully employed also when different data sources are considered.
- Score: 3.0229888038442914
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we propose to exploit the side-tuning framework for multimodal
document classification. Side-tuning is a methodology for network adaptation
recently introduced to solve some of the problems related to previous
approaches. Thanks to this technique it is actually possible to overcome model
rigidity and catastrophic forgetting of transfer learning by fine-tuning. The
proposed solution uses off-the-shelf deep learning architectures leveraging the
side-tuning framework to combine a base model with a tandem of two side
networks. We show that side-tuning can be successfully employed also when
different data sources are considered, e.g. text and images in document
classification. The experimental results show that this approach pushes further
the limit for document classification accuracy with respect to the state of the
art.
Related papers
- Task-Specific Adaptation with Restricted Model Access [23.114703555189937]
"Gray-box" fine-tuning approaches, where the model's architecture and weights remain hidden, allow only gradient propagation.
We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model's input and output.
We evaluate our approaches across several backbones on benchmarks such as text-image alignment, text-video alignment, and sketch-image alignment.
arXiv Detail & Related papers (2025-02-02T13:29:44Z) - Towards Compatible Fine-tuning for Vision-Language Model Updates [114.25776195225494]
Class-conditioned Context Optimization (ContCoOp) integrates learnable prompts with class embeddings using an attention layer before inputting them into the text encoder.
Our experiments over 15 datasets show that our ContCoOp achieves the highest compatibility over the baseline methods, and exhibits robust out-of-distribution generalization.
arXiv Detail & Related papers (2024-12-30T12:06:27Z) - High-Performance Few-Shot Segmentation with Foundation Models: An Empirical Study [64.06777376676513]
We develop a few-shot segmentation (FSS) framework based on foundation models.
To be specific, we propose a simple approach to extract implicit knowledge from foundation models to construct coarse correspondence.
Experiments on two widely used datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-10T08:04:11Z) - DocXplain: A Novel Model-Agnostic Explainability Method for Document Image Classification [5.247930659596986]
This paper introduces DocXplain, a novel model-agnostic explainability method specifically designed for generating high interpretability feature attribution maps.
We extensively evaluate our proposed approach in the context of document image classification, utilizing 4 different evaluation metrics.
To the best of the authors' knowledge, this work presents the first model-agnostic attribution-based explainability method specifically tailored for document images.
arXiv Detail & Related papers (2024-07-04T10:59:15Z) - Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Navigating Text-To-Image Customization: From LyCORIS Fine-Tuning to
Model Evaluation [6.7311791228366]
This paper introduces LyCORIS, an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion.
We also present a framework for the systematic assessment of varied fine-tuning techniques.
Our work provides essential insights into the nuanced effects of fine-tuning parameters, bridging the gap between state-of-the-art research and practical application.
arXiv Detail & Related papers (2023-09-26T11:36:26Z) - Switchable Representation Learning Framework with Self-compatibility [50.48336074436792]
We propose a Switchable representation learning Framework with Self-Compatibility (SFSC)
SFSC generates a series of compatible sub-models with different capacities through one training process.
SFSC achieves state-of-the-art performance on the evaluated datasets.
arXiv Detail & Related papers (2022-06-16T16:46:32Z) - RectiNet-v2: A stacked network architecture for document image dewarping [16.249023269158734]
We propose an end-to-end CNN architecture that can produce distortion free document images from warped documents it takes as input.
We train this model on warped document images simulated synthetically to compensate for lack of enough natural data.
We evaluate our method on the DocUNet dataset, a benchmark in this domain, and obtain results comparable to state-of-the-art methods.
arXiv Detail & Related papers (2021-02-01T19:26:17Z) - Unsupervised Neural Domain Adaptation for Document Image Binarization [13.848843012433187]
This paper proposes a method that combines neural networks and Domain Adaptation (DA) in order to carry out unsupervised document binarization.
Results show that our proposal successfully deals with the binarization of new document domains without the need for labeled data.
arXiv Detail & Related papers (2020-12-02T13:42:38Z) - Self-supervised Deep Reconstruction of Mixed Strip-shredded Text
Documents [63.41717168981103]
This work extends our previous deep learning method for single-page reconstruction to a more realistic/complex scenario.
In our approach, the compatibility evaluation is modeled as a two-class (valid or invalid) pattern recognition problem.
The proposed method outperforms the competing ones on complex scenarios, achieving accuracy superior to 90%.
arXiv Detail & Related papers (2020-07-01T21:48:05Z) - Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised
Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds.
This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly.
Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.