VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal
Document Classification
- URL: http://arxiv.org/abs/2205.12029v3
- Date: Thu, 11 May 2023 15:31:06 GMT
- Title: VLCDoC: Vision-Language Contrastive Pre-Training Model for Cross-Modal
Document Classification
- Authors: Souhail Bakkali, Zuheng Ming, Mickael Coustaty, Mar\c{c}al Rusi\~nol,
Oriol Ramos Terrades
- Abstract summary: Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task.
In this paper, we approach the document classification problem by learning cross-modal representations through language and vision cues.
The proposed method exploits high-level interactions and learns relevant semantic information from effective attention flows within and across modalities.
- Score: 3.7798600249187295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal learning from document data has achieved great success lately as
it allows to pre-train semantically meaningful features as a prior into a
learnable downstream task. In this paper, we approach the document
classification problem by learning cross-modal representations through language
and vision cues, considering intra- and inter-modality relationships. Instead
of merging features from different modalities into a joint representation
space, the proposed method exploits high-level interactions and learns relevant
semantic information from effective attention flows within and across
modalities. The proposed learning objective is devised between intra- and
inter-modality alignment tasks, where the similarity distribution per task is
computed by contracting positive sample pairs while simultaneously contrasting
negative ones in the joint representation space}. Extensive experiments on
public document classification datasets demonstrate the effectiveness and the
generality of our model on low-scale and large-scale datasets.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Knowledge-Enhanced Hierarchical Information Correlation Learning for
Multi-Modal Rumor Detection [82.94413676131545]
We propose a novel knowledge-enhanced hierarchical information correlation learning approach (KhiCL) for multi-modal rumor detection.
KhiCL exploits cross-modal joint dictionary to transfer the heterogeneous unimodality features into the common feature space.
It extracts visual and textual entities from images and text, and designs a knowledge relevance reasoning strategy.
arXiv Detail & Related papers (2023-06-28T06:08:20Z) - Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences.
We pose the problem of unseen modality interaction and introduce a first solution.
It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z) - EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification [1.1470070927586016]
We design a self-attention-based fusion module that serves as a block in our ensemble trainable network.
It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
This is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
arXiv Detail & Related papers (2023-05-11T16:05:03Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Sequential Cross-Document Coreference Resolution [14.099694053823765]
Cross-document coreference resolution is important for the growing interest in multi-document analysis tasks.
We propose a new model that extends the efficient sequential prediction paradigm for coreference resolution to cross-document settings.
Our model incrementally composes mentions into cluster representations and predicts links between a mention and the already constructed clusters.
arXiv Detail & Related papers (2021-04-17T00:46:57Z) - An End-to-end Model for Entity-level Relation Extraction using
Multi-instance Learning [2.111790330664657]
We present a joint model for entity-level relation extraction from documents.
We achieve state-of-the-art relation extraction results on the DocRED dataset.
Our experimental results suggest that a joint approach is on par with task-specific learning, though more efficient due to shared parameters and training steps.
arXiv Detail & Related papers (2021-02-11T12:49:39Z) - Semantically Driven Sentence Fusion: Modeling and Evaluation [27.599227950466442]
Sentence fusion is the task of joining related sentences into coherent text.
Current training and evaluation schemes for this task are based on single reference ground-truths.
We show that this hinders models from robustly capturing the semantic relationship between input sentences.
arXiv Detail & Related papers (2020-10-06T10:06:01Z) - Task-Feature Collaborative Learning with Application to Personalized
Attribute Prediction [166.87111665908333]
We propose a novel multi-task learning method called Task-Feature Collaborative Learning (TFCL)
Specifically, we first propose a base model with a heterogeneous block-diagonal structure regularizer to leverage the collaborative grouping of features and tasks.
As a practical extension, we extend the base model by allowing overlapping features and differentiating the hard tasks.
arXiv Detail & Related papers (2020-04-29T02:32:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.