Related papers: Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

URL: http://arxiv.org/abs/2302.10315v1
Date: Thu, 16 Feb 2023 03:34:08 GMT
Title: Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training
Authors: Zhangxiaobing and Tangzhenhao and Longzi and Fuxianghua
Abstract summary: multimodal pre-training generalization algorithm for self-supervised training is proposed. We show that when the filtered information is used as multimodal machine translation for fine-tuning, the effect of translation in the global voice dataset is 0.5 BLEU higher than the baseline.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, a large number of studies have shown that the introduction of visual information can effectively improve the effect of neural machine translation (NMT). Its effectiveness largely depends on the availability of a large number of bilingual parallel sentence pairs and manual image annotation. The lack of images and the effectiveness of images have been difficult to solve. In this paper, a multimodal pre-training generalization algorithm for self-supervised training is proposed, which overcomes the lack of visual information and inaccuracy, and thus extends the applicability of images on NMT. Specifically, we will search for many pictures from the existing sentences through the search engine, and then through the relationship between visual information and text, do the self-supervised training task of graphics and text to obtain more effective visual information for text. We show that when the filtered information is used as multimodal machine translation for fine-tuning, the effect of translation in the global voice dataset is 0.5 BLEU higher than the baseline.

Related papers

NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training [6.34265125858783]
We propose a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer. We introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning.
arXiv Detail & Related papers (2024-09-15T01:54:17Z)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference for image descriptions using unlabeled images. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z)
Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination [88.74459704391214]
In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup. We represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics. Several SG-pivoting based learning objectives are introduced for unsupervised translation training. Our method outperforms the best-performing baseline by significant BLEU scores on the task and setup.
arXiv Detail & Related papers (2023-05-20T18:17:20Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Multimodal Semi-Supervised Learning for Text Recognition [10.33262222726707]
We present semi-supervised learning for multimodal text recognizers (SemiMTR) that leverages unlabeled data at each modality training phase. Our algorithm starts by pretraining the vision model through a single-stage training that unifies self-supervised learning with supervised training. In a novel setup, consistency is enforced on each modality separately.
arXiv Detail & Related papers (2022-05-08T13:55:30Z)
FILIP: Fine-grained Interactive Language-Image Pre-Training [106.19474076935363]
Fine-grained Interactive Language-Image Pre-training achieves finer-level alignment through a cross-modal late interaction mechanism. We construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-11-09T17:15:38Z)
TVDIM: Enhancing Image Self-Supervised Pretraining via Noisy Text Data [13.68491474904529]
We propose Text-enhanced Visual Deep InfoMax (TVDIM) to learn better visual representations. Our core idea of self-supervised learning is to maximize the mutual information between features extracted from multiple views. TVDIM significantly outperforms previous visual self-supervised methods when processing the same set of images.
arXiv Detail & Related papers (2021-06-03T12:36:01Z)
MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.