PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction
- URL: http://arxiv.org/abs/2306.01016v1
- Date: Thu, 1 Jun 2023 05:39:45 GMT
- Title: PV2TEA: Patching Visual Modality to Textual-Established Information
Extraction
- Authors: Hejie Cui, Rongmei Lin, Nasser Zalmout, Chenwei Zhang, Jingbo Shang,
Carl Yang, Xian Li
- Abstract summary: We patch the visual modality to the textual-established attribute information extractor.
PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes.
Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
- Score: 59.76117533540496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Information extraction, e.g., attribute value extraction, has been
extensively studied and formulated based only on text. However, many attributes
can benefit from image-based extraction, like color, shape, pattern, among
others. The visual modality has long been underutilized, mainly due to
multimodal annotation difficulty. In this paper, we aim to patch the visual
modality to the textual-established attribute information extractor. The
cross-modality integration faces several unique challenges: (C1) images and
textual descriptions are loosely paired intra-sample and inter-samples; (C2)
images usually contain rich backgrounds that can mislead the prediction; (C3)
weakly supervised labels from textual-established extractors are biased for
multimodal training. We present PV2TEA, an encoder-decoder architecture
equipped with three bias reduction schemes: (S1) Augmented label-smoothed
contrast to improve the cross-modality alignment for loosely-paired image and
text; (S2) Attention-pruning that adaptively distinguishes the visual
foreground; (S3) Two-level neighborhood regularization that mitigates the label
textual bias via reliability estimation. Empirical results on real-world
e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1
increase over unimodal baselines.
Related papers
- MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive)
It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss.
Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z) - Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning.
CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z) - Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE)
M3AE learns a unified encoder for both vision and language data via masked token prediction.
We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Dual-path CNN with Max Gated block for Text-Based Person
Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings.
The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching.
Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.