Related papers: PV2TEA: Patching Visual Modality to Textual-Established Information Extraction

PV2TEA: Patching Visual Modality to Textual-Established Information Extraction

URL: http://arxiv.org/abs/2306.01016v1
Date: Thu, 1 Jun 2023 05:39:45 GMT
Title: PV2TEA: Patching Visual Modality to Textual-Established Information Extraction
Authors: Hejie Cui, Rongmei Lin, Nasser Zalmout, Chenwei Zhang, Jingbo Shang, Carl Yang, Xian Li
Abstract summary: We patch the visual modality to the textual-established attribute information extractor. PV2TEA is an encoder-decoder architecture equipped with three bias reduction schemes. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.
Score: 59.76117533540496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Information extraction, e.g., attribute value extraction, has been extensively studied and formulated based only on text. However, many attributes can benefit from image-based extraction, like color, shape, pattern, among others. The visual modality has long been underutilized, mainly due to multimodal annotation difficulty. In this paper, we aim to patch the visual modality to the textual-established attribute information extractor. The cross-modality integration faces several unique challenges: (C1) images and textual descriptions are loosely paired intra-sample and inter-samples; (C2) images usually contain rich backgrounds that can mislead the prediction; (C3) weakly supervised labels from textual-established extractors are biased for multimodal training. We present PV2TEA, an encoder-decoder architecture equipped with three bias reduction schemes: (S1) Augmented label-smoothed contrast to improve the cross-modality alignment for loosely-paired image and text; (S2) Attention-pruning that adaptively distinguishes the visual foreground; (S3) Two-level neighborhood regularization that mitigates the label textual bias via reliability estimation. Empirical results on real-world e-Commerce datasets demonstrate up to 11.74% absolute (20.97% relatively) F1 increase over unimodal baselines.

Related papers

OFFSET: Segmentation-based Focus Shift Revision for Composed Image Retrieval [59.377821673653436]
Composed Image Retrieval (CIR) is capable of expressing users' intricate retrieval requirements flexibly.<n>CIR remains in its nascent stages due to two limitations: 1) inhomogeneity between dominant and noisy portions in visual data is ignored, leading to query feature degradation.<n>This work presents a focus mapping-based feature extractor, which consists of two modules: dominant portion segmentation and dual focus mapping.
arXiv Detail & Related papers (2025-07-08T03:27:46Z)
Beyond Intermediate States: Explaining Visual Redundancy through Language [7.275188652473603]
Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens. Visual tokens with low ViT-[cls] association and low text-to-image attention scores can contain recognizable information. We develop a more reliable method for identifying and pruning redundant visual tokens.
arXiv Detail & Related papers (2025-03-26T13:38:10Z)
LEGION: Learning to Ground and Explain for Synthetic Image Detection [49.958951540410816]
We introduce SynthScars, a high-quality and diverse dataset consisting of 12,236 fully synthetic images with human-expert annotations. It features 4 distinct image content types, 3 categories of artifacts, and fine-grained annotations covering pixel-level segmentation, detailed textual explanations, and artifact category labels. We propose LEGION, a multimodal large language model (MLLM)-based image forgery analysis framework that integrates artifact detection, segmentation, and explanation.
arXiv Detail & Related papers (2025-03-19T14:37:21Z)
MARS: Paying more attention to visual attributes for text-based person search [6.438244172631555]
This paper presents a novel TBPS architecture named MARS (Mae-Attribute-Relation-Sensitive) It enhances current state-of-the-art models by introducing two key components: a Visual Reconstruction Loss and an Attribute Loss. Experiments on three commonly used datasets, namely CUHK-PEDES, ICFG-PEDES, and RSTPReid, report performance improvements.
arXiv Detail & Related papers (2024-07-05T06:44:43Z)
Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions. We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z)
Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals. Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions. On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z)
Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework. It learns relations between local visual-textual tokens and enhances global image-text matching. The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z)
Self-supervised Character-to-Character Distillation for Text Recognition [54.12490492265583]
We propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate text representation learning. CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution.
arXiv Detail & Related papers (2022-11-01T05:48:18Z)
Multimodal Masked Autoencoders Learn Transferable Representations [127.35955819874063]
We propose a simple and scalable network architecture, the Multimodal Masked Autoencoder (M3AE) M3AE learns a unified encoder for both vision and language data via masked token prediction. We provide an empirical study of M3AE trained on a large-scale image-text dataset, and find that M3AE is able to learn generalizable representations that transfer well to downstream tasks.
arXiv Detail & Related papers (2022-05-27T19:09:42Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)
Dual-path CNN with Max Gated block for Text-Based Person Re-identification [6.1534388046236765]
A novel Dual-path CNN with Max Gated block (DCMG) is proposed to extract discriminative word embeddings. The framework is based on two deep residual CNNs jointly optimized with cross-modal projection matching. Our approach achieves the rank-1 score of 55.81% and outperforms the state-of-the-art method by 1.3%.
arXiv Detail & Related papers (2020-09-20T03:33:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.