Related papers: Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM tree

Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM tree

URL: http://arxiv.org/abs/2108.11629v1
Date: Thu, 26 Aug 2021 07:49:28 GMT
Title: Web Image Context Extraction with Graph Neural Networks and Sentence Embeddings on the DOM tree
Authors: Chen Dang (QR), Hicham Randrianarivo (QR), Rapha\"el Fournier-S'Niehotta (CNAM, CEDRIC - VERTIGO), Nicolas Audebert (CNAM, CEDRIC - VERTIGO)
Abstract summary: Web Image Context Extraction (WICE) consists in obtaining the textual information describing an image. We introduce a novel WICE approach that combines Graph Neural Networks (GNNs) and Natural Language Processing models. We show that our approach gives promising results to help address the large-scale WICE problem using only HTML data.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Web Image Context Extraction (WICE) consists in obtaining the textual information describing an image using the content of the surrounding webpage. A common preprocessing step before performing WICE is to render the content of the webpage. When done at a large scale (e.g., for search engine indexation), it may become very computationally costly (up to several seconds per page). To avoid this cost, we introduce a novel WICE approach that combines Graph Neural Networks (GNNs) and Natural Language Processing models. Our method relies on a graph model containing both node types and text as features. The model is fed through several blocks of GNNs to extract the textual context. Since no labeled WICE dataset with ground truth exists, we train and evaluate the GNNs on a proxy task that consists in finding the semantically closest text to the image caption. We then interpret importance weights to find the most relevant text nodes and define them as the image context. Thanks to GNNs, our model is able to encode both structural and semantic information from the webpage. We show that our approach gives promising results to help address the large-scale WICE problem using only HTML data.

Related papers

BATINet: Background-Aware Text to Image Synthesis and Manipulation Network [12.924990882126105]
We analyzed a novel Background-Aware Text2Image (BAT2I) task in which the generated content matches the input background. We proposed a Background-Aware Text to Image synthesis and manipulation Network (BATINet), which contains two key components. We demonstrated through qualitative and quantitative evaluations on the CUB dataset that the proposed model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2023-08-11T03:22:33Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path [28.898240725099782]
We propose a new approach, ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the shortest relative paths in the Document Object Model (DOM) tree. It also incorporates the popularity of each text node by counting the occurrence of the same text node across different web pages.
arXiv Detail & Related papers (2023-05-23T08:16:52Z)
Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types. We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z)
GraghVQA: Language-Guided Graph Neural Networks for Graph-based Visual Question Answering [4.673063715963991]
Scene Graph encodes objects as nodes connected via pairwise relations as edges. We propose GraphVQA, a language-guided graph neural network framework that translates and executes a natural language question. Our experiments on GQA dataset show that GraphVQA outperforms the state-of-the-art accuracy by a large margin.
arXiv Detail & Related papers (2021-04-20T23:54:41Z)
Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation. We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths. In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
Text-Guided Neural Image Inpainting [20.551488941041256]
Inpainting task requires filling the corrupted image with contents coherent with the context. The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text. We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
arXiv Detail & Related papers (2020-04-07T09:04:43Z)
Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)
Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text [93.08109196909763]
We propose a novel VQA approach, Multi-Modal Graph Neural Network (MM-GNN) It first represents an image as a graph consisting of three sub-graphs, depicting visual, semantic, and numeric modalities respectively. It then introduces three aggregators which guide the message passing from one graph to another to utilize the contexts in various modalities.
arXiv Detail & Related papers (2020-03-31T05:56:59Z)
Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN) Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.