Related papers: Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

URL: http://arxiv.org/abs/2407.06723v2
Date: Wed, 26 Feb 2025 22:54:53 GMT
Title: Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Authors: Yu-Guan Hsieh, Cheng-Yu Hsieh, Shih-Ying Yeh, Louis Béthune, Hadi Pour Ansari, Pavan Kumar Anasosalu Vasu, Chun-Liang Li, Ranjay Krishna, Oncel Tuzel, Marco Cuturi,
Abstract summary: Graph-based captioning (GBC) describes an image using a labeled graph structure, with nodes of various types.<n>We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models.<n>We show that leveraging GBC nodes' annotations significantly boosts the model's performance across various benchmarks.
Score: 53.069446715005924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Humans describe complex scenes with compositionality, using simple text descriptions enriched with links and relationships. While vision-language research has aimed to develop models with compositional understanding capabilities, this is not reflected yet in existing datasets which, for the most part, still use plain text to describe images. In this work, we propose a new annotation strategy, graph-based captioning (GBC) that describes an image using a labeled graph structure, with nodes of various types. The nodes in GBC are created through a two-stage process: first, identifying and describing entity nodes; second, linking these nodes by highlighting \textit{compositions} and \textit{relations} among them. Since \textit{all} GBC nodes hold plain text descriptions, GBC retains the flexibility found in natural language, but can also encode hierarchical information in its edges. We demonstrate that GBC can be produced automatically, using off-the-shelf multimodal LLMs and object detection models, by building a new dataset GBC10M that gathers GBC annotations for about 10M images of the CC12M dataset. Through CLIP training on GBC10M, we show that leveraging GBC nodes' annotations -- particularly those in composition and relation nodes -- significantly boosts the model's performance across various benchmarks compared to when other annotations are used. To further explore the opportunities provided by GBC, we also investigate the use of GBC as middleware for text-to-image generation, and show the extra benefits of incorporating the graph structure in this task. Our code and datasets are released at https://github.com/apple/ml-gbc and https://huggingface.co/graph-based-captions.

Related papers

Open-Vocabulary Octree-Graph for 3D Scene Understanding [54.11828083068082]
Octree-Graph is a novel scene representation for open-vocabulary 3D scene understanding. An adaptive-octree structure is developed that stores semantics and depicts the occupancy of an object adjustably according to its shape.
arXiv Detail & Related papers (2024-11-25T10:14:10Z)
DTGB: A Comprehensive Benchmark for Dynamic Text-Attributed Graphs [28.340416573162898]
Dynamic text-attributed graphs (DyTAGs) are prevalent in various real-world scenarios. Despite their broad applicability, there is a notable scarcity of benchmark datasets tailored to DyTAGs. We introduce Dynamic Text-attributed Graph Benchmark (DTGB), a collection of large-scale, time-evolving graphs.
arXiv Detail & Related papers (2024-06-17T20:16:12Z)
UniGLM: Training One Unified Language Model for Text-Attributed Graph Embedding [31.464021556351685]
Unified Graph Language Model (UniGLM) is a graph embedding model that generalizes well to both in-domain and cross-domain TAGs. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training.
arXiv Detail & Related papers (2024-06-17T19:45:21Z)
TEG-DB: A Comprehensive Dataset and Benchmark of Textual-Edge Graphs [14.437863803271808]
Text-Attributed Graphs (TAGs) augment graph structures with natural language descriptions, facilitating detailed depictions of data and their interconnections. Existing TAG datasets predominantly feature textual information only at the nodes, with edges typically represented by mere binary or categorical attributes. To address this gap, we introduce Textual-Edge Graphs datasets featuring rich textual descriptions on nodes and edges.
arXiv Detail & Related papers (2024-06-14T06:22:47Z)
Hierarchical Compression of Text-Rich Graphs via Large Language Models [63.75293588479027]
Text-rich graphs are prevalent in data mining contexts like e-commerce and academic graphs. This paper introduces Hierarchical Compression'' (HiCom), a novel method to align the capabilities of LLMs with the structure of text-rich graphs. HiCom can outperform both GNNs and LLM backbones for node classification on e-commerce and citation graphs.
arXiv Detail & Related papers (2024-06-13T07:24:46Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
Pretraining Language Models with Text-Attributed Heterogeneous Graphs [28.579509154284448]
We present a new pretraining framework for Language Models (LMs) that explicitly considers the topological and heterogeneous information in Text-Attributed Heterogeneous Graphs (TAHGs) We propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network. We conduct link prediction and node classification tasks on three datasets from various domains.
arXiv Detail & Related papers (2023-10-19T08:41:21Z)
Empower Text-Attributed Graphs Learning with Large Language Models (LLMs) [5.920353954082262]
We propose a plug-and-play approach to empower text-attributed graphs through node generation using Large Language Models (LLMs) We employ an edge predictor to capture the structural information inherent in the raw dataset and integrate the newly generated samples into the original graph. Experiments demonstrate the outstanding performance of our proposed paradigm, particularly in low-shot scenarios.
arXiv Detail & Related papers (2023-10-15T16:04:28Z)
Learning Multiplex Representations on Text-Attributed Graphs with One Language Model Encoder [55.24276913049635]
We propose METAG, a new framework for learning Multiplex rEpresentations on Text-Attributed Graphs. In contrast to existing methods, METAG uses one text encoder to model the shared knowledge across relations. We conduct experiments on nine downstream tasks in five graphs from both academic and e-commerce domains.
arXiv Detail & Related papers (2023-10-10T14:59:22Z)
Clustering-based Image-Text Graph Matching for Domain Generalization [13.277406473107721]
Domain-invariant visual representations are important to train a model that can generalize well to unseen target task domains. Recent works demonstrate that text descriptions contain high-level class-discriminative information. We advocate for the use of local alignment between image regions and corresponding textual descriptions to get domain-invariant features.
arXiv Detail & Related papers (2023-10-04T10:03:07Z)
KnowGL: Knowledge Generation and Linking from Text [13.407149206621828]
We propose KnowGL, a tool that allows converting text into structured relational data represented as a set of ABox assertions. We address this problem as a sequence generation task by leveraging pre-trained sequence-to-sequence language models, e.g. BART. To showcase the capabilities of our tool, we build a web application consisting of a set of UI widgets that help users to navigate through the semantic data extracted from a given input text.
arXiv Detail & Related papers (2022-10-25T12:12:36Z)
Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces. We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z)
Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs [74.88118535585903]
We propose the Abstract Scene Graph structure to represent user intention in fine-grained level. From the ASG, we propose a novel ASG2Caption model, which is able to recognise user intentions and semantics in the graph. Our model achieves better controllability conditioning on ASGs than carefully designed baselines on both VisualGenome and MSCOCO datasets.
arXiv Detail & Related papers (2020-03-01T03:34:07Z)
Modeling Global and Local Node Contexts for Text Generation from Knowledge Graphs [63.12058935995516]
Recent graph-to-text models generate text from graph-based data using either global or local aggregation. We propose novel neural models which encode an input graph combining both global and local node contexts. Our approaches lead to significant improvements on two graph-to-text datasets.
arXiv Detail & Related papers (2020-01-29T18:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.