Related papers: Pretraining Language Models with Text-Attributed Heterogeneous Graphs

Pretraining Language Models with Text-Attributed Heterogeneous Graphs

URL: http://arxiv.org/abs/2310.12580v2
Date: Mon, 23 Oct 2023 01:46:04 GMT
Title: Pretraining Language Models with Text-Attributed Heterogeneous Graphs
Authors: Tao Zou, Le Yu, Yifei Huang, Leilei Sun, Bowen Du
Abstract summary: We present a new pretraining framework for Language Models (LMs) that explicitly considers the topological and heterogeneous information in Text-Attributed Heterogeneous Graphs (TAHGs) We propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network. We conduct link prediction and node classification tasks on three datasets from various domains.
Score: 28.579509154284448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In many real-world scenarios (e.g., academic networks, social platforms), different types of entities are not only associated with texts but also connected by various relationships, which can be abstracted as Text-Attributed Heterogeneous Graphs (TAHGs). Current pretraining tasks for Language Models (LMs) primarily focus on separately learning the textual information of each entity and overlook the crucial aspect of capturing topological connections among entities in TAHGs. In this paper, we present a new pretraining framework for LMs that explicitly considers the topological and heterogeneous information in TAHGs. Firstly, we define a context graph as neighborhoods of a target node within specific orders and propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network. Secondly, based on the observation that some nodes are text-rich while others have little text, we devise a text augmentation strategy to enrich textless nodes with their neighbors' texts for handling the imbalance issue. We conduct link prediction and node classification tasks on three datasets from various domains. Experimental results demonstrate the superiority of our approach over existing methods and the rationality of each design. Our code is available at https://github.com/Hope-Rita/THLM.

Related papers

Integrating Structural and Semantic Signals in Text-Attributed Graphs with BiGTex [0.16385815610837165]
BiGTex is a novel architecture that tightly integrates GNNs and LLMs through stacked Graph-Text Fusion Units. BiGTex achieves state-of-the-art performance in node classification and generalizes effectively to link prediction.
arXiv Detail & Related papers (2025-04-16T20:25:11Z)
LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models [54.82915844507371]
Text-Attributed Graphs (TAGs) are ubiquitous in real-world scenarios. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures. We propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning.
arXiv Detail & Related papers (2025-03-05T09:45:22Z)
HeTGB: A Comprehensive Benchmark for Heterophilic Text-Attributed Graphs [38.79574338268996]
Graph neural networks (GNNs) have demonstrated success in modeling relational data under the assumption of homophily. Many real-world graphs exhibit heterophily, where linked nodes belong to different categories or possess diverse attributes. We introduce the Heterophilic Text-attributed Graph Benchmark (HeTGB), a novel benchmark comprising five real-world heterophilic graph datasets from diverse domains.
arXiv Detail & Related papers (2025-03-05T02:00:32Z)
Large Language Model-based Augmentation for Imbalanced Node Classification on Text-Attributed Graphs [13.42259312243504]
We propose a novel approach called LA-TAG (LLM-based Augmentation on Text-Attributed Graphs) It prompts Large Language Models to generate synthetic texts based on existing node texts in the graph. To integrate these synthetic text-attributed nodes into the graph, we introduce a text-based link predictor.
arXiv Detail & Related papers (2024-10-22T10:36:15Z)
UniGLM: Training One Unified Language Model for Text-Attributed Graphs [31.464021556351685]
Unified Graph Language Model (UniGLM) is a graph embedding model that generalizes well to both in-domain and cross-domain TAGs. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training.
arXiv Detail & Related papers (2024-06-17T19:45:21Z)
Unleashing the Potential of Text-attributed Graphs: Automatic Relation Decomposition via Large Language Models [31.443478448031886]
RoSE (Relation-oriented Semantic Edge-decomposition) is a novel framework that decomposes the graph structure by analyzing raw text attributes. Our framework significantly enhances node classification performance across various datasets, with improvements of up to 16% on the Wisconsin dataset.
arXiv Detail & Related papers (2024-05-28T20:54:47Z)
Learning Multiplex Representations on Text-Attributed Graphs with One Language Model Encoder [55.24276913049635]
We propose METAG, a new framework for learning Multiplex rEpresentations on Text-Attributed Graphs. In contrast to existing methods, METAG uses one text encoder to model the shared knowledge across relations. We conduct experiments on nine downstream tasks in five graphs from both academic and e-commerce domains.
arXiv Detail & Related papers (2023-10-10T14:59:22Z)
Harnessing Explanations: LLM-to-LM Interpreter for Enhanced Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks. Our method achieves state-of-the-art results on well-established TAG datasets. Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z)
ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings [20.25180279903009]
We propose Contrastive Graph-Text pretraining (ConGraT) for jointly learning separate representations of texts and nodes in a text-attributed graph (TAG) Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP. Experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling.
arXiv Detail & Related papers (2023-05-23T17:53:30Z)
Hierarchical Heterogeneous Graph Representation Learning for Short Text Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification. First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs. Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z)
Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text. Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z)
GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models. With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow. In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z)
Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network. Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.