Pretraining Language Models with Text-Attributed Heterogeneous Graphs
- URL: http://arxiv.org/abs/2310.12580v2
- Date: Mon, 23 Oct 2023 01:46:04 GMT
- Title: Pretraining Language Models with Text-Attributed Heterogeneous Graphs
- Authors: Tao Zou, Le Yu, Yifei Huang, Leilei Sun, Bowen Du
- Abstract summary: We present a new pretraining framework for Language Models (LMs) that explicitly considers the topological and heterogeneous information in Text-Attributed Heterogeneous Graphs (TAHGs)
We propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network.
We conduct link prediction and node classification tasks on three datasets from various domains.
- Score: 28.579509154284448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many real-world scenarios (e.g., academic networks, social platforms),
different types of entities are not only associated with texts but also
connected by various relationships, which can be abstracted as Text-Attributed
Heterogeneous Graphs (TAHGs). Current pretraining tasks for Language Models
(LMs) primarily focus on separately learning the textual information of each
entity and overlook the crucial aspect of capturing topological connections
among entities in TAHGs. In this paper, we present a new pretraining framework
for LMs that explicitly considers the topological and heterogeneous information
in TAHGs. Firstly, we define a context graph as neighborhoods of a target node
within specific orders and propose a topology-aware pretraining task to predict
nodes involved in the context graph by jointly optimizing an LM and an
auxiliary heterogeneous graph neural network. Secondly, based on the
observation that some nodes are text-rich while others have little text, we
devise a text augmentation strategy to enrich textless nodes with their
neighbors' texts for handling the imbalance issue. We conduct link prediction
and node classification tasks on three datasets from various domains.
Experimental results demonstrate the superiority of our approach over existing
methods and the rationality of each design. Our code is available at
https://github.com/Hope-Rita/THLM.
Related papers
- Large Language Model-based Augmentation for Imbalanced Node Classification on Text-Attributed Graphs [13.42259312243504]
We propose a novel approach called LA-TAG (LLM-based Augmentation on Text-Attributed Graphs)
It prompts Large Language Models to generate synthetic texts based on existing node texts in the graph.
To integrate these synthetic text-attributed nodes into the graph, we introduce a text-based link predictor.
arXiv Detail & Related papers (2024-10-22T10:36:15Z) - UniGLM: Training One Unified Language Model for Text-Attributed Graphs [31.464021556351685]
Unified Graph Language Model (UniGLM) is a graph embedding model that generalizes well to both in-domain and cross-domain TAGs.
UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training.
arXiv Detail & Related papers (2024-06-17T19:45:21Z) - Unleashing the Potential of Text-attributed Graphs: Automatic Relation Decomposition via Large Language Models [31.443478448031886]
RoSE (Relation-oriented Semantic Edge-decomposition) is a novel framework that decomposes the graph structure by analyzing raw text attributes.
Our framework significantly enhances node classification performance across various datasets, with improvements of up to 16% on the Wisconsin dataset.
arXiv Detail & Related papers (2024-05-28T20:54:47Z) - Learning Multiplex Representations on Text-Attributed Graphs with One Language Model Encoder [55.24276913049635]
We propose METAG, a new framework for learning Multiplex rEpresentations on Text-Attributed Graphs.
In contrast to existing methods, METAG uses one text encoder to model the shared knowledge across relations.
We conduct experiments on nine downstream tasks in five graphs from both academic and e-commerce domains.
arXiv Detail & Related papers (2023-10-10T14:59:22Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - ConGraT: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings [20.25180279903009]
We propose Contrastive Graph-Text pretraining (ConGraT) for jointly learning separate representations of texts and nodes in a text-attributed graph (TAG)
Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP.
Experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling.
arXiv Detail & Related papers (2023-05-23T17:53:30Z) - Hierarchical Heterogeneous Graph Representation Learning for Short Text
Classification [60.233529926965836]
We propose a new method called SHINE, which is based on graph neural network (GNN) for short text classification.
First, we model the short text dataset as a hierarchical heterogeneous graph consisting of word-level component graphs.
Then, we dynamically learn a short document graph that facilitates effective label propagation among similar short texts.
arXiv Detail & Related papers (2021-10-30T05:33:05Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z) - Iterative Context-Aware Graph Inference for Visual Dialog [126.016187323249]
We propose a novel Context-Aware Graph (CAG) neural network.
Each node in the graph corresponds to a joint semantic feature, including both object-based (visual) and history-related (textual) context representations.
arXiv Detail & Related papers (2020-04-05T13:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.