Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings
- URL: http://arxiv.org/abs/2602.04630v1
- Date: Wed, 04 Feb 2026 15:02:32 GMT
- Title: Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings
- Authors: Tim Kunt, Annika Buchholz, Imene Khebouri, Thorsten Koch, Ida Litzel, Thi Huong Vu,
- Abstract summary: Large text data sets inherit two types of features: the text itself, and its information conveyed through semantics, and its relationship to other texts through links, references, or shared attributes.<n>We propose a new embedding method for the Web of Science dataset, containing 56 million scientific publications through the lens of our proposed embedding method.
- Score: 0.722741581069214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
Related papers
- A Novel Graph-Sequence Learning Model for Inductive Text Classification [7.129773362505109]
Text classification plays an important role in various downstream text-related tasks, such as sentiment analysis, fake news detection, and public opinion analysis.<n>We propose a Novel Graph-Sequence Learning Model for Inductive Text Classification (TextGSL) to address the previously mentioned issues.<n>TextGSL has been comprehensively compared with several strong baselines.
arXiv Detail & Related papers (2025-12-23T06:49:33Z) - Human Mobility Datasets Enriched With Contextual and Social Dimensions [1.0268257686354103]
We present two datasets of semantically enriched human trajectories, together with a pipeline to build them.<n>The trajectories are publicly available GPS traces retrieved from OpenStreetMap.<n>A novel semantic feature is the inclusion of synthetic, realistic social media posts generated by Large Language Models.
arXiv Detail & Related papers (2025-09-26T07:45:27Z) - Unify Graph Learning with Text: Unleashing LLM Potentials for Session Search [35.20525123189316]
Session search involves a series of interactive queries and actions to fulfill user's complex information need.<n>Current strategies typically prioritize sequential modeling for deep semantic understanding, overlooking the graph structure in interactions.<n>We propose Symbolic Graph Ranker (SGR), which aims to take advantage of both text-based and graph-based approaches.
arXiv Detail & Related papers (2025-05-20T10:05:06Z) - MapExplorer: New Content Generation from Low-Dimensional Visualizations [60.02149343347818]
Low-dimensional visualizations, or "projection maps," are widely used to interpret large-scale and complex datasets.<n>These visualizations not only aid in understanding existing knowledge spaces but also implicitly guide exploration into unknown areas.<n>We introduce MapExplorer, a novel knowledge discovery task that translates coordinates within any projection map into coherent, contextually aligned textual content.
arXiv Detail & Related papers (2024-12-24T20:16:13Z) - GT2Vec: Large Language Models as Multi-Modal Encoders for Text and Graph-Structured Data [42.18348019901044]
GT2Vec is a framework that leverages Large Language Models to jointly encode text and graph data.<n>Unlike prior work, we also introduce contrastive learning to align the graph and text spaces more effectively.
arXiv Detail & Related papers (2024-10-15T03:40:20Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Large Language Models on Graphs: A Comprehensive Survey [77.16803297418201]
We provide a systematic review of scenarios and techniques related to large language models on graphs.
We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-attributed graphs, and text-paired graphs.
We discuss the real-world applications of such methods and summarize open-source codes and benchmark datasets.
arXiv Detail & Related papers (2023-12-05T14:14:27Z) - Pretraining Language Models with Text-Attributed Heterogeneous Graphs [28.579509154284448]
We present a new pretraining framework for Language Models (LMs) that explicitly considers the topological and heterogeneous information in Text-Attributed Heterogeneous Graphs (TAHGs)
We propose a topology-aware pretraining task to predict nodes involved in the context graph by jointly optimizing an LM and an auxiliary heterogeneous graph neural network.
We conduct link prediction and node classification tasks on three datasets from various domains.
arXiv Detail & Related papers (2023-10-19T08:41:21Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - The Semantic Scholar Open Data Platform [92.2948743167744]
Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by helping scholars discover and understand scientific literature.<n>We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction.<n>The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings.
arXiv Detail & Related papers (2023-01-24T17:13:08Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.