Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
- URL: http://arxiv.org/abs/2602.15540v1
- Date: Tue, 17 Feb 2026 12:44:05 GMT
- Title: Perspectives - Interactive Document Clustering in the Discourse Analysis Tool Suite
- Authors: Tim Fischer, Chris Biemann,
- Abstract summary: Perspectives is a tool suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections.<n> Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.
- Score: 20.935269641413694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces Perspectives, an interactive extension of the Discourse Analysis Tool Suite designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections. Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities. We showcase how this process can be initially steered by defining analytical lenses through document rewriting prompts and instruction-based embeddings, and further aligned with user intent through tools for refining clusters and mechanisms for fine-tuning the embedding model. The demonstration highlights a typical workflow, illustrating how DH researchers can leverage Perspectives's interactive document map to uncover topics, sentiments, or other relevant categories, thereby gaining insights and preparing their data for subsequent in-depth analysis.
Related papers
- From Reflection to Repair: A Scoping Review of Dataset Documentation Tools [10.124271544484634]
We present a systematic review supported by mixed-methods analysis of 59 dataset documentation publications.<n>Our analysis shows four persistent patterns in dataset documentation conceptualization that potentially impede adoption and standardization.<n>Building on these findings, we propose a shift in Responsible AI tool design toward institutional rather than individual solutions.
arXiv Detail & Related papers (2026-02-17T19:37:16Z) - DREAM: Document Reconstruction via End-to-end Autoregressive Model [53.51754520966657]
We present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM)<n>We establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task.
arXiv Detail & Related papers (2025-07-08T09:24:07Z) - Conceptual Topic Aggregation [0.0]
We propose FAT-CAT, an approach based on Formal Concept Analysis (FCA) to enhance meaningful topic aggregation and visualization.<n>Our approach can handle diverse topics and file types -- grouped by directories -- to construct a concept lattice that offers a structured, hierarchical representation of their topic distribution.
arXiv Detail & Related papers (2025-06-27T15:19:38Z) - From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [60.733557487886635]
This paper focuses on bridging the comprehension gap between Large Language Models and external tools.<n>We propose a novel framework, DRAFT, aimed at Dynamically Refining tool documentation.<n>This methodology pivots on an innovative trial-and-error approach, consisting of three distinct learning phases.
arXiv Detail & Related papers (2024-10-10T17:58:44Z) - Interactive Topic Models with Optimal Transport [75.26555710661908]
We present EdTM, as an approach for label name supervised topic modeling.
EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities.
arXiv Detail & Related papers (2024-06-28T13:57:27Z) - HADES: Homologous Automated Document Exploration and Summarization [3.3509104620016092]
HADES is designed to streamline the work of professionals dealing with large volumes of documents.
The tool employs a multi-step pipeline that begins with processing PDF documents using topic modeling, summarization, and analysis of the most important words for each topic.
arXiv Detail & Related papers (2023-02-25T15:16:10Z) - Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout
Analysis [4.920817773181236]
Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis.
We first construct graphs to explicitly describe four main aspects, including syntactic, semantic, density, and appearance/visual information.
We apply graph convolutional networks for representing each aspect of information and use pooling to integrate them.
arXiv Detail & Related papers (2022-08-22T07:22:05Z) - Scholastic: Graphical Human-Al Collaboration for Inductive and
Interpretive Text Analysis [20.008165537258254]
Interpretive scholars generate knowledge from text corpora by manually sampling documents, applying codes, and refining and collating codes into categories until meaningful themes emerge.
Given a large corpus, machine learning could help scale this data sampling and analysis, but prior research shows that experts are generally concerned about algorithms potentially disrupting or driving interpretive scholarship.
We take a human-centered design approach to addressing concerns around machine-in-the-loop clustering algorithm to scaffold interpretive text analysis.
arXiv Detail & Related papers (2022-08-12T06:41:45Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - iFacetSum: Coreference-based Interactive Faceted Summarization for
Multi-Document Exploration [63.272359227081836]
iFacetSum integrates interactive summarization together with faceted search.
Fine-grained facets are automatically produced based on cross-document coreference pipelines.
arXiv Detail & Related papers (2021-09-23T20:01:11Z) - DOC2PPT: Automatic Presentation Slides Generation from Scientific
Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation.
We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner.
Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.