Beyond Text: Characterizing Domain Expert Needs in Document Research
- URL: http://arxiv.org/abs/2504.12495v1
- Date: Wed, 16 Apr 2025 21:24:41 GMT
- Title: Beyond Text: Characterizing Domain Expert Needs in Document Research
- Authors: Sireesh Gururaja, Nupoor Gandhi, Jeremiah Milbauer, Emma Strubell,
- Abstract summary: We ask sixteen domain experts across two domains to understand their processes of document research.<n>We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document.<n>We call on the NLP community to more carefully consider the role of the document in building useful tools.
- Score: 10.98467955215441
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Working with documents is a key part of almost any knowledge work, from contextualizing research in a literature review to reviewing legal precedent. Recently, as their capabilities have expanded, primarily text-based NLP systems have often been billed as able to assist or even automate this kind of work. But to what extent are these systems able to model these tasks as experts conceptualize and perform them now? In this study, we interview sixteen domain experts across two domains to understand their processes of document research, and compare it to the current state of NLP systems. We find that our participants processes are idiosyncratic, iterative, and rely extensively on the social context of a document in addition its content; existing approaches in NLP and adjacent fields that explicitly center the document as an object, rather than as merely a container for text, tend to better reflect our participants' priorities, though they are often less accessible outside their research communities. We call on the NLP community to more carefully consider the role of the document in building useful tools that are accessible, personalizable, iterative, and socially aware.
Related papers
- Judgement Citation Retrieval using Contextual Similarity [0.0]
We propose a methodology that combines natural language processing (NLP) and machine learning techniques to enhance the organization and utilization of legal case descriptions.
Our methodology addresses two primary objectives: unsupervised clustering and supervised citation retrieval.
Our methodology achieved an impressive accuracy rate of 90.9%.
arXiv Detail & Related papers (2024-05-28T04:22:28Z) - Knowledge-Driven Cross-Document Relation Extraction [3.868708275322908]
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task.
We propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE.
arXiv Detail & Related papers (2024-05-22T11:30:59Z) - What Can Natural Language Processing Do for Peer Review? [173.8912784451817]
In modern science, peer review is widely used, yet it is hard, time-consuming, and prone to error.
Since the artifacts involved in peer review are largely text-based, Natural Language Processing has great potential to improve reviewing.
We detail each step of the process from manuscript submission to camera-ready revision, and discuss the associated challenges and opportunities for NLP assistance.
arXiv Detail & Related papers (2024-05-10T16:06:43Z) - Understanding Archives: Towards New Research Interfaces Relying on the Semantic Annotation of Documents [0.2302001830524133]
We show how the semantic annotation of the textual content of study corpora of archival documents allow to facilitate their exploitation and valorisation.
First, we present a methodological framework for the construction of new interfaces based on textual semantics, then address the current technological obstacles and their potential solutions.
arXiv Detail & Related papers (2024-03-28T07:55:29Z) - Neural Natural Language Processing for Long Texts: A Survey on Classification and Summarization [6.728794938150435]
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP)
The ever increasing size of documents uploaded online renders automated understanding of lengthy texts a critical issue.
This article serves as an entry point into this dynamic domain and aims to achieve two objectives.
arXiv Detail & Related papers (2023-05-25T17:13:44Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - Synergistic Interplay between Search and Large Language Models for
Information Retrieval [141.18083677333848]
InteR allows RMs to expand knowledge in queries using LLM-generated knowledge collections.
InteR achieves overall superior zero-shot retrieval performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2023-05-12T11:58:15Z) - The Use of NLP-Based Text Representation Techniques to Support
Requirement Engineering Tasks: A Systematic Mapping Review [1.5469452301122177]
The research direction has changed from the use of lexical and syntactic features to the use of advanced embedding techniques.
We identify four gaps in the existing literature, why they matter, and how future research can begin to address them.
arXiv Detail & Related papers (2022-05-17T02:47:26Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - A Survey of Deep Learning Approaches for OCR and Document Understanding [68.65995739708525]
We review different techniques for document understanding for documents written in English.
We consolidate methodologies present in literature to act as a jumping-off point for researchers exploring this area.
arXiv Detail & Related papers (2020-11-27T03:05:59Z) - From Standard Summarization to New Tasks and Beyond: Summarization with
Manifold Information [77.89755281215079]
Text summarization is the research area aiming at creating a short and condensed version of the original document.
In real-world applications, most of the data is not in a plain text format.
This paper focuses on the survey of these new summarization tasks and approaches in the real-world application.
arXiv Detail & Related papers (2020-05-10T14:59:36Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z) - Conversations with Documents. An Exploration of Document-Centered
Assistance [55.60379539074692]
Document-centered assistance, for example, to help an individual quickly review a document, has seen less significant progress.
We present a survey to understand the space of document-centered assistance and the capabilities people expect in this scenario.
We present a set of initial machine learned models that show that (a) we can accurately detect document-centered questions, and (b) we can build reasonably accurate models for answering such questions.
arXiv Detail & Related papers (2020-01-27T17:10:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.