cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers
- URL: http://arxiv.org/abs/2406.08398v1
- Date: Wed, 12 Jun 2024 16:46:12 GMT
- Title: cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers
- Authors: Anirudh Sundar, Jin Xu, William Gay, Christopher Richardson, Larry Heck,
- Abstract summary: This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers.
We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from source files.
- Score: 5.103692331918768
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An emerging area of research in situated and multimodal interactive conversations (SIMMC) includes interactions in scientific papers. Since scientific papers are primarily composed of text, equations, figures, and tables, SIMMC methods must be developed specifically for each component to support the depth of inquiry and interactions required by research scientists. This work introduces Conversational Papers (cPAPERS), a dataset of conversational question-answer pairs from reviews of academic papers grounded in these paper components and their associated references from scientific documents available on arXiv. We present a data collection strategy to collect these question-answer pairs from OpenReview and associate them with contextual information from LaTeX source files. Additionally, we present a series of baseline approaches utilizing Large Language Models (LLMs) in both zero-shot and fine-tuned configurations to address the cPAPERS dataset.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions [62.0123588983514]
Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields.
We reformulate the peer-review process as a multi-turn, long-context dialogue, incorporating distinct roles for authors, reviewers, and decision makers.
We construct a comprehensive dataset containing over 26,841 papers with 92,017 reviews collected from multiple sources.
arXiv Detail & Related papers (2024-06-09T08:24:17Z) - Context-Enhanced Language Models for Generating Multi-Paper Citations [35.80247519023821]
We propose a method that leverages Large Language Models (LLMs) to generate multi-citation sentences.
Our approach involves a single source paper and a collection of target papers, culminating in a coherent paragraph containing multi-sentence citation text.
arXiv Detail & Related papers (2024-04-22T04:30:36Z) - SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval [64.03631654052445]
Current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap.
We develop a specialised scientific MMIR benchmark by leveraging open-access paper collections.
This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents.
arXiv Detail & Related papers (2024-01-24T14:23:12Z) - Bridging Research and Readers: A Multi-Modal Automated Academic Papers
Interpretation System [47.13932723910289]
We introduce an open-source multi-modal automated academic paper interpretation system (MMAPIS) with three-step process stages.
It employs the hybrid modality preprocessing and alignment module to extract plain text, and tables or figures from documents separately.
It then aligns this information based on the section names they belong to, ensuring that data with identical section names are categorized under the same section.
It utilizes the extracted section names to divide the article into shorter text segments, facilitating specific summarizations both within and between sections via LLMs.
arXiv Detail & Related papers (2024-01-17T11:50:53Z) - QuOTeS: Query-Oriented Technical Summarization [0.2936007114555107]
We propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references.
QuOTeS integrates techniques from Query-Focused Extractive Summarization and High-Recall Information Retrieval to provide Interactive Query-Focused Summarization of scientific documents.
The results show that QuOTeS provides a positive user experience and consistently provides query-focused summaries that are relevant, concise, and complete.
arXiv Detail & Related papers (2023-06-20T18:43:24Z) - Topic Segmentation of Research Article Collections [4.0810783261728565]
We perform topic segmentation of a paper data collection that we crawled and produce a multitopic dataset of roughly seven million paper data records.
We construct a taxonomy of topics extracted from the data records and then annotate each document with its corresponding topic from that taxonomy.
It is possible to use this newly proposed dataset in two modalities: as a heterogeneous collection of documents from various disciplines or as a set of homogeneous collections, each from a single research topic.
arXiv Detail & Related papers (2022-05-18T15:19:42Z) - Revise and Resubmit: An Intertextual Model of Text-based Collaboration
in Peer Review [52.359007622096684]
Peer review is a key component of the publishing process in most fields of science.
Existing NLP studies focus on the analysis of individual texts.
editorial assistance often requires modeling interactions between pairs of texts.
arXiv Detail & Related papers (2022-04-22T16:39:38Z) - NLP Scholar: An Interactive Visual Explorer for Natural Language
Processing Literature [31.87319293259599]
We describe several interconnected interactive visualizations (dashboards) that present various aspects of the data.
The interactive visualizations presented here, and the associated dataset of papers mapped to citations, have additional uses as well including understanding how the field is growing.
arXiv Detail & Related papers (2020-05-31T17:12:37Z) - Explaining Relationships Between Scientific Documents [55.23390424044378]
We address the task of explaining relationships between two scientific documents using natural language text.
In this paper we establish a dataset of 622K examples from 154K documents.
arXiv Detail & Related papers (2020-02-02T03:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.