MultiCite: Modeling realistic citations requires moving beyond the
single-sentence single-label setting
- URL: http://arxiv.org/abs/2107.00414v1
- Date: Thu, 1 Jul 2021 12:54:23 GMT
- Title: MultiCite: Modeling realistic citations requires moving beyond the
single-sentence single-label setting
- Authors: Anne Lauscher, Brandon Ko, Bailey Kuhl, Sophie Johnson, David Jurgens,
Arman Cohan, Kyle Lo
- Abstract summary: We release MultiCite, a new dataset of 12,653 citation contexts from over 1,200 computational linguistics papers.
We show how our dataset, while still usable for training classic CCA models, also supports the development of new types of models for CCA beyond fixed-width text classification.
- Score: 13.493267499658527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Citation context analysis (CCA) is an important task in natural language
processing that studies how and why scholars discuss each others' work. Despite
being studied for decades, traditional frameworks for CCA have largely relied
on overly-simplistic assumptions of how authors cite, which ignore several
important phenomena. For instance, scholarly papers often contain rich
discussions of cited work that span multiple sentences and express multiple
intents concurrently. Yet, CCA is typically approached as a single-sentence,
single-label classification task, and thus existing datasets fail to capture
this interesting discourse. In our work, we address this research gap by
proposing a novel framework for CCA as a document-level context extraction and
labeling task. We release MultiCite, a new dataset of 12,653 citation contexts
from over 1,200 computational linguistics papers. Not only is it the largest
collection of expert-annotated citation contexts to-date, MultiCite contains
multi-sentence, multi-label citation contexts within full paper texts. Finally,
we demonstrate how our dataset, while still usable for training classic CCA
models, also supports the development of new types of models for CCA beyond
fixed-width text classification. We release our code and dataset at
https://github.com/allenai/multicite.
Related papers
- HLM-Cite: Hybrid Language Model Workflow for Text-based Scientific Citation Prediction [14.731720495144112]
We introduce the novel concept of core citation, which identifies the critical references that go beyond superficial mentions.
We propose $textbfHLM-Cite, a $textbfH$ybrid $textbfL$anguage $textbfM$odel workflow for citation prediction.
We evaluate HLM-Cite across 19 scientific fields, demonstrating a 17.6% performance improvement comparing SOTA methods.
arXiv Detail & Related papers (2024-10-10T10:46:06Z) - Context-Enhanced Language Models for Generating Multi-Paper Citations [35.80247519023821]
We propose a method that leverages Large Language Models (LLMs) to generate multi-citation sentences.
Our approach involves a single source paper and a collection of target papers, culminating in a coherent paragraph containing multi-sentence citation text.
arXiv Detail & Related papers (2024-04-22T04:30:36Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - In-Context Learning for Text Classification with Many Labels [34.87532045406169]
In-context learning (ICL) using large language models for tasks with many labels is challenging due to the limited context window.
We use a pre-trained dense retrieval model to bypass this limitation.
We analyze the performance across number of in-context examples and different model scales.
arXiv Detail & Related papers (2023-09-19T22:41:44Z) - CiteBench: A benchmark for Scientific Citation Text Generation [69.37571393032026]
CiteBench is a benchmark for citation text generation.
We make the code for CiteBench publicly available at https://github.com/UKPLab/citebench.
arXiv Detail & Related papers (2022-12-19T16:10:56Z) - The Fellowship of the Authors: Disambiguating Names from Social Network
Context [2.3605348648054454]
Authority lists with extensive textual descriptions for each entity are lacking and ambiguous named entities.
We combine BERT-based mention representations with a variety of graph induction strategies and experiment with supervised and unsupervised cluster inference methods.
We find that in-domain language model pretraining can significantly improve mention representations, especially for larger corpora.
arXiv Detail & Related papers (2022-08-31T21:51:55Z) - CORWA: A Citation-Oriented Related Work Annotation Dataset [4.740962650068886]
In natural language processing, literature reviews are usually conducted under the "Related Work" section.
We train a strong baseline model that automatically tags the CORWA labels on massive unlabeled related work section texts.
We suggest a novel framework for human-in-the-loop, iterative, abstractive related work generation.
arXiv Detail & Related papers (2022-05-07T00:23:46Z) - Towards generating citation sentences for multiple references with
intent control [86.53829532976303]
We build a novel generation model with the Fusion-in-Decoder approach to cope with multiple long inputs.
Experiments demonstrate that the proposed approaches provide much more comprehensive features for generating citation sentences.
arXiv Detail & Related papers (2021-12-02T15:32:24Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.