Structure-Tags Improve Text Classification for Scholarly Document
Quality Prediction
- URL: http://arxiv.org/abs/2005.00129v2
- Date: Thu, 17 Dec 2020 20:35:14 GMT
- Title: Structure-Tags Improve Text Classification for Scholarly Document
Quality Prediction
- Authors: Gideon Maillette de Buy Wenniger, Thomas van Dongen, Eleri Aedmaa,
Herbert Teun Kruitbosch, Edwin A. Valentijn, and Lambert Schomaker
- Abstract summary: We propose the use of HANs combined with structure-tags which mark the role of sentences in the document.
Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction.
- Score: 4.4641025448898475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training recurrent neural networks on long texts, in particular scholarly
documents, causes problems for learning. While hierarchical attention networks
(HANs) are effective in solving these problems, they still lose important
information about the structure of the text. To tackle these problems, we
propose the use of HANs combined with structure-tags which mark the role of
sentences in the document. Adding tags to sentences, marking them as
corresponding to title, abstract or main body text, yields improvements over
the state-of-the-art for scholarly document quality prediction. The proposed
system is applied to the task of accept/reject prediction on the PeerRead
dataset and compared against a recent BiLSTM-based model and joint
textual+visual model as well as against plain HANs. Compared to plain HANs,
accuracy increases on all three domains. On the computation and language domain
our new model works best overall, and increases accuracy 4.7% over the best
literature result. We also obtain improvements when introducing the tags for
prediction of the number of citations for 88k scientific publications that we
compiled from the Allen AI S2ORC dataset. For our HAN-system with
structure-tags we reach 28.5% explained variance, an improvement of 1.8% over
our reimplementation of the BiLSTM-based model as well as 1.0% improvement over
plain HANs.
Related papers
- Language Modeling with Editable External Knowledge [90.7714362827356]
This paper introduces ERASE, which improves model behavior when new documents are acquired.
It incrementally deletes or rewriting other entries in the knowledge base each time a document is added.
It improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute.
arXiv Detail & Related papers (2024-06-17T17:59:35Z) - Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models.
We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions.
Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - RGAT: A Deeper Look into Syntactic Dependency Information for
Coreference Resolution [8.017036537163008]
We propose an end-to-end resolution that combines pre-trained BERT with a Syntactic Relation Graph Attention Network (RGAT)
In particular, the RGAT model is first proposed, then used to understand the syntactic dependency graph and learn better task-specific syntactic embeddings.
An integrated architecture incorporating BERT embeddings and syntactic embeddings is constructed to generate blending representations for the downstream task.
arXiv Detail & Related papers (2023-09-10T09:46:38Z) - Prompt-based Learning for Text Readability Assessment [0.4757470449749875]
We propose the novel adaptation of a pre-trained seq2seq model for readability assessment.
We prove that a seq2seq model can be adapted to discern which text is more difficult from two given texts (pairwise)
arXiv Detail & Related papers (2023-02-25T18:39:59Z) - Text and Code Embeddings by Contrastive Pre-Training [15.099849247795714]
We show that contrastive pre-training on unsupervised data at scale leads to high quality vector representations of text and code.
Same unsupervised text embeddings that achieve new state-of-the-art results in linear-probe classification also display impressive semantic search capabilities.
arXiv Detail & Related papers (2022-01-24T23:36:20Z) - Meta-Learning Adversarial Domain Adaptation Network for Few-Shot Text
Classification [31.167424308211995]
We propose a novel meta-learning framework integrated with an adversarial domain adaptation network.
Our method demonstrates clear superiority over the state-of-the-art models in all the datasets.
In particular, the accuracy of 1-shot and 5-shot classification on the dataset of 20 Newsgroups is boosted from 52.1% to 59.6%.
arXiv Detail & Related papers (2021-07-26T15:09:40Z) - Incorporating Visual Layout Structures for Scientific Text
Classification [31.15058113053433]
We introduce new methods for incorporating VIsual LAyout structures (VILA), e.g., the grouping of page texts into text lines or text blocks, into language models.
We show that the I-VILA approach, which simply adds special tokens denoting boundaries between layout structures into model inputs, can lead to +14.5 F1 Score improvements in token classification tasks.
arXiv Detail & Related papers (2021-06-01T17:59:00Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.