Related papers: Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction

Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction

URL: http://arxiv.org/abs/2005.00129v2
Date: Thu, 17 Dec 2020 20:35:14 GMT
Title: Structure-Tags Improve Text Classification for Scholarly Document Quality Prediction
Authors: Gideon Maillette de Buy Wenniger, Thomas van Dongen, Eleri Aedmaa, Herbert Teun Kruitbosch, Edwin A. Valentijn, and Lambert Schomaker
Abstract summary: We propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction.
Score: 4.4641025448898475
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Training recurrent neural networks on long texts, in particular scholarly documents, causes problems for learning. While hierarchical attention networks (HANs) are effective in solving these problems, they still lose important information about the structure of the text. To tackle these problems, we propose the use of HANs combined with structure-tags which mark the role of sentences in the document. Adding tags to sentences, marking them as corresponding to title, abstract or main body text, yields improvements over the state-of-the-art for scholarly document quality prediction. The proposed system is applied to the task of accept/reject prediction on the PeerRead dataset and compared against a recent BiLSTM-based model and joint textual+visual model as well as against plain HANs. Compared to plain HANs, accuracy increases on all three domains. On the computation and language domain our new model works best overall, and increases accuracy 4.7% over the best literature result. We also obtain improvements when introducing the tags for prediction of the number of citations for 88k scientific publications that we compiled from the Allen AI S2ORC dataset. For our HAN-system with structure-tags we reach 28.5% explained variance, an improvement of 1.8% over our reimplementation of the BiLSTM-based model as well as 1.0% improvement over plain HANs.

Related papers

Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling [0.0]
APEX-Embedding-7B is a 7-billion parameter decoder-only text Feature Extraction Model. Our approach employs two training techniques that yield an emergent improvement in factual focus. Based on our evaluations, our model establishes a new state-of-the-art standard in text feature extraction for longer context document retrieval tasks.
arXiv Detail & Related papers (2024-10-08T17:36:48Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media [1.053698976085779]
This study introduces an innovative automatic labeling framework to address the challenges of lexical normalization in social media texts. We propose a framework that integrates semi-supervised learning with weak supervision techniques. Our framework automatically labels raw data, converting non-standard vocabulary into standardized forms.
arXiv Detail & Related papers (2024-09-30T16:26:40Z)
Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification [3.9889306957591755]
We propose a novel framework to boost deep learning models' performance given augmented data/samples in text classification tasks. We propose novel weight-dependent enqueue and dequeue algorithms to utilize augmented samples' weight/quality information effectively. Our framework achieves an average of 1.6%, up to 4.3% absolute improvement on Text-CNN encoders and an average of 1.4%, up to 4.4% absolute improvement on RoBERTa-base encoders.
arXiv Detail & Related papers (2024-09-26T02:19:13Z)
Language Modeling with Editable External Knowledge [90.7714362827356]
This paper introduces ERASE, which improves model behavior when new documents are acquired. It incrementally deletes or rewriting other entries in the knowledge base each time a document is added. It improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute.
arXiv Detail & Related papers (2024-06-17T17:59:35Z)
Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models. We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions. Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z)
Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents. Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z)
Meta-Learning Adversarial Domain Adaptation Network for Few-Shot Text Classification [31.167424308211995]
We propose a novel meta-learning framework integrated with an adversarial domain adaptation network. Our method demonstrates clear superiority over the state-of-the-art models in all the datasets. In particular, the accuracy of 1-shot and 5-shot classification on the dataset of 20 Newsgroups is boosted from 52.1% to 59.6%.
arXiv Detail & Related papers (2021-07-26T15:09:40Z)
Incorporating Visual Layout Structures for Scientific Text Classification [31.15058113053433]
We introduce new methods for incorporating VIsual LAyout structures (VILA), e.g., the grouping of page texts into text lines or text blocks, into language models. We show that the I-VILA approach, which simply adds special tokens denoting boundaries between layout structures into model inputs, can lead to +14.5 F1 Score improvements in token classification tasks.
arXiv Detail & Related papers (2021-06-01T17:59:00Z)
Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks. We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z)
Be More with Less: Hypergraph Attention Networks for Inductive Text Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words. We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.