A Sentence-level Hierarchical BERT Model for Document Classification
with Limited Labelled Data
- URL: http://arxiv.org/abs/2106.06738v1
- Date: Sat, 12 Jun 2021 10:45:24 GMT
- Title: A Sentence-level Hierarchical BERT Model for Document Classification
with Limited Labelled Data
- Authors: Jinghui Lu, Maeve Henchion, Ivan Bacher, Brian Mac Namee
- Abstract summary: This work introduces a long-text-specific model -- the Hierarchical BERT Model (HBM) -- that learns sentence-level features of the text and works well in scenarios with limited data.
Various evaluation experiments have demonstrated that HBM can achieve higher performance in document classification than the previous state-of-the-art methods with only 50 to 200 labelled instances.
- Score: 5.123298347655086
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Training deep learning models with limited labelled data is an attractive
scenario for many NLP tasks, including document classification. While with the
recent emergence of BERT, deep learning language models can achieve reasonably
good performance in document classification with few labelled instances, there
is a lack of evidence in the utility of applying BERT-like models on long
document classification. This work introduces a long-text-specific model -- the
Hierarchical BERT Model (HBM) -- that learns sentence-level features of the
text and works well in scenarios with limited labelled data. Various evaluation
experiments have demonstrated that HBM can achieve higher performance in
document classification than the previous state-of-the-art methods with only 50
to 200 labelled instances, especially when documents are long. Also, as an
extra benefit of HBM, the salient sentences identified by learned HBM are
useful as explanations for labelling documents based on a user study.
Related papers
- Probing Representations for Document-level Event Extraction [30.523959637364484]
This work is the first to apply the probing paradigm to representations learned for document-level information extraction.
We designed eight embedding probes to analyze surface, semantic, and event-understanding capabilities relevant to document-level event extraction.
We found that trained encoders from these models yield embeddings that can modestly improve argument detections and labeling but only slightly enhance event-level tasks.
arXiv Detail & Related papers (2023-10-23T19:33:04Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Automated Few-shot Classification with Instruction-Finetuned Language
Models [76.69064714392165]
We show that AuT-Few outperforms state-of-the-art few-shot learning methods.
We also show that AuT-Few is the best ranking method across datasets on the RAFT few-shot benchmark.
arXiv Detail & Related papers (2023-05-21T21:50:27Z) - Context-Aware Classification of Legal Document Pages [7.306025535482021]
We present a simple but effective approach that overcomes the constraint on input length.
Specifically, we enhance the input with extra tokens carrying sequential information about previous pages.
Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification.
arXiv Detail & Related papers (2023-04-05T23:14:58Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Comparative Study of Long Document Classification [0.0]
We revisit long document classification using standard machine learning approaches.
We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
arXiv Detail & Related papers (2021-11-01T04:51:51Z) - DocNLI: A Large-scale Dataset for Document-level Natural Language
Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems.
This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z) - LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts.
Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.