Related papers: A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data

A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data

URL: http://arxiv.org/abs/2106.06738v1
Date: Sat, 12 Jun 2021 10:45:24 GMT
Title: A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data
Authors: Jinghui Lu, Maeve Henchion, Ivan Bacher, Brian Mac Namee
Abstract summary: This work introduces a long-text-specific model -- the Hierarchical BERT Model (HBM) -- that learns sentence-level features of the text and works well in scenarios with limited data. Various evaluation experiments have demonstrated that HBM can achieve higher performance in document classification than the previous state-of-the-art methods with only 50 to 200 labelled instances.
Score: 5.123298347655086
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Training deep learning models with limited labelled data is an attractive scenario for many NLP tasks, including document classification. While with the recent emergence of BERT, deep learning language models can achieve reasonably good performance in document classification with few labelled instances, there is a lack of evidence in the utility of applying BERT-like models on long document classification. This work introduces a long-text-specific model -- the Hierarchical BERT Model (HBM) -- that learns sentence-level features of the text and works well in scenarios with limited labelled data. Various evaluation experiments have demonstrated that HBM can achieve higher performance in document classification than the previous state-of-the-art methods with only 50 to 200 labelled instances, especially when documents are long. Also, as an extra benefit of HBM, the salient sentences identified by learned HBM are useful as explanations for labelling documents based on a user study.

Related papers

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations [22.336858733121158]
We introduce OmniDocBench, a novel benchmark featuring high-quality annotations across nine document sources. We conduct a thorough evaluation of both pipeline-based methods and end-to-end vision-language models.
arXiv Detail & Related papers (2024-12-10T16:05:56Z)
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models. Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z)
A Small Claims Court for the NLP: Judging Legal Text Classification Strategies With Small Datasets [0.0]
This paper investigates the best strategies for optimizing the use of a small labeled dataset and large amounts of unlabeled data. We use the records of demands to a Brazilian Public Prosecutor's Office aiming to assign the descriptions in one of the subjects. The best result was obtained with Unsupervised Data Augmentation (UDA), which jointly uses BERT, data augmentation, and strategies of semi-supervised learning.
arXiv Detail & Related papers (2024-09-09T18:10:05Z)
Probing Representations for Document-level Event Extraction [30.523959637364484]
This work is the first to apply the probing paradigm to representations learned for document-level information extraction. We designed eight embedding probes to analyze surface, semantic, and event-understanding capabilities relevant to document-level event extraction. We found that trained encoders from these models yield embeddings that can modestly improve argument detections and labeling but only slightly enhance event-level tasks.
arXiv Detail & Related papers (2023-10-23T19:33:04Z)
DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models. The collected dataset, named DocumentNet, does not depend on specific document types or entity sets. Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z)
Automated Few-shot Classification with Instruction-Finetuned Language Models [76.69064714392165]
We show that AuT-Few outperforms state-of-the-art few-shot learning methods. We also show that AuT-Few is the best ranking method across datasets on the RAFT few-shot benchmark.
arXiv Detail & Related papers (2023-05-21T21:50:27Z)
Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding. UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input. An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z)
Comparative Study of Long Document Classification [0.0]
We revisit long document classification using standard machine learning approaches. We benchmark approaches ranging from simple Naive Bayes to complex BERT on six standard text classification datasets.
arXiv Detail & Related papers (2021-11-01T04:51:51Z)
DocNLI: A Large-scale Dataset for Document-level Natural Language Inference [55.868482696821815]
Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems. This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI.
arXiv Detail & Related papers (2021-06-17T13:02:26Z)
LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts. Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z)
SPECTER: Document-level Representation Learning using Citation-informed Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model. We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.