Beyond Document Page Classification: Design, Datasets, and Challenges
- URL: http://arxiv.org/abs/2308.12896v3
- Date: Tue, 31 Oct 2023 10:35:39 GMT
- Title: Beyond Document Page Classification: Design, Datasets, and Challenges
- Authors: Jordy Van Landeghem, Sanket Biswas, Matthew B. Blaschko,
Marie-Francine Moens
- Abstract summary: This paper highlights the need to bring document classification benchmarking closer to real-world applications.
We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations.
- Score: 32.94494070330065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper highlights the need to bring document classification benchmarking
closer to real-world applications, both in the nature of data tested ($X$:
multi-channel, multi-paged, multi-industry; $Y$: class distributions and label
set variety) and in classification tasks considered ($f$: multi-page document,
page stream, and document bundle classification, ...). We identify the lack of
public multi-page document classification datasets, formalize different
classification tasks arising in application scenarios, and motivate the value
of targeting efficient multi-page document representations. An experimental
study on proposed multi-page document classification datasets demonstrates that
current benchmarks have become irrelevant and need to be updated to evaluate
complete documents, as they naturally occur in practice. This reality check
also calls for more mature evaluation methodologies, covering calibration
evaluation, inference complexity (time-memory), and a range of realistic
distribution shifts (e.g., born-digital vs. scanning noise, shifting page
order). Our study ends on a hopeful note by recommending concrete avenues for
future improvements.}
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - SEAM: A Stochastic Benchmark for Multi-Document Tasks [30.153949809172605]
There is currently no benchmark which measures abilities of large language models (LLMs) on multi-document tasks.
We present SEAM (a Evaluation Approach for Multi-document tasks), a conglomerate benchmark over a diverse set of multi-document datasets.
We find that multi-document tasks pose a significant challenge for LLMs, even for state-of-the-art models with 70B parameters.
arXiv Detail & Related papers (2024-06-23T11:57:53Z) - Knowledge-Centric Templatic Views of Documents [2.654058995940072]
Authors often share their ideas in various document formats, such as slide decks, newsletters, reports, and posters.
We introduce a novel unified evaluation framework that can be adapted to measuring the quality of document generators.
We conduct a human evaluation, which shows that people prefer 82% of the documents generated with our method.
arXiv Detail & Related papers (2024-01-13T01:22:15Z) - Context-Aware Classification of Legal Document Pages [7.306025535482021]
We present a simple but effective approach that overcomes the constraint on input length.
Specifically, we enhance the input with extra tokens carrying sequential information about previous pages.
Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification.
arXiv Detail & Related papers (2023-04-05T23:14:58Z) - SciRepEval: A Multi-Format Benchmark for Scientific Document
Representations [52.01865318382197]
We introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations.
We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats.
A new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance.
arXiv Detail & Related papers (2022-11-23T21:25:39Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Efficient Classification of Long Documents Using Transformers [13.927622630633344]
We evaluate the relative efficacy measured against various baselines and diverse datasets.
Results show that more complex models often fail to outperform simple baselines and yield inconsistent performance across datasets.
arXiv Detail & Related papers (2022-03-21T18:36:18Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Modeling Endorsement for Multi-Document Abstractive Summarization [10.166639983949887]
A crucial difference between single- and multi-document summarization is how salient content manifests itself in the document(s)
In this paper, we model the cross-document endorsement effect and its utilization in multiple document summarization.
Our method generates a synopsis from each document, which serves as an endorser to identify salient content from other documents.
arXiv Detail & Related papers (2021-10-15T03:55:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.