Leveraging BERT Language Model for Arabic Long Document Classification
- URL: http://arxiv.org/abs/2305.03519v1
- Date: Thu, 4 May 2023 13:56:32 GMT
- Title: Leveraging BERT Language Model for Arabic Long Document Classification
- Authors: Muhammad AL-Qurishi
- Abstract summary: We propose two models to classify long length Arabic documents.
Both of our models outperform the Longformer and RoBERT in this task over two different datasets.
- Score: 0.47138177023764655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the number of Arabic speakers worldwide and the notably large amount of
content in the web today in some fields such as law, medicine, or even news,
documents of considerable length are produced regularly. Classifying those
documents using traditional learning models is often impractical since extended
length of the documents increases computational requirements to an
unsustainable level. Thus, it is necessary to customize these models
specifically for long textual documents. In this paper we propose two simple
but effective models to classify long length Arabic documents. We also
fine-tune two different models-namely, Longformer and RoBERT, for the same task
and compare their results to our models. Both of our models outperform the
Longformer and RoBERT in this task over two different datasets.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Language Resources for Dutch Large Language Modelling [0.0]
We introduce two fine-tuned variants of the Llama 2 13B model.
We provide a leaderboard to keep track of the performance of (Dutch) models on a number of generation tasks.
arXiv Detail & Related papers (2023-12-20T09:06:06Z) - WanJuan: A Comprehensive Multimodal Dataset for Advancing English and
Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources.
It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z) - HeRo: RoBERTa and Longformer Hebrew Language Models [0.0]
We provide a state-of-the-art pre-trained language model HeRo for standard length inputs and an efficient transformer LongHeRo for long input sequences.
The HeRo model was evaluated on the sentiment analysis, the named entity recognition, and the question answering tasks.
The LongHeRo model was evaluated on the document classification task with a dataset composed of long documents.
arXiv Detail & Related papers (2023-04-18T05:56:32Z) - Speculative Decoding with Big Little Decoder [108.95187338417541]
Big Little Decoder (BiLD) is a framework that can improve inference efficiency and latency for a wide range of text generation applications.
On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation.
Our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture.
arXiv Detail & Related papers (2023-02-15T18:55:29Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Longtonotes: OntoNotes with Longer Coreference Chains [111.73115731999793]
We build a corpus of coreference-annotated documents of significantly longer length than what is currently available.
The resulting corpus, which we call LongtoNotes, contains documents in multiple genres of the English language with varying lengths.
We evaluate state-of-the-art neural coreference systems on this new corpus.
arXiv Detail & Related papers (2022-10-07T15:58:41Z) - LOT: A Benchmark for Evaluating Chinese Long Text Understanding and
Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations.
We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation.
We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z) - LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts.
Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z) - Introducing various Semantic Models for Amharic: Experimentation and
Evaluation with multiple Tasks and Datasets [19.855120632909124]
We introduce different semantic models for Amharic.
Models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings.
We find that newly trained models perform better than pre-trained multilingual models.
arXiv Detail & Related papers (2020-11-02T17:48:25Z) - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.