Related papers: Leveraging BERT Language Model for Arabic Long Document Classification

Related papers

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension [77.93156509994994]
We show how to represent short chunks in a way that is conditioned on a broader context window to enhance retrieval performance.<n>Existing embedding models are not well-equipped to encode such situated context effectively.<n>Our method substantially outperforms state-of-the-art embedding models.
arXiv Detail & Related papers (2025-08-03T23:59:31Z)
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy [111.1291107651131]
Long-VITA is a large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens. Long-VITA is fully reproducible and supports both NPU and GPU platforms for training and testing.
arXiv Detail & Related papers (2025-02-07T18:59:56Z)
HERA: Improving Long Document Summarization using Large Language Models with Context Packaging and Reordering [6.876612430571396]
We propose a novel summary generation framework, called HERA. We first segment a long document by its semantic structure and retrieve text segments about the same event, and finally reorder them to form the input context. The experimental results show that HERA outperforms foundation models in ROUGE, BERTScore and faithfulness metrics.
arXiv Detail & Related papers (2025-02-01T14:55:06Z)
M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework [75.95430061891828]
We introduce M-LongDoc, a benchmark of 851 samples, and an automated framework to evaluate the performance of large multimodal models. We propose a retrieval-aware tuning approach for efficient and effective multimodal document reading.
arXiv Detail & Related papers (2024-11-09T13:30:38Z)
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models. Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z)
Language Resources for Dutch Large Language Modelling [0.0]
We introduce two fine-tuned variants of the Llama 2 13B model. We provide a leaderboard to keep track of the performance of (Dutch) models on a number of generation tasks.
arXiv Detail & Related papers (2023-12-20T09:06:06Z)
WanJuan: A Comprehensive Multimodal Dataset for Advancing English and Chinese Large Models [69.96148259273065]
"Wan Juan" is a large-scale multimodal dataset composed of both Chinese and English data, collected from a wide range of web sources. It was utilized in the training of InternLM, a model that demonstrated significant advantages in multi-dimensional evaluations when compared to models of a similar scale.
arXiv Detail & Related papers (2023-08-21T14:40:48Z)
HeRo: RoBERTa and Longformer Hebrew Language Models [0.0]
We provide a state-of-the-art pre-trained language model HeRo for standard length inputs and an efficient transformer LongHeRo for long input sequences. The HeRo model was evaluated on the sentiment analysis, the named entity recognition, and the question answering tasks. The LongHeRo model was evaluated on the document classification task with a dataset composed of long documents.
arXiv Detail & Related papers (2023-04-18T05:56:32Z)
A Survey on Long Text Modeling with Transformers [106.50471784909212]
We provide an overview of the recent advances on long texts modeling based on Transformer models.<n>We discuss how to process long input to satisfy the length limitation and design improved Transformer architectures.<n>We describe four typical applications involving long text modeling and conclude this paper with a discussion of future directions.
arXiv Detail & Related papers (2023-02-28T11:34:30Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
Longtonotes: OntoNotes with Longer Coreference Chains [111.73115731999793]
We build a corpus of coreference-annotated documents of significantly longer length than what is currently available. The resulting corpus, which we call LongtoNotes, contains documents in multiple genres of the English language with varying lengths. We evaluate state-of-the-art neural coreference systems on this new corpus.
arXiv Detail & Related papers (2022-10-07T15:58:41Z)
LOT: A Benchmark for Evaluating Chinese Long Text Understanding and Generation [49.57366550980932]
Long text modeling requires many capabilities such as modeling long-range commonsense and discourse relations. We propose LOT, a benchmark including two understanding and two generation tasks for Chinese long text modeling evaluation. We release an encoder-decoder Chinese long text pretraining model named LongLM with up to 1 billion parameters.
arXiv Detail & Related papers (2021-08-30T02:38:32Z)
LAWDR: Language-Agnostic Weighted Document Representations from Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts. Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z)
Introducing various Semantic Models for Amharic: Experimentation and Evaluation with multiple Tasks and Datasets [19.855120632909124]
We introduce different semantic models for Amharic. Models are build using word2Vec embeddings, distributional thesaurus (DT), contextual embeddings, and DT embeddings. We find that newly trained models perform better than pre-trained multilingual models.
arXiv Detail & Related papers (2020-11-02T17:48:25Z)
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.