NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents
- URL: http://arxiv.org/abs/2402.17682v2
- Date: Thu, 13 Jun 2024 10:21:03 GMT
- Title: NextLevelBERT: Masked Language Modeling with Higher-Level Representations for Long Documents
- Authors: Tamara Czinczoll, Christoph Hönes, Maximilian Schall, Gerard de Melo,
- Abstract summary: NextLevelBERT is a Masked Language Model operating not on tokens, but on higher-level semantic representations in the form of text embeddings.
We find that next-level Masked Language Modeling is an effective technique to tackle long-document use cases and can outperfor much larger embedding models as long as the required level of detail of semantic information is not too fine.
- Score: 17.94934249657174
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While (large) language models have significantly improved over the last years, they still struggle to sensibly process long sequences found, e.g., in books, due to the quadratic scaling of the underlying attention mechanism. To address this, we propose NextLevelBERT, a Masked Language Model operating not on tokens, but on higher-level semantic representations in the form of text embeddings. We pretrain NextLevelBERT to predict the vector representation of entire masked text chunks and evaluate the effectiveness of the resulting document vectors on three types of tasks: 1) Semantic Textual Similarity via zero-shot document embeddings, 2) Long document classification, 3) Multiple-choice question answering. We find that next-level Masked Language Modeling is an effective technique to tackle long-document use cases and can outperfor much larger embedding models as long as the required level of detail of semantic information is not too fine. Our models and code are publicly available online.
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Segment Any 3D Object with Language [58.471327490684295]
We introduce Segment any 3D Object with LanguagE (SOLE), a semantic geometric and-aware visual-language learning framework with strong generalizability.
Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder.
Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks.
arXiv Detail & Related papers (2024-04-02T17:59:10Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Are the Best Multilingual Document Embeddings simply Based on Sentence
Embeddings? [18.968571816913208]
We provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models.
We show that a clever combination of sentence embeddings is usually better than encoding the full document as a single unit.
arXiv Detail & Related papers (2023-04-28T12:11:21Z) - Understanding the Effectiveness of Very Large Language Models on Dialog
Evaluation [20.18656308749408]
Large language models (LLMs) have been used for generation and can now output human-like text.
This paper investigates how the number of examples in the prompt and the type of example selection used affect the model's performance.
arXiv Detail & Related papers (2023-01-27T22:02:27Z) - Weakly Supervised Text Classification using Supervision Signals from a
Language Model [33.5830441120473]
We design a prompt which combines the document itself and "this article is talking about [MASK]"
A masked language model can generate words for the [MASK] token.
The generated words which summarize the content of a document can be utilized as supervision signals.
arXiv Detail & Related papers (2022-05-13T12:57:15Z) - Attend, Memorize and Generate: Towards Faithful Table-to-Text Generation
in Few Shots [58.404516361586325]
Few-shot table-to-text generation is a task of composing fluent and faithful sentences to convey table content using limited data.
This paper proposes a novel approach, Memorize and Generate (called AMG), inspired by the text generation process of humans.
arXiv Detail & Related papers (2022-03-01T20:37:20Z) - LAWDR: Language-Agnostic Weighted Document Representations from
Pre-trained Models [8.745407715423992]
Cross-lingual document representations enable language understanding in multilingual contexts.
Large pre-trained language models such as BERT, XLM and XLM-RoBERTa have achieved great success when fine-tuned on sentence-level downstream tasks.
arXiv Detail & Related papers (2021-06-07T07:14:00Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.