Related papers: MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

URL: http://arxiv.org/abs/2503.09600v2
Date: Mon, 26 May 2025 12:24:56 GMT
Title: MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System
Authors: Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li,
Abstract summary: This paper introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness.<n>We highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances.<n>We devise the Mixture-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism.
Score: 11.793639794583498
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

Related papers

AI4Contracts: LLM & RAG-Powered Encoding of Financial Derivative Contracts [1.3060230641655135]
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) are reshaping how AI systems extract and organize information from unstructured text.<n>We introduce CDMizer, a template-driven, LLM, and RAG-based framework for structured text transformation.
arXiv Detail & Related papers (2025-06-01T16:05:00Z)
Document Valuation in LLM Summaries: A Cluster Shapley Approach [0.0]
Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources.<n>We propose using Shapley values, a game-theoretic method that allocates credit based on each document's marginal contribution.<n>We therefore propose Cluster Shapley, an efficient approximation algorithm that leverages semantic similarity between documents.
arXiv Detail & Related papers (2025-05-28T15:14:21Z)
RALLRec+: Retrieval Augmented Large Language Model Recommendation with Reasoning [22.495874056980824]
We propose Representation learning and textbfReasoning empowered retrieval-textbfAugmented textbfLarge textbfLanguage model textbfRecommendation (RALLRec+).
arXiv Detail & Related papers (2025-03-26T11:03:34Z)
Efficient Knowledge Feeding to Language Models: A Novel Integrated Encoder-Decoder Architecture [0.0]
ICV recasts in-context learning by using latent embeddings of language models.<n>ICV directly integrates information into the model, enabling it to process this information more effectively.
arXiv Detail & Related papers (2025-02-07T04:24:07Z)
Enhancing Item Tokenization for Generative Recommendation through Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs)<n>Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.<n>We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv Detail & Related papers (2024-12-22T21:56:15Z)
Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception [10.614437503578856]
This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality.<n>We design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking.<n>We establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure.
arXiv Detail & Related papers (2024-10-16T17:59:32Z)
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z)
Bridging LLMs and KGs without Fine-Tuning: Intermediate Probing Meets Subgraph-Aware Entity Descriptions [49.36683223327633]
Large Language Models (LLMs) encapsulate extensive world knowledge and exhibit powerful context modeling capabilities.<n>We propose a novel framework that synergizes the strengths of LLMs with robust knowledge representation to enable effective and efficient KGC.<n>We achieve a 47% relative improvement over previous methods based on non-fine-tuned LLMs and, to our knowledge, are the first to achieve classification performance comparable to fine-tuned LLMs.
arXiv Detail & Related papers (2024-08-13T10:15:55Z)
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning. deploying LLM inference poses challenges due to the high compute and memory requirements. We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv Detail & Related papers (2024-06-16T09:51:55Z)
CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks [14.603394022550864]
CheckEmbed (CE) is a simple, scalable, and accurate verification method for large language models (LLMs)<n>CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability.<n> Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks.
arXiv Detail & Related papers (2024-06-04T17:42:21Z)
Enhancing Retrieval-Augmented LMs with a Two-stage Consistency Learning Compressor [4.35807211471107]
This work proposes a novel two-stage consistency learning approach for retrieved information compression in retrieval-augmented language models. The proposed method is empirically validated across multiple datasets, demonstrating notable enhancements in precision and efficiency for question-answering tasks.
arXiv Detail & Related papers (2024-06-04T12:43:23Z)
A Rationale-centric Counterfactual Data Augmentation Method for Cross-Document Event Coreference Resolution [29.34028569245905]
We formalize the decision-making process of the baseline ECR system using a Structural Causal Model (SCM) We develop a rationale-centric counterfactual data augmentation method with LLM-in-the-loop. Our approach achieves state-of-the-art performance on three popular cross-document ECR benchmarks and demonstrates robustness in out-of-domain scenarios.
arXiv Detail & Related papers (2024-04-02T13:15:07Z)
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy [46.81745860690336]
Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems. This paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework.
arXiv Detail & Related papers (2023-12-20T02:55:15Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale [64.10124092250126]
Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus. In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting. We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
arXiv Detail & Related papers (2023-04-19T18:09:27Z)
CoCoMoT: Conformance Checking of Multi-Perspective Processes via SMT (Extended Version) [62.96267257163426]
We introduce the CoCoMoT (Computing Conformance Modulo Theories) framework. First, we show how SAT-based encodings studied in the pure control-flow setting can be lifted to our data-aware case. Second, we introduce a novel preprocessing technique based on a notion of property-preserving clustering.
arXiv Detail & Related papers (2021-03-18T20:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.