MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented   Generation System
        - URL: http://arxiv.org/abs/2503.09600v2
 - Date: Mon, 26 May 2025 12:24:56 GMT
 - Title: MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented   Generation System
 - Authors: Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li, 
 - Abstract summary: This paper introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness.<n>We highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances.<n>We devise the Mixture-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism.
 - Score: 11.793639794583498
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system. 
 
       
      
        Related papers
        - AI4Contracts: LLM & RAG-Powered Encoding of Financial Derivative   Contracts [1.3060230641655135]
Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) are reshaping how AI systems extract and organize information from unstructured text.<n>We introduce CDMizer, a template-driven, LLM, and RAG-based framework for structured text transformation.
arXiv  Detail & Related papers  (2025-06-01T16:05:00Z) - Document Valuation in LLM Summaries: A Cluster Shapley Approach [0.0]
Large Language Models (LLMs) are increasingly used in systems that retrieve and summarize content from multiple sources.<n>We propose using Shapley values, a game-theoretic method that allocates credit based on each document's marginal contribution.<n>We therefore propose Cluster Shapley, an efficient approximation algorithm that leverages semantic similarity between documents.
arXiv  Detail & Related papers  (2025-05-28T15:14:21Z) - RALLRec+: Retrieval Augmented Large Language Model Recommendation with   Reasoning [22.495874056980824]
We propose Representation learning and textbfReasoning empowered retrieval-textbfAugmented textbfLarge textbfLanguage model textbfRecommendation (RALLRec+).
arXiv  Detail & Related papers  (2025-03-26T11:03:34Z) - Efficient Knowledge Feeding to Language Models: A Novel Integrated   Encoder-Decoder Architecture [0.0]
ICV recasts in-context learning by using latent embeddings of language models.<n>ICV directly integrates information into the model, enabling it to process this information more effectively.
arXiv  Detail & Related papers  (2025-02-07T04:24:07Z) - Enhancing Item Tokenization for Generative Recommendation through   Self-Improvement [67.94240423434944]
Generative recommendation systems are driven by large language models (LLMs)<n>Current item tokenization methods include using text descriptions, numerical strings, or sequences of discrete tokens.<n>We propose a self-improving item tokenization method that allows the LLM to refine its own item tokenizations during training process.
arXiv  Detail & Related papers  (2024-12-22T21:56:15Z) - Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv  Detail & Related papers  (2024-12-17T01:54:08Z) - Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach.
This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets.
We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv  Detail & Related papers  (2024-11-07T10:31:31Z) - Meta-Chunking: Learning Text Segmentation and Semantic Completion via   Logical Perception [10.614437503578856]
This paper proposes the Meta-Chunking framework, which specifically enhances chunking quality.<n>We design two adaptive chunking techniques based on uncertainty, namely Perplexity Chunking and Margin Sampling Chunking.<n>We establish the global information compensation mechanism, encompassing a two-stage hierarchical summary generation process and a three-stage text chunk rewriting procedure.
arXiv  Detail & Related papers  (2024-10-16T17:59:32Z) - Attribute Controlled Fine-tuning for Large Language Models: A Case Study   on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control.
We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv  Detail & Related papers  (2024-10-07T23:38:58Z) - Bridging LLMs and KGs without Fine-Tuning: Intermediate Probing Meets   Subgraph-Aware Entity Descriptions [49.36683223327633]
Large Language Models (LLMs) encapsulate extensive world knowledge and exhibit powerful context modeling capabilities.<n>We propose a novel framework that synergizes the strengths of LLMs with robust knowledge representation to enable effective and efficient KGC.<n>We achieve a 47% relative improvement over previous methods based on non-fine-tuned LLMs and, to our knowledge, are the first to achieve classification performance comparable to fine-tuned LLMs.
arXiv  Detail & Related papers  (2024-08-13T10:15:55Z) - Tender: Accelerating Large Language Models via Tensor Decomposition and   Runtime Requantization [0.6445087473595953]
Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning.
 deploying LLM inference poses challenges due to the high compute and memory requirements.
We present Tender, an algorithm-hardware co-design solution that enables efficient deployment of LLM inference at low precision.
arXiv  Detail & Related papers  (2024-06-16T09:51:55Z) - CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks [14.603394022550864]
CheckEmbed (CE) is a simple, scalable, and accurate verification method for large language models (LLMs)<n>CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability.<n> Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks.
arXiv  Detail & Related papers  (2024-06-04T17:42:21Z) - Enhancing Retrieval-Augmented LMs with a Two-stage Consistency Learning   Compressor [4.35807211471107]
This work proposes a novel two-stage consistency learning approach for retrieved information compression in retrieval-augmented language models.
The proposed method is empirically validated across multiple datasets, demonstrating notable enhancements in precision and efficiency for question-answering tasks.
arXiv  Detail & Related papers  (2024-06-04T12:43:23Z) - A Rationale-centric Counterfactual Data Augmentation Method for   Cross-Document Event Coreference Resolution [29.34028569245905]
We formalize the decision-making process of the baseline ECR system using a Structural Causal Model (SCM)
We develop a rationale-centric counterfactual data augmentation method with LLM-in-the-loop.
Our approach achieves state-of-the-art performance on three popular cross-document ECR benchmarks and demonstrates robustness in out-of-domain scenarios.
arXiv  Detail & Related papers  (2024-04-02T13:15:07Z) - Lookahead: An Inference Acceleration Framework for Large Language Model   with Lossless Generation Accuracy [46.81745860690336]
Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems.
This paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction.
We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework.
arXiv  Detail & Related papers  (2023-12-20T02:55:15Z) - Scalable Learning of Latent Language Structure With Logical Offline
  Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv  Detail & Related papers  (2023-05-31T16:47:20Z) - A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at
  Scale [64.10124092250126]
Unpaired text and audio injection have emerged as dominant methods for improving ASR performance in the absence of a large labeled corpus.
In this work, we compare three state-of-the-art semi-supervised methods encompassing both unpaired text and audio as well as several of their combinations in a controlled setting.
We find that in our setting these methods offer many improvements beyond raw WER, including substantial gains in tail-word WER, decoder computation during inference, and lattice density.
arXiv  Detail & Related papers  (2023-04-19T18:09:27Z) - CoCoMoT: Conformance Checking of Multi-Perspective Processes via SMT
  (Extended Version) [62.96267257163426]
We introduce the CoCoMoT (Computing Conformance Modulo Theories) framework.
First, we show how SAT-based encodings studied in the pure control-flow setting can be lifted to our data-aware case.
Second, we introduce a novel preprocessing technique based on a notion of property-preserving clustering.
arXiv  Detail & Related papers  (2021-03-18T20:22:50Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.