Hybrid-RACA: Hybrid Retrieval-Augmented Composition Assistance for Real-time Text Prediction
- URL: http://arxiv.org/abs/2308.04215v3
- Date: Sat, 12 Oct 2024 12:50:33 GMT
- Title: Hybrid-RACA: Hybrid Retrieval-Augmented Composition Assistance for Real-time Text Prediction
- Authors: Menglin Xia, Xuchao Zhang, Camille Couturier, Guoqing Zheng, Saravan Rajmohan, Victor Ruhle,
- Abstract summary: We propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA) for real-time text prediction.
It efficiently combines a cloud-based large language model with a smaller client-side model through retrieval augmented memory.
Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.
- Score: 17.94189417448127
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) enhanced with retrieval augmentation has shown great performance in many applications. However, the computational demands for these models pose a challenge when applying them to real-time tasks, such as composition assistance. To address this, we propose Hybrid Retrieval-Augmented Composition Assistance (Hybrid-RACA), a novel system for real-time text prediction that efficiently combines a cloud-based LLM with a smaller client-side model through retrieval augmented memory. This integration enables the client model to generate better responses, benefiting from the LLM's capabilities and cloud-based data. Meanwhile, via a novel asynchronous memory update mechanism, the client model can deliver real-time completions to user inputs without the need to wait for responses from the cloud. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.
Related papers
- Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.
By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z) - Advanced Relay-Based Collaborative Framework for Optimizing Synchronization in Split Federated Learning over Wireless Networks [4.235050593084289]
Split Federated Learning (SFL) offers a promising approach for distributed model training in edge computing.
We propose a collaborative SFL framework (CSFL) to optimize synchronization efficiency among users.
We show that our proposed CSFL framework reduces synchronization delays and improves overall system throughput.
arXiv Detail & Related papers (2025-03-18T22:11:54Z) - Cross-Format Retrieval-Augmented Generation in XR with LLMs for Context-Aware Maintenance Assistance [6.16808916207942]
This paper presents a detailed evaluation of a Retrieval-Augmented Generation system that integrates large language models (LLMs)
We assess the performance of eight LLMs, emphasizing key metrics such as response speed and accuracy, which were quantified using BLEU and METEOR scores.
The results validate the system's ability to deliver timely and accurate responses, highlighting the potential of RAG frameworks to optimize maintenance operations.
arXiv Detail & Related papers (2025-02-21T17:19:39Z) - A Hybrid Swarm Intelligence Approach for Optimizing Multimodal Large Language Models Deployment in Edge-Cloud-based Federated Learning Environments [10.72166883797356]
Federated Learning (FL), Multimodal Large Language Models (MLLMs), and edge-cloud computing enables distributed and real-time data processing.
We propose a novel hybrid framework wherein MLLMs are deployed on edge devices equipped with sufficient resources and battery life, while the majority of training occurs in the cloud.
Our experimental results show that the proposed method significantly improves system performance, achieving an accuracy of 92%, reducing communication cost by 30%, and enhancing client participation.
arXiv Detail & Related papers (2025-02-04T03:03:24Z) - RL-based Query Rewriting with Distilled LLM for online E-Commerce Systems [19.674493253615235]
We propose a novel hybrid pipeline for QR that balances efficiency and effectiveness.
Our approach combines offline knowledge distillation to create a lightweight but efficient student model with online reinforcement learning (RL) to refine query rewriting dynamically using real-time feedback.
Experimental results on Amazon ESCI dataset demonstrate significant improvements in query relevance, diversity, and adaptability.
arXiv Detail & Related papers (2025-01-29T23:41:12Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Hybrid Training Approaches for LLMs: Leveraging Real and Synthetic Data to Enhance Model Performance in Domain-Specific Applications [0.0]
This research explores a hybrid approach to fine-tuning large language models (LLMs)
By leveraging a dataset combining transcribed real interactions with high-quality synthetic sessions, we aimed to overcome the limitations of domain-specific real data.
The study evaluated three models: a base foundational model, a model fine-tuned with real data, and a hybrid fine-tuned model.
arXiv Detail & Related papers (2024-10-11T18:16:03Z) - Self-Boosting Large Language Models with Synthetic Preference Data [97.94185115047999]
We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment.
After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities.
SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
arXiv Detail & Related papers (2024-10-09T14:57:31Z) - Beyond the Turn-Based Game: Enabling Real-Time Conversations with Duplex Models [66.24055500785657]
Traditional turn-based chat systems prevent users from verbally interacting with system while it is generating responses.
To overcome these limitations, we adapt existing LLMs to listen users while generating output and provide users with instant feedback.
We build a dataset consisting of alternating time slices of queries and responses as well as covering typical feedback types in instantaneous interactions.
arXiv Detail & Related papers (2024-06-22T03:20:10Z) - Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models [0.8399688944263842]
Large Language Models (LLMs) have the capability to understand and generate human-like text from input queries.
This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines.
We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding.
arXiv Detail & Related papers (2024-06-17T04:35:17Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - Towards Client Driven Federated Learning [7.528642177161784]
We introduce Client-Driven Federated Learning (CDFL), a novel FL framework that puts clients at the driving role.
In CDFL, each client independently and asynchronously updates its model by uploading the locally trained model to the server and receiving a customized model tailored to its local task.
arXiv Detail & Related papers (2024-05-24T10:17:49Z) - CELA: Cost-Efficient Language Model Alignment for CTR Prediction [71.85120354973073]
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems.
Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs)
We propose textbfCost-textbfEfficient textbfLanguage Model textbfAlignment (textbfCELA) for CTR prediction.
arXiv Detail & Related papers (2024-05-17T07:43:25Z) - Efficient Cloud-edge Collaborative Inference for Object
Re-identification [27.952445808987036]
We pioneer a cloud-edge collaborative inference framework for ReID systems.
We propose a distribution-aware correlation modeling network (DaCM) to make the desired image return to the cloud server.
DaCM embeds the spatial-temporal correlations implicitly included in the timestamps into a graph structure, and it can be applied in the cloud to regulate the size of the upload window.
arXiv Detail & Related papers (2024-01-04T02:56:50Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation Framework for Efficient Device Model Generalization [66.27399823422665]
Device Model Generalization (DMG) is a practical yet under-investigated research topic for on-device machine learning applications.
We propose an efficient Device-cloUd collaborative parametErs generaTion framework DUET.
arXiv Detail & Related papers (2022-09-12T13:26:26Z) - Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in
Public Cloud [9.149566952446058]
We proposeCocktail, a costeffective ensembling-based model serving framework.
A prototype implementation ofCocktailon the AWS EC2 platform and exhaustive evalua-tions using a variety of workloads demonstrate thatCocktailcan reduce deployment cost by 1.45x.
arXiv Detail & Related papers (2021-06-09T19:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.