Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation
- URL: http://arxiv.org/abs/2509.11524v1
- Date: Mon, 15 Sep 2025 02:30:35 GMT
- Title: Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation
- Authors: Chengbing Wang, Yang Zhang, Zhicheng Wang, Tianhao Shi, Keqin Bao, Fuli Feng, Tat-Seng Chua,
- Abstract summary: Light Latent-space Decoding (L2D) is an effective and efficient latent-space decoding method.<n>L2D is more than 10x faster than language-space decoding while maintaining or enhancing performance.
- Score: 75.72196852363116
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Fine-tuning large language models (LLMs) for recommendation in a generative manner has delivered promising results, but encounters significant inference overhead due to autoregressive decoding in the language space. This work explores bypassing language-space decoding by directly matching candidate items with the LLM's internal thought representations in the latent space, eliminating the time-consuming autoregressive process to reduce computational costs. Towards this, we introduce Light Latent-space Decoding (L2D), an effective and efficient latent-space decoding method. L2D represents user-preferred items by using the hidden states of test sequences reflecting the LLM's internal thought, and obtains candidate item representations from the hidden states of training sequences labeled with the corresponding candidate items. It then matches the two types of representations to decode items, achieving latent-space decoding. In this way, it enables efficient decoding without altering the LLM's generative tuning paradigm, thereby preserving performance. Extensive empirical results demonstrate that L2D is more than 10x faster than language-space decoding while maintaining or enhancing performance.
Related papers
- LLM Based Long Code Translation using Identifier Replacement [3.833075202213095]
We propose a novel zero-shot code translation method that incorporates identifier replacement.<n>By substituting user-given long identifiers with generalized placeholders during translation, our method improves the efficiency and cost-effectiveness of long code translation.
arXiv Detail & Related papers (2025-10-10T06:28:15Z) - The Hidden Cost of Readability: How Code Formatting Silently Consumes Your LLM Budget [13.419222464653425]
We evaluate the impact of code formatting on large language models (LLMs) performance and efficiency.<n>Key findings indicate that LLMs can maintain performance across formatted code and unformatted code, achieving an average input token reduction of 24.5%.<n>We develop a bidirectional code transformation tool for format processing, which can be seamlessly integrated into existing inference.
arXiv Detail & Related papers (2025-08-19T09:13:48Z) - Constrained Decoding of Diffusion LLMs with Context-Free Grammars [1.0923877073891446]
Large language models (LLMs) have shown promising performance across diverse domains.<n>This paper presents the first constrained decoding method for diffusion models.<n>We show that our method achieves near-perfect syntactic correctness while consistently preserving or improving functional correctness.
arXiv Detail & Related papers (2025-08-13T18:09:09Z) - ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.<n>Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.<n>In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z) - Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features [5.774786149181392]
alicious users can exploit large language models (LLMs) to produce paraphrased versions of proprietary code that closely resemble the original.<n>We develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code.<n> LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively.
arXiv Detail & Related papers (2025-02-25T00:58:06Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding [15.723047976314751]
Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following.
We propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding.
arXiv Detail & Related papers (2024-02-26T18:59:28Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.