Related papers: Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence

Related papers

The Serial Scaling Hypothesis [8.375582694104923]
Inherently serial problems require dependent computational steps that cannot be parallelized.<n>We argue that recognizing the serial nature of reasoning holds profound implications on machine learning, model design, hardware development.<n>As AI tackles increasingly complex reasoning, deliberately scaling serial computation-not just parallel computation-is essential for continued progress.
arXiv Detail & Related papers (2025-07-16T18:01:26Z)
Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z)
Fast correlated decoding of transversal logical algorithms [67.01652927671279]
Quantum error correction (QEC) is required for large-scale computation, but incurs a significant resource overhead.<n>Recent advances have shown that by jointly decoding logical qubits in algorithms composed of logical gates, the number of syndrome extraction rounds can be reduced.<n>Here, we reform the problem of decoding circuits by directly decoding relevant logical operator products as they propagate through the circuit.
arXiv Detail & Related papers (2025-05-19T18:00:00Z)
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? [72.70486097967124]
We formalize a framework using deterministic finite automata (DFAs) We show that there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized. We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
arXiv Detail & Related papers (2025-04-02T17:45:58Z)
Dynamic Parallel Tree Search for Efficient LLM Reasoning [102.16694475391665]
Tree of Thoughts (ToT) enhances Large Language Model (LLM) reasoning by structuring problem-solving as a spanning tree.<n>We propose Dynamic Parallel Tree Search (DPTS), a novel parallelism framework that aims to dynamically optimize the reasoning path in inference.<n> Experiments on Qwen-2.5 and Llama-3 with Math500 and GSM8K datasets show that DPTS significantly improves efficiency by 2-4x on average.
arXiv Detail & Related papers (2025-02-22T14:13:37Z)
Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques [0.0]
We propose a graph computing view of attention where tokens are perceived as nodes of the graph and the attention mask determines the edges of the graph. Using this view, we develop graph processing algorithms to implement the attention mechanism. Our algorithms are able to achieve extremely long sequence lengths of as high as 160 million on a single NVIDIA A100 GPU.
arXiv Detail & Related papers (2025-01-31T22:05:00Z)
Cerberus: Efficient Inference with Adaptive Parallel Decoding and Sequential Knowledge Enhancement [12.40683763019276]
Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. We have identified two key issues with existing parallel decoding frameworks. We propose Cerberus, an adaptive parallel decoding framework.
arXiv Detail & Related papers (2024-10-17T08:55:18Z)
Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding [12.449023969197684]
ProPD is an efficient parallel decoding framework based on dynamic token tree pruning and generation. We demonstrate ProPD consistently outperforms existing decoding algorithms by 1.1-3.2x.
arXiv Detail & Related papers (2024-02-21T02:51:07Z)
SPEED: Speculative Pipelined Execution for Efficient Decoding [35.45955948053644]
We propose SPEED, which improves inference efficiency by speculatively executing multiple future tokens in parallel with the current token. For Transformer decoders that employ parameter sharing, the memory operations for the tokens executing in parallel can be amortized. We demonstrate the efficiency of our method in terms of latency reduction relative to model accuracy and demonstrate how speculation allows for training deeper decoders with parameter sharing with minimal runtime overhead.
arXiv Detail & Related papers (2023-10-18T16:07:01Z)
Tractable Bounding of Counterfactual Queries by Knowledge Compilation [51.47174989680976]
We discuss the problem of bounding partially identifiable queries, such as counterfactuals, in Pearlian structural causal models. A recently proposed iterated EM scheme yields an inner approximation of those bounds by sampling the initialisation parameters. We show how a single symbolic knowledge compilation allows us to obtain the circuit structure with symbolic parameters to be replaced by their actual values.
arXiv Detail & Related papers (2023-10-05T07:10:40Z)
Parallel Algorithms Align with Neural Execution [7.535219325248997]
Parallel algorithms however may exploit their full computational power, therefore requiring fewer layers to be executed. This drastically reduces training times, as we observe when comparing parallel implementations of searching, sorting and finding strongly connected components to their sequential counterparts on the CLRS framework.
arXiv Detail & Related papers (2023-07-08T21:28:20Z)
Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models [74.95486528482327]
We explore code prompting, a neural symbolic prompting method with both zero-shot and few-shot versions which triggers code as intermediate steps. We conduct experiments on 7 widely-used benchmarks involving symbolic reasoning and arithmetic reasoning.
arXiv Detail & Related papers (2023-05-29T15:14:09Z)
NAPG: Non-Autoregressive Program Generation for Hybrid Tabular-Textual Question Answering [52.10214317661547]
Current numerical reasoning methods autoregressively decode program sequences. The accuracy of program generation drops sharply as the decoding steps unfold due to error propagation. In this paper, we propose a non-autoregressive program generation framework.
arXiv Detail & Related papers (2022-11-07T11:25:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.