Related papers: Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

URL: http://arxiv.org/abs/2505.07859v2
Date: Wed, 11 Jun 2025 15:19:33 GMT
Title: Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective
Authors: Daniel Franzen, Jan Disselhoff, David Hartmann,
Abstract summary: We leverage task-specific data augmentations throughout the training, generation, and scoring phases.<n>We employ a depth-first search algorithm to generate diverse, high-probability candidate solutions.<n>Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set.
Score: 3.2771631221674333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware (we assume a price of 36ct/hour for a Nvidia 4090 GPU).

Related papers

Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling [19.673388630963807]
Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs)<n>How to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time.<n>We propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias.
arXiv Detail & Related papers (2025-05-30T09:05:25Z)
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.<n>Our approach only leverages a small number of samples to search for the desired pruning policy.<n>We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z)
Bag of Tricks for Inference-time Computation of LLM Reasoning [10.366475014241407]
We investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity.<n>Our ablation studies reveal that previously overlooked strategies can significantly enhance performance.<n>We establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks.
arXiv Detail & Related papers (2025-02-11T02:31:11Z)
Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation [55.21013307734612]
AoPS-Instruct is a dataset of more than 600,000 high-quality QA pairs.<n>LiveAoPSBench is an evolving evaluation set with timestamps, derived from the latest forum data.<n>Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning.
arXiv Detail & Related papers (2025-01-24T06:39:38Z)
Provenance: A Light-weight Fact-checker for Retrieval Augmented LLM Generation Output [49.893971654861424]
We present a light-weight approach for detecting nonfactual outputs from retrieval-augmented generation (RAG) We compute a factuality score that can be thresholded to yield a binary decision. Our experiments show high area under the ROC curve (AUC) across a wide range of relevant open source datasets.
arXiv Detail & Related papers (2024-11-01T20:44:59Z)
EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv Detail & Related papers (2024-10-08T17:54:03Z)
H-ARC: A Robust Estimate of Human Performance on the Abstraction and Reasoning Corpus Benchmark [7.840781070208872]
Since 2019, limited progress has been observed on the challenge using existing artificial intelligence methods. Previous work explored how well humans can solve tasks from the ARC benchmark. We obtain a more robust estimate of human performance by evaluating 1729 humans on the full set of 400 training and 400 evaluation tasks.
arXiv Detail & Related papers (2024-09-02T17:11:32Z)
Efficient Budget Allocation for Large-Scale LLM-Enabled Virtual Screening [0.9558392439655016]
We consider an LLM-as-human-evaluator approach for conducting screening virtually, thereby reducing the cost burden.<n>We propose using a top-$m$ greedy evaluation mechanism, and design the explore-first top-$m$ greedy (EFG-$m$) algorithm.<n>Surprisingly, we uncover a bonus ranking effect, where the algorithm naturally induces an indifference-based ranking within the selected subset.
arXiv Detail & Related papers (2024-08-18T16:44:41Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
Thought of Search: Planning with Language Models Through The Lens of Efficiency [22.47015814897628]
We argue that recent trends abandon both soundness and completeness for the sake of inefficiency. We show that by using LLMs to produce the code for the search components we can solve the entire datasets with 100% accuracy.
arXiv Detail & Related papers (2024-04-18T01:27:29Z)
Characterization of Large Language Model Development in the Datacenter [55.9909258342639]
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs. We present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme.
arXiv Detail & Related papers (2024-03-12T13:31:14Z)
OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs. Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
Searching Large Neighborhoods for Integer Linear Programs with Contrastive Learning [39.40838358438744]
Linear Programs (ILPs) are powerful tools for modeling and solving a large number of optimization problems. Large Neighborhood Search (LNS), as a algorithm, can find high quality solutions to ILPs faster than Branch and Bound. We propose a novel approach, CL-LNS, that delivers state-of-the-art anytime performance on several ILP benchmarks measured by metrics.
arXiv Detail & Related papers (2023-02-03T07:15:37Z)
Planning for Sample Efficient Imitation Learning [52.44953015011569]
Current imitation algorithms struggle to achieve high performance and high in-environment sample efficiency simultaneously. We propose EfficientImitate, a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously. Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency.
arXiv Detail & Related papers (2022-10-18T05:19:26Z)
Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images. To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN. In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.