Related papers: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

URL: http://arxiv.org/abs/2411.03562v1
Date: Tue, 05 Nov 2024 23:55:23 GMT
Title: Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level
Authors: Antoine Grosnit, Alexandre Maraval, James Doran, Giuseppe Paolo, Albert Thomas, Refinath Shahul Hameed Nabeezath Beevi, Jonas Gonzalez, Khyati Khandelwal, Ignacio Iacobacci, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balazs Kegl, Haitham Bou-Ammar, Jun Wang,
Abstract summary: We introduce Agent K v1.0, an end-to-end autonomous data science agent. It manages the entire data science life cycle by learning from experience. It optimises long- and short-term memory by selectively storing and retrieving key information.
Score: 73.14232472724758
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce Agent K v1.0, an end-to-end autonomous data science agent designed to automate, optimise, and generalise across diverse data science tasks. Fully automated, Agent K v1.0 manages the entire data science life cycle by learning from experience. It leverages a highly flexible structured reasoning framework to enable it to dynamically process memory in a nested structure, effectively learning from accumulated experience stored to handle complex reasoning tasks. It optimises long- and short-term memory by selectively storing and retrieving key information, guiding future decisions based on environmental rewards. This iterative approach allows it to refine decisions without fine-tuning or backpropagation, achieving continuous improvement through experiential learning. We evaluate our agent's apabilities using Kaggle competitions as a case study. Following a fully automated protocol, Agent K v1.0 systematically addresses complex and multimodal data science tasks, employing Bayesian optimisation for hyperparameter tuning and feature engineering. Our new evaluation framework rigorously assesses Agent K v1.0's end-to-end capabilities to generate and send submissions starting from a Kaggle competition URL. Results demonstrate that Agent K v1.0 achieves a 92.5\% success rate across tasks, spanning tabular, computer vision, NLP, and multimodal domains. When benchmarking against 5,856 human Kaggle competitors by calculating Elo-MMR scores for each, Agent K v1.0 ranks in the top 38\%, demonstrating an overall skill level comparable to Expert-level users. Notably, its Elo-MMR score falls between the first and third quartiles of scores achieved by human Grandmasters. Furthermore, our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold, 3 silver, and 7 bronze medals, as defined by Kaggle's progression system.

Related papers

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning [83.98129545309277]
We propose SkillRL, a framework that bridges the gap between raw experience and policy improvement.<n>Our approach introduces an experience-based distillation mechanism to build a hierarchical skill library SkillBank.<n> Experimental results on ALF, WebShop and seven search-augmented tasks demonstrate that SkillRL achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-09T03:17:17Z)
eSapiens: A Platform for Secure and Auditable Retrieval-Augmented Generation [10.667949307405983]
eSapiens is an AI-as-a-Service (AI) platform engineered around a business-oriented trifecta: proprietary data, operational, and any major Large Language Model (LLM)<n>eSapiens gives businesses full control over their AI assets, keeping everything in-house for AI knowledge retention and data security.
arXiv Detail & Related papers (2025-07-13T11:41:44Z)
SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam? [51.112225746095746]
We introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers.<n>X-Masters sets a new state-of-the-art record on Humanity's Last Exam with a score of 32.1%.
arXiv Detail & Related papers (2025-07-07T17:50:52Z)
Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact [27.722167796617114]
This paper offers a cross-disciplinary synthesis of artificial intelligence, cognitive neuroscience, psychology, generative models, and agent-based systems.<n>We analyze the architectural and cognitive foundations of general intelligence, highlighting the role of modular reasoning, persistent memory, and multi-agent coordination.<n>We identify key scientific, technical, and ethical challenges on the path to Artificial General Intelligence.
arXiv Detail & Related papers (2025-07-01T16:52:25Z)
Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning [29.580108004844856]
Multi-agent systems (MAS) built on large language models (LLMs) offer a promising path toward solving complex, real-world tasks. Recent advancements in test-time scaling (TTS) have significantly improved single-agent performance on challenging reasoning tasks. We introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination.
arXiv Detail & Related papers (2025-04-14T00:27:45Z)
Generalising from Self-Produced Data: Model Training Beyond Human Constraints [0.0]
This paper introduces a novel framework in which AI models autonomously generate and validate new knowledge.<n>Central to this approach is an unbounded, ungamable numeric reward that guides learning without requiring human benchmarks.
arXiv Detail & Related papers (2025-04-07T03:48:02Z)
Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
We propose Similar, a step-wise Multi-dimensional Generalist Reward Model. It offers fine-grained signals for agent training and can choose better action for inference-time scaling. We introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.
arXiv Detail & Related papers (2025-03-24T13:30:47Z)
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [59.825725526176655]
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents. Existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. We introduce MultiAgentBench, a benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios.
arXiv Detail & Related papers (2025-03-03T05:18:50Z)
AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions [45.0447118979891]
AutoKaggle implements an iterative development process that combines code execution and unit testing to ensure code correctness and logic consistency. Our universal data science toolkit, comprising validated functions for data cleaning, feature engineering, and modeling, forms the foundation of this solution. AutoKaggle achieves a validation rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines.
arXiv Detail & Related papers (2024-10-27T12:44:25Z)
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering [35.237253622981264]
MLE-bench is a benchmark for measuring how well AI agents perform at machine learning engineering. We curate 75 ML engineering-related competitions from Kaggle. We establish human baselines for each competition using Kaggle's publicly available leaderboards.
arXiv Detail & Related papers (2024-10-09T17:34:27Z)
Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process. We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals. The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z)
Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning [50.47568731994238]
Key method for creating Artificial Intelligence (AI) agents is Reinforcement Learning (RL) This paper presents a general framework model for integrating and learning structured reasoning into AI agents' policies.
arXiv Detail & Related papers (2023-12-22T17:57:57Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
Incremental procedural and sensorimotor learning in cognitive humanoid robots [52.77024349608834]
This work presents a cognitive agent that can learn procedures incrementally. We show the cognitive functions required in each substage and how adding new functions helps address tasks previously unsolved by the agent. Results show that this approach is capable of solving complex tasks incrementally.
arXiv Detail & Related papers (2023-04-30T22:51:31Z)
Self-Supervised Representation Learning from Temporal Ordering of Automated Driving Sequences [49.91741677556553]
We propose TempO, a temporal ordering pretext task for pre-training region-level feature representations for perception tasks. We embed each frame by an unordered set of proposal feature vectors, a representation that is natural for object detection or tracking systems. Extensive evaluations on the BDD100K, nuImages, and MOT17 datasets show that our TempO pre-training approach outperforms single-frame self-supervised learning methods.
arXiv Detail & Related papers (2023-02-17T18:18:27Z)
Ensemble Value Functions for Efficient Exploration in Multi-Agent Reinforcement Learning [18.762198598488066]
Multi-agent reinforcement learning (MARL) requires agents to explore within a vast joint action space. EMAX is a framework to seamlessly extend value-based MARL algorithms.
arXiv Detail & Related papers (2023-02-07T12:51:20Z)
Improving Multimodal Interactive Agents with Reinforcement Learning from Human Feedback [16.268581985382433]
An important goal in artificial intelligence is to create agents that can both interact naturally with humans and learn from their feedback. Here we demonstrate how to use reinforcement learning from human feedback to improve upon simulated, embodied agents.
arXiv Detail & Related papers (2022-11-21T16:00:31Z)
DAAS: Differentiable Architecture and Augmentation Policy Search [107.53318939844422]
This work considers the possible coupling between neural architectures and data augmentation and proposes an effective algorithm jointly searching for them. Our approach achieves 97.91% accuracy on CIFAR-10 and 76.6% Top-1 accuracy on ImageNet dataset, showing the outstanding performance of our search algorithm.
arXiv Detail & Related papers (2021-09-30T17:15:17Z)
COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI) Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets. We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z)
Cognitive architecture aided by working-memory for self-supervised multi-modal humans recognition [54.749127627191655]
The ability to recognize human partners is an important social skill to build personalized and long-term human-robot interactions. Deep learning networks have achieved state-of-the-art results and demonstrated to be suitable tools to address such a task. One solution is to make robots learn from their first-hand sensory data with self-supervision.
arXiv Detail & Related papers (2021-03-16T13:50:24Z)
How much progress have we made in neural network training? A New Evaluation Protocol for Benchmarking Optimizers [86.36020260204302]
We propose a new benchmarking protocol to evaluate both end-to-end efficiency and data-addition training efficiency. A human study is conducted to show that our evaluation protocol matches human tuning behavior better than the random search. We then apply the proposed benchmarking framework to 7s and various tasks, including computer vision, natural language processing, reinforcement learning, and graph mining.
arXiv Detail & Related papers (2020-10-19T21:46:39Z)
AIBench Training: Balanced Industry-Standard AI Training Benchmarking [26.820244556465333]
Earlier-stage evaluations of a new AI architecture/system need affordable benchmarks. We use real-world benchmarks to cover the factors space that impacts the learning dynamics. We contribute by far the most comprehensive AI training benchmark suite.
arXiv Detail & Related papers (2020-04-30T11:08:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.