TreeMind: Automatically Reproducing Android Bug Reports via LLM-empowered Monte Carlo Tree Search
- URL: http://arxiv.org/abs/2509.22431v1
- Date: Fri, 26 Sep 2025 14:50:13 GMT
- Title: TreeMind: Automatically Reproducing Android Bug Reports via LLM-empowered Monte Carlo Tree Search
- Authors: Zhengyu Chen, Zhaoyi Meng, Wenxiang Zhao, Wansen Wang, Haoyang Zhao, Jiahao Zhan, Jie Cui, Hong Zhong,
- Abstract summary: We present TreeMind, a novel technique that integrates large language models with a customized Monte Carlo Tree Search algorithm to achieve strategic UI exploration in bug reproduction.<n>To the best of our knowledge, this is the first work to combine external decision-making with semantic reasoning for reliable bug reproduction.<n>We evaluate TreeMind on a dataset of 93 real-world Android bug reports from three widely-used benchmarks. Experimental results show that it significantly outperforms four state-of-the-art baselines in reproduction success rate.
- Score: 24.23102808875548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically reproducing Android app crashes from textual bug reports is challenging, particularly when the reports are incomplete and the modern UI exhibits high combinatorial complexity. Existing approaches based on reinforcement learning or large language models (LLMs) exhibit limitations in such scenarios. They struggle to infer unobserved steps and reconstruct the underlying user action sequences to navigate the vast UI interaction space, primarily due to limited goal-directed reasoning and planning. We present TreeMind, a novel technique that integrates LLMs with a customized Monte Carlo Tree Search (MCTS) algorithm to achieve strategic UI exploration in bug reproduction. To the best of our knowledge, this is the first work to combine external decision-making with LLM semantic reasoning for reliable bug reproduction. We formulate the reproduction task as a target-driven search problem, leveraging MCTS as the core planning mechanism to iteratively refine action sequences. To enhance MCTS with semantic reasoning, we introduce two LLM-guided agents with distinct roles: Expander generates top-k promising actions based on the current UI state and exploration history, while Simulator estimates the likelihood that each action leads toward successful reproduction. By incorporating multi-modal UI inputs and advanced prompting techniques, TreeMind conducts feedback-aware navigation that identifies missing but essential user actions and incrementally reconstructs the reproduction paths. We evaluate TreeMind on a dataset of 93 real-world Android bug reports from three widely-used benchmarks. Experimental results show that it significantly outperforms four state-of-the-art baselines in reproduction success rate. A real-world case study indicates that integrating LLM reasoning with MCTS-based planning is a compelling direction for automated bug reproduction.
Related papers
- Human-Centric Open-Future Task Discovery: Formulation, Benchmark, and Scalable Tree-Based Search [55.96277616578607]
We formalize the problem of Human-centric Open-future Task Discovery (HOTD), focusing on identifying tasks that reduce human effort across multiple plausible futures.<n>To facilitate this study, we propose an HOTD-Bench, which features over 2K real-world videos, a semi-automated annotation pipeline, and a simulation-based protocol tailored for open-set future evaluation.<n>We also propose the Collaborative Multi-Agent Search Tree (CMAST), which decomposes the complex reasoning through a multi-agent system and structures the reasoning process through a scalable search tree module.
arXiv Detail & Related papers (2025-11-24T09:33:59Z) - AutoMLGen: Navigating Fine-Grained Optimization for Coding Agents [27.864519204078004]
Large language models (LLMs) have shown impressive performance in general programming tasks.<n>We introduce AutoMLGen, an LLM-based coding agent that integrates a domain knowledge base for high-quality prior guidance.<n>We show that AutoMLGen achieves state-of-the-art performance in numerous dimensions, such as the average medal rate and the valid submission rate.
arXiv Detail & Related papers (2025-10-09T17:45:05Z) - KompeteAI: Accelerated Autonomous Multi-Agent System for End-to-End Pipeline Generation for Machine Learning Problems [36.17807193758863]
KompeteAI is a novel AutoML framework with dynamic solution space exploration.<n>We introduce KompeteAI, a novel AutoML framework with dynamic solution space exploration.<n>We propose Kompete-bench to address limitations in MLE-Bench, where KompeteAI also achieves state-of-the-art results.
arXiv Detail & Related papers (2025-08-13T20:29:56Z) - SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition [5.5688696788198975]
We introduce SELT (Self-Evaluation LLM Tree Search), a novel framework to enhance LLM reasoning without relying on external reward models.<n>We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools.
arXiv Detail & Related papers (2025-06-09T08:52:27Z) - I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search [10.718560472954644]
Introspective Monte Carlo Tree Search (I-MCTS) is a novel approach that iteratively expands tree nodes through an introspective process.<n>We integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node's solution.<n>Our approach demonstrates a 6% absolute improvement in performance compared to the strong open-source AutoML agents.
arXiv Detail & Related papers (2025-02-20T16:19:09Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search [74.46681227410038]
We propose Collective Monte Carlo Tree Search (CoMCTS) for effective and efficient reasoning-path searching and learning.<n>We construct Mulberry-260k, a multimodal dataset with a tree of rich, explicit and well-defined reasoning nodes for each question.<n>We perform collective SFT to train our model, Mulberry, a series of MLLMs with o1-like step-by-step Reasoning and Reflection capabilities.
arXiv Detail & Related papers (2024-12-24T10:07:51Z) - Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling [63.98194996746229]
Large language models (LLMs) are prone to hallucination and producing factually incorrect information.<n>We propose a novel framework, called Think&Cite, and formulate attributed text generation as a multi-step reasoning problem integrated with search.
arXiv Detail & Related papers (2024-12-19T13:55:48Z) - SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z) - Recurrent Alignment with Hard Attention for Hierarchical Text Rating [6.858867989434858]
We propose a novel framework for hierarchical text rating utilizing large language models (LLMs)
Our framework incorporates Recurrent Alignment with Hard Attention (RAHA)
Experimental results demonstrate that RAHA outperforms existing state-of-the-art methods on three hierarchical text rating datasets.
arXiv Detail & Related papers (2024-02-14T00:40:51Z) - Tree-Planner: Efficient Close-loop Task Planning with Large Language Models [63.06270302774049]
Tree-Planner reframes task planning with Large Language Models into three distinct phases.
Tree-Planner achieves state-of-the-art performance while maintaining high efficiency.
arXiv Detail & Related papers (2023-10-12T17:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.