Related papers: Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success

Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success

URL: http://arxiv.org/abs/2505.24622v2
Date: Mon, 15 Sep 2025 18:17:48 GMT
Title: Random Rule Forest (RRF): Interpretable Ensembles of LLM-Generated Questions for Predicting Startup Success
Authors: Ben Griffin, Diego Vidaurre, Ugur Koyluoglu, Joseph Ternasky, Fuat Alican, Yigit Ihlamur,
Abstract summary: We introduce Random Rule Forest (RRF), a lightweight ensemble method that uses a large language model (LLM) to generate simple YES/NO questions in natural language.<n>RRF achieves a 6.9x improvement over a random baseline on held-out data; adding expert-crafted questions lifts this to 8x.<n>By combining the creativity of LLMs with the rigor of ensemble learning, RRF delivers interpretable, high-precision predictions suitable for decision-making in high-stakes domains.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting rare outcomes such as startup success is central to venture capital, demanding models that are both accurate and interpretable. We introduce Random Rule Forest (RRF), a lightweight ensemble method that uses a large language model (LLM) to generate simple YES/NO questions in natural language. Each question functions as a weak learner, and their responses are combined using a threshold-based voting rule to form a strong, interpretable predictor. Applied to a dataset of 9,892 founders, RRF achieves a 6.9x improvement over a random baseline on held-out data; adding expert-crafted questions lifts this to 8x and highlights the value of human-LLM collaboration. Compared with zero- and few-shot baselines across three LLM architectures, RRF attains an F0.5 of 0.121, versus 0.086 for the best baseline (+0.035 absolute, +41% relative). By combining the creativity of LLMs with the rigor of ensemble learning, RRF delivers interpretable, high-precision predictions suitable for decision-making in high-stakes domains.

Related papers

Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process [58.265053900416895]
LLM-PeerReview is built on a novel, peer-review-inspired framework.<n>It operates in three stages: For scoring, we use the emerging LLM-as-a-Judge technique.<n>For reasoning, we can apply a graphical model-based truth inference algorithm.<n>Finally, the highest-scoring response is selected as the best ensemble output.
arXiv Detail & Related papers (2025-12-29T05:25:49Z)
RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z)
LLM-AR: LLM-powered Automated Reasoning Framework [0.0]
Large language models (LLMs) can already identify patterns and reason effectively, yet their variable accuracy hampers adoption in high-stakes decision-making applications.<n>We introduce LLM-AR, a pipeline inspired by neural-symbolic systems that distils LLM-generateds into probabilistic rules executed by the ProbLog automated-reasoning engine.<n>On unseen folds, LLM-AR achieves 59.5% precision and 8.7% recall, 5.9x the random baseline precision, while exposing every decision path for human inspection.
arXiv Detail & Related papers (2025-10-24T21:36:18Z)
From Limited Data to Rare-event Prediction: LLM-powered Feature Engineering and Multi-model Learning in Venture Capital [0.0]
This paper presents a framework for predicting rare, high-impact outcomes by integrating large language models (LLMs) with a multi-model machine learning (ML) architecture.<n>We use LLM-powered feature engineering to extract and synthesize complex signals from unstructured data.<n>We apply this framework to the domain of Venture Capital (VC), where investors must evaluate startups with limited and noisy early-stage data.
arXiv Detail & Related papers (2025-09-09T20:46:54Z)
RLPR: Extrapolating RLVR to General Domains without Verifiers [103.14103272635893]
We propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains.<n>We find that addressing the high variance of this noisy probability reward is crucial to make it work.<n>RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models.
arXiv Detail & Related papers (2025-06-23T02:56:36Z)
Policy Induction: Predicting Startup Success via Explainable Memory-Augmented In-Context Learning [0.0]
We propose a transparent and data-efficient investment decision framework powered by memory-augmented large language models.<n>We introduce a lightweight training process that combines few-shot learning with an in-context learning loop.<n>Our system predicts startup success far more accurately than existing benchmarks.
arXiv Detail & Related papers (2025-05-27T16:57:07Z)
Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z)
Reasoning-Based AI for Startup Evaluation (R.A.I.S.E.): A Memory-Augmented, Multi-Step Decision Framework [0.0]
We present a novel framework that bridges the gap between the interpretability of decision trees and the advanced reasoning capabilities of large language models (LLMs) to predict startup success.<n>Our approach leverages chain-of-thought prompting to generate detailed reasoning logs, which are subsequently distilled into structured, human-understandable logical rules.<n>Our method not only augments traditional decision-making processes but also facilitates expert intervention and continuous policy refinement.
arXiv Detail & Related papers (2025-04-16T13:53:42Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
CER: Confidence Enhanced Reasoning in LLMs [2.4392539322920763]
We introduce an uncertainty-aware framework designed to enhance the accuracy of Large Language Models responses.<n>We quantify the confidence of intermediate answers such as numerical results in mathematical reasoning and proper nouns in open-domain generation.<n>Results consistently validate the effectiveness of our novel confidence aggregation method.
arXiv Detail & Related papers (2025-02-20T15:16:42Z)
LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering. We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
SSFF: Investigating LLM Predictive Capabilities for Startup Success through a Multi-Agent Framework with Enhanced Explainability and Performance [0.16385815610837165]
The Startup Success Forecasting Framework is an autonomous system that emulates the reasoning of venture capital analysts.<n>By leveraging founder segmentation, startups led by L5 founders are 3.79 times more likely to succeed than those led by L1 founders.<n>Our framework significantly enhances prediction accuracy, yielding a 108.3 percent relative improvement over GPT 4o mini and a 30.8 percent relative improvement over GPT 4o.
arXiv Detail & Related papers (2024-05-29T19:07:42Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
LoRA-Ensemble: Efficient Uncertainty Modelling for Self-Attention Networks [52.46420522934253]
We introduce LoRA-Ensemble, a parameter-efficient ensembling method for self-attention networks.<n>The method not only outperforms state-of-the-art implicit techniques like BatchEnsemble, but even matches or exceeds the accuracy of an Explicit Ensemble.
arXiv Detail & Related papers (2024-05-23T11:10:32Z)
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection [90.71323430635593]
We propose a novel self-detection paradigm that considers the comprehensive answer space beyond LLM-generated answers. Building upon this paradigm, we introduce a two-step framework, which firstly instructs LLM to reflect and provide justifications for each candidate answer. This framework can be seamlessly integrated with existing approaches for superior self-detection.
arXiv Detail & Related papers (2024-03-15T02:38:26Z)
Pushing The Limit of LLM Capacity for Text Classification [27.684335455517417]
We propose RGPT, an adaptive boosting framework tailored to produce a specialized text classification LLM. We show that RGPT significantly outperforms 8 SOTA PLMs and 7 SOTA LLMs on four benchmarks by 1.36% on average.
arXiv Detail & Related papers (2024-02-12T08:14:03Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers? We propose KaRR, a statistical approach to assess factual knowledge for LLMs. Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.