Related papers: Solving the Data Sparsity Problem in Predicting the Success of the Startups with Machine Learning Methods

Solving the Data Sparsity Problem in Predicting the Success of the Startups with Machine Learning Methods

URL: http://arxiv.org/abs/2112.07985v1
Date: Wed, 15 Dec 2021 09:21:32 GMT
Title: Solving the Data Sparsity Problem in Predicting the Success of the Startups with Machine Learning Methods
Authors: Dafei Yin, Jing Li, Gaosheng Wu
Abstract summary: We investigate several machine learning algorithms with a large dataset from Crunchbase. The results suggest that LightGBM and XGBoost perform best and achieve 53.03% and 52.96% F1 scores. These findings have substantial implications on how machine learning methods can help startup companies and investors.
Score: 2.939434965353219
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting the success of startup companies is of great importance for both startup companies and investors. It is difficult due to the lack of available data and appropriate general methods. With data platforms like Crunchbase aggregating the information of startup companies, it is possible to predict with machine learning algorithms. Existing research suffers from the data sparsity problem as most early-stage startup companies do not have much data available to the public. We try to leverage the recent algorithms to solve this problem. We investigate several machine learning algorithms with a large dataset from Crunchbase. The results suggest that LightGBM and XGBoost perform best and achieve 53.03% and 52.96% F1 scores. We interpret the predictions from the perspective of feature contribution. We construct portfolios based on the models and achieve high success rates. These findings have substantial implications on how machine learning methods can help startup companies and investors.

Related papers

Decision Making under Imperfect Recall: Algorithms and Benchmarks [77.12503122836422]
We introduce the first benchmark suite for imperfect-recall decision problems.<n>Our benchmarks capture a variety of problem types, including ones concerning privacy in AI systems.<n>We evaluate the performance of different algorithms for finding first-order optimal strategies in such problems.
arXiv Detail & Related papers (2026-02-16T23:19:01Z)
Predicting Startup Success Using Large Language Models: A Novel In-Context Learning Approach [32.510120225056944]
In this paper, we propose an in-context learning framework for startup success prediction using large language models (LLMs)<n>Specifically, we propose a novel k-nearest-neighbor-based in-context learning framework, called kNN-ICL, which selects the most relevant past startups as examples based on similarity.<n>Using real-world profiles from Crunchbase, we find that the kNN-ICL approach achieves higher prediction accuracy than supervised machine learning baselines and vanilla in-context learning.
arXiv Detail & Related papers (2026-01-23T09:08:52Z)
Exposing the Copycat Problem of Imitation-based Planner: A Novel Closed-Loop Simulator, Causal Benchmark and Joint IL-RL Baseline [49.51385135697656]
Within machine learning-based planning, imitation learning (IL) is a common algorithm. It primarily learns driving policies directly from supervised trajectory data. It remains challenging to determine if the learned policy truly understands fundamental driving principles. This work proposes a novel closed-loop simulator supporting both imitation and reinforcement learning.
arXiv Detail & Related papers (2025-04-20T18:51:26Z)
A Fused Large Language Model for Predicting Startup Success [21.75303916815358]
We develop a machine learning approach with the aim of locating successful startups on venture capital platforms. Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success. Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success.
arXiv Detail & Related papers (2024-09-05T16:22:31Z)
MUSE: Machine Unlearning Six-Way Evaluation for Language Models [109.76505405962783]
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. We propose MUSE, a comprehensive machine unlearning evaluation benchmark. We benchmark how effectively eight popular unlearning algorithms can unlearn Harry Potter books and news articles.
arXiv Detail & Related papers (2024-07-08T23:47:29Z)
Learning-Augmented Algorithms with Explicit Predictors [67.02156211760415]
Recent advances in algorithmic design show how to utilize predictions obtained by machine learning models from past and present data. Prior research in this context was focused on a paradigm where the predictor is pre-trained on past data and then used as a black box. In this work, we unpack the predictor and integrate the learning problem it gives rise for within the algorithmic challenge.
arXiv Detail & Related papers (2024-03-12T08:40:21Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
Problem-Solving Guide: Predicting the Algorithm Tags and Difficulty for Competitive Programming Problems [7.955313479061445]
Most tech companies require the ability to solve algorithm problems including Google, Meta, and Amazon. Our study addresses the task of predicting the algorithm tag as a useful tool for engineers and developers. We also consider predicting the difficulty levels of algorithm problems, which can be used as useful guidance to calculate the required time to solve that problem.
arXiv Detail & Related papers (2023-10-09T15:26:07Z)
Startup success prediction and VC portfolio simulation using CrunchBase data [1.7897779505837144]
This paper focuses on startups at their Series B and Series C investment stages, aiming to predict key success milestones. We introduce novel deep learning model for predicting startup success, integrating a variety of factors such as funding metrics, founder features, industry category. Our work demonstrates the considerable promise of deep learning models and alternative unstructured data in predicting startup success.
arXiv Detail & Related papers (2023-09-27T10:22:37Z)
Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset. The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Can We Do Better Than Random Start? The Power of Data Outsourcing [9.677679780556103]
Many organizations have access to abundant data but lack the computational power to process the data. We propose simulation-based algorithms that can utilize a small amount of outsourced data to find good initial points.
arXiv Detail & Related papers (2022-05-17T05:34:36Z)
MatRec: Matrix Factorization for Highly Skewed Dataset [4.658166900129066]
We propose a new algorithm solving the problem in the framework of matrix factorization. We prove our method generates comparably favorite results with popular recommender system algorithms.
arXiv Detail & Related papers (2020-11-09T12:55:38Z)
Faster Secure Data Mining via Distributed Homomorphic Encryption [108.77460689459247]
Homomorphic Encryption (HE) is receiving more and more attention recently for its capability to do computations over the encrypted field. We propose a novel general distributed HE-based data mining framework towards one step of solving the scaling problem. We verify the efficiency and effectiveness of our new framework by testing over various data mining algorithms and benchmark data-sets.
arXiv Detail & Related papers (2020-06-17T18:14:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.