Solving the Data Sparsity Problem in Predicting the Success of the
Startups with Machine Learning Methods
- URL: http://arxiv.org/abs/2112.07985v1
- Date: Wed, 15 Dec 2021 09:21:32 GMT
- Title: Solving the Data Sparsity Problem in Predicting the Success of the
Startups with Machine Learning Methods
- Authors: Dafei Yin, Jing Li, Gaosheng Wu
- Abstract summary: We investigate several machine learning algorithms with a large dataset from Crunchbase.
The results suggest that LightGBM and XGBoost perform best and achieve 53.03% and 52.96% F1 scores.
These findings have substantial implications on how machine learning methods can help startup companies and investors.
- Score: 2.939434965353219
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting the success of startup companies is of great importance for both
startup companies and investors. It is difficult due to the lack of available
data and appropriate general methods. With data platforms like Crunchbase
aggregating the information of startup companies, it is possible to predict
with machine learning algorithms. Existing research suffers from the data
sparsity problem as most early-stage startup companies do not have much data
available to the public. We try to leverage the recent algorithms to solve this
problem. We investigate several machine learning algorithms with a large
dataset from Crunchbase. The results suggest that LightGBM and XGBoost perform
best and achieve 53.03% and 52.96% F1 scores. We interpret the predictions from
the perspective of feature contribution. We construct portfolios based on the
models and achieve high success rates. These findings have substantial
implications on how machine learning methods can help startup companies and
investors.
Related papers
- A Fused Large Language Model for Predicting Startup Success [21.75303916815358]
We develop a machine learning approach with the aim of locating successful startups on venture capital platforms.
Specifically, we develop, train, and evaluate a tailored, fused large language model to predict startup success.
Using 20,172 online profiles from Crunchbase, we find that our fused large language model can predict startup success.
arXiv Detail & Related papers (2024-09-05T16:22:31Z) - MUSE: Machine Unlearning Six-Way Evaluation for Language Models [109.76505405962783]
Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content.
We propose MUSE, a comprehensive machine unlearning evaluation benchmark.
We benchmark how effectively eight popular unlearning algorithms can unlearn Harry Potter books and news articles.
arXiv Detail & Related papers (2024-07-08T23:47:29Z) - Learning-Augmented Algorithms with Explicit Predictors [67.02156211760415]
Recent advances in algorithmic design show how to utilize predictions obtained by machine learning models from past and present data.
Prior research in this context was focused on a paradigm where the predictor is pre-trained on past data and then used as a black box.
In this work, we unpack the predictor and integrate the learning problem it gives rise for within the algorithmic challenge.
arXiv Detail & Related papers (2024-03-12T08:40:21Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Problem-Solving Guide: Predicting the Algorithm Tags and Difficulty for Competitive Programming Problems [7.955313479061445]
Most tech companies require the ability to solve algorithm problems including Google, Meta, and Amazon.
Our study addresses the task of predicting the algorithm tag as a useful tool for engineers and developers.
We also consider predicting the difficulty levels of algorithm problems, which can be used as useful guidance to calculate the required time to solve that problem.
arXiv Detail & Related papers (2023-10-09T15:26:07Z) - Startup success prediction and VC portfolio simulation using CrunchBase
data [1.7897779505837144]
This paper focuses on startups at their Series B and Series C investment stages, aiming to predict key success milestones.
We introduce novel deep learning model for predicting startup success, integrating a variety of factors such as funding metrics, founder features, industry category.
Our work demonstrates the considerable promise of deep learning models and alternative unstructured data in predicting startup success.
arXiv Detail & Related papers (2023-09-27T10:22:37Z) - Bag of Tricks for Training Data Extraction from Language Models [98.40637430115204]
We investigate and benchmark tricks for improving training data extraction using a publicly available dataset.
The experimental results show that several previously overlooked tricks can be crucial to the success of training data extraction.
arXiv Detail & Related papers (2023-02-09T06:46:42Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Can We Do Better Than Random Start? The Power of Data Outsourcing [9.677679780556103]
Many organizations have access to abundant data but lack the computational power to process the data.
We propose simulation-based algorithms that can utilize a small amount of outsourced data to find good initial points.
arXiv Detail & Related papers (2022-05-17T05:34:36Z) - MatRec: Matrix Factorization for Highly Skewed Dataset [4.658166900129066]
We propose a new algorithm solving the problem in the framework of matrix factorization.
We prove our method generates comparably favorite results with popular recommender system algorithms.
arXiv Detail & Related papers (2020-11-09T12:55:38Z) - Faster Secure Data Mining via Distributed Homomorphic Encryption [108.77460689459247]
Homomorphic Encryption (HE) is receiving more and more attention recently for its capability to do computations over the encrypted field.
We propose a novel general distributed HE-based data mining framework towards one step of solving the scaling problem.
We verify the efficiency and effectiveness of our new framework by testing over various data mining algorithms and benchmark data-sets.
arXiv Detail & Related papers (2020-06-17T18:14:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.