StackOverflow vs Kaggle: A Study of Developer Discussions About Data
Science
- URL: http://arxiv.org/abs/2006.08334v1
- Date: Sat, 6 Jun 2020 06:51:11 GMT
- Title: StackOverflow vs Kaggle: A Study of Developer Discussions About Data
Science
- Authors: David Hin
- Abstract summary: This paper conducts experiments to study the characteristics of 197836 posts from StackOverflow and Kaggle.
The main findings include that-related topics were most prevalent in StackOverflow.
Across both communities, DS discussion is increasing at a dramatic rate.
ensemble algorithms are the most mentioned ML/DL algorithms in Kaggle but are rarely discussed on StackOverflow.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software developers are increasingly required to understand fundamental Data
science (DS) concepts. Recently, the presence of machine learning (ML) and deep
learning (DL) has dramatically increased in the development of user
applications, whether they are leveraged through frameworks or implemented from
scratch. These topics attract much discussion on online platforms. This paper
conducts large-scale qualitative and quantitative experiments to study the
characteristics of 197836 posts from StackOverflow and Kaggle. Latent Dirichlet
Allocation topic modelling is used to extract twenty-four DS discussion topics.
The main findings include that TensorFlow-related topics were most prevalent in
StackOverflow, while meta discussion topics were the prevalent ones on Kaggle.
StackOverflow tends to include lower-level troubleshooting, while Kaggle
focuses on practicality and optimising leaderboard performance. In addition,
across both communities, DS discussion is increasing at a dramatic rate. While
TensorFlow discussion on StackOverflow is slowing, interest in Keras is rising.
Finally, ensemble algorithms are the most mentioned ML/DL algorithms in Kaggle
but are rarely discussed on StackOverflow. These findings can help educators
and researchers to more effectively tailor and prioritise efforts in
researching and communicating DS concepts towards different developer
communities.
Related papers
- Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z) - A Tale of Two Communities: Exploring Academic References on Stack Overflow [1.2914230269240388]
We find that Stack Overflow communities with different domains of interest engage with academic literature at varying frequencies and speeds.
The contradicting patterns suggest that some disciplines may have diverged in their interests and development trajectories from the corresponding practitioner community.
arXiv Detail & Related papers (2024-03-14T20:33:55Z) - ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow
Discussions [13.7001994656622]
ChatGPT has shaken up Stack Overflow, the premier platform for developers' queries on programming and software development.
Two months after ChatGPT's release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on.
arXiv Detail & Related papers (2024-02-13T21:15:33Z) - RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal
Sentiment Classification [70.9087014537896]
Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars.
To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets.
arXiv Detail & Related papers (2023-10-14T14:52:37Z) - Semantic Parsing for Conversational Question Answering over Knowledge
Graphs [63.939700311269156]
We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
arXiv Detail & Related papers (2023-01-28T14:45:11Z) - Answer ranking in Community Question Answering: a deep learning approach [0.0]
This work tries to advance the state of the art on answer ranking for community Question Answering by proceeding with a deep learning approach.
We created a large data set of questions and answers posted to the Stack Overflow website.
We leveraged the natural language processing capabilities of dense embeddings and LSTM networks to produce a prediction for the accepted answer attribute.
arXiv Detail & Related papers (2022-10-16T18:47:41Z) - Attention-based model for predicting question relatedness on Stack
Overflow [0.0]
We propose an Attention-based Sentence pair Interaction Model (ASIM) to predict the relatedness between questions on Stack Overflow automatically.
ASIM has made significant improvement over the baseline approaches in Precision, Recall, and Micro-F1 evaluation metrics.
Our model also performs well in the duplicate question detection task of Ask Ubuntu.
arXiv Detail & Related papers (2021-03-19T12:18:03Z) - The Influence of Domain-Based Preprocessing on Subject-Specific
Clustering [55.41644538483948]
The sudden change of moving the majority of teaching online at Universities has caused an increased amount of workload for academics.
One way to deal with this problem is to cluster these questions depending on their topic.
In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results.
arXiv Detail & Related papers (2020-11-16T17:47:19Z) - Unification of HDP and LDA Models for Optimal Topic Clustering of
Subject Specific Question Banks [55.41644538483948]
An increase in the popularity of online courses would result in an increase in the number of course-related queries for academics.
In order to reduce the time spent on answering each individual question, clustering them is an ideal choice.
We use the Hierarchical Dirichlet Process to determine an optimal topic number input for our LDA model runs.
arXiv Detail & Related papers (2020-10-04T18:21:20Z) - Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
arXiv Detail & Related papers (2020-10-03T18:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.