StackOverflow vs Kaggle: A Study of Developer Discussions About Data
  Science
        - URL: http://arxiv.org/abs/2006.08334v1
- Date: Sat, 6 Jun 2020 06:51:11 GMT
- Title: StackOverflow vs Kaggle: A Study of Developer Discussions About Data
  Science
- Authors: David Hin
- Abstract summary: This paper conducts experiments to study the characteristics of 197836 posts from StackOverflow and Kaggle.
The main findings include that-related topics were most prevalent in StackOverflow.
Across both communities, DS discussion is increasing at a dramatic rate.
 ensemble algorithms are the most mentioned ML/DL algorithms in Kaggle but are rarely discussed on StackOverflow.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Software developers are increasingly required to understand fundamental Data
science (DS) concepts. Recently, the presence of machine learning (ML) and deep
learning (DL) has dramatically increased in the development of user
applications, whether they are leveraged through frameworks or implemented from
scratch. These topics attract much discussion on online platforms. This paper
conducts large-scale qualitative and quantitative experiments to study the
characteristics of 197836 posts from StackOverflow and Kaggle. Latent Dirichlet
Allocation topic modelling is used to extract twenty-four DS discussion topics.
The main findings include that TensorFlow-related topics were most prevalent in
StackOverflow, while meta discussion topics were the prevalent ones on Kaggle.
StackOverflow tends to include lower-level troubleshooting, while Kaggle
focuses on practicality and optimising leaderboard performance. In addition,
across both communities, DS discussion is increasing at a dramatic rate. While
TensorFlow discussion on StackOverflow is slowing, interest in Keras is rising.
Finally, ensemble algorithms are the most mentioned ML/DL algorithms in Kaggle
but are rarely discussed on StackOverflow. These findings can help educators
and researchers to more effectively tailor and prioritise efforts in
researching and communicating DS concepts towards different developer
communities.
 
      
        Related papers
        - GBM Returns the Best Prediction Performance among Regression Approaches:   A Case Study of Stack Overflow Code Quality [2.5515299924109858]
 We examined the variables that predict Stack Overflow (Java) code quality, and the regression approach that provides the best predictive power.<n>Longer Stack Overflow code tended to have more code violations, questions that were scored higher also attracted more views and the more answers that are added to questions on Stack Overflow the more errors were typically observed in the code that was provided.
 arXiv  Detail & Related papers  (2025-05-15T07:04:17Z)
- Exploring Challenges in Test Mocking: Developer Questions and Insights   from StackOverflow [0.0]
 We have analyzed 25,302 questions related to Mocking on FLOW techniques.<n>We have used Latent Dirichlet Allocation for topic modeling.<n>We have analyzed the annual and relative probabilities of each category to understand the evolution of mocking-related discussions.
 arXiv  Detail & Related papers  (2025-05-13T07:23:49Z)
- Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
 We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
 arXiv  Detail & Related papers  (2024-05-07T07:39:15Z)
- A Tale of Two Communities: Exploring Academic References on Stack   Overflow [1.2914230269240388]
 We find that Stack Overflow communities with different domains of interest engage with academic literature at varying frequencies and speeds.
The contradicting patterns suggest that some disciplines may have diverged in their interests and development trajectories from the corresponding practitioner community.
 arXiv  Detail & Related papers  (2024-03-14T20:33:55Z)
- ChatGPT vs LLaMA: Impact, Reliability, and Challenges in Stack Overflow
  Discussions [13.7001994656622]
 ChatGPT has shaken up Stack Overflow, the premier platform for developers' queries on programming and software development.
Two months after ChatGPT's release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on.
 arXiv  Detail & Related papers  (2024-02-13T21:15:33Z)
- RethinkingTMSC: An Empirical Study for Target-Oriented Multimodal
  Sentiment Classification [70.9087014537896]
 Target-oriented Multimodal Sentiment Classification (TMSC) has gained significant attention among scholars.
To investigate the causes of this problem, we perform extensive empirical evaluation and in-depth analysis of the datasets.
 arXiv  Detail & Related papers  (2023-10-14T14:52:37Z)
- Semantic Parsing for Conversational Question Answering over Knowledge
  Graphs [63.939700311269156]
 We develop a dataset where user questions are annotated with Sparql parses and system answers correspond to execution results thereof.
We present two different semantic parsing approaches and highlight the challenges of the task.
Our dataset and models are released at https://github.com/Edinburgh/SPICE.
 arXiv  Detail & Related papers  (2023-01-28T14:45:11Z)
- Answer ranking in Community Question Answering: a deep learning approach [0.0]
 This work tries to advance the state of the art on answer ranking for community Question Answering by proceeding with a deep learning approach.
We created a large data set of questions and answers posted to the Stack Overflow website.
We leveraged the natural language processing capabilities of dense embeddings and LSTM networks to produce a prediction for the accepted answer attribute.
 arXiv  Detail & Related papers  (2022-10-16T18:47:41Z)
- Attention-based model for predicting question relatedness on Stack
  Overflow [0.0]
 We propose an Attention-based Sentence pair Interaction Model (ASIM) to predict the relatedness between questions on Stack Overflow automatically.
ASIM has made significant improvement over the baseline approaches in Precision, Recall, and Micro-F1 evaluation metrics.
Our model also performs well in the duplicate question detection task of Ask Ubuntu.
 arXiv  Detail & Related papers  (2021-03-19T12:18:03Z)
- The Influence of Domain-Based Preprocessing on Subject-Specific
  Clustering [55.41644538483948]
 The sudden change of moving the majority of teaching online at Universities has caused an increased amount of workload for academics.
One way to deal with this problem is to cluster these questions depending on their topic.
In this paper, we explore the realms of tagging data sets, focusing on identifying code excerpts and providing empirical results.
 arXiv  Detail & Related papers  (2020-11-16T17:47:19Z)
- Unification of HDP and LDA Models for Optimal Topic Clustering of
  Subject Specific Question Banks [55.41644538483948]
 An increase in the popularity of online courses would result in an increase in the number of course-related queries for academics.
In order to reduce the time spent on answering each individual question, clustering them is an ideal choice.
We use the Hierarchical Dirichlet Process to determine an optimal topic number input for our LDA model runs.
 arXiv  Detail & Related papers  (2020-10-04T18:21:20Z)
- Code to Comment "Translation": Data, Metrics, Baselining & Evaluation [49.35567240750619]
 We analyze several recent code-comment datasets for this task.
We compare them with WMT19, a standard dataset frequently used to train state of the art natural language translators.
We find some interesting differences between the code-comment data and the WMT19 natural language data.
 arXiv  Detail & Related papers  (2020-10-03T18:57:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.