CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers
for Analyzing Data Analysis
- URL: http://arxiv.org/abs/2008.12828v1
- Date: Fri, 28 Aug 2020 19:57:49 GMT
- Title: CORAL: COde RepresentAtion Learning with Weakly-Supervised Transformers
for Analyzing Data Analysis
- Authors: Ge Zhang, Mike A. Merrill, Yang Liu, Jeffrey Heer, Tim Althoff
- Abstract summary: Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process.
We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments.
We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplieds and outperforms a suite of baselines.
- Score: 33.190021245507445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large scale analysis of source code, and in particular scientific source
code, holds the promise of better understanding the data science process,
identifying analytical best practices, and providing insights to the builders
of scientific toolkits. However, large corpora have remained unanalyzed in
depth, as descriptive labels are absent and require expert domain knowledge to
generate. We propose a novel weakly supervised transformer-based architecture
for computing joint representations of code from both abstract syntax trees and
surrounding natural language comments. We then evaluate the model on a new
classification task for labeling computational notebook cells as stages in the
data analysis process from data import to wrangling, exploration, modeling, and
evaluation. We show that our model, leveraging only easily-available weak
supervision, achieves a 38% increase in accuracy over expert-supplied
heuristics and outperforms a suite of baselines. Our model enables us to
examine a set of 118,000 Jupyter Notebooks to uncover common data analysis
patterns. Focusing on notebooks with relationships to academic articles, we
conduct the largest ever study of scientific code and find that notebook
composition correlates with the citation count of corresponding papers.
Related papers
- SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - DACO: Towards Application-Driven and Comprehensive Data Analysis via Code Generation [83.30006900263744]
Data analysis is a crucial analytical process to generate in-depth studies and conclusive insights.
We propose to automatically generate high-quality answer annotations leveraging the code-generation capabilities of LLMs.
Our DACO-RL algorithm is evaluated by human annotators to produce more helpful answers than SFT model in 57.72% cases.
arXiv Detail & Related papers (2024-03-04T22:47:58Z) - Towards Controlled Table-to-Text Generation with Scientific Reasoning [46.87189607486007]
We present a new task for generating fluent and logical descriptions that match user preferences over scientific data, aiming to automate scientific document analysis.
We construct a new challenging dataset,SciTab, consisting of table-description pairs extracted from the scientific literature, with highlighted cells and corresponding domain-specific knowledge base.
The results showed that large models struggle to produce accurate content that aligns with user preferences. As the first of its kind, our work should motivate further research in scientific domains.
arXiv Detail & Related papers (2023-12-08T22:57:35Z) - Leveraging Contextual Information for Effective Entity Salience Detection [21.30389576465761]
We show that fine-tuning medium-sized language models with a cross-encoder style architecture yields substantial performance gains over feature engineering approaches.
We also show that zero-shot prompting of instruction-tuned language models yields inferior results, indicating the task's uniqueness and complexity.
arXiv Detail & Related papers (2023-09-14T19:04:40Z) - PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts using
Transfer Learning [0.0]
PharmKE is a text analysis platform that applies deep learning through several stages for thorough semantic analysis of pharmaceutical articles.
The methodology is used to create accurately labeled training and test datasets, which are then used to train models for custom entity labeling tasks.
The obtained results are compared to the fine-tuned BERT and BioBERT models trained on the same dataset.
arXiv Detail & Related papers (2021-02-25T19:36:35Z) - What's New? Summarizing Contributions in Scientific Literature [85.95906677964815]
We introduce a new task of disentangled paper summarization, which seeks to generate separate summaries for the paper contributions and the context of the work.
We extend the S2ORC corpus of academic articles by adding disentangled "contribution" and "context" reference labels.
We propose a comprehensive automatic evaluation protocol which reports the relevance, novelty, and disentanglement of generated outputs.
arXiv Detail & Related papers (2020-11-06T02:23:01Z) - Method and Dataset Entity Mining in Scientific Literature: A CNN +
Bi-LSTM Model with Self-attention [21.93889297841459]
We propose a novel entity recognition model, called MDER, which is able to effectively extract the method and dataset entities from scientific papers.
We evaluate the proposed model on datasets constructed from the published papers of four research areas in computer science, i.e., NLP, CV, Data Mining and AI.
arXiv Detail & Related papers (2020-10-26T13:38:43Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.