CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks
- URL: http://arxiv.org/abs/2507.11742v1
- Date: Tue, 15 Jul 2025 21:14:08 GMT
- Title: CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks
- Authors: Meng Li, Timothy M. McPhillips, Dingmin Wang, Shin-Rong Tsai, Bertram Ludäscher,
- Abstract summary: Data science and machine learning Python notebooks are critical for evaluating, reusing, and adapting new tasks.<n>Large Language Models fail to understand some realistic notebooks due to hallucinations and long-context challenges.<n>We propose a notebook understanding task yielding an information flow graph and corresponding cell execution dependency graph for a notebook.<n>We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks.
- Score: 6.16712273021723
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognizing the information flows and operations comprising data science and machine learning Python notebooks is critical for evaluating, reusing, and adapting notebooks for new tasks. Investigating a notebook via re-execution often is impractical due to the challenges of resolving data and software dependencies. While Large Language Models (LLMs) pre-trained on large codebases have demonstrated effectiveness in understanding code without running it, we observe that they fail to understand some realistic notebooks due to hallucinations and long-context challenges. To address these issues, we propose a notebook understanding task yielding an information flow graph and corresponding cell execution dependency graph for a notebook, and demonstrate the effectiveness of a pincer strategy that uses limited syntactic analysis to assist full comprehension of the notebook using an LLM. Our Capture and Resolve Assisted Bounding Strategy (CRABS) employs shallow syntactic parsing and analysis of the abstract syntax tree (AST) to capture the correct interpretation of a notebook between lower and upper estimates of the inter-cell I/O sets, then uses an LLM to resolve remaining ambiguities via cell-by-cell zero-shot learning, thereby identifying the true data inputs and outputs of each cell. We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks that together represent 3454 actual cell inputs and outputs. The LLM correctly resolves 1397 of 1425 (98%) ambiguities left by analyzing the syntactic structure of these notebooks. Across 50 notebooks, CRABS achieves average F1 scores of 98% identifying cell-to-cell information flows and 99% identifying transitive cell execution dependencies.
Related papers
- Language Bottleneck Models: A Framework for Interpretable Knowledge Tracing and Beyond [55.984684518346924]
We recast Knowledge Tracing as an inverse problem: learning the minimum natural-language summary that makes past answers explainable and future answers predictable.<n>Our Language Bottleneck Model (LBM) consists of an encoder LLM that writes an interpretable knowledge summary and a frozen decoder LLM that must reconstruct and predict student responses using only that summary text.<n> Experiments on synthetic arithmetic benchmarks and the large-scale Eedi dataset show that LBMs rival the accuracy of state-of-the-art KT and direct LLM methods while requiring orders-of-magnitude fewer student trajectories.
arXiv Detail & Related papers (2025-06-20T13:21:14Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics [1.3707925738322797]
We propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment.<n>Our comprehensive analysis demonstrates that emphquestion-specific rubrics significantly enhance logical assessment of code in educational settings.
arXiv Detail & Related papers (2025-03-31T11:59:43Z) - Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding [71.01099784480597]
Large language models (LLMs) excel at a range of tasks through in-context learning (ICL)<n>We introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping.
arXiv Detail & Related papers (2025-02-19T14:04:46Z) - Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning.<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Instructive Code Retriever: Learn from Large Language Model's Feedback for Code Intelligence Tasks [10.867880635762395]
We introduce a novel approach named Instructive Code Retriever (ICR)
ICR is designed to retrieve examples that enhance model inference across various code intelligence tasks and datasets.
We evaluate our model's effectiveness on various tasks, i.e., code summarization, program synthesis, and bug fixing.
arXiv Detail & Related papers (2024-10-15T05:44:00Z) - Interactive Analysis of LLMs using Meaningful Counterfactuals [22.755345889167934]
Counterfactual examples are useful for exploring the decision boundaries of machine learning models.
How can we apply counterfactual-based methods to analyze and explain LLMs?
We propose a novel algorithm for generating batches of complete and meaningful textual counterfactuals.
In our experiments, 97.2% of the counterfactuals are grammatically correct.
arXiv Detail & Related papers (2024-04-23T19:57:03Z) - Analyzing LLM Usage in an Advanced Computing Class in India [4.580708389528142]
This study examines the use of large language models (LLMs) by undergraduate and graduate students for programming assignments in advanced computing classes.
We conducted a comprehensive analysis involving 411 students from a Distributed Systems class at an Indian university.
arXiv Detail & Related papers (2024-04-06T12:06:56Z) - Efficient End-to-end Language Model Fine-tuning on Graphs [21.23522552579571]
Learning from Text-Attributed Graphs (TAGs) has attracted significant attention due to its wide range of real-world applications.
We introduce LEADING, a novel and efficient approach for end-to-end fine-tuning of language models on TAGs.
Our proposed approach demonstrates superior performance, achieving state-of-the-art (SOTA) results on the ogbn-arxiv leaderboard.
arXiv Detail & Related papers (2023-12-07T22:35:16Z) - Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models [38.79074982172423]
We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.
We propose modeling factual queries as constraint satisfaction problems.
We find a strong positive relationship between the LLM's attention to constraint tokens and the factual accuracy of generations.
arXiv Detail & Related papers (2023-09-26T17:48:55Z) - Active Learning for Abstractive Text Summarization [50.79416783266641]
We propose the first effective query strategy for Active Learning in abstractive text summarization.
We show that using our strategy in AL annotation helps to improve the model performance in terms of ROUGE and consistency scores.
arXiv Detail & Related papers (2023-01-09T10:33:14Z) - Natural Language to Code Generation in Interactive Data Science
Notebooks [35.621936471322385]
We build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks.
We develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs.
arXiv Detail & Related papers (2022-12-19T05:06:00Z) - MOCHA: A Dataset for Training and Evaluating Generative Reading
Comprehension Metrics [55.85042753772513]
We introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human.
s.
Using MOCHA, we train a Learned Evaluation metric for Reading Pearson, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute points on held-out annotations.
When we evaluate on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
arXiv Detail & Related papers (2020-10-07T20:22:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.