Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts
- URL: http://arxiv.org/abs/2402.07625v4
- Date: Tue, 18 Feb 2025 01:02:54 GMT
- Title: Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts
- Authors: Yifan Zhang, Yifan Luo, Yang Yuan, Andrew Chi-Chih Yao,
- Abstract summary: We present Autonomous Data Selection (AutoDS), a method that automatically curates high-quality mathematical texts.
Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits.
We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation.
- Score: 13.789739307267952
- License:
- Abstract: We present Autonomous Data Selection (AutoDS), a method that leverages base language models themselves as zero-shot "generative classifiers" to automatically curate high-quality mathematical texts. Unlike prior approaches that require human annotations or training a dedicated data filter, AutoDS relies solely on a model's logits to determine whether a given passage is mathematically informative and educational. By integrating AutoDS into a continual pretraining pipeline, we substantially boost downstream performance on challenging math benchmarks (MATH, GSM8K, and BBH) while using far fewer tokens than previous methods. Empirically, our approach achieves roughly a twofold improvement in pretraining token efficiency over strong baselines, underscoring the potential of self-directed data selection in enhancing mathematical reasoning. We release our curated AutoMathText dataset to facilitate future research in automated domain-specific data curation. The AutoMathText dataset is available at https://huggingface.co/datasets/math-ai/AutoMathText. The code is available at https://github.com/yifanzhang-pro/AutoMathText.
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - Automatic Short Math Answer Grading via In-context Meta-learning [2.0263791972068628]
We study the problem of automatic short answer grading for students' responses to math questions.
We use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model.
Second, we use an in-context learning approach that provides scoring examples as input to the language model.
arXiv Detail & Related papers (2022-05-30T16:26:02Z) - Syntax-Aware Network for Handwritten Mathematical Expression Recognition [53.130826547287626]
Handwritten mathematical expression recognition (HMER) is a challenging task that has many potential applications.
Recent methods for HMER have achieved outstanding performance with an encoder-decoder architecture.
We propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
arXiv Detail & Related papers (2022-03-03T09:57:19Z) - Neuro-Symbolic Language Modeling with Automaton-augmented Retrieval [129.25914272977542]
RetoMaton is a weighted finite automaton built on top of the datastore.
Traversing this automaton at inference time, in parallel to the LM inference, reduces its perplexity.
arXiv Detail & Related papers (2022-01-28T21:38:56Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - Detecting Handwritten Mathematical Terms with Sensor Based Data [71.84852429039881]
We propose a solution to the UbiComp 2021 Challenge by Stabilo in which handwritten mathematical terms are supposed to be automatically classified.
The input data set contains data of different writers, with label strings constructed from a total of 15 different possible characters.
arXiv Detail & Related papers (2021-09-12T19:33:34Z) - Privileged Zero-Shot AutoML [16.386335031156]
This work improves the quality of automated machine learning (AutoML) systems by using dataset and function descriptions.
We show that zero-shot AutoML reduces running and prediction times from minutes to milliseconds, consistently across datasets.
arXiv Detail & Related papers (2021-06-25T16:31:05Z) - LaTeX-Numeric: Language-agnostic Text attribute eXtraction for
E-commerce Numeric Attributes [0.25782420501870296]
We present high-precision fully-automated scalable framework for extracting E-commerce numeric attributes from product text.
We propose a multi-task architecture to deal with missing labels in attribute data, leading to F1 improvement of 9.2% for numeric attributes over single-task architecture.
We propose an automated algorithm for alias creation using attribute values, leading to a 20.2% F1 improvement.
arXiv Detail & Related papers (2021-04-19T19:14:32Z) - Data Augmentation in Natural Language Processing: A Novel Text
Generation Approach for Long and Short Text Classifiers [8.19984844136462]
We present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts.
In a simulated low data regime additive accuracy gains of up to 15.53% are achieved.
We discuss implications and patterns for the successful application of our approach on different types of datasets.
arXiv Detail & Related papers (2021-03-26T13:16:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.