Related papers: LF-checker: Machine Learning Acceleration of Bounded Model Checking for Concurrency Verification (Competition Contribution)

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models [55.39064621869925]
OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning.<n>A key component of their training is the incorporation of verifiable rewards within reinforcement learning.<n>Existing reward benchmarks do not evaluate reference-based reward systems.
arXiv Detail & Related papers (2025-05-21T17:54:43Z)
The Geometry of Self-Verification in a Task-Specific Reasoning Model [45.669264589017665]
We train a model using DeepSeek R1's recipe on the CountDown task. We do a top-down and bottom-up analysis to reverse-engineer how the model verifies its outputs.
arXiv Detail & Related papers (2025-04-19T18:40:51Z)
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers [36.1723136776532]
Multi-Agent Verification (MAV) is a test-time compute paradigm that combines multiple verifiers to improve performance. We introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers. Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.
arXiv Detail & Related papers (2025-02-27T18:53:30Z)
Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening. Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training. We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z)
BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration [7.261063083251448]
We present a complete framework for calibrating and administering a robust large-scale computerized adaptive test (CAT) with a small number of responses. We use AutoIRT, a new method that uses automated machine learning (AutoML) in combination with item response theory (IRT) We propose the BanditCAT framework, a methodology motivated by casting the problem in the contextual bandit framework and utilizing item response theory (IRT)
arXiv Detail & Related papers (2024-10-28T13:54:10Z)
Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning [53.241569810013836]
We propose a novel framework that utilizes large language models (LLMs) to identify effective feature generation rules. We use decision trees to convey this reasoning information, as they can be easily represented in natural language. OCTree consistently enhances the performance of various prediction models across diverse benchmarks.
arXiv Detail & Related papers (2024-06-12T08:31:34Z)
Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models [5.463333911506443]
We aim to enhance the self-checking capabilities of large language models (LLMs) by constructing training data for checking tasks. We propose a specialized checking format called "Step CoT Check" Experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs.
arXiv Detail & Related papers (2024-02-20T14:23:23Z)
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers [121.53749383203792]
We present a holistic end-to-end solution for annotating the factuality of large language models (LLMs)-generated responses. We construct an open-domain document-level factuality benchmark in three-level granularity: claim, sentence and document. Preliminary experiments show that FacTool, FactScore and Perplexity are struggling to identify false claims.
arXiv Detail & Related papers (2023-11-15T14:41:57Z)
Sparse Conditional Hidden Markov Model for Weakly Supervised Named Entity Recognition [68.68300358332156]
We propose the sparse conditional hidden Markov model (Sparse-CHMM) to evaluate noisy labeling functions. Sparse-CHMM is optimized through unsupervised learning with a three-stage training pipeline. It achieves a 3.01 average F1 score improvement on five comprehensive datasets.
arXiv Detail & Related papers (2022-05-27T20:47:30Z)
On the Limits of Evaluating Embodied Agent Model Generalization Using Validation Sets [101.28658250723804]
This paper experiments with augmenting a transformer model with modules that effectively utilize a wider field of view and learn to choose whether the next step requires a navigation or manipulation action. We observe that the proposed modules resulted in improved, and in fact state-of-the-art performance on an unseen validation set of a popular benchmark dataset, ALFRED. We highlight this result as we believe it may be a wider phenomenon in machine learning tasks but primarily noticeable only in benchmarks that limit evaluations on test splits.
arXiv Detail & Related papers (2022-05-18T23:52:21Z)
Scaling up Memory-Efficient Formal Verification Tools for Tree Ensembles [2.588973722689844]
We formalise and extend the VoTE algorithm presented earlier as a tool description. We show how the separation of property checking from the core verification engine enables verification of versatile requirements. We demonstrate the application of the tool in two case studies, namely digit recognition and aircraft collision avoidance.
arXiv Detail & Related papers (2021-05-06T11:50:22Z)
Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency. We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.