A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification
- URL: http://arxiv.org/abs/2601.13288v1
- Date: Mon, 19 Jan 2026 18:40:29 GMT
- Title: A BERTology View of LLM Orchestrations: Token- and Layer-Selective Probes for Efficient Single-Pass Classification
- Authors: Gonzalo Ariel Meyoyan, Luciano Del Corro,
- Abstract summary: Production LLM systems often rely on separate models for safety and other classification-heavy steps.<n>We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation.
- Score: 2.0069888187253615
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Production LLM systems often rely on separate models for safety and other classification-heavy steps, increasing latency, VRAM footprint, and operational complexity. We instead reuse computation already paid for by the serving LLM: we train lightweight probes on its hidden states and predict labels in the same forward pass used for generation. We frame classification as representation selection over the full token-layer hidden-state tensor, rather than committing to a fixed token or fixed layer (e.g., first-token logits or final-layer pooling). To implement this, we introduce a two-stage aggregator that (i) summarizes tokens within each layer and (ii) aggregates across layer summaries to form a single representation for classification. We instantiate this template with direct pooling, a 100K-parameter scoring-attention gate, and a downcast multi-head self-attention (MHA) probe with up to 35M trainable parameters. Across safety and sentiment benchmarks our probes improve over logit-only reuse (e.g., MULI) and are competitive with substantially larger task-specific baselines, while preserving near-serving latency and avoiding the VRAM and latency costs of a separate guard-model pipeline.
Related papers
- Towards Improved Sentence Representations using Token Graphs [41.412173502714225]
We introduce GLOT, a structure-aware pooling module that reframes pooling as relational learning followed by aggregation.<n>On a diagnostic stress test where 90% of tokens are random distractors, GLOT maintains over 97% accuracy while baseline methods collapse.<n>It is competitive with state-of-the-art techniques on benchmarks like GLUE and MTEB with 20x fewer trainable parameters and speeds up the training time by over 100x compared with parameter-efficient fine-tuning methods.
arXiv Detail & Related papers (2026-03-03T09:00:01Z) - LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging [11.135582038431368]
We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer.<n>LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule.<n> Layerwise analysis and corruption tests indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features.
arXiv Detail & Related papers (2026-02-10T05:10:31Z) - SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z) - ConformalSAM: Unlocking the Potential of Foundational Segmentation Models in Semi-Supervised Semantic Segmentation with Conformal Prediction [57.930531826380836]
This work explores whether a foundational segmentation model can address label scarcity in the pixel-level vision task as an annotator for unlabeled images.<n>We propose ConformalSAM, a novel SSSS framework which first calibrates the foundation model using the target domain's labeled data and then filters out unreliable pixel labels of unlabeled data.
arXiv Detail & Related papers (2025-07-21T17:02:57Z) - Grokking in LLM Pretraining? Monitor Memorization-to-Generalization without Test [22.499052329934603]
We investigate when an LLM memorizes the training data, when its generalization on downstream tasks starts to improve, and what happens if there is a lag between the two.<n>Our study, for the first time, shows that grokking still emerges in pretraining mixture-of-experts (MoE)<n>Our primary discovery is that the pathways evolve from random, non-smooth across layers, instance-specific to more structured and transferable across samples, despite the converged pretraining loss.
arXiv Detail & Related papers (2025-06-26T17:59:58Z) - Few-Shot Learning for Industrial Time Series: A Comparative Analysis Using the Example of Screw-Fastening Process Monitoring [0.0]
Few-shot learning has shown promise in vision but remains largely unexplored for emphindustrial time-series data.<n>We present a systematic FSL study on screw-fastening process monitoring, using a 2,300-sample multivariate torque dataset.<n>We introduce a textbflabel-aware episodic sampler that collapses multi-label sequences into multiple single-label tasks.
arXiv Detail & Related papers (2025-06-16T18:38:34Z) - DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding [7.204881999658682]
We introduce DEL, a plug-and-play method that adaptively selects the exit layer and speculation length during inference.<n>Del achieves overall speedups of $2.16times$$sim$$2.62times$ over vanilla auto-regressive decoding.
arXiv Detail & Related papers (2025-04-08T01:12:59Z) - Beyond Next Token Probabilities: Learnable, Fast Detection of Hallucinations and Data Contamination on LLM Output Distributions [60.43398881149664]
We introduce LOS-Net, a lightweight attention-based architecture trained on an efficient encoding of the LLM Output Signature.<n>It achieves superior performance across diverse benchmarks and LLMs, while maintaining extremely low detection latency.
arXiv Detail & Related papers (2025-03-18T09:04:37Z) - SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement [40.37217744643069]
We propose a universal and efficient approach by adapting SAM to the mask refinement task.<n>Specifically, we introduce a multi-prompt excavation strategy to mine diverse input prompts for SAM.<n>We extend our method to SAMRefiner++ by introducing an additional IoU adaption step to further boost the performance of the generic SAMRefiner on the target dataset.
arXiv Detail & Related papers (2025-02-10T18:33:15Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - Semi-Supervised Temporal Action Detection with Proposal-Free Masking [134.26292288193298]
We propose a novel Semi-supervised Temporal action detection model based on PropOsal-free Temporal mask (SPOT)
SPOT outperforms state-of-the-art alternatives, often by a large margin.
arXiv Detail & Related papers (2022-07-14T16:58:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.