Related papers: GPT-Neo for commonsense reasoning -- a theoretical and practical lens

Related papers

HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling [52.58723853697152]
We propose a Hybrid Architecture Distillation (HAD) approach for DNA sequence modeling.<n>We employ the NTv2-500M as the teacher model and devise a grouping masking strategy.<n>Compared to models with similar parameters, our model achieved excellent performance.
arXiv Detail & Related papers (2025-05-27T07:57:35Z)
Leveraging Reasoning Model Answers to Enhance Non-Reasoning Model Capability [16.441081996257576]
We propose leveraging reasoning-intensive models to improve less computationally demanding, non-reasoning models. We demonstrate consistent improvements across various benchmarks, underscoring the potential of this approach for advancing the ability of models to answer questions directly.
arXiv Detail & Related papers (2025-04-13T16:26:56Z)
Benchmarking Post-Training Quantization in LLMs: Comprehensive Taxonomy, Unified Evaluation, and Comparative Analysis [89.60263788590893]
Post-training Quantization (PTQ) technique has been extensively adopted for large language models (LLMs) compression. Existing algorithms focus primarily on performance, overlooking the trade-off among model size, performance, and quantization bitwidth. We provide a novel benchmark for LLMs PTQ in this paper.
arXiv Detail & Related papers (2025-02-18T07:35:35Z)
Distilling foundation models for robust and efficient models in digital pathology [32.99044401004595]
We distilled a large foundation model into a smaller one, reducing the number of parameters by several orders of magnitude. Our model, H0-mini, achieves nearly comparable performance to large FMs at a significantly reduced inference cost. It is evaluated on several public benchmarks, achieving 3rd place on the HEST benchmark and 5th place on the EVA benchmark.
arXiv Detail & Related papers (2025-01-27T17:35:39Z)
GPT vs RETRO: Exploring the Intersection of Retrieval and Parameter-Efficient Fine-Tuning [48.71952325015267]
We apply PEFT methods to a modified Retrieval-Enhanced Transformer (RETRO) and a baseline GPT model across several sizes. We show that RETRO models outperform GPT models in zero-shot settings due to their unique pre-training process. This work presents the first comprehensive comparison of various PEFT methods integrated with RAG, applied to both GPT and RETRO models.
arXiv Detail & Related papers (2024-07-05T14:16:47Z)
Efficiency at Scale: Investigating the Performance of Diminutive Language Models in Clinical Tasks [2.834743715323873]
We present an investigation into the suitability of different PEFT methods to clinical decision-making tasks. Our analysis shows that the performance of most PEFT approaches varies significantly from one task to another. The effectiveness of PEFT methods in the clinical domain is evident, particularly for specialised models which can operate on low-cost, in-house computing infrastructure.
arXiv Detail & Related papers (2024-02-16T11:30:11Z)
Parameter-Efficient Fine-Tuning for Pre-Trained Vision Models: A Survey and Benchmark [97.8968058408759]
Pre-trained vision models (PVMs) have demonstrated remarkable adaptability across a wide range of downstream vision tasks.<n>As these models scale to billions or even trillions of parameters, conventional full fine-tuning has become increasingly impractical due to its high computational and storage demands.<n> parameter-efficient fine-tuning (PEFT) has emerged as a promising alternative, aiming to achieve performance comparable to full fine-tuning while making minimal adjustments to the model parameters.
arXiv Detail & Related papers (2024-02-03T19:12:20Z)
Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models [21.17021844323919]
We introduce Astraios, a suite of 28 instruction-tuned OctoCoder models using 7 tuning methods and 4 model sizes up to 16 billion parameters. We find that FFT leads to the best downstream performance across all scales, and PEFT methods differ significantly in their efficacy based on the model scale.
arXiv Detail & Related papers (2024-01-01T15:30:19Z)
PanGu-$\pi$: Enhancing Language Model Architectures via Nonlinearity Compensation [97.78045712375047]
We present a new efficient model architecture for large language models (LLMs) We show that PanGu-$pi$-7B can achieve a comparable performance to that of benchmarks with about 10% inference speed-up. In addition, we have deployed PanGu-$pi$-7B in the high-value domains of finance and law, developing an LLM named YunShan for practical application.
arXiv Detail & Related papers (2023-12-27T11:49:24Z)
E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation. Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z)
Revisiting Implicit Models: Sparsity Trade-offs Capability in Weight-tied Model for Vision Tasks [4.872984658007499]
Implicit models such as Deep Equilibrium Models (DEQs) have garnered significant attention in the community for their ability to train infinite layer models. We revisit the line of implicit models and trace them back to the original weight-tied models. Surprisingly, we observe that weight-tied models are more effective, stable, as well as efficient on vision tasks, compared to the DEQ variants.
arXiv Detail & Related papers (2023-07-16T11:45:35Z)
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness. We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost. We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing. This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.