PHUDGE: Phi-3 as Scalable Judge
- URL: http://arxiv.org/abs/2405.08029v2
- Date: Wed, 15 May 2024 05:46:01 GMT
- Title: PHUDGE: Phi-3 as Scalable Judge
- Authors: Mahesh Deshwal, Apoorva Chawla,
- Abstract summary: We present Phi3 model that achieved SOTA results in 4 tasks as Feedback Test, Feedback OOD, MT Human, Preference Test.
It shows very strong correlation not only with GPT4 but with Human annotators too in unseen data as well as in both absolute and relative grading tasks.
We show that by following systematic ML experimentation, thoughtful data augmentation and reposing the problem itself, we can even beat 10x bigger models even with lesser training data.
- Score: 1.7495213911983414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper cum technical report, we present PHUDGE A fine tuned Phi3 model that achieved SOTA results in 4 tasks as Feedback Test, Feedback OOD, MT Human, Preference Test surpassing each and every existing model in latency and throughput. It shows very strong correlation not only with GPT4 but with Human annotators too in unseen data as well as in both absolute and relative grading tasks. We have not only addressed the usage of small LMs for cost effective production grade systems but have also shown that Causal modelling is not only slow in nature but sometimes it can hinder models learning capabilities and should be replaced by simpler tasks whenever we can to make the overall system faster and better. We show that by following systematic ML experimentation, thoughtful data augmentation and re purposing the problem itself, we can even beat 10x bigger models even with lesser training data. To the best of our knowledge, we are re the first one to experiment and showcase the usage of generalised version of Earth Movers Distance AKA Wasserstein distance by using Minkowski Distance with a penalty to control loss smoothing and can be used as a loss function instead of Cross Entropy to get stable training and better results for grading tasks.
Related papers
- Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - PUMA: margin-based data pruning [51.12154122266251]
We focus on data pruning, where some training samples are removed based on the distance to the model classification boundary (i.e., margin)
We propose PUMA, a new data pruning strategy that computes the margin using DeepFool.
We show that PUMA can be used on top of the current state-of-the-art methodology in robustness, and it is able to significantly improve the model performance unlike the existing data pruning strategies.
arXiv Detail & Related papers (2024-05-10T08:02:20Z) - An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with
Pre-trained Vision-Language Models [62.663113296987085]
Few-shot class-incremental learning aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data.
We introduce two novel components: the Redundant Feature Eliminator (RFE) and the Spatial Noise Compensator (SNC)
Considering the imbalance in existing 3D datasets, we also propose new evaluation metrics that offer a more nuanced assessment of a 3D FSCIL model.
arXiv Detail & Related papers (2023-12-28T14:52:07Z) - Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models [115.501751261878]
Fine-tuning language models(LMs) on human-generated data remains a prevalent practice.
We investigate whether we can go beyond human data on tasks where we have access to scalar feedback.
We find that ReST$EM$ scales favorably with model size and significantly surpasses fine-tuning only on human data.
arXiv Detail & Related papers (2023-12-11T18:17:43Z) - More is Better in Modern Machine Learning: when Infinite Overparameterization is Optimal and Overfitting is Obligatory [12.689249854199982]
We show that the test risk of RF regression decreases monotonically with both the number of features and the number of samples.
We then demonstrate that, for a large class of tasks characterized by powerlaw eigenstructure, training to near-zero training loss is obligatory.
arXiv Detail & Related papers (2023-11-24T18:27:41Z) - Are Sample-Efficient NLP Models More Robust? [90.54786862811183]
We investigate the relationship between sample efficiency (amount of data needed to reach a given ID accuracy) and robustness (how models fare on OOD evaluation)
We find that higher sample efficiency is only correlated with better average OOD robustness on some modeling interventions and tasks, but not others.
These results suggest that general-purpose methods for improving sample efficiency are unlikely to yield universal OOD robustness improvements, since such improvements are highly dataset- and task-dependent.
arXiv Detail & Related papers (2022-10-12T17:54:59Z) - Self-Supervised Monocular Depth Estimation: Solving the Edge-Fattening
Problem [39.82550656611876]
Triplet loss, popular for metric learning, has made a great success in many computer vision tasks.
We show two drawbacks of the raw triplet loss in MDE and demonstrate our problem-driven redesigns.
arXiv Detail & Related papers (2022-10-02T03:08:59Z) - LiteDepth: Digging into Fast and Accurate Depth Estimation on Mobile
Devices [45.84356762066717]
We develop an end-to-end learning-based model with a tiny weight size (1.4MB) and a short inference time (27FPS on Raspberry Pi 4.
We propose a simple yet effective data augmentation strategy, called R2 crop, to boost the model performance.
Notably, our solution named LiteDepth ranks 2nd in the MAI&AIM2022 Monocular Depth Estimation Challenge, with a si-RMSE of 0.311, an RMSE of 3.79, and the inference time is 37$ms$ tested on the Raspberry Pi 4.
arXiv Detail & Related papers (2022-09-02T11:38:28Z) - An Efficient Method of Training Small Models for Regression Problems
with Knowledge Distillation [1.433758865948252]
We propose a new formalism of knowledge distillation for regression problems.
First, we propose a new loss function, teacher outlier loss rejection, which rejects outliers in training samples using teacher model predictions.
By considering the multi-task network, training of the feature extraction of student models becomes more effective.
arXiv Detail & Related papers (2020-02-28T08:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.