Comparative Study of Language Models on Cross-Domain Data with Model
Agnostic Explainability
- URL: http://arxiv.org/abs/2009.04095v1
- Date: Wed, 9 Sep 2020 04:31:44 GMT
- Title: Comparative Study of Language Models on Cross-Domain Data with Model
Agnostic Explainability
- Authors: Mayank Chhipa, Hrushikesh Mahesh Vazurkar, Abhijeet Kumar, Mridul
Mishra
- Abstract summary: The study compares the state-of-the-art language models - BERT, ELECTRA and its derivatives which include RoBERTa, ALBERT and DistilBERT.
The experimental results establish new state-of-the-art for 2013 rating classification task and Financial Phrasebank sentiment detection task with 69% accuracy and 88.2% accuracy respectively.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent influx of bidirectional contextualized transformer language
models in the NLP, it becomes a necessity to have a systematic comparative
study of these models on variety of datasets. Also, the performance of these
language models has not been explored on non-GLUE datasets. The study presented
in paper compares the state-of-the-art language models - BERT, ELECTRA and its
derivatives which include RoBERTa, ALBERT and DistilBERT. We conducted
experiments by finetuning these models for cross domain and disparate data and
penned an in-depth analysis of model's performances. Moreover, an
explainability of language models coherent with pretraining is presented which
verifies the context capturing capabilities of these models through a model
agnostic approach. The experimental results establish new state-of-the-art for
Yelp 2013 rating classification task and Financial Phrasebank sentiment
detection task with 69% accuracy and 88.2% accuracy respectively. Finally, the
study conferred here can greatly assist industry researchers in choosing the
language model effectively in terms of performance or compute efficiency.
Related papers
- Linguistically Grounded Analysis of Language Models using Shapley Head Values [2.914115079173979]
We investigate the processing of morphosyntactic phenomena by leveraging a recently proposed method for probing language models via Shapley Head Values (SHVs)
Using the English language BLiMP dataset, we test our approach on two widely used models, BERT and RoBERTa, and compare how linguistic constructions are handled.
Our results show that SHV-based attributions reveal distinct patterns across both models, providing insights into how language models organize and process linguistic information.
arXiv Detail & Related papers (2024-10-17T09:48:08Z) - Multilingual Models for Check-Worthy Social Media Posts Detection [0.552480439325792]
The study includes a comprehensive analysis of different models, with a special focus on multilingual models.
The novelty of this work lies in the development of multi-label multilingual classification models that can simultaneously detect harmful posts and posts that contain verifiable factual claims in an efficient way.
arXiv Detail & Related papers (2024-08-13T08:55:28Z) - PRobELM: Plausibility Ranking Evaluation for Language Models [12.057770969325453]
PRobELM is a benchmark designed to assess language models' ability to discern more plausible scenarios through their parametric knowledge.
Our benchmark is constructed from a dataset curated from Wikidata edit histories, tailored to align the temporal bounds of the training data for the evaluated models.
arXiv Detail & Related papers (2024-04-04T21:57:11Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Sensitivity and Robustness of Large Language Models to Prompt Template
in Japanese Text Classification Tasks [0.0]
A critical issue has been identified within this domain: the inadequate sensitivity and robustness of large language models towards Prompt templates.
This paper explores this issue through a comprehensive evaluation of several representative Large Language Models (LLMs) and a widely-utilized pre-trained model(PLM)
Our experimental results reveal startling discrepancies. A simple modification in the sentence structure of the Prompt template led to a drastic drop in the accuracy of GPT-4 from 49.21 to 25.44.
arXiv Detail & Related papers (2023-05-15T15:19:08Z) - mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets.
Despite promising results, current models still suffer from generating factually inconsistent summaries.
We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Towards Trustworthy Deception Detection: Benchmarking Model Robustness
across Domains, Modalities, and Languages [10.131671217810581]
We evaluate model robustness to out-of-domain data, modality-specific features, and languages other than English.
We find that with additional image content as input, ELMo embeddings yield significantly fewer errors compared to BERT orGLoVe.
arXiv Detail & Related papers (2021-04-23T18:05:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.