Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models
- URL: http://arxiv.org/abs/2206.04615v3
- Date: Mon, 12 Jun 2023 17:51:15 GMT
- Title: Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models
- Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb,
Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adri\`a
Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea
Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv,
Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda
Dsouza, Ambrose Slone, Ameet Rahane, Anantharaman S. Iyer, Anders Andreassen,
Andrea Madotto, Andrea Santilli, Andreas Stuhlm\"uller, Andrew Dai, Andrew
La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong,
Animesh Gupta, Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash
Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun Kirubarajan, Asher
Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla
Karaka\c{s}, B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bart{\l}omiej
Bojanowski, Batuhan \"Ozyurt, Behnam Hedayatnia, Behnam Neyshabur, Benjamin
Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan
Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta,
C\'esar Ferri Ram\'irez, Chandan Singh, Charles Rathkopf, Chenlin Meng,
Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian Voigt,
Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera,
Clemencia Siro, Colin Raffel, Courtney Ashcraft, Cristina Garbacea, Damien
Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth, Daniel Freeman,
Daniel Khashabi, Daniel Levy, Daniel Mosegu\'i Gonz\'alez, Danielle Perszyk,
Danny Hernandez, Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David
Drakard, David Jurgens, Debajyoti Datta, Deep Ganguli, Denis Emelin, Denis
Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra,
Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader,
Ekaterina Shutova, Ekin Dogus Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth
Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola, Emma Lam, Eric
Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan
Jerzak, Ethan Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia,
Fatemeh Siar, Fernando Mart\'inez-Plumed, Francesca Happ\'e, Francois
Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo,
Germ\'an Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang,
Gonzalo Jaimovitch-L\'opez, Gregor Betz, Guy Gur-Ari, Hana Galijasevic,
Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar,
Henry Shevlin, Hinrich Sch\"utze, Hiromu Yakura, Hongming Zhang, Hugh Mee
Wong, Ian Ng, Isaac Noble, Jaap Jumelet, Jack Geissinger, Jackson Kernion,
Jacob Hilton, Jaehoon Lee, Jaime Fern\'andez Fisac, James B. Simon, James
Koppel, James Zheng, James Zou, Jan Koco\'n, Jana Thompson, Janelle
Wingfield, Jared Kaplan, Jarema Radom, Jascha Sohl-Dickstein, Jason Phang,
Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher, Jennifer
Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu,
Jiaming Song, Jillian Tang, Joan Waweru, John Burden, John Miller, John U.
Balis, Jonathan Batchelder, Jonathan Berant, J\"org Frohberg, Jos Rozen, Jose
Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B.
Tenenbaum, Joshua S. Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl
Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva, Katja Markert, Kaustubh
D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo,
Ksenia Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria
Reynolds, Leo Gao, Li Zhang, Liam Dugan, Lianhui Qin, Lidia
Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy
Noble, Ludwig Schmidt, Luheng He, Luis Oliveros Col\'on, Luke Metz, L\"utfi
Kerem \c{S}enel, Maarten Bosma, Maarten Sap, Maartje ter Hoeve, Maheen
Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli, Marco
Maru, Maria Jose Ram\'irez Quintana, Marie Tolkiehn, Mario Giulianelli,
Martha Lewis, Martin Potthast, Matthew L. Leavitt, Matthias Hagen, M\'aty\'as
Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin McElrath, Michael
A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt,
Michael Strube, Micha{\l} Sw\k{e}drowski, Michele Bevilacqua, Michihiro
Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac Suzgun, Mitch Walker, Mo
Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma
T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas
Cameron, Nicholas Roberts, Nick Doiron, Nicole Martinez, Nikita Nangia,
Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar, Niveditha S. Iyer,
Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar
Elbaghdadi, Omer Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth
Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol, Pegah Alipoormolabashi,
Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu
Hwang, Piotr Mi{\l}kowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli,
Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin Banjade, Rachel Etta Rudolph,
Raefer Gabriel, Rahel Habacker, Ramon Risco, Rapha\"el Milli\`ere, Rhythm
Garg, Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert
Frank, Rohan Sikand, Roman Novak, Roman Sitelew, Ronan LeBras, Rosanne Liu,
Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan Lee, Ryan
Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant
Anand, Sam Dillavou, Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R.
Bowman, Samuel S. Schoenholz, Sanghyun Han, Sanjeev Kwatra, Sarah A. Rous,
Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian
Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou,
Shashank Srivastava, Sherry Shi, Shikhar Singh, Shima Asaadi, Shixiang Shane
Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay, Shyamolima (Shammie)
Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha
Priscilla Makini, Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas
Dehaene, Stefan Divic, Stefano Ermon, Stella Biderman, Stephanie Lin, Stephen
Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana
Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq
Ali, Tatsu Hashimoto, Te-Lin Wu, Th\'eo Desbordes, Theodore Rothschild,
Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei Kornev,
Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar
Khot, Tyler Shultz, Uri Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai,
Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu, Vishakh Padmakumar, Vivek
Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang
Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh,
Yair Lakretz, Yangqiu Song, Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding
Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao Bai, Zachary
Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, Ziyi Wu
- Abstract summary: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale.
Big-bench consists of 204 tasks, contributed by 450 authors across 132 institutions.
We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench.
- Score: 648.3665819567409
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models demonstrate both quantitative improvement and new qualitative
capabilities with increasing scale. Despite their potentially transformative
impact, these new capabilities are as yet poorly characterized. In order to
inform future research, prepare for disruptive new model capabilities, and
ameliorate socially harmful effects, it is vital that we understand the present
and near-future capabilities and limitations of language models. To address
this challenge, we introduce the Beyond the Imitation Game benchmark
(BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450
authors across 132 institutions. Task topics are diverse, drawing problems from
linguistics, childhood development, math, common-sense reasoning, biology,
physics, social bias, software development, and beyond. BIG-bench focuses on
tasks that are believed to be beyond the capabilities of current language
models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense
transformer architectures, and Switch-style sparse transformers on BIG-bench,
across model sizes spanning millions to hundreds of billions of parameters. In
addition, a team of human expert raters performed all tasks in order to provide
a strong baseline. Findings include: model performance and calibration both
improve with scale, but are poor in absolute terms (and when compared with
rater performance); performance is remarkably similar across model classes,
though with benefits from sparsity; tasks that improve gradually and
predictably commonly involve a large knowledge or memorization component,
whereas tasks that exhibit "breakthrough" behavior at a critical scale often
involve multiple steps or components, or brittle metrics; social bias typically
increases with scale in settings with ambiguous context, but this can be
improved with prompting.
Related papers
- LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models [39.46610170563634]
INSTRUCTEVAL is a more comprehensive evaluation suite designed specifically for instruction-tuned large language models.
We take a holistic approach to analyze various factors affecting model performance, including the pretraining foundation, instruction-tuning data, and training methods.
Our findings reveal that the quality of instruction data is the most crucial factor in scaling model performance.
arXiv Detail & Related papers (2023-06-07T20:12:29Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - Language Model Behavior: A Comprehensive Survey [5.663056267168211]
We discuss over 250 recent studies of English language model behavior before task-specific fine-tuning.
Despite dramatic increases in generated text quality as models scale to hundreds of billions of parameters, the models are still prone to unfactual responses, commonsense errors, memorized text, and social biases.
arXiv Detail & Related papers (2023-03-20T23:54:26Z) - What Language Model to Train if You Have One Million GPU Hours? [54.32062236748831]
We study the impact of different modeling practices and their impact on zero-shot generalization.
We also study the performance of a multilingual model and how it compares to the English-only one.
All our models and code are open-sourced at https://huggingface.co/bigscience.
arXiv Detail & Related papers (2022-10-27T13:43:27Z) - Do Vision-and-Language Transformers Learn Grounded Predicate-Noun
Dependencies? [0.06299766708197882]
We create a new task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup.
We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably.
This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.
arXiv Detail & Related papers (2022-10-21T16:07:00Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.