論文の概要、ライセンス

# (参考訳) 構造化、フレキシブル、ロバスト:分散推論タスクにおける人間のような振る舞いに向けた大規模言語モデルのベンチマークと改善 [全文訳有]

Structured, flexible, and robust: benchmarking and improving large language models towards more human-like behavior in out-of-distribution reasoning tasks ( http://arxiv.org/abs/2205.05718v1 )

ライセンス: CC BY 4.0
Katherine M. Collins, Catherine Wong, Jiahai Feng, Megan Wei, and Joshua B. Tenenbaum(参考訳) 私たちは物語を語り、説明し、言葉を通じて私たちの信念と目標を表現します。 明らかな証拠は、言語が学習の構造化に発達的な役割を担っていることを示唆している。 言語だけで統計パターンを学習することで、人間のような思考のどれ程を捉えることができるのか? 我々はまず,人間と分布型大言語モデル(LLM)を比較するための新しい課題ベンチマークを提出する。 本ベンチマークは,2つの問題解決領域(計画と説明生成)を含み,言語で表現された新たな分散問題への一般化が要求される。 このベンチマークでは、人間はLSMよりもはるかに堅牢であることが分かりました。 次に、構造的シンボリック推論モジュールで分散LLMを増強するハイブリッドParse-and-Solveモデルを提案する。 このモデルは配布外計画問題への堅牢な適応を示し、人間的な推論のためのハイブリッドAIモデルの可能性を実証している。

Human language offers a powerful window into our thoughts -- we tell stories, give explanations, and express our beliefs and goals through words. Abundant evidence also suggests that language plays a developmental role in structuring our learning. Here, we ask: how much of human-like thinking can be captured by learning statistical patterns in language alone? We first contribute a new challenge benchmark for comparing humans and distributional large language models (LLMs). Our benchmark contains two problem-solving domains (planning and explanation generation) and is designed to require generalization to new, out-of-distribution problems expressed in language. We find that humans are far more robust than LLMs on this benchmark. Next, we propose a hybrid Parse-and-Solve model, which augments distributional LLMs with a structured symbolic reasoning module. We find that this model shows more robust adaptation to out-of-distribution planning problems, demonstrating the promise of hybrid AI models for more human-like reasoning.
公開日: Wed, 11 May 2022 18:14:33 GMT

※ 翻訳結果を表に示しています。PDFがオリジナルの論文です。翻訳結果のライセンスはCC BY-SA 4.0です。詳細はトップページをご参照ください。

翻訳結果

    Page: /      
英語(論文から抽出)日本語訳スコア
Structured,flexible,androbust:ben chmarkingandimprovin glargelanguagemodels towardsmorehuman-lik ebehaviorinout-of-di stributionreasoningt asksKatherineM.Colli ns1,2? 構造的フレキシブル・アンドロバスト:ベンチマルキングと改善的ラージ言語モデルtowardsmore human-likebehaviorin out-of-distributionr easoningtasksKatheri neM.Collins1,2? 0.20
†,CatherineWong2? キャサリン・ウォン2? 0.47
,JiahaiFeng2,MeganWe i2,andJoshuaB.Tenenb aum21UniversityofCam bridge,2MIT†kmc61@cam.ac.ukAbstr actHumanlanguageoffe rsapowerfulwindowint oourthoughts–wetellstories,giveex planations,andexpres sourbeliefsandgoalst hroughwords.Abundant evidencealsosuggests thatlanguageplaysade velopmentalroleinstr ucturingourlearning. Here,weask:howmuchof human-likethinkingca nbecapturedbylearnin gstatisticalpatterns inlanguagealone? JiahaiFeng2,MeganWei 2,andJoshuaB.Tenenba um21UniversityofCamb ridge,2MIT,kmc61@cam .ac.ukAbstractHumanl angoffersapowerfulwi ndowintoour Thoughts–wetellstories,giveex planations,and expressedourbeliefsa ndgoalsthroughwords. Abundantevidence alsosuggeststhatlang playsadevelopmentalr oleinstructuringourl earning.Here,weask:h owmuchof Human-likethinkingca nbecapturedbylearnin galpatternsinlangalo ne? 0.06
Wefirstcontributeanewcha l-lengebenchmarkforc omparinghumansanddis tributionallargelang uagemodels(LLMs). Wefirst Contributeanewchal-l engebenchmarkforcomp aring Humansanddistributio nallargelangmodels (LLMs)。 0.12
Ourbenchmarkcontains twoproblem-solvingdo mains(planningandexp lanationgeneration)a ndisdesignedtorequir egeneralizationtonew ,out-of-distribution problemsexpressedinl anguage.Wefindthathumansarefarmo rerobustthanLLMsonth isbenchmark.Next,wep roposeahybridParse-a nd-Solvemodel,whicha ugmentsdistributiona lLLMswithastruc-ture dsymbolicreasoningmo dule.Wefindthatthismodelshows morerobustadaptation toout-of-distributio nplan-ningproblems,d emonstratingthepromi seofhybridAImodelsfo rmorehuman-likereaso ning.Keywords:lan-gu age;problem-solving;programs;languageofthought;neuro-symbolicmodels IntroductionLanguage expressestherichinte rnallandscapeofourth inkinginaformthatcan besharedexternallywi thothers.Wetellstori esaboutreal(whatdidI dotoday?)andhypothet ical(whatwouldIdoifI wonthelottery?)situa tions;giveinstructionsfora chievinggoalsranging fromthemundane(howdo Iputawaythedishes?)t othecomplex(howdoIfixacarburetor?);andproposeexplanatio nsforbotheverydayeve nts(whyisn’tthelightbulbturning on?)andnovelobservat ions(what’sthatstrangebeepings ound?). Ourbenchmarkcontains twoproblem-solvingdo mains(planningandexp lanationgeneration)a ndisdesignedtorequir egeneralizationtonew ,out-of-distribution problemsexpressedinl anguage.Wefindthathumansarefarmo rerobustthanLLMsonth isbenchmark.Next,wep roposeahybridParse-a nd-Solvemodel,whicha ugmentsdistributiona lLLMswithastruc-ture dsymbolicreasoningmo dule.Wefindthatthismodelshows morerobustadaptation toout-of-distributio nplan-ningproblems,d emonstratingthepromi seofhybridAImodelsfo rmorehuman-likereaso ning.Keywords:lan-gu age;problem-solving;programs;languageofthought;neuro-symbolicmodels IntroductionLanguage expressestherichinte rnallandscapeofourth inkinginaformthatcan besharedexternallywi thothers.Wetellstori esaboutreal(whatdidI dotoday?)andhypothet ical(whatwouldIdoifI wonthelottery?)situa tions;giveinstructionsfora chievinggoalsranging fromthemundane(howdo Iputawaythedishes?)t othecomplex(howdoIfixacarburetor?);andproposeexplanatio nsforbotheverydayeve nts(whyisn’tthelightbulbturning on?)andnovelobservat ions(what’sthatstrangebeepings ound?). 0.09
Learninglanguageandl earningfromlanguagea lsoplaycrucialrolesi nthedevelopmentofchi ldren’sthinking(Gopnik& ;Meltzoff,1997;Carey,2009;Harrisetal.,2018). Gopnik&Meltzoff, 1997;Carey, 2009;Harrisetal., 2018) 0.23
Butwhat,incomputatio nalterms,istherelati onshipbetweenlanguag eandthought,andbetwe enlearninglanguagean dlearningtothink? しかし、言語と思考と学習との間にある関係性とは何か? 0.33
Classicaltheoriesdra wastarkdivisionbetwe enthink-ingasthemani pulationofstructured representationsinani nternalsymbolsystemo rlanguageofthought(L OT)(Fodor,1975),andl anguageasasystemofma ppingsbetweenthosere presentationsandoutw ardlyexpressed*Contr ibutedequally. 古典理論drawastarkdivision betweenthink-ingasth emanipulationofstruc turedrepresentations inaninternalsymbolsy stemorlanguageofthou ght(lot)(fodor, 1975) andlanguageasasystem ofmappings betweenthoserepresen tations andoutwardexpressed* contributedequally 0.07
‡Dataandcodeforthepro jectcanbefoundat:htt ps://github.com/coll inskatie/structuredflexibleandrobustforms (e g ,sounds,text). dataandcodeforthe projectcanbefoundat: https://github.com/c ollinskatie/structur edflexibleandrobustf orms(eg,sounds,text) 0.23
Underthisview,learni nglan-guageplaysatbe stasupportingroleinl earningtothink.Recen tlyhowever,anewgener ationofstatisticalla nguagelearningsystem sinAIhasputforthaser iouschallengetothisv iew.So-calledlargela nguagemodels(LLMs)(B rownetal.,2020;Raeetal.,2021)havede monstratedsuchstriki ngsuccessinrealistic languageproductionth attheyoftenappeartob e“thinking”–andyettheyaredrivens olelybyneuralnetwork strainedtopredictthe dis-tributionofnextw ordsinlongtextsequen cesfromverylargecorp oraofhumanlanguage.O therworkhaspro-posed usingLLMsasauniversa lfoundationforemulat ingmanyhumanreasonin gabilities–includingcapacitiesa sdiverseasphysicalre asoning(Bisketal.,20 19),task-levelplanni ng(Sharmaetal.,2021;Huangetal.,2022),and evenmathematicalreas oning(Cobbeetal.,202 1)–simplybyre-framingth emaslinguisticpredic tion.Underthisview,“allyouneedislanguage ”:learningtothinkre-q uireslittlemorethanl earning(thestatistic sof)language,orlearn ingonlythelatentstru cturesufficienttoproducethemos tprobablenextwordina nylinguisticcontext. Inthispaper,ourgoali stocriticallyassessh owclosemodernLLMscom etoactuallylearningt othink,andtosketchou tanalternativehybrid viewofthelanguage-th oughtinterfacethatin tegrateselementsofth eclassicalLOTandrece ntLLMparadigms.InPar tI,wedescribeanew,ge nericapproachforcons tructinglinguisticre ason-ingpromptsthatm easureflexible,creativethink ingabili-tiesinnovel situations,asopposed totheabilitytoretrie vefamiliarpatternsof thoughtforfamiliarsi tuations.Weuseaniter ativeconstraintgener ationparadigmthatext endsinitiallinguisti cpromptsusinglinguis ticconstraintsthatre strictproductionofth emostcommonhumanresp onses,forcingrespons esthatrequirenovella nguageproduction–and,weargue,agreater degreeofthinking.Wec ompareLLMstohumansus ingthisbenchmarkontw odomains–planandexplanationge neration–andfindthathumansbothsign ificantlyoutperformLLMs ingeneral,andarecomp arativelymorerobustt opromptsthatextendbe yondthestandarddistr ibutionofhumanlangua ge.InPartII,wepropos eanalternativecomput ationalapproachthatl everagesanLLMtomapna turallanguageintoasp acearXiv:2205.05718v 1 [cs.CL] 11 May 2022 Underthisview,learni nglan-guageplaysatbe stasupportingroleinl earningtothink.Recen tlyhowever,anewgener ationofstatisticalla nguagelearningsystem sinAIhasputforthaser iouschallengetothisv iew.So-calledlargela nguagemodels(LLMs)(B rownetal.,2020;Raeetal.,2021)havede monstratedsuchstriki ngsuccessinrealistic languageproductionth attheyoftenappeartob e“thinking”–andyettheyaredrivens olelybyneuralnetwork strainedtopredictthe dis-tributionofnextw ordsinlongtextsequen cesfromverylargecorp oraofhumanlanguage.O therworkhaspro-posed usingLLMsasauniversa lfoundationforemulat ingmanyhumanreasonin gabilities–includingcapacitiesa sdiverseasphysicalre asoning(Bisketal.,20 19),task-levelplanni ng(Sharmaetal.,2021;Huangetal.,2022),and evenmathematicalreas oning(Cobbeetal.,202 1)–simplybyre-framingth emaslinguisticpredic tion.Underthisview,“allyouneedislanguage ”:learningtothinkre-q uireslittlemorethanl earning(thestatistic sof)language,orlearn ingonlythelatentstru cturesufficienttoproducethemos tprobablenextwordina nylinguisticcontext. Inthispaper,ourgoali stocriticallyassessh owclosemodernLLMscom etoactuallylearningt othink,andtosketchou tanalternativehybrid viewofthelanguage-th oughtinterfacethatin tegrateselementsofth eclassicalLOTandrece ntLLMparadigms.InPar tI,wedescribeanew,ge nericapproachforcons tructinglinguisticre ason-ingpromptsthatm easureflexible,creativethink ingabili-tiesinnovel situations,asopposed totheabilitytoretrie vefamiliarpatternsof thoughtforfamiliarsi tuations.Weuseaniter ativeconstraintgener ationparadigmthatext endsinitiallinguisti cpromptsusinglinguis ticconstraintsthatre strictproductionofth emostcommonhumanresp onses,forcingrespons esthatrequirenovella nguageproduction–and,weargue,agreater degreeofthinking.Wec ompareLLMstohumansus ingthisbenchmarkontw odomains–planandexplanationge neration–andfindthathumansbothsign ificantlyoutperformLLMs ingeneral,andarecomp arativelymorerobustt opromptsthatextendbe yondthestandarddistr ibutionofhumanlangua ge.InPartII,wepropos eanalternativecomput ationalapproachthatl everagesanLLMtomapna turallanguageintoasp acearXiv:2205.05718v 1 [cs.CL] 11 May 2022 0.04
英語(論文から抽出)日本語訳スコア
ofstructuredprograms ,suchthatreasoningpr oblemscanbesolvedbyp owerful,scalablesymb olicalgorithms-rathe rthanthepurelyneural formofend-to-endLLMs alone.Weimplementand demonstratethismodel inasimplifiedsyntheticlanguages ettingdesignedtoemul atetheplanningdomain inPartI.Ourresultssu ggestthatsuchhybrida pproachesareapromisi ngwayforwards,albeit stillrichwithpotenti alforfutureimproveme nt.PartI:Linguisticr easoningbenchmarkfor humansandlanguagemod elsThefirstcoremotivationoft hisworkistoevaluatet heextenttowhichmodel ingthepredictivedist ributionoflanguageac tuallycapturestheund erlyingreasoningla-t entinhumanlanguage.T owardsthisend,weprop oseabenchmarktask(Fi g.1)basedontwocorere asoningabilities–goal-basedplanningan dcausalexplanation–usinganiterativedesi gntochallengemodelsw hichsimplylearnpredi ctableresponsesfromp riorlanguage.Methods Webenchmarkhumanandl anguagemodelperforma nceusingatwo-stageex perimentaldesign.Int hefirststage,aniterative humanlanguageproduct ionexperi-ment(Fig.1 B),wecollecthumanres ponsesontwodomains(p lanningandexplanatio ns)underthreepro-gre ssivelymorechallengi ngconditions:abaseli neinitialpromptcondi tionusingacollecting oflinguisticreason-i ngprompts;andtwoconstrainedcon ditionswhichre-stric ttheuseofcommonanswe rstoeachprompt,inord ertoencouragepartici pantstogeneratenovel linguisticso-lutions .Inthesecondstage,we evaluatealargelangua gemodel(LLM)onthesam eprompts,andcollectr esponsesbysamplingfr omitspredictivedistr ibution.Wedescribeea chstageinmoredetailb elow.Humanlanguagepr oductionexperimentPa rticipants240partici pantsrecruitedfromPr olific(2domainsx3conditio nsx40participants)co mpletedthetask.Basep aywas$15/hr,witha$1q ualitybonus.Conditio n1:initialreasoningp romptsTomeasurebasel ineperformance,ourfirstreasoningconditio nelicitshumanrespons estoinitialprompts(F ig.1B,Condi-tion1)on eachgroundingdomain. Weconstruct28goalpro mptsfortheplanningdo main(Fig.1A,top),des ignedtoelicitaconcre telinguisticplanandt ovaryintheirbasetypi cality(eg.rangingfro mcleanthedirtydishes togetasofaontheroof) . ofstructuredprograms ,suchthatreasoningpr oblemscanbesolvedbyp owerful,scalablesymb olicalgorithms-rathe rthanthepurelyneural formofend-to-endLLMs alone.Weimplementand demonstratethismodel inasimplifiedsyntheticlanguages ettingdesignedtoemul atetheplanningdomain inPartI.Ourresultssu ggestthatsuchhybrida pproachesareapromisi ngwayforwards,albeit stillrichwithpotenti alforfutureimproveme nt.PartI:Linguisticr easoningbenchmarkfor humansandlanguagemod elsThefirstcoremotivationoft hisworkistoevaluatet heextenttowhichmodel ingthepredictivedist ributionoflanguageac tuallycapturestheund erlyingreasoningla-t entinhumanlanguage.T owardsthisend,weprop oseabenchmarktask(Fi g.1)basedontwocorere asoningabilities–goal-basedplanningan dcausalexplanation–usinganiterativedesi gntochallengemodelsw hichsimplylearnpredi ctableresponsesfromp riorlanguage.Methods Webenchmarkhumanandl anguagemodelperforma nceusingatwo-stageex perimentaldesign.Int hefirststage,aniterative humanlanguageproduct ionexperi-ment(Fig.1 B),wecollecthumanres ponsesontwodomains(p lanningandexplanatio ns)underthreepro-gre ssivelymorechallengi ngconditions:abaseli neinitialpromptcondi tionusingacollecting oflinguisticreason-i ngprompts;andtwoconstrainedcon ditionswhichre-stric ttheuseofcommonanswe rstoeachprompt,inord ertoencouragepartici pantstogeneratenovel linguisticso-lutions .Inthesecondstage,we evaluatealargelangua gemodel(LLM)onthesam eprompts,andcollectr esponsesbysamplingfr omitspredictivedistr ibution.Wedescribeea chstageinmoredetailb elow.Humanlanguagepr oductionexperimentPa rticipants240partici pantsrecruitedfromPr olific(2domainsx3conditio nsx40participants)co mpletedthetask.Basep aywas$15/hr,witha$1q ualitybonus.Conditio n1:initialreasoningp romptsTomeasurebasel ineperformance,ourfirstreasoningconditio nelicitshumanrespons estoinitialprompts(F ig.1B,Condi-tion1)on eachgroundingdomain. Weconstruct28goalpro mptsfortheplanningdo main(Fig.1A,top),des ignedtoelicitaconcre telinguisticplanandt ovaryintheirbasetypi cality(eg.rangingfro mcleanthedirtydishes togetasofaontheroof) . 0.05
Wealsoconstruct28cau saleventpromptsofvar yingtypicalityforthe explanationsdomain(F ig.1A,bottom),inspir edbythe“unusualevent”promptsin(Korman& ;Khemlani,2020):eache ventbe-ginswithaninc itingcauseanditsusua lconsequence,thenpos esacounterfactual.Pa rticipantsinthiscond itionrespondedtoaran dombatch(n=7)ofpromptsfromasing ledomain,resultingin 10uniqueresponsesper prompt.Afterrespondi ngtoallprompts,weals oaskparticipantstosc orebasetypicalityfor eachpromptofthegoal( onplanning)orincitin gevent(onexplanation s)usinga7-pointLiker tscale.Condition2and 3:constrainedreasoni ngpromptsInthesubseq uentconditions(Fig.1 B,Condition2,3),weev aluatethehumanabilit ytoflexiblygeneratemoreno velplansandexplanati onsforthesameinitial prompts,byrestrictin gtheirresponsestopre ventsubjectsfromfall ingbackonthemostcomm onsolutions.Specifically,weusesubjectre sponsesfromCondition 1todeterminecommon(a ndlikelyhighlypredic table)componentsofpl ansandexplanationsfo reachprompt.Weconstr uctlinguisticcon-str aintsbyextractingcon cretenounsfromallres ponsestoagivenprompt (usinganexperthumant agger,whoalsolemmati zesandstandarizesthe formofeachnoun). Wealsoconstruct28cau saleventpromptsofvar yingtypicalityforthe explanationsdomain(F ig.1A,bottom),inspir edbythe“unusualevent”promptsin(Korman& ;Khemlani,2020):eache ventbe-ginswithaninc itingcauseanditsusua lconsequence,thenpos esacounterfactual.Pa rticipantsinthiscond itionrespondedtoaran dombatch(n=7)ofpromptsfromasing ledomain,resultingin 10uniqueresponsesper prompt.Afterrespondi ngtoallprompts,weals oaskparticipantstosc orebasetypicalityfor eachpromptofthegoal( onplanning)orincitin gevent(onexplanation s)usinga7-pointLiker tscale.Condition2and 3:constrainedreasoni ngpromptsInthesubseq uentconditions(Fig.1 B,Condition2,3),weev aluatethehumanabilit ytoflexiblygeneratemoreno velplansandexplanati onsforthesameinitial prompts,byrestrictin gtheirresponsestopre ventsubjectsfromfall ingbackonthemostcomm onsolutions.Specifically,weusesubjectre sponsesfromCondition 1todeterminecommon(a ndlikelyhighlypredic table)componentsofpl ansandexplanationsfo reachprompt.Weconstr uctlinguisticcon-str aintsbyextractingcon cretenounsfromallres ponsestoagivenprompt (usinganexperthumant agger,whoalsolemmati zesandstandarizesthe formofeachnoun). 0.08
Wethenextendeachinit ialpromptintwomorech alleng-ingconditions :inthemostcommonnoun constrainedcondition ,werestrictresponses whichusethesinglemos tcommonnoun;intheallinitialnouns constrained,werestri ctallnounswhichappea rintheinitialrespons es.Anewsetofparticip antsrespondedtoarand ombatch(n=7)ofpromptsinasingle domainandcondition,a gainresultingin10uni queresponsesperpromp tandconditionthatreflecttheselinguisticco nstraints.Languagemo delmatchedproduction experimentOurhumanex perimentyieldsaserie soflinguisticprompts ,inwhichindividualgo alandexplanationprom ptsareextendedacross twomorechallengingco ndi-tionsthroughling uisticconstraintstha trestricttheusageoft hemostcommonresponse stoeach.Weusethesesa mepromptstoconstruct abenchmarklanguagepr oductiontaskforourar tificiallanguagemodel.We evaluateourpromptson thestate-of-the-artm odelGPT-3(Brownetal. ,2020),usingthefew-s hotpromptingtechniqu eintroducedin(Browne tal.,2020)forgenerat -ingpredictivelangua geforparticulartasks .Specifically,weseedthemodel withasmallnumberofex amples(n=12goals,andn=15explanations:thema ximumnumberofexample sthemodelallowed,bas edontokenlimits)pair -ingheldoutpromptsan dhuman-generatedtext ,thenelicitgenerated responsesforeachprom ptacrossallcondition s.Toeliminatepurelyd egeneratetext,wealso prescreenthesamplesb yaskinghumanevaluato rs(N=370;re-cruitedfromProlific)toscoreresponsesfo rsurfacelan-guageerr orsalone,andremoveth elowestscoringre- Wethenextendeachinit ialpromptintwomorech alleng-ingconditions :inthemostcommonnoun constrainedcondition ,werestrictresponses whichusethesinglemos tcommonnoun;intheallinitialnouns constrained,werestri ctallnounswhichappea rintheinitialrespons es.Anewsetofparticip antsrespondedtoarand ombatch(n=7)ofpromptsinasingle domainandcondition,a gainresultingin10uni queresponsesperpromp tandconditionthatreflecttheselinguisticco nstraints.Languagemo delmatchedproduction experimentOurhumanex perimentyieldsaserie soflinguisticprompts ,inwhichindividualgo alandexplanationprom ptsareextendedacross twomorechallengingco ndi-tionsthroughling uisticconstraintstha trestricttheusageoft hemostcommonresponse stoeach.Weusethesesa mepromptstoconstruct abenchmarklanguagepr oductiontaskforourar tificiallanguagemodel.We evaluateourpromptson thestate-of-the-artm odelGPT-3(Brownetal. ,2020),usingthefew-s hotpromptingtechniqu eintroducedin(Browne tal.,2020)forgenerat -ingpredictivelangua geforparticulartasks .Specifically,weseedthemodel withasmallnumberofex amples(n=12goals,andn=15explanations:thema ximumnumberofexample sthemodelallowed,bas edontokenlimits)pair -ingheldoutpromptsan dhuman-generatedtext ,thenelicitgenerated responsesforeachprom ptacrossallcondition s.Toeliminatepurelyd egeneratetext,wealso prescreenthesamplesb yaskinghumanevaluato rs(N=370;re-cruitedfromProlific)toscoreresponsesfo rsurfacelan-guageerr orsalone,andremoveth elowestscoringre- 0.05
英語(論文から抽出)日本語訳スコア
A. Model-based language domainsPlanningExpla nationsB. A.モデルに基づく言語ドメインPlanningExplanations B 0.71
Iteratively constrained humanlanguage production taskMore typicalgoalLess typicalgoalGoal: Clean the dirty dishes.Goal: Keep the plants in your garden alive.Goal: Make a pair of new shoes.Goal: Create a safe landing for a falling skydiver.Goal: Keep a baby platypus entertained.Goal: Remove plaque from the teeth of a lion.More typicalincidentLess typicalincidentIf a door is locked with a bolt, then it cannot be opened. 目標: 庭の植物を生かしたままにする 目標: 落下するスカイダイバーのために安全な着陸場所を作る 目標: 赤ちゃんのプラティパスを楽しませる 目標: ライオンの歯からプラークを除去する 典型的なインシデントLess 典型的なインシデントLess 典型的なインシデント: ドアがボルトでロックされている場合、開けることはできない。 0.62
But suppose a door is locked with a bolt, and then it is opened. しかし、ドアがボルトでロックされていると仮定すると、それが開く。 0.78
Why? If rocks are thrown at a window, the window breaks. なぜだ? 岩が窓に投げられると、窓が割れる。 0.72
But suppose rocks are thrown at a window, and then the window does not break. しかし、もし岩が窓に投げられても窓が壊れないとしたら。 0.72
Why? If a piano is dropped from a skyscraper, thenthe piano shatters. なぜだ? ピアノが高層ビルから落とされると、ピアノは粉々になる。 0.74
But suppose a piano is dropped from a skyscraper, and then the piano does not shatter. しかし、もしピアノが高層ビルから落とされたら、ピアノは壊れない。 0.57
Why? Goal: Clean the dirty dishes.I would use warm water, soap, and a sponge.Use lots of soapand water. なぜだ? ゴール:汚れた皿をきれいにし、暖かい水、石けん、スポンジを使います。多くの石けんと水を使います。 0.68
Maybe a dishwasher. おそらく食器洗い機。 0.74
Scrub with soapand a sponge.Goal: Clean the dirty dishes, without using soap.C. 石けんとスポンジで洗う:石けんを使わずに汚い皿をきれいにする。 0.71
Language models evaluated with matched promptsExtract nouns to construct constraintsGoal: Clean the dirty dishes, without using soap, water, asponge, adishwasher…Condition 2: + Most commonnoun constrainedCondition 1: Initial promptCondition 3: + All initialnouns constrainedTurn on really hot water to wash the dishes…You could use baking soda…Rinse the dishes off…put them in the dishwater on high heat/long wash…I can use cleaning wipes and paper towels…I would use a rag and wipe them clean…Heat up the stove, grab a pair of tongs and…use the heat to sterilize the dishes…Clean the dishes by using soap and water.Get a sponge, warm water, and soap to clean the dishes.First gather all of the dirty dishes. Language models evaluated with matched promptsExtract nouns to construct constraintsGoal: Clean the dirty dishes, without using soap, water, asponge, adishwasher…Condition 2: + Most commonnoun constrainedCondition 1: Initial promptCondition 3: + All initialnouns constrainedTurn on really hot water to wash the dishes…You could use baking soda…Rinse the dishes off…put them in the dishwater on high heat/long wash…I can use cleaning wipes and paper towels…I would use a rag and wipe them clean…Heat up the stove, grab a pair of tongs and…use the heat to sterilize the dishes…Clean the dishes by using soap and water.Get a sponge, warm water, and soap to clean the dishes.First gather all of the dirty dishes.
訳抜け防止モード: 一致したプロンプトで評価された言語モデル 制約を構築するための抽出名詞 : 汚れた皿をきれいにする soap, water, asponge, adishwasher ... Condition 2 : + most commonnoun constrainedCondition 1 : initial promptCondition 3 : + 全ての初期名詞が、本当に湯で洗って皿を洗う...焼いたソーダを消して......皿を高熱で皿水に浸して、長く洗う......掃除機と紙のタオルを使える... そして... 熱を使って 皿を消毒し... 石けんと水を使って スポンジ、温水、石けんを使って皿をきれいにする。 まずは汚れた料理を集めなさい。
0.75
Then I would put them into the dishwasher...Use bleach to clean the dishes.Use hot water and a sponge to wash the dishes.Use vinegar to clean the dishes. 次に食器洗い機に入れます...食器洗い用の漂白剤、湯とスポンジで洗うもの、酢で食器を洗うもの。 0.68
I would first put the dirty dishes in the sink and fill it with water...Use a paper towel to wipe off the dishes and then use a rubber band to hold the paper towel in place…Use a brush to scrub the dishes off...find a sink, and put the dirty dishes in...Use a magic eraser.Goal: Clean the dirty dishes.Goal: Clean the dirty dishes, without using soap. まずは皿を洗って水で満たす...紙タオルで皿を拭き取り、ゴムバンドで紙タオルを固定......ブラシで皿を洗う...シンクを磨いて、汚れた皿を洗う...魔法消し器で。ゴール:汚れた皿を洗う。ゴール:石けんを使わずに汚れた皿をきれいにする。 0.65
……Goal: Clean the dirty dishes, without using soap, water, asponge, adishwasher…Figure1:Iterativerea soningtaskoverview.A )Samplegoalsandscena riosfortheplanningan dexplanationdomains, respectively,illustr atingtherangeofbaset ypicalityofourstimul i;B)Formationofconstra intsfromhuman-genera tedlanguage,wherecon straintsareselectedb asedonfrequency,with samplehumangeneratio ns(bluetext)C)LLM-ge nerations(graytext)i nresponsetothesamepr ompts.sponses.Afters creening,wecollectat otalof20LLM-generate dresponsesforeachpro mptineachcondition.B lindcomparativehuman evaluationHavingcol- lectedhumanandLLMres ponsestothesamelingu is-ticpromptsacrossa llconditions,wenowbe nchmarktheirrelative performanceusingblin dhumanevaluators(N=393;recruitedfromProlific)askedtoevaluatere- sponsesinasingledoma inandconditiona7-poi ntLikertscale(1:wors t;7:best). ……Goal: Clean the dirty dishes, without using soap, water, asponge, adishwasher…Figure1:Iterativerea soningtaskoverview.A )Samplegoalsandscena riosfortheplanningan dexplanationdomains, respectively,illustr atingtherangeofbaset ypicalityofourstimul i;B)Formationofconstra intsfromhuman-genera tedlanguage,wherecon straintsareselectedb asedonfrequency,with samplehumangeneratio ns(bluetext)C)LLM-ge nerations(graytext)i nresponsetothesamepr ompts.sponses.Afters creening,wecollectat otalof20LLM-generate dresponsesforeachpro mptineachcondition.B lindcomparativehuman evaluationHavingcol- lectedhumanandLLMres ponsestothesamelingu is-ticpromptsacrossa llconditions,wenowbe nchmarktheirrelative performanceusingblin dhumanevaluators(N=393;recruitedfromProlific)askedtoevaluatere- sponsesinasingledoma inandconditiona7-poi ntLikertscale(1:wors t;7:best). 0.13
Subjectsratedrespons esforarandombatchofp rompts,scoringa(rand omlyshuffled)setofhuman(n=10)andLLM(n=10)responsesforeach. ResultsRepresentativ ehumanresponsesandla nguagemodelre-sponse sacrossbothdomainsan dconditionsaredepict edinFig.2.Toinvestig atecomparativeperfor mance,wefitlinearmixedeffectsr egression(LMER)model spredict-ingthehuman -evaluatedscoreandus eacorrespondinglikel ihoodratiotest(LRT)b etweenanablatedmodel todeterminethesignificanceofthefixedeffects.Fig.3show sresultsoftheblindhu manevaluation,anddep ictsstatisticalsigni ficancewithinandacross conditions.Peopleout performtheLLMwithine achreasoningconditio nWefirstfitaLMERpredictingtheh umanevaluatedscorefr omthesourcelanguageg enerator(hu-manorLLM ),withrandomeffectsf ortheindividualrater sandprompts(syntax:s core∼source+(1|raterid)+(1|prompt)). Subjectsratedrespons esforarandombatchofp rompts,scoringa(rand omlyshuffled)setofhuman(n=10)andLLM(n=10)responsesforeach. ResultsRepresentativ ehumanresponsesandla nguagemodelre-sponse sacrossbothdomainsan dconditionsaredepict edinFig.2.Toinvestig atecomparativeperfor mance,wefitlinearmixedeffectsr egression(LMER)model spredict-ingthehuman -evaluatedscoreandus eacorrespondinglikel ihoodratiotest(LRT)b etweenanablatedmodel todeterminethesignificanceofthefixedeffects.Fig.3show sresultsoftheblindhu manevaluation,anddep ictsstatisticalsigni ficancewithinandacross conditions.Peopleout performtheLLMwithine achreasoningconditio nWefirstfitaLMERpredictingtheh umanevaluatedscorefr omthesourcelanguageg enerator(hu-manorLLM ),withrandomeffectsf ortheindividualrater sandprompts(syntax:s core∼source+(1|raterid)+(1|prompt)). 0.08
OurLRTfindsthatthereisasigni ficanteffect(p<0.001)ofthelanguages ource(humansvs.LLM)i nbothdomainsandineac hcondi-tion(3,blacki ndicators),humansout performtheLLMinevery condition,acrossboth domains.Peoplearemor erobusttoout-of-dist ributionpromptswithc onstraintsWenextcons iderourmorecentralqu estion:howwelldolang uagemodelsperformspe cificallyonourmoreconstr ainedconditions,desi gnedexplicitlytoforc ebothhumansandmodels togeneratenovelsolut ionstoourunderlyingr easoningtask? OurLRTfinds thatreisasignificant effect(p<0.001)ofthe languagessource(huma nsvs.LLM)inboth domainsandineachcond i-tion(3,blackindica tors), humansoutperformtheL LMineverycondition,a crossboth domains.Peoplearemor erobusttoout-of-dist ributionpromptswithc onstraintsWenextcons iderourmorecentralqu estion:howwelldolang modelsperformallyono urmoreconstrained conditions,designedl icitlytoforceboth humansandgeneratenov elsolutionsourunderl ylyingreaingtask? 0.09
Weexpecthumanstonoto nlyoutperformlanguag emodelsinadirectcomp arisonacrossindividu alprompts,butalsotob ecomparativelymorero busttopromptswhichre stricthighlypredicta bleanswers,andrequir eresponsesbeyondthed istributionofstandar dhumanlanguage.Anini tialLMERwithafixedeffectforthecondi tion(unconstrained,m ostcommonconstraint, ormanycon-straints)s uggeststhatbothhuman sandLLMsaresensi-tiv etotheaddedconstrain ts,thoughwefindastronglysignificanteffectofconditio nonperformanceforLLM s(p<0.001);andaweaklysignificanteffect(p=0.03)forhumansinthep lanningdomainbutstro nglysignificantforexplanations( p<0.001). Weexpecthumanstonoto nlyoutperformlanguag emodelsinadirectcomp arisonacrossindividu alprompts,butalsotob ecomparativelymorero busttopromptswhichre stricthighlypredicta bleanswers,andrequir eresponsesbeyondthed istributionofstandar dhumanlanguage.Anini tialLMERwithafixedeffectforthecondi tion(unconstrained,m ostcommonconstraint, ormanycon-straints)s uggeststhatbothhuman sandLLMsaresensi-tiv etotheaddedconstrain ts,thoughwefindastronglysignificanteffectofconditio nonperformanceforLLM s(p<0.001);andaweaklysignificanteffect(p=0.03)forhumansinthep lanningdomainbutstro nglysignificantforexplanations( p<0.001). 0.07
However,asubsequentL MERwithaninteraction term しかし、AsubsequentLMERwitha ninteractionterm 0.14
英語(論文から抽出)日本語訳スコア
A. Domain 1: Planning–Representative responses from humansand LLMsB. a. ドメイン1:人間とllmsbからの計画-表現的反応。 0.55
Domain 2: Explanations–Representative responses from humansand LLMsGoal: Get your sofa onto the roof of your house.Goal: Get your sofa onto the roof of your house, without using a pulley.Goal: Get your sofa onto the roof of your house, without using a pulley, a ladder, a crane…Condition 1: Initial promptCondition 3: + All initial nouns constrained[4.8] You may need to rent a Genie lift large enough to carry the sofa. ドメイン2:説明–表現的反応 LLMsGoal: ソファーを家の屋根の上に持っていく: ゴール: ソファーを家の屋根の上に持っていく: プーリー、はしご、クレーンを使わずに、家の屋根の上に持っていく ... コンディション1: 初期プロンプトコンディション3: + 初期名詞が制約された[4.8] ソファーを運ぶのに十分な大きさのジェニーリフトを借りる必要があるかもしれない。 0.66
You will need at least one other person…[5.3] Need a pulley system…take off the windows and pass the sofa through the opening…[4.3] My plan is to push the sofa up through the attic window, with friends on the roof who can pull it up from there. 少なくとも1人の人が必要です...[5.3] プーリーシステムが必要です... 窓からソファを外して、開口部を通り抜けます ...[4.3] 私の計画は、ソファを屋根裏の窓から押し上げることです。
訳抜け防止モード: 最低でも1人必要... [5.3 ] 滑車システムが必要...窓を外す ソファーをオープニングに通して... [4.3] ソファーを屋根裏の窓に押し上げるのが私の計画です。 屋根の上の友人と そこから引き揚げることができます
0.78
[5.1] I would get a giant crane…and use the crane to lift it to the roof of my house. 〔5.1〕巨大なクレーンを手に入れて、クレーンを使って家の屋根まで持ち上げる。
訳抜け防止モード: [5.1 ]巨大なクレーンを手に入れて、クレーンを使う 私の家の屋根まで持ち上げるために
0.75
[4.3] This would need quite a few people because a sofa is heavy. [4.3] ソファが重いので、かなりの人数が必要になります。 0.75
Wrap the sofa in fabric tarps and tie it all up with a rope…[5.0] I will build a large wooden ramp…on the side of my house with platforms every 5 feet…[3.6] I would start by getting a very strong ladder and a very strong friend...[4.3] Get a strong rope and tie it to the sofa and the roof. ソファを布製のタープで包み、ロープで結ぶ... [5.0] 大きな木製のランプを作る... 5フィートごとにプラットホームのある家の側面... [3.6] まずは、非常に強いはしごと非常に強い友人... [4.3] 強いロープを手に入れて、ソファと屋根に縛る。 0.76
Then I would pull the sofa up. それからソファーを引き上げた。 0.45
[3.0] Use a rope to tie around the sofa and connect it to a car. 【3.0】ロープでソファに縛り付け、車とつなぐ。
訳抜け防止モード: [3.0 ]ロープを使用する ソファーの周りに縛り付けて 車に繋ぎます
0.67
[3.0] Have a friend help me lift it up and over the edge of the roof. [3.0]屋根の端から持ち上げるのを友人に手伝ってください。 0.57
Then I would have him stand on the roof and have him boost me up onto the roof. すると、彼は屋根の上に立たせて、私を屋根の上に上げさせます。 0.77
.[2.7] Cut the bottom of the sofa so that it would fit through the window...break the windows to make room for the sofa. .[2.7] ソファの底をカットして窓の内側に収まるように......窓を割ってソファの部屋を作る。 0.78
[2.8] Get a car with a hydraulic lift…then put the sofa into the car.If plants are not watered, then they die. [2.8]油圧リフト付きの車を取り...その後、ソファーを車に入れます。もし植物が水に浸かなければ、彼らは死ぬのです。
訳抜け防止モード: [2.8 ] 油圧リフトで車を取り... もし植物が水に浸かなければ、彼らは死ぬ。
0.77
But suppose plants are not watered, and then they do not die. しかし、植物が水に浸かっていない場合、その植物は死なない。 0.76
Why? If plants are not watered, then they die…However, the reason this happened was not that the plants were cacti. なぜだ? もし植物が水に浸からなかったら死んでしまう...しかし、その理由は植物がサボテンだったからではない。 0.75
[5.0] This could have happened because the plants are cacti. [5.0]植物が菌体であるため、これは起こり得る。 0.65
[5.8] This could have happened because the plant is a succulent or cactus. [5.8] 植物がサボテンまたはサボテンであるため、これは起こりうる。 0.70
[4.5] This could have happened because it rained so that plants got natural watering. 【4.5】雨が降って、植物が自然に水をまいたためか。 0.72
[5.2] This could have happened because the plants were in a dormant stage…where they don't need water to stay alive. 5.2] 植物が休眠状態にあったからかもしれない 生き残るのに水は必要ない
訳抜け防止モード: [5.2] だって こんなこと あったのに 植物は休眠の段階にあった 生き残るのに水は必要ない
0.79
[4.2] This could have happened because the plants live in a rainforest. [4.2]植物が熱帯雨林に生息するため、これは起こり得る。 0.78
[4.3] This could have happened because these are aquatic plants that live under water and thus do not need to be watered. 【4.3】水の下に生息する水生植物であるため、水に浸かる必要はない。
訳抜け防止モード: 【[[4.3] だって こんなこと あったのに これらは水中に棲息する水生植物であり、水は必要としない。
0.76
[5.0] This could have happened because they were watered yesterday. 【5.0]昨日水が流れたせいかもしれない。 0.74
[5.3] This could have happened because the plants were genetically modified…[4.0] This could have happened because the plants were potted…so the plants were able to survive until the owners remembered to water them again. 5.3] 植物が遺伝的に改変されたため、これは起こり得る...[4.0] 植物がポットされたため、植物は再び水に浸れるまで生き残ることができた。 0.81
[4.3] This could have happened because the plants were in a room with a humidifier. [4.3]これは、植物が加湿器のある部屋にいたためかもしれない。 0.75
[2.7] This could have happened because the plants were watered by a drip system that was not turned off. [2.7] 植物が電源を切らないドリップシステムによって水に浸されたからかもしれない。 0.79
[3.0] This could have happened because the plants were painted to look like they were dying because it was a prank.Condition 1: Initial promptIf plants are not watered, then they die…However, the reason this happened was not that the plants were cacti, they are fake plants, or…Condition 2: + Most commonnoun constrainedCondition 3: + All initial nouns constrainedCondition 2: + Most commonnoun constrainedFigure2:R epresentativeplans(A )andexplanations(B), perconstraintconditi on,generatedbyhumans andanend-to-endLLM.A veragegoodnessrating ,overthehumanevaluat orsforeachgeneration ,isshowninorange.Ini tialMost CommonConstraintAll ConstraintsConstrain t Condition1234567Huma n Evaluation Score*************** ****Domain 1: PlanningInitialMost CommonConstraintAll ConstraintsConstrain t Condition1234567Huma n Evaluation Score*************** *****Domain 2: ExplanationsFigure3: Meanoverallgoodnessr atingoverplans(left) andexplanations(righ t),showacrossallthre econstraintcondition s.Humans(blueboxes)s ignificantlyoutperformtheL LM(grayboxes)inevery condition(black,lowe rbars)andinsuccessiv epairwiseconditions( red,upperbars). [3.0] This could have happened because the plants were painted to look like they were dying because it was a prank.Condition 1: Initial promptIf plants are not watered, then they die…However, the reason this happened was not that the plants were cacti, they are fake plants, or…Condition 2: + Most commonnoun constrainedCondition 3: + All initial nouns constrainedCondition 2: + Most commonnoun constrainedFigure2:R epresentativeplans(A )andexplanations(B), perconstraintconditi on,generatedbyhumans andanend-to-endLLM.A veragegoodnessrating ,overthehumanevaluat orsforeachgeneration ,isshowninorange.Ini tialMost CommonConstraintAll ConstraintsConstrain t Condition1234567Huma n Evaluation Score*************** ****Domain 1: PlanningInitialMost CommonConstraintAll ConstraintsConstrain t Condition1234567Huma n Evaluation Score*************** *****Domain 2: ExplanationsFigure3: Meanoverallgoodnessr atingoverplans(left) andexplanations(righ t),showacrossallthre econstraintcondition s.Humans(blueboxes)s ignificantlyoutperformtheL LM(grayboxes)inevery condition(black,lowe rbars)andinsuccessiv epairwiseconditions( red,upperbars).
訳抜け防止モード: [3.0 ]これは起こり得る。 植物は死にかけていたように 描かれていました 条件1 : イニシャルプロンプト 植物が浸水していない場合。 そして彼らは死ぬ しかし、それが原因ではない。 植物はサボテンで 偽の植物でした または...条件 2 : + 最も一般的な制約付き条件 3 : 表2: 代表計画(A)と説明(B)、制約条件、生成 by Humansandanend -to - endLLM.平均良さ、 overthehumanevaluato rsforeachgeneration, isshowninorange . InitialMost CommonConstraintAll ConstraintsConstrain t Condition1234567Huma n Evaluation Score*************** ****Domain 1 : PlanningInitialMost CommonConstraintAll ConstraintsConstrain t Condition1234567Huma n Evaluation Score*************** *****Domain 2 : ExplanationsFigure3 : Meanoverallgoodnessr atingoverplans(left) andexplanations(righ t),showacrossallthre econstraintcondition s . Humans(blueboxes)sig nificantlyoutperformtheL LM(grayboxes)inevery condition(black, lowerbars)andinsucce ssivepairwiseconditi ons(red, upperbars ) .
0.86
forthelanguagesource (humansorLLMs)andcon dition(fitpairwiseacrosseachs uccessivesetofcondit ions)indi-catesthath umansandLLMsarenoteq uallysensitivetocons traints:wefindstronglysignficantinteractionterms (Fig.3,red)indicatin gthathumansaremorero busttoaddedconstrain tsacrosseachconditio n.Thissupportsourcen tralhypothesis:langu agemodelsareincreasi nglypooratsolvingthe underlyingtaskonceth epromptsareconstrain edtorestrictpredicta bleresponses.Peoplea remorerobusttogoalty picalityWealsoinvest igatewhetheranotherm easureoflinguisticpr e-dictability–theatypicalityofourb aseprompts–alsoimpactsLLMperfor mancerelativetohuman s.WefitafinalLMERmodelwithanin teractiontermforsour ceandhumantypicality scoreselicitedinouri nitialexperi- forthelanguagesource (humansorLLMs)andcon dition(fitpairwiseacrosseachs uccessivesetofcondit ions)indi-catesthath umansandLLMsarenoteq uallysensitivetocons traints:wefindstronglysignficantinteractionterms (Fig.3,red)indicatin gthathumansaremorero busttoaddedconstrain tsacrosseachconditio n.Thissupportsourcen tralhypothesis:langu agemodelsareincreasi nglypooratsolvingthe underlyingtaskonceth epromptsareconstrain edtorestrictpredicta bleresponses.Peoplea remorerobusttogoalty picalityWealsoinvest igatewhetheranotherm easureoflinguisticpr e-dictability–theatypicalityofourb aseprompts–alsoimpactsLLMperfor mancerelativetohuman s.WefitafinalLMERmodelwithanin teractiontermforsour ceandhumantypicality scoreselicitedinouri nitialexperi- 0.03
英語(論文から抽出)日本語訳スコア
ment.Interestingly,w efindasignificantinteractioneffec toftypicality(p<0.001)fortheplanning domain,butnotforexpl anations.Asassessing typicalityforthesepr omptsismorecomplex,f urtherwork(suchaslin guisticmeasuresofpro mpttypicality)arenec essarytobetterassess theexplanationsdomai ns.Thisfindingfurthersup-port sourbroaderhypothesi s:thatLLMsarelessrob usttorespondingtoout -of-distributionscen arioswhichposenovel, butsolvable,planning problems.Qualitative analysisofcommonsens efailuresinLLMreason ingDolargelanguagemo delssufferfromdis-ti nctivelydifferentpat ternsoferrors? ment.Interestingly,w efindasignificantinteractioneffec toftypicality(p<0.001)fortheplanning domain,butnotforexpl anations.Asassessing typicalityforthesepr omptsismorecomplex,f urtherwork(suchaslin guisticmeasuresofpro mpttypicality)arenec essarytobetterassess theexplanationsdomai ns.Thisfindingfurthersup-port sourbroaderhypothesi s:thatLLMsarelessrob usttorespondingtoout -of-distributionscen arioswhichposenovel, butsolvable,planning problems.Qualitative analysisofcommonsens efailuresinLLMreason ingDolargelanguagemo delssufferfromdis-ti nctivelydifferentpat ternsoferrors? 0.07
Aninitial,qualita-ti veexaminationsuggest sthatlargelanguagemo delsareparticularlyp ronetoerrorsindicati ngamorefundamentalla ckof“commonsense”understanding:oftheu nderly-ingtask,orthe worldknowledgerequir edtosolveit.Aprelimi naryexaminationsugge ststhatlanguagemodel sstruggleparticularl yingeneratingcoheren t,realisticsolu-tion sforproblemsthatrequ irenovelbutconcretep hysicalreasoning:asi nthesofaonaroofgoals inFig.2;orfailurestoundersta ndcolor(Thecarpetwas white,sothebluedyedi dnotshowup);water(thegrassisnotm adeofwaterandsoitdoe snotabsorbthewater);orgravityandmaterial (eg.someonefailingto scrapetheirkneesafte rfallinginpantsthatw eremadeofpaper). Aninitial,qualita-ti veexaminationsuggest sthatlargelanguagemo delsareparticularlyp ronetoerrorsindicati ngamorefundamentalla ckof“commonsense”understanding:oftheu nderly-ingtask,orthe worldknowledgerequir edtosolveit.Aprelimi naryexaminationsugge ststhatlanguagemodel sstruggleparticularl yingeneratingcoheren t,realisticsolu-tion sforproblemsthatrequ irenovelbutconcretep hysicalreasoning:asi nthesofaonaroofgoals inFig.2;orfailurestoundersta ndcolor(Thecarpetwas white,sothebluedyedi dnotshowup);water(thegrassisnotm adeofwaterandsoitdoe snotabsorbthewater);orgravityandmaterial (eg.someonefailingto scrapetheirkneesafte rfallinginpantsthatw eremadeofpaper). 0.06
Takento-gether,ourre asoningexperimentsug geststhatdespitethes urfaceplausibiltyoft heirgeneratedtext,la rgelanguagemodelsgen erallystruggletoemul atethelatentreasonin gthatbackshumanrespo nses–onceproblemsexpresse dinlanguagerequireso lutionsbeyondthestan dard,andmostpredicta ble,distributionofpr iorlanguage,theappar ent“reasoning”abilitiesofthesemode lsdeterioratesharply .PartII:Integratingl anguagewithstructure dreasoningmodelsOurr esultsinPartIsuggest thatevenverylargelan guagemodelsmaynotcap turethecharacteristi cflexibilityofhumanreas oning:theystruggleto producelanguagere-flectingnovelcomputati onoveranunderlyingta sk.Here,weproposeana lternatecomputationa lapproachforreasonin gaboutproblemsposedi nlanguage.Ratherthan hopingtosimulatelate ntcomputations(likep lan-ning)bydirectlyp redictingoutputlangu age,weproposeasimple (butdemonstrative)pa rse-and-symbolicplan -ner(P+S)modelwhichgroundsl anguageinanexplicit“language-of-thought”(Fodor,1975):aformal programexpressingthe meaningofthelinguist icprompt,whichin-ter faceswithasymbolicco mputationalsolver(Fi g.4B). Takento-gether,ourre asoningexperimentsug geststhatdespitethes urfaceplausibiltyoft heirgeneratedtext,la rgelanguagemodelsgen erallystruggletoemul atethelatentreasonin gthatbackshumanrespo nses–onceproblemsexpresse dinlanguagerequireso lutionsbeyondthestan dard,andmostpredicta ble,distributionofpr iorlanguage,theappar ent“reasoning”abilitiesofthesemode lsdeterioratesharply .PartII:Integratingl anguagewithstructure dreasoningmodelsOurr esultsinPartIsuggest thatevenverylargelan guagemodelsmaynotcap turethecharacteristi cflexibilityofhumanreas oning:theystruggleto producelanguagere-flectingnovelcomputati onoveranunderlyingta sk.Here,weproposeana lternatecomputationa lapproachforreasonin gaboutproblemsposedi nlanguage.Ratherthan hopingtosimulatelate ntcomputations(likep lan-ning)bydirectlyp redictingoutputlangu age,weproposeasimple (butdemonstrative)pa rse-and-symbolicplan -ner(P+S)modelwhichgroundsl anguageinanexplicit“language-of-thought”(Fodor,1975):aformal programexpressingthe meaningofthelinguist icprompt,whichin-ter faceswithasymbolicco mputationalsolver(Fi g.4B). 0.06
Simulatedplanningexp erimentWeintroduceas imulatedplanningdoma intobenchmarkourpars e-and-symbolicplanne rmodelagainstastan-d ardLLM(here,GPT-Neo( Black,Gao,Wang,Leahy ,&Biderman,2021)),usin garestrictedsetofpro mptsdesignedtoemulat ethecorepropertiesof thebroaderplanningdo maininPartI.Wefocuso nplanninghereforastr aightforwardmetricof comparativeperforman ce:accuracyofourrest rictedplanscanbeeval uateddirectlyonanexp licitworldmodel.Init ialandconstrainedsyn theticplanningprompt sAswithPartI,oursimu latedexperimentbench marksmodelperformanc eunderthreeprogressi velymorechal-lenging conditions:responses toaninitialsetofling uis-ticgoalprompts(F ig.4A,Condition1);andtwocon-strainedco nditionswhichintrodu cenewlinguisticcon-s traintsovertheinitia lgoal(Fig.4B,Conditi on2,3). Simulatedplanningexp erimentWeintroduceas imulatedplanningdoma intobenchmarkourpars e-and-symbolicplanne rmodelagainstastan-d ardLLM(here,GPT-Neo( Black,Gao,Wang,Leahy ,&Biderman,2021)),usin garestrictedsetofpro mptsdesignedtoemulat ethecorepropertiesof thebroaderplanningdo maininPartI.Wefocuso nplanninghereforastr aightforwardmetricof comparativeperforman ce:accuracyofourrest rictedplanscanbeeval uateddirectlyonanexp licitworldmodel.Init ialandconstrainedsyn theticplanningprompt sAswithPartI,oursimu latedexperimentbench marksmodelperformanc eunderthreeprogressi velymorechal-lenging conditions:responses toaninitialsetofling uis-ticgoalprompts(F ig.4A,Condition1);andtwocon-strainedco nditionswhichintrodu cenewlinguisticcon-s traintsovertheinitia lgoal(Fig.4B,Conditi on2,3). 0.06
AsisobviousfromFig.4 B,ourconditionsdiffe rfromPartIinoneimpor tantrespect:weextend ourinitialgoalswithp ositiveconstraints,r atherthanthenegative constraintsinPartI.T hisformatpermitsamor edirect,albeitsimpli fied,evaluationoftheco retask–fullysimulatingrestr ictionsoninitialreso urceswouldrequiremod eling(andcommu-nicat ing)allpossiblealter nativewaystoachievea goalinasimulatedenvi ronment–whilestillrequiringm odelstoreasonaboutco mplex,out-of-distrib utionlanguage.Wegene rateinitialandconstr ainedgoalprompts–alongwithalinguistic initialconditioncomp letelyspeci-fyingthe startingplanningstat eforeachprompt–fromasyntheticgramma roverasimpleobject-s tackingdomain(Gupta& amp;Nau,1992),inwhicheac hgoalisatargetstacko fobjectsonatable(Fig .4). AsisobviousfromFig.4 B,ourconditionsdiffe rfromPartIinoneimpor tantrespect:weextend ourinitialgoalswithp ositiveconstraints,r atherthanthenegative constraintsinPartI.T hisformatpermitsamor edirect,albeitsimpli fied,evaluationoftheco retask–fullysimulatingrestr ictionsoninitialreso urceswouldrequiremod eling(andcommu-nicat ing)allpossiblealter nativewaystoachievea goalinasimulatedenvi ronment–whilestillrequiringm odelstoreasonaboutco mplex,out-of-distrib utionlanguage.Wegene rateinitialandconstr ainedgoalprompts–alongwithalinguistic initialconditioncomp letelyspeci-fyingthe startingplanningstat eforeachprompt–fromasyntheticgramma roverasimpleobject-s tackingdomain(Gupta& amp;Nau,1992),inwhicheac hgoalisatargetstacko fobjectsonatable(Fig .4). 0.05
Initialpromptsinvolv egoalswithasinglecom monhouseholdobject;theseareextendedwith bothasingleconstrain tandmanycon-straints (n=4)thatintroduceaddit ional,unusualobjects intotheinitialgoal.I ntotal,wesamplen=100initialgoalsandth ensampleconstraintsf orbothextendedcondit ions.Parse-and-solve modelFig.4Bdepictsas chematicofourparse-a nd-solvemodel,design edtodisentanglelangu agefromtheunderlying computationrequiredt osolveplanningtaskse xpressedinlanguage.O urmodelintegratestwo distinctcomponents.F irst,itparseslan-gua geintoaformalprogram representingtheiniti alprob-lemstateandgo al(usingthePDDLplann inglanguage(McDermot tetal.,1998)). Initialpromptsinvolv egoalswithasinglecom monhouseholdobject;theseareextendedwith bothasingleconstrain tandmanycon-straints (n=4)thatintroduceaddit ional,unusualobjects intotheinitialgoal.I ntotal,wesamplen=100initialgoalsandth ensampleconstraintsf orbothextendedcondit ions.Parse-and-solve modelFig.4Bdepictsas chematicofourparse-a nd-solvemodel,design edtodisentanglelangu agefromtheunderlying computationrequiredt osolveplanningtaskse xpressedinlanguage.O urmodelintegratestwo distinctcomponents.F irst,itparseslan-gua geintoaformalprogram representingtheiniti alprob-lemstateandgo al(usingthePDDLplann inglanguage(McDermot tetal.,1998)). 0.05
Formoredirectcompari sonwithabenchmarkLLM ,wealsousealargelang uagemodelasoursurfac eparser:weusetheCode x(Chenetal.,2021)mod el(aGPT-3modelfine-tunedonajointdis- tributionoflanguagea ndsymbolicprograms), whichcan“parse”languageintoprograms usingananalogousfew- shotpromptingtechniq ue(seededwithcoupled examplesoftextandcod e). formoredirectcompari sonwithabenchmarkLLM ,we alsousealarge languagesmodelasourg roundparser:weusethe Codex(Chenetal.,2021 )model(aGPT-3modelfi ne-tunedonajointdis- tributionoflang andsymbolic programs) whichcan “parse” languageinto programsusingananalo gousfew-shotpromptin gtechnique(seededwit hcoupledexamplesofte xtandcodes)。 0.16
Unlikeourcomparisonm odel,how-ever,weempl oydistributionalpred ictiononlyforamoreco nstrainedtask:emulat ingthejointvariation betweenanaturalandfo rmallanguage.Thepars edprogramsarepassedt oourmodel’ssecondcorecomponent :asymbolic Unlikeourcomparisonm odel,how-ever,weempl oydistributionalpred ictiononlyforamoreco nstrainedtask:emulat ingthejointvariation betweena Naturalandformal Language.Theparsed programsarepassedtoo urmodel's secondcorecomponent: asymbolic 0.10
英語(論文から抽出)日本語訳スコア
Figure4:Simulatedite rativeplanningtaskov erview.A)Exampleprog ressively-constraine dgoalstimuli;B)Evaluationcompares plansgenerateddirect lyfromanLLM(left)wit hplansgeneratedfromP +S(right);C)SuccessrateofP+Smodel(purple)vs.LLM (gray);P+Sstatisticallysignificantlyoutperformsthe LLMundereachconditio n(blackbars). 図4:Simulatediterative planningtaskoverview .A)Exampleprogressiv ely-constrainedgoals timuli;B)Evaluationcompares generated generated original fromanLLM(left)withp lans generatedfromP+S(right);C)SuccessrateofP+Smodel(purple)vs.LLM (gray);P+Sstatisticallysignif icantlyoutperformsth eLLMundereachconditi on(blackbars)。 0.19
solver,modeledwithas earch-basedplanner(A lkhazrajietal.,2020) whichattemptstogener ateasymbolicplanover arestrictedsetofacti ons(movingobjectsfro monelocationtothenex t)tosolvetheparsedgo al.Plansimulationenv ironmentUnlikeinPart I,plansusingtherestr ictedspaceofactionsi nthisdomaincanbesimu lateddirectlytoasses saccuracy.TheP+Smodeloutputsexecuta blePDDLactions;theLLM-as-plannerbas elineoutputslanguage whichwereparsebyinve rtingthesyntheticgra mmarintoPDDLactions. Forbothmod-els,wemar kunparseableorinvali dplansasunsuccessful .ResultsAnalogousana lysestothoseinPartI( Fig.4C)–measur-ingthecompara tiveperformanceofour modelwithanLLM,aswel lasitsrobustnesstoco nstraints–suggestthatourhybrid model,whichusespredi ctivemodelingonlytot ransformlanguageinto astructuredinterface toanunderlyingsymbol icplanner,vastlyimpr ovesitsabilitytoadap ttocomplexlyconstrai nedgoals.Parse-and-s olvemodeloutperforms LLMAnLMERcomparingou rtwomodels(P+SandLLM)findsastronglysignificantdifferenceinover allperformance(p<0.001;Fig.4C):indeed,theLL Msolvesnoneoftheprob lemsinourmostconstra inedcondition.Compar ativerobustnesstocon straintsInterestingl y,apairwiseLMERtesti ngforaninteractionbe tweensourceandcondit iondoesnotfindasignificantinterac-tioneffe ct,suggestingthatbot hmodelsdeclinesimila rlyinrelativeperform ancebetweencondition s.Onelikelypossibili tyisthatthisisanarti factofourrestrictede x-perimentsize:theLL Msimplycanperformnow orseinthefinalcondition.However ,theseresultscouldal sosuggestthatthepars ingapproachweusehere –whichemploysdistribu tionalmodelstomaplan guageintopro-grams–mayitselfstruggletog eneralize;ahybridparser,whichi tselfdrawsonmorestru cturedrepresentation s(likeclassicallingu isticgrammars),might bebettersuitedtopars ingourmostchallengin gcompositionalgoals. DiscussionHumanlangu ageprovidesarichlyst ructuredwindowintoho wwethinkabouttheworl d.Ourresults,however ,suggestthatmodeling thedistributionoflan guagealonemaynotbesu fficienttocapturethecom putationsunderly-ing planning,explanation s,andotherformsofrea soningwhichgroundthe languageweproduce.In stead,wepro-poseanal ternativeapproach:hy bridmodelswhichusedi stributionalpredicti ontomaplanguageintos tructuredformalrepre sentationsofmeaningt hatinterfacedirectly withstructuredsymbol icalgorithms(Elliset al.,2020;Wongetal.,2021;Nyeetal.,2021). solver,modeledwithas earch-basedplanner(A lkhazrajietal.,2020) whichattemptstogener ateasymbolicplanover arestrictedsetofacti ons(movingobjectsfro monelocationtothenex t)tosolvetheparsedgo al.Plansimulationenv ironmentUnlikeinPart I,plansusingtherestr ictedspaceofactionsi nthisdomaincanbesimu lateddirectlytoasses saccuracy.TheP+Smodeloutputsexecuta blePDDLactions;theLLM-as-plannerbas elineoutputslanguage whichwereparsebyinve rtingthesyntheticgra mmarintoPDDLactions. Forbothmod-els,wemar kunparseableorinvali dplansasunsuccessful .ResultsAnalogousana lysestothoseinPartI( Fig.4C)–measur-ingthecompara tiveperformanceofour modelwithanLLM,aswel lasitsrobustnesstoco nstraints–suggestthatourhybrid model,whichusespredi ctivemodelingonlytot ransformlanguageinto astructuredinterface toanunderlyingsymbol icplanner,vastlyimpr ovesitsabilitytoadap ttocomplexlyconstrai nedgoals.Parse-and-s olvemodeloutperforms LLMAnLMERcomparingou rtwomodels(P+SandLLM)findsastronglysignificantdifferenceinover allperformance(p<0.001;Fig.4C):indeed,theLL Msolvesnoneoftheprob lemsinourmostconstra inedcondition.Compar ativerobustnesstocon straintsInterestingl y,apairwiseLMERtesti ngforaninteractionbe tweensourceandcondit iondoesnotfindasignificantinterac-tioneffe ct,suggestingthatbot hmodelsdeclinesimila rlyinrelativeperform ancebetweencondition s.Onelikelypossibili tyisthatthisisanarti factofourrestrictede x-perimentsize:theLL Msimplycanperformnow orseinthefinalcondition.However ,theseresultscouldal sosuggestthatthepars ingapproachweusehere –whichemploysdistribu tionalmodelstomaplan guageintopro-grams–mayitselfstruggletog eneralize;ahybridparser,whichi tselfdrawsonmorestru cturedrepresentation s(likeclassicallingu isticgrammars),might bebettersuitedtopars ingourmostchallengin gcompositionalgoals. DiscussionHumanlangu ageprovidesarichlyst ructuredwindowintoho wwethinkabouttheworl d.Ourresults,however ,suggestthatmodeling thedistributionoflan guagealonemaynotbesu fficienttocapturethecom putationsunderly-ing planning,explanation s,andotherformsofrea soningwhichgroundthe languageweproduce.In stead,wepro-poseanal ternativeapproach:hy bridmodelswhichusedi stributionalpredicti ontomaplanguageintos tructuredformalrepre sentationsofmeaningt hatinterfacedirectly withstructuredsymbol icalgorithms(Elliset al.,2020;Wongetal.,2021;Nyeetal.,2021). 0.04
Ourcontributionshere leavemuchopenforfutu rework:tomoresystema t-icallycharacterize regimesunderwhichsim plyproduc-ingprobabl elanguagecloselyappr oximates,anddevi-ate s,fromhumanreasoning ,andgobeyondthesimpl edemonstrationmodelw ehaveprovidedtowards broader-coveragemode lsformorerealisticre asoningdomains.Animp ortantnextstepwillbe buildingonthequali-t ativeanalysesinPartI todisentanglethemany factors(e g ,accuracy,semanticco herence,andconcision )thatmayseparatehuma nperformancefrompure lypredictiveresponse s.Intandem,thehybrid modelweproposehereof fersapromising,albei thighlyrestricted,st eptowardsemulatinghu man-likereasoningove rlanguage.Howdowelea rnthestructuredworld models,orevensophist i-catedplanningalgor ithms,thatoursimplem odelbuilds Ourcontributionshere leavemuchopenforfutu rework:tomoresystema t-icallycharacterize regimesunderwhichsim plyproduc-ingprobabl elanguagecloselyappr oximates,anddevi-ate s,fromhumanreasoning ,andgobeyondthesimpl edemonstrationmodelw ehaveprovidedtowards broader-coveragemode lsformorerealisticre asoningdomains.Animp ortantnextstepwillbe buildingonthequali-t ativeanalysesinPartI todisentanglethemany factors(e g ,accuracy,semanticco herence,andconcision )thatmayseparatehuma nperformancefrompure lypredictiveresponse s.Intandem,thehybrid modelweproposehereof fersapromising,albei thighlyrestricted,st eptowardsemulatinghu man-likereasoningove rlanguage.Howdowelea rnthestructuredworld models,orevensophist i-catedplanningalgor ithms,thatoursimplem odelbuilds 0.07
英語(論文から抽出)日本語訳スコア
upon? Ourcoremodelingappro achsuggestsapathto-w ardsthesemorefundame ntallearningproblems :usinglanguagetocons truct,orguidediscove ry,ofprogramswhichre presentnovelenvironm ents,actions,andeven algorithmsforoperati ngoversuchworlds.Ack nowledgmentsWethankL auraSchulz,JunyiChu, AlexLew,JoaoLoulaGui maraesdeCampos,MaxNy e,andtherestoftheGPS Communityformanythri lling,inspiringconve rsations,aswellaspra cticaladvicewithourp roject.Wearealsodeep lygratefulforPratyus haSharma,JacobAndrea s,andNoaKorneevforth eirthoughtfulsuggest ionsandsupport.Wetha nkYoniFriedmanforhis fantastichelpwithhum anannotatorrecruitme nt.Additionally,weth anktheOpenAIteamfori ncreasingourquotatoe nableustorunmoreGPT- 3rolloutsforPartI,an dourAnonymousReviewe rsforusefulcomments. KMCissupportedbyaMar shallScholarshipandc onductedworkonthepro jectunderaGoldwa-ter Scholarship.CWandJBT aresupportedbyAFOSR# FA9550-19-1-0269,the MITQuestforIntelli-g ence,theMIT-IBMWatso nAILab,ONRScienceofA I,andDARPAMachineCom monSense.ReferencesA lkhazraji,Y. で? Ourcoremodelingappro achsuggestsapathto-w ardsthesemorefundame ntallearningproblems :usinglanguagetocons truct,orguidediscove ry,ofprogramswhichre presentnovelenvironm ents,actions,andeven algorithmsforoperati ngoversuchworlds.Ack nowledgmentsWethankL auraSchulz,JunyiChu, AlexLew,JoaoLoulaGui maraesdeCampos,MaxNy e,andtherestoftheGPS Communityformanythri lling,inspiringconve rsations,aswellaspra cticaladvicewithourp roject.Wearealsodeep lygratefulforPratyus haSharma,JacobAndrea s,andNoaKorneevforth eirthoughtfulsuggest ionsandsupport.Wetha nkYoniFriedmanforhis fantastichelpwithhum anannotatorrecruitme nt.Additionally,weth anktheOpenAIteamfori ncreasingourquotatoe nableustorunmoreGPT- 3rolloutsforPartI,an dourAnonymousReviewe rsforusefulcomments. KMCissupportedbyaMar shallScholarshipandc onductedworkonthepro jectunderaGoldwa-ter Scholarship.CWandJBT aresupportedbyAFOSR# FA9550-19-1-0269,the MITQuestforIntelli-g ence,theMIT-IBMWatso nAILab,ONRScienceofA I,andDARPAMachineCom monSense.ReferencesA lkhazraji,Y. 0.35
,Frorath,M. ,Gr¨utzner,M. フローラス、m。 とGr sutzner,M。 0.57
,Helmert,M. ,Liebetraut,T. ヘルマント、m。 、Liebetraut,T。 0.45
,Mattm¨uller,R. マット・シュラー、R。 0.55
,...W¨ulfing,J. W sulfing, J。 0.22
(2020). Pyperplan.https://do i.org/10.5281/zenodo .3700819.Zenodo.Retr ievedfromhttps://doi .org/10.5281/zenodo. 3700819doi:10.5281/z enodo.3700819Bisk,Y. (2020). Pyperplan.https://do i.org/10.5281/zenodo .3700819.Zenodo.Retr ieved fromhttps://doi.org/ 10.5281/zenodo.37008 19doi:10.5281/zenodo .3700819Bisk,Y 0.27
,Zellers,R. 、Zellers,R。 0.38
,Bras,R.L.,Gao,J. ブラス、r.l.、gao、j. 0.67
,&Choi,Y. (2019). とChoi,Y。 (2019). 0.52
PIQA:reasoningaboutp hysicalcommonsensein naturallanguage.CoRR ,abs/1911.11641.Retr ievedfromhttp://arxi v.org/abs/1911.11641 Black,S. piqa:reasoningaboutp hysical commonsenseinnatural language.corr,abs/19 11.11641.retrievedfr omhttp://arxiv.org/a bs/1911.11641black.s 0.18
,Gao,L. ,Wang,P. ,Gao,L。 、Wang,P。 0.59
,Leahy,C. ,&Biderman,S. リーハイ、c。 とBiderman,S。 0.58
(2021,March). GPT-Neo:LargeScaleAu toregressiveLanguage ModelingwithMesh-Ten sorflow.Zenodo.Retrievedf romhttps://doi.org/1 0.5281/zenodo.529771 5(Ifyouusethissoftwa re,pleaseciteitusing thesemetadata.)doi:1 0.5281/zenodo.529771 5Brown,T.B.,Mann,B. (2021年3月) GPT-Neo:LargeScaleAu toregressiveLanguage ModelingwithMesh-Ten sorflow.Zenodo.Retri evedfromhttps://doi. org/10.5281/zenodo.5 297715(Ifyouusethiss oftware,pleaseciteit usingthesemetadata.) doi:10.5281/zenodo.5 297715Brown,T.B.,Man n,B 0.41
,Ryder,N. ,Subbiah,M. 、Ryder,N。 サブビア、m。 0.66
,Kaplan,J. ,Dhariwal,P. カプラン、j。 とDhariwal,P。 0.59
,...Amodei,D. (2020). アモディ、d。 (2020). 0.38
Languagemod-elsarefe w-shotlearners.CoRR, abs/2005.14165.Re-tr ievedfromhttps://arx iv.org/abs/2005.1416 5Carey,S. Languagemod-elsarefe w-shotlearners.CoRR, abs/2005.14165.Re-tr ieved Fromhttps://arxiv.or g/abs/2005.14165Care y,S 0.15
(2009). Whereournumberconcep tscomefrom.TheJourna lofphilosophy,106(4) ,220.Chen,M. (2009). TheJournalofphilosop hy,106(4),220.Chen,M 。 0.54
,Tworek,J. ,Jun,H. Tworek,J。 、Jun,H。 0.35
,Yuan,Q. ,deOliveiraPinto,H.P .,Kaplan,J. 、Yuan,Q。 ,deOliveiraPinto,H.P .,Kaplan,J。 0.61
,...Zaremba,W. (2021). ザレンバ、W。 (2021). 0.30
Evaluat-inglargelang uagemodelstrainedonc ode.CoRR,abs/2107.03 374.Retrievedfromhtt ps://arxiv.org/abs/2 107.03374Cobbe,K. Evaluat-Inlarge Languagemodelstraine doncode.CoRR,abs/210 7.03374.Retrieved fromhttps://arxiv.or g/abs/2107.03374Cobb e,K 0.15
,Kosaraju,V. ,Bavarian,M. こさらじゅう、v。 ,Bavarian,M。 0.46
,Hilton,J. ,Nakano,R. ヒルトン、J。 ナカノ、R。 0.48
,Hesse,C. ,&Schulman,J. とHesse,C。 とShulman,J。 0.58
(2021). Trainingver-ifierstosolvemathwordpr oblems.arXivpreprint arXiv:2110.14168.Ell is,K. (2021). trainingver-ifiersto solvemathwordproblem s.arxivpreprintarxiv :2110.14168.ellis,k。 0.35
,Wong,C. ,Nye,M.I.,Sabl´e-Meyer,M. とWong,C。 通称、sabl ́e-meyer、m。 0.56
,Cary,L. ,Morales,L. 、Cary,L。 モーラール、l。 0.65
,...Tenenbaum,J.B.(2 020). テネンバウム、J.B.(2020)。 0.60
Dreamcoder:Growingge neralizable,interpre tableknowledgewithwa ke-sleepbayesianprog ramlearn-ing.CoRR,ab s/2006.08381.Retriev edfromhttps://arxiv. org/abs/2006.08381Fo dor,J.A.(1975). dreamcoder: growinggeneralizable ,interpretableknowle dgewithwake-sleepbay esianprogramlearn-in g.corr,abs/2006.0838 1.retrievedfromhttps ://arxiv.org/abs/200 6.08381fodor,j.a.(19 75) 0.20
Thelanguageofthought .HarvardUniversityPr ess.Gopnik,A. the languageofthought.ha rvarduniversitypress .gopnik,a 0.25
,&Meltzoff,A.N.(1997). とMeltzoff, A.N. (1997)。 0.77
Words,thoughts,andth eories.MitPress.Gupt a,N. 言葉も考えも 理論もね 0.35
,&Nau,D.S.(1992). とD.S.(1992年)。 0.62
Onthecomplexityofblo cks-worldplanning.Ar tificialIntelligence,56( 2-3),223–254.Harris,P.L.,Koen ig,M.A.,Corriveau,K. H.,&Jaswal,V.K.(2018). Artificial Intelligence,56(2-3) ,223–254.Harris,P.L.,Koen ig,M.A.,Corriveau,K. H.,&Jaswal,V.K.(2018) 0.32
Cognitivefoundations oflearningfromtestim ony.AnnualReviewofPs ychology,69,251–273.Huang,W. Annual Reviewof Psychoology,69,251–273.Huang,W 0.12
,Abbeel,P. ,Abbeel,P。 0.80
,Pathak,D. ,Pathak,D。 0.39
,&Mordatch,I. とMordatch,I。 0.68
(2022). Languagemodelsaszero -shotplanners:Extrac tingactionableknowle dgeforembodiedagents .Korman,J. (2022). 言語モデルとショットプランナー 0.39
,&Khemlani,S. とkhemlaniは言う。 0.39
(2020). Explanatorycom-plete ness.ActaPsychologic a,209,103139.Retriev edfromhttps://www.sc iencedirect.com/scie nce/article/pii/S000 1691819303531doi:htt ps://doi.org/10.1016 /j.actpsy.2020.10313 9McDermott,D. (2020). ActaPsychologicala,2 09,103139.Retrieved fromhttps://www.scie ncedirect.com/scienc e/article/pii/S00016 918 19303531doi:https:// doi.org/10.1016/j.ac tpsy.2020.103139McDe rmott,D. 0.27
,Ghallab,M. 、Ghallab,M。 0.39
,Howe,A. ,Knoblock,C. とA。 ,Knoblock,C。 0.37
,Ram,A. ,Veloso,M. ,Ram,A。 ,Veloso,M。 0.60
,...Wilkins,D. (1998). ウィルキンス、d。 (1998). 0.38
Pddl-theplanningdoma indefinitionlanguage(Tech. Rep.No.TR-98-003). Pddl-theplanning domaindefinition Language (Tech.Rep.No.TR-98-0 03) 0.17
YaleCenterforComputa tionalVisionandContr ol,. yalecenter forcomputationalvisi onandcontrolの略。 0.16
Nye,M.I.,Tessler,M.H .,Tenenbaum,J.B.,&am p;Lake,B.M.(2021). Nye,M.I.,Tessler,M.H .,Tenenbaum,J.B.,&am p;Lake,B.M.(2021) 0.46
Improvingcoherencean dconsistencyinneural sequencemodelswithdu al-system,neuro-symb olicreasoning.CoRR,a bs/2107.02794.Retrie vedfromhttps://arxiv .org/abs/2107.02794R ae,J.W.,Borgeaud,S. Coherence and Consistencyinneurals equencemodelswithdua l-system,neuro-symbo licreasoning.CoRR,ab s/2107.02794.Retriev ed fromhttps://arxiv.or g/abs/2107.02794Rae, J.W.,Borgeaud,S. 0.13
,Cai,T. ,Millican,K. 、Cai,T。 、Millican,K。 0.58
,Hoffmann,J. ,Song,F. ホフマン、J。 とSong,F。 0.64
,...others(2021). 他者(2021年)。 0.24
Scalinglanguagemod-e ls:Methods,analysis& amp;insightsfromtraining gopher.arXivpreprint arXiv:2112.11446.Sha rma,P. スケーリング言語mod-els:Methods,anal ysis&insights fromtraininggopher.a rXivpreprintarXiv:21 12.11446.Sharma,P. 0.50
,Torralba,A. ,Torralba,A。 0.81
,&Andreas,J. 通称、andreas,j。 0.47
(2021). Skillinductionandpla nningwithlatentlangu age.Wong,C. (2021). Skillinduction andplanningwithlaten t Language.Wong,C. 0.37
,Ellis,K. ,Tenenbaum,J.B.,& ;Andreas,J. エリス、k。 テネンバウム、J.B.、アンドレアス、J。 0.55
(2021). Leveraginglanguageto learnprogramabstrac- tionsandsearchheuris tics.CoRR,abs/2106.1 1053.Re-trievedfromh ttps://arxiv.org/abs /2106.11053 (2021). Leveraging Languagetolearn programsabstrac-tion sandsearchheuristics .CoRR,abs/2106.11053 .Re-trieved fromhttps://arxiv.or g/abs/2106.11053 0.27
英語(論文から抽出)日本語訳スコア
Supplemental:Structu red,flexible,androbust:ben chmarkingandimprovin glargelanguagemodels towardsmorehuman-lik ebehaviorinout-of-di stributionreasoningt asksHere,weprovidead ditionaldetailsonthe humanexper-iments,mo dels,stimuli,andanal ysesdescribedinParts IandIIofourmaintext. Furtherdetailscanbef oundinourrepository. Apre-registrationlog gedforPartIofourstud ycanbefoundathere.S1 .PartISupplementalDe tailsWefirstexpandontheexperi mentaldesignandresul tsdis-cussedinPartI: Linguisticreasoningb enchmarkforhu-mansan dlanguagemodels.Code fortheseexperimentsc anbefoundinthe“PartI”directoryofourreposi tory.Creationofconst raintsConstraintsuse dinCondition2and3per domainwereconstructe dbyanexperthumantagg erfromtheplansandexp lanationsproducedbyh umansinCondition1.Th ehumantaggerextracte dallconcretenounsmen tionedinthehumangene rations.Weallowtheex perttaggertocol-laps emultiplesemanticall yidenticalphrasesint oasinglenoun,basedon priorworldknowledge;e g ,collapsing“putalifenet”,“flyanoutstretchednet”,and“allowthemtolandinthe net”,intoasinglephrase“thenet.”Intheexplanationdoma in,constraintswereco nstructedwithrespect tothefullsetofN=10unconstrainedex-am ples.Intheplanningdo main,plansweregenera tedinstages.Inthefirststage,approximate ly4−7humansgen-eratedpla nspergoal,andinthese condstage,additional humanswererecruitedt oensureatleast10sawe achplan.Constraintsf ortheplanningdomainw ereconstructedonlywi threspecttothisfirstbatch.Humanexperi mentsWenowprovideadd itionaldetailsonthet hreeflavorsofhumanexperime ntsruninthiswork:1)g enerationofplansande xplanations,2)rating humannessofLLM-gener atedlanguage,and3)ra tingtheoverallgoodne ssofmixedhuman-andLL M-generatedlanguage. Inallexperiments,par ticipantswererecruit edfromProlificviaCognition.Partic ipantswerebasedinthe UnitedStatesandhadto beatleast18yearsolda ndspeakEnglish.Parti cipantswerenotallowe dtopartakeinmorethan oneflavorofexperiment.Hum anlanguageproduction andtypicalityratingL anguageproductiontas ksweredesignedtolast 30min-utes.Participa ntswereaskedtoprovid eplansorexplana-tion sforsevengoalsorscen arioswithinasingleco ndi-tion.Participant sinCondition1(theini tialpromptwithoutcon straints),alsowereas kedtoprovidetypicali tyratingsforallgoals orscenariosthattheyh adseen.Allsuchrat-in gswereconductedafter allsevenlanguageprod uctiontaskswerecompl eted.Fortheplanningd omain,partic-ipantsw ereaskedtomarkona1-7 Likertscale“Howfrequentlyyouthin kpeopletrytoachievee achgoal,”where1=“Mostpeopledothisonad ailybasis.”and7=“Idon’tseehowthisisevenpos sibletodoortrytodo.”Participantsintheexp lanationdomainratedp romptsalongtwodimens ionsoftypicality:1)t hebaserateoftheincid ent,X(e g ,“Howfrequentlydoyouth inksomeoneobservesth iseventintheactualwo rld?”),and2)therateofthee ffect,Y,giventhecaus e-eventX(e g ,“Assumingthatthisinit ialeventhappens...ho wfrequentlydoyouthin kthisresults?”). Supplemental:Structu red,flexible,androbust:ben chmarkingandimprovin glargelanguagemodels towardsmorehuman-lik ebehaviorinout-of-di stributionreasoningt asksHere,weprovidead ditionaldetailsonthe humanexper-iments,mo dels,stimuli,andanal ysesdescribedinParts IandIIofourmaintext. Furtherdetailscanbef oundinourrepository. Apre-registrationlog gedforPartIofourstud ycanbefoundathere.S1 .PartISupplementalDe tailsWefirstexpandontheexperi mentaldesignandresul tsdis-cussedinPartI: Linguisticreasoningb enchmarkforhu-mansan dlanguagemodels.Code fortheseexperimentsc anbefoundinthe“PartI”directoryofourreposi tory.Creationofconst raintsConstraintsuse dinCondition2and3per domainwereconstructe dbyanexperthumantagg erfromtheplansandexp lanationsproducedbyh umansinCondition1.Th ehumantaggerextracte dallconcretenounsmen tionedinthehumangene rations.Weallowtheex perttaggertocol-laps emultiplesemanticall yidenticalphrasesint oasinglenoun,basedon priorworldknowledge;e g ,collapsing“putalifenet”,“flyanoutstretchednet”,and“allowthemtolandinthe net”,intoasinglephrase“thenet.”Intheexplanationdoma in,constraintswereco nstructedwithrespect tothefullsetofN=10unconstrainedex-am ples.Intheplanningdo main,plansweregenera tedinstages.Inthefirststage,approximate ly4−7humansgen-eratedpla nspergoal,andinthese condstage,additional humanswererecruitedt oensureatleast10sawe achplan.Constraintsf ortheplanningdomainw ereconstructedonlywi threspecttothisfirstbatch.Humanexperi mentsWenowprovideadd itionaldetailsonthet hreeflavorsofhumanexperime ntsruninthiswork:1)g enerationofplansande xplanations,2)rating humannessofLLM-gener atedlanguage,and3)ra tingtheoverallgoodne ssofmixedhuman-andLL M-generatedlanguage. Inallexperiments,par ticipantswererecruit edfromProlificviaCognition.Partic ipantswerebasedinthe UnitedStatesandhadto beatleast18yearsolda ndspeakEnglish.Parti cipantswerenotallowe dtopartakeinmorethan oneflavorofexperiment.Hum anlanguageproduction andtypicalityratingL anguageproductiontas ksweredesignedtolast 30min-utes.Participa ntswereaskedtoprovid eplansorexplana-tion sforsevengoalsorscen arioswithinasingleco ndi-tion.Participant sinCondition1(theini tialpromptwithoutcon straints),alsowereas kedtoprovidetypicali tyratingsforallgoals orscenariosthattheyh adseen.Allsuchrat-in gswereconductedafter allsevenlanguageprod uctiontaskswerecompl eted.Fortheplanningd omain,partic-ipantsw ereaskedtomarkona1-7 Likertscale“Howfrequentlyyouthin kpeopletrytoachievee achgoal,”where1=“Mostpeopledothisonad ailybasis.”and7=“Idon’tseehowthisisevenpos sibletodoortrytodo.”Participantsintheexp lanationdomainratedp romptsalongtwodimens ionsoftypicality:1)t hebaserateoftheincid ent,X(e g ,“Howfrequentlydoyouth inksomeoneobservesth iseventintheactualwo rld?”),and2)therateofthee ffect,Y,giventhecaus e-eventX(e g ,“Assumingthatthisinit ialeventhappens...ho wfrequentlydoyouthin kthisresults?”). 0.06
RatinghumannessTogiv etheLLMthefairestcom -parison,weemployhum anannotatorstopre-sc reen,orfilter,LLMgenerationsb asedontheir“humanness.”An-notatorswereasked toratethelikelihoodt hatthelan-guagecould havebeengeneratedbya personona1-7Likertsc ale(“Howplausibleisitthat ahumanwouldhavegener atedthelanguageofthe explanation/plan?”)with1being“completelyimplausibl e”and7being“completelyplausible”. Rating HumannessTogivetheLL Mthefairestcom-paris on,weemploy Humanannotatorstopre -screen,orfilter,LLM generations basedontheir“ Humanness.”An-notatorswereasked toratethelikelihood Thatthelan-guagecoul dhavebeen generatedbyapersonon a1-7Likertscale(“Howplausibleisitthat a humanwouldhave generatedthe Englishoftheexplanat ion/plan?”)with1being“completelyimplausibl e”and7being“completelyplausible”。 0.20
Eachratersawarandoms ubsetofbetween42-45g enerationsfromwithin thesamedomainandas-s ociatedconstraintcon dition.Thesubsetwasr andomlyselectedovert heentirespaceofgoals orscenarios,withthee xceptionthatapilotch eckoftheexperimentwa srunfortheplanningdo mainonly,wherepartic ipantssawplansrandom izedoveronlythreegoa ls.Weretainedonlyexp lanationsandplanswhi chsurpassedascoreof2 toremovedegeneratere sponses.Blindcompara tivehumanevaluationE achpartici-pantrated languagegeneratedwit hinthesamedomain,and withinthesameconditi oninsaiddomain.Asub- arXiv:2205.05718v1 [cs.CL] 11 May 2022 Eachratersawarandoms ubsetofbetween42-45g enerationsfromwithin thesamedomainandas-s ociatedconstraintcon dition.Thesubsetwasr andomlyselectedovert heentirespaceofgoals orscenarios,withthee xceptionthatapilotch eckoftheexperimentwa srunfortheplanningdo mainonly,wherepartic ipantssawplansrandom izedoveronlythreegoa ls.Weretainedonlyexp lanationsandplanswhi chsurpassedascoreof2 toremovedegeneratere sponses.Blindcompara tivehumanevaluationE achpartici-pantrated languagegeneratedwit hinthesamedomain,and withinthesameconditi oninsaiddomain.Asub- arXiv:2205.05718v1 [cs.CL] 11 May 2022 0.03
英語(論文から抽出)日本語訳スコア
setof10ofthe20LLM-ge neratedplansorexplan ationswasshuffledwithall10humanplan s.Wedividedthe20LLMg enerationsintotwoset sof10inadvance.Eachp articipanthencesaw20 total(10human,10LLM) gen-eratedplansorexp lanations–dependingonthedomain assigned–for3separategoalsors cenarios,withinthesa mecondition.Particip antshada15secondbrea kafterratingalllangu ageforasinglegoalors cenario,beforecontin uingtothenext.Inthep lanningdomain,annota torswereasked:”Howgoodisthisplanove rall? setof10ofthe20llm-cr eateplansorexplanati onswasshuffledwithal l10humanplans.wedivi dedthe20llmgeneratio nsintotwosetsof10ina dvance.eachparticipa nthencesaw20total(10 human,10llm)gen-erat edplansorexplanation s–dependingonthedomain assigned–for3separategoalsors cenarios,withinthesa mecondition.particip antshada15secondbrea kafterratingalllangu ageforasinglegoalors cenario,beforecontin uingtothenext.inthep lanningdomain,annota torswereasked:howgoo disthisplanoverall? 0.09
Assignitasinglescore thatsummarizeshowgoo ditisforthisgoal,”andintheexplanationd omain:“Howgoodisthisexpla-n ationoverall?Assigni tasinglescorethatsum marizeshowgooditisfo rthisscenario.”Ratingsweremarkedona 7-pointLikertscaleas notedinPartI.Prompti ngGPT-3Wenextexpando nthepromptingregimeu sedtogleanplansandex planationsfromGPT-3( Brownetal.,2020). Assignitasinglescore thatsummarizeshowgoo ditisforthisgoal”andintheexplanation domain: “Howgoodisthisexpla-n ationoverall?Assigni tasinglescorethatsum marizeshowgooditisfo rthisscenario.”Ratingsweremarkedona 7-pointLikertscaleas notedin PartI.PromptingGPT-3 Wenextexpandonthepro mptingregimeusedtogl eanplansandexplanati ons FromGPT-3 (Brownetal, 2020) 0.06
Asanexample,togenera teplansforthegoal“Cooldowninarecord-br eakingheatwave,witho utusinganaircon-diti oner,”wesampleN=15oftheremaining27go alswithinthesamecond ition–andforeachofthisNgoa ls,werandomlysampleo neofthe10human-gener atedplanwrittentoach ievethatgoal.Weconca tenatethesegoal:plan pairsinthefollowingf ormat:"Goal:Helpyourlocalto wnmayorwinre-electio n,withoutusingbillbo ards. Asanexample,togenera teplansforthegoal “Cooldowninarecord-br eakingheatwave,witho utusinganaircon-diti oner”WesampleN=15oftheremaining27go alswithinthesamecond ition–andforeachofthisNgoa ls,werandomlysampleo neofthe10 human- generatedplan writtentoachievethat goal.Weconcatenateth esegoal:planpairsint hefollowingformat:”Goal:Helpyour localtownmayorwinree lection,withoutusing billboards”[原文](英語) 0.24
\nPlan:\"Organizeateamofpeopl etocampaigndoor-to-d oor.Wecanalsoprintfl yerstopassoutonthest reetsandputoncars.AF acebookadwouldbeusef ul,aswellasaradioint erviewifwecansetoneu p.Finally,bookingade batewithhisopponentw ouldhelp. Wecan alsoprintflyerstopas soutonthestreetsandp utoncars.AFacebookad wouldbeuseful,aswell asaradiointerviewifw ecansetoneup.実のところ、bookingadebatewithhi sopponentwouldhelp. 0.08
\"\nGoal:Buildafloatto dazzlethecrowdattheM acy’sDayParade,withoutus ingatrailer. 例:buildafloattodazzle thecrowdatthemacy’sdayparade,withoutus ingatrailer。 0.41
\nPlan:\"Researchdifferenttyp esoffloatsthataresee nintheMacy’sDayParadetoseeifany areusedwithoutatrail er. nplan:\"researchdifferenttyp esoffloatsthataresee ninthemacy’sdayparadetoseeifany areusedwithoutatrail er。 0.09
\"\nGoal:Orderfoodinar estaurant,whereyoudo n’tspeakthenativelangu age,withoutusingatra nslatorapp. where you don’tspeakthenative language,withoutusin ga translatorapp.\nGoal :Orderfoodinarestaur ant。 0.40
\nPlan:\"Theplanwouldinvolveu singgestureandpointi ngatitemsonthemenuto describewhatyouwould liketheorder.Youcoul dalsofindsomeonewhos peaksyourlanguageand persuadethemtoorder/ translateforyou. Theplanwouldinvolveu singgesture andpointingatitemson theutodescribewhatyo uwouldliketheorder.Y oucould alsoalsofindsomeone whospeaksyour languagesandpersuade themtoorder/translat eforyou 0.06
\"[...]\nGoal:Jumpoverasixf oottallman,withoutus ingatrampoline. Jumpoverasixfoottall man,withoutusingatra mpoline.”[...]\nGoal:Jumpoverasixf oottallman,withoutus ingatrampoline。 0.34
\nPlan:\"Fashionacatapultusin garesistancebandanda treetrunk,andlaunchy ourselfovertheman. fashionacatapultusin garesistancebandanda treetrunk, andlaunchyourselfove rtheman. (英語) 0.17
\"\nGoal:Buildabookshe lf,withoutusingwood. 木を使わずに本棚を建てる。 0.43
\nPlan:\"Youneedtogoandpurcha secinderblocksandshe etsofplastic.Stackth ecinderblocksasabase .Youshoulddo3rowsfor stability.Youcanthen usetheplasticsheetsf orshelvesandthebacki ng. Youcanthenusetheplas ticsheets forshelvesandtheback ing.Youshoulddo3rows forstability.Youcant henusetheplasticshee ts forshelvesandtheback ing 0.15
\"\nGoal:Stopyourcanoe fromfallingdownthewa terfall,withoutusing apaddle. \"\ngoal:stopyourcanoe fromfalling downthewaterfall,wit houtusingapaddle. (英語) 0.31
\nPlan:\"1.Surveyareaforother thingstograbonto.2.U sehandsinthewaterasm akeshiftpaddletomane uveryourwaytotheposs ibleobjects(rocks,tr eebranches,etc).3.Re achouttoandgrabontos aidobjects. \nplan:\"1.surveyareaforother thingstograbonto.2.u sehandsinthewaterasm akeshiftpaddletomane uveryourwaytotheposs ibleobjects(rocks,tr eebranches,etc).3.re achouttoandgrabontos aidobjects 0.09
\"[...]\nGoal:Fixaflattire, withoutusingaspareti re. Fixaflattire,without usingasparetire.”[...]\nGoal:Fixaflattire, withoutusingaspareti re。 0.33
\nPlan:\"Callroadsideassistan ceorsomeoneyouknowan dhavethembringatiret hatfitsyoucar.Second optionistocallatowin gcompanyandhaveyourc artowedtoatireshop. 第2回日本映画大賞受賞者。 0.05
\"\nGoal:Escapequicksa nd,withoutusingabran ch. unusingabranch. \"\nGoal:Escapequicksa nd,withoutusingabran ch。 0.37
\nPlan:\"Iwould",Note,human-generate dplansarecouchedinqu otationsandalwaysend withapunctuationmark ;periodswereaddedtoan yhuman-generatedplan thatendedwithaletter instead.Newlinesdema rcatedthesegoal:plan pairs.Wethereforeque riedGPT-3bystartingo nanopenparenthesisan dused“. ThereforequeriedGPT- 3bystartingonanopenp arenthesisandused”. Newlinesdemarcatedth esegoal:planpairs. WethereforequeriedGP T-3bystartingonopenp arenthesisandused. 0.17
"\n”asourstoptoken.Acomp arableprocesswasempl oyedfortheexplana-ti ondomain,withthemodi ficationthatthenumbero fseedpromptswassetto N=12duetocontextwindow constraints(explanat ionstimulitendedtobe longerthanthegoalsus edintheplanningdomai n). Acomparableprocesswa semployedfortheexpla na-tion domain,withthemodifi cation thatthenumberofseedp romptswassettoN=12duetocontextwindow constraints (explanationstimulit endedtobelongerthant hegoalsusedintheplan ning domain) 0.07
Moreover,GPT-3wasins teadseededwithpairso ftheform:“Scenario:”and“Explanation:Thiscoul dhavehappenedbecause ”. さらにgpt-3wasinseedwithpa irsoftheform: “scenario:” and “explanation:thiscoul dhavehappened because” と題されている。 0.57
GPT-3waspromptedusin gatemperatureof0.5an dqueriedforamaximumo f300tokensperrollout .Wedefine“rollout”asasingleforwardgene rationfromtheLLM.Wes etthemaximumnumberof tokenstobegreatertha nthemosttokensusedby anyhumanparticipantt oensurethatGPT-3wasg ivenafairchancetogen erateasufficientlylongplanorexp lanation.Wecollected 30plansandexplanatio nsperlinguisticpromp t.Wemanuallyremovedr olloutswhichincluded derogatorylanguagean dregeneratedasneeded toensurewehad30gener ationsperprompt.Gene rationswhichdidnoten donaperiod,butreache dthemaximumnumberoft okensweresplicedtoen donthelastperiod.Ifn operiodwasgeneratedi ntherollout,thegener ationwasdiscardedand re-generated.Asthere isthepossibilitythat suchparsingmayhaveim pactedthesemantics–inadditiontobroadlyd egener-ate,un-humanl ikebehaviorpotential lyimpactingratings–wereasonedthatourpre screenstagewouldnatu rallyensurethatonlyt hemosthuman-likelang uagewasmain-tainedin ordertogiveGPT-3thef airestchancepossible GPT-3waspromptedusin gatemperatureof0.5an dqueriedforamaximumo f300tokensperrollout .Wedefine“rollout”asasingleforwardgene rationfromtheLLM.Wes etthemaximumnumberof tokenstobegreatertha nthemosttokensusedby anyhumanparticipantt oensurethatGPT-3wasg ivenafairchancetogen erateasufficientlylongplanorexp lanation.Wecollected 30plansandexplanatio nsperlinguisticpromp t.Wemanuallyremovedr olloutswhichincluded derogatorylanguagean dregeneratedasneeded toensurewehad30gener ationsperprompt.Gene rationswhichdidnoten donaperiod,butreache dthemaximumnumberoft okensweresplicedtoen donthelastperiod.Ifn operiodwasgeneratedi ntherollout,thegener ationwasdiscardedand re-generated.Asthere isthepossibilitythat suchparsingmayhaveim pactedthesemantics–inadditiontobroadlyd egener-ate,un-humanl ikebehaviorpotential lyimpactingratings–wereasonedthatourpre screenstagewouldnatu rallyensurethatonlyt hemosthuman-likelang uagewasmain-tainedin ordertogiveGPT-3thef airestchancepossible 0.02
英語(論文から抽出)日本語訳スコア
inourlaterevaluation s.Statisticalanalyse sWeincludethefullsyn taxforalllinearregre ssionmodelsemployedt orunthestatisticalte stsconductedinPartI: •Within-group(withinh umans,andwithinLLM)s en-sitivitytoconstra ints:score∼condition+(1|raterid)+(1|prompt)•Between-group(humans vs.LLM)sensitivityto constraints:score∼(source*condition)+source+condition+(1|raterid)+(1|prompt)•Robustnesstopromptty picality:score∼(source*typicalitysc ore)+source+typicalityscore+(1|raterid)+(1|prompt)S2.PartIISupp lementalDetailsWenex tclarifythestimulius ed,modelscompared,an devaluationset-upemp loyedinPartII:Integr atinglan-guagewithst ructuredreasoningmod els.Associatedcodeca nbefoundunderthe“PartII”directoryofourreposi -tory.Set-upandstimu licreationWedesignap roblemsettingtomimic theplanningdo-mainof PartI;however,here,weneedt odesigntaskssuchthat wecanexactlyverifywh etheraplansuccess-fu llysolvedagoal.Theop en-endednatureofthep lanningproblemsofPar tIprohibitsthisdegre eofcontrol,whichisan ecessarysteptodesign asolidtestbedtocompa remodels.Tothatend,w econstructasynthetic grammar–overbothgoalsandacti ons–thatcanbedirectlymap pedintoformalpredica tesforgoals,andPDDLa ctions.Thisenablesus toexecutegeneratedpl ansandevaluatesuc-ce ss,asdiscussedinthes ectionentitled“Plansimulationenvior nment”. inourlaterevaluation s.Statisticalanalyse sWeincludethefullsyn taxforalllinearregre ssionmodelsemployedt orunthestatisticalte stsconductedinPartI: •Within-group(withinh umans,andwithinLLM)s en-sitivitytoconstra ints:score∼condition+(1|raterid)+(1|prompt)•Between-group(humans vs.LLM)sensitivityto constraints:score∼(source*condition)+source+condition+(1|raterid)+(1|prompt)•Robustnesstopromptty picality:score∼(source*typicalitysc ore)+source+typicalityscore+(1|raterid)+(1|prompt)S2.PartIISupp lementalDetailsWenex tclarifythestimulius ed,modelscompared,an devaluationset-upemp loyedinPartII:Integr atinglan-guagewithst ructuredreasoningmod els.Associatedcodeca nbefoundunderthe“PartII”directoryofourreposi -tory.Set-upandstimu licreationWedesignap roblemsettingtomimic theplanningdo-mainof PartI;however,here,weneedt odesigntaskssuchthat wecanexactlyverifywh etheraplansuccess-fu llysolvedagoal.Theop en-endednatureofthep lanningproblemsofPar tIprohibitsthisdegre eofcontrol,whichisan ecessarysteptodesign asolidtestbedtocompa remodels.Tothatend,w econstructasynthetic grammar–overbothgoalsandacti ons–thatcanbedirectlymap pedintoformalpredica tesforgoals,andPDDLa ctions.Thisenablesus toexecutegeneratedpl ansandevaluatesuc-ce ss,asdiscussedinthes ectionentitled“Plansimulationenvior nment”. 0.10
Possibleactionsinour grammarinclude:stack ,un-stack,andstackfr omtable.Exampleiniti alconfigurationsandgoalstoo ktheformofthoseshown inFig.4ofthemaintext .Configurationsentailstack ingproblems,whereeac hstackingproblemincl udesarandomsetofN=4itemsselectedfromap re-definedvocabularyofevery dayhouseholditems(e g ,“plate”,“keyboard”). Possibleactionsinour grammarinclude:stack ,un-stack,andstackfr omtable.Exampleiniti alconfigurationsandg oalstooktheformoftho seshowninFig.4ofthem aintext.Configuratio nsentailstackingprob lems,whereeachstacki ngproblemincludesara ndomsetofN=4itemselectedfromapr e-definedvocabularyo feverydayhouseholdit es(e g , “plate”, “keyboard”)。 0.10
Wegener-ate100testco nfigurationsusingthisgr ammar,andforeachconfiguration,buildthreei ncreasinglyconstrain edsettings–asdiscussedinPartII. Constraintsareformed bysamplingoneoftheco nstraintsthatfullysp ecifiesthegoalconditionan daddingthistothegoal .Foreveryobjectmenti onedinadditionalcons traints(wediscludeth efirstconstraint),weswa ptheobjectwithanout- of-distributionobjec t(whereout-of-distri butionisdefinedwithrespecttowhat maynotusuallybefound onahouse-holdtable;e g ,“meteorite”,“corduroypants”)toinjectadimensiono fatypicalitytomirror stimuliusedinPartI.A listingofallstimuliu sedcanbefoundinourre posi-tory.PromptingW enextdiscusshowwepro mpttheLLMsusedinthis sec-tion.WeemployanL LM-as-Planner(GPT-Ne o(Black,Gao,Wang,Lea hy,&Biderman,2021)whichg eneratesplansinnatur allanguage(NL)todire ctlymirrortheLLMset- upusedinPartI.Additi onally,P+SreliesonanLLM(Codex (Chenetal.,2021))top arsetheinitialconfigura-tionsandgoalspe cificationintoformalpred icates.LLM-as-Planne rFortheLLM-as-Planne rmodel(avanillaLLMpr omptedtoproduceanent ireplan,consist-ingo fasequenceofactions, givenalinguisticgoal ),weconstructafew-sh otpromptanalogoustot hoseinPartIconsistin gofaheaderwithasetof “training”example(goal,plan)pa irsseparatedbythesam edelimiter,andthenen dinginthedesiredgoal forwhichtheLLMshould produceaplan.Forallg oals,weuseaheadercon sist-ingofthesameseq uenceofn=3(goal,plan)examples (whicharedisjointfro mthegoalsweevaluate. Asamplepromptcontain ingthesetrainingexam pleusedcanbefoundint heREADMEwithinthe“PartII”directory.So-lutions arestructuredtofollo w“Actions:”,and“Initially:”isusedasthestoptoken ,asitwouldindicateth estartofanewplanning problem.Generationsw ererunwithatemperatu reof0.05.P+SpromptingTheLLM-mod elusedintheP+Smodelisonlyusedasa”parser”:ittransduceslinguis -ticgoalsintoasymbol icprogramcontainingt heformalenvironmentp redicatesforthisgoal .Therefore,theLLMint hisexampleisprompted withaheaderwithaseto f“training”example(goal,parsedg oalprogrampredicate) pairsseparatedbythes amedelimiter,andthen endinginthedesiredgo alforwhichtheLLMshou ldproduceaparse.Fora llgoals,weuseaheader consistingofthesames equenceofn=3(goal,parsedgoalpro grampredicate)exampl es,whicharedrawnfrom thesamesetoftraining examplesasintheLLM-a s-plannerbaselinepro mpt(buttheseareshown onlywithgoalparses,n otcompleteplans). Wegener-ate100testco nfigurationsusingthisgr ammar,andforeachconfiguration,buildthreei ncreasinglyconstrain edsettings–asdiscussedinPartII. Constraintsareformed bysamplingoneoftheco nstraintsthatfullysp ecifiesthegoalconditionan daddingthistothegoal .Foreveryobjectmenti onedinadditionalcons traints(wediscludeth efirstconstraint),weswa ptheobjectwithanout- of-distributionobjec t(whereout-of-distri butionisdefinedwithrespecttowhat maynotusuallybefound onahouse-holdtable;e g ,“meteorite”,“corduroypants”)toinjectadimensiono fatypicalitytomirror stimuliusedinPartI.A listingofallstimuliu sedcanbefoundinourre posi-tory.PromptingW enextdiscusshowwepro mpttheLLMsusedinthis sec-tion.WeemployanL LM-as-Planner(GPT-Ne o(Black,Gao,Wang,Lea hy,&Biderman,2021)whichg eneratesplansinnatur allanguage(NL)todire ctlymirrortheLLMset- upusedinPartI.Additi onally,P+SreliesonanLLM(Codex (Chenetal.,2021))top arsetheinitialconfigura-tionsandgoalspe cificationintoformalpred icates.LLM-as-Planne rFortheLLM-as-Planne rmodel(avanillaLLMpr omptedtoproduceanent ireplan,consist-ingo fasequenceofactions, givenalinguisticgoal ),weconstructafew-sh otpromptanalogoustot hoseinPartIconsistin gofaheaderwithasetof “training”example(goal,plan)pa irsseparatedbythesam edelimiter,andthenen dinginthedesiredgoal forwhichtheLLMshould produceaplan.Forallg oals,weuseaheadercon sist-ingofthesameseq uenceofn=3(goal,plan)examples (whicharedisjointfro mthegoalsweevaluate. Asamplepromptcontain ingthesetrainingexam pleusedcanbefoundint heREADMEwithinthe“PartII”directory.So-lutions arestructuredtofollo w“Actions:”,and“Initially:”isusedasthestoptoken ,asitwouldindicateth estartofanewplanning problem.Generationsw ererunwithatemperatu reof0.05.P+SpromptingTheLLM-mod elusedintheP+Smodelisonlyusedasa”parser”:ittransduceslinguis -ticgoalsintoasymbol icprogramcontainingt heformalenvironmentp redicatesforthisgoal .Therefore,theLLMint hisexampleisprompted withaheaderwithaseto f“training”example(goal,parsedg oalprogrampredicate) pairsseparatedbythes amedelimiter,andthen endinginthedesiredgo alforwhichtheLLMshou ldproduceaparse.Fora llgoals,weuseaheader consistingofthesames equenceofn=3(goal,parsedgoalpro grampredicate)exampl es,whicharedrawnfrom thesamesetoftraining examplesasintheLLM-a s-plannerbaselinepro mpt(buttheseareshown onlywithgoalparses,n otcompleteplans). 0.10
Anopenparenthesescue sCodextostartgenerat ingPDDL,and”;”isusedasastoptoken.A samplepromptcanbefou ndinthesameREADME.Pl ansimulationenvironm entOurset-upconsists ofanLLM-as-Plannerge neratingplansinnatur allanguage,andourP+Smodelgeneratingplan sasprograms(e g ,PDDL). Anopenparenthesescue sCodextostartratingP DDL,and”; “isusedasastoptoken.A samplepromptcanbefou ndinthesameREADME.Pl ansimulationenvironm entOurset-upconsisto fanLLM-as-Plannergen eratingplansinnatura l Language,andourP+Smodelgeneratingplan sas programss(e g ,PDDL) 0.13
Wethereforeneededto Wethereforeneedto 0.32
英語(論文から抽出)日本語訳スコア
designaschemetocompa resuchnaturallanguag eagainstplans–namely,our“Plansimulationenvior nment”. designaschemetocompa resuchnatural languagesagainstplan s,our "Plansimulationenvior nment" 0.25
Ourset-upautomatical lyparsesLLM-generate dlanguageintoaprogra musingoursyntheticgr ammar.Bothprogramsar ethenevaluatedastowh ethertheyachievedthe goal.Planningsuccess iscodedinabinary(1=success,0=fail)fashion.Ifthego alstateisnotreached, thentheplanisdeemedt ohavefailed.Statisti calanalysesThestatis ticalanalysesruninPa rtIIusedthefollowing syntax:•Between-group(LLM-as -Plannervs.P+S)perfor-mance:succe ed∼method+(1|id)•Between-group(LLM-as -Plannervs.P+S)sen-sitivitytocons traints:succeed∼(method*constraints) +method+constraints+(1|id)Weconductedaninit ialanalysis(succeed∼constraints+(1|id))toseeifbothmodel swereimpactedbyconst raints,akintoPartI;wefindthattheLLM-as-Plan nerandP+Sareindeedbothimpact edbyconstraints(p< ;0.001). Ourset-upautomatical lyparsesLLM-generate dlanguageintoaprogra musingoursyntheticgr ammar.Bothprogramsar ethenevaluatedastowh ethertheyachievedthe goal.Planningsuccess iscodedinabinary(1=success,0=fail)fashion.Ifthego alstateisnotreached, thentheplanisdeemedt ohavefailed.Statisti calanalysesThestatis ticalanalysesruninPa rtIIusedthefollowing syntax:•Between-group(LLM-as -Plannervs.P+S)perfor-mance:succe ed∼method+(1|id)•Between-group(LLM-as -Plannervs.P+S)sen-sitivitytocons traints:succeed∼(method*constraints) +method+constraints+(1|id)Weconductedaninit ialanalysis(succeed∼constraints+(1|id))toseeifbothmodel swereimpactedbyconst raints,akintoPartI;wefindthattheLLM-as-Plan nerandP+Sareindeedbothimpact edbyconstraints(p< ;0.001). 0.11
Qualitativeanalysiso ffailuremodesWefound thattheLLM-as-Planne rwasabletogener-atev alidactions,butnotco nsistentlyvalidplans .Forinstance,whenpro mptedwiththefollowin ginitialcon-figurationandgoal:Init ially:Thewritingpadr estsonthetable.Theno tebookisonthewriting pad.Thetissueboxison thenotebook.Thereisn othingonthetissuebox .Thetabletrestsonthe table.Thereisnothing onthetablet.Goal:The reisnothingonthenote book.TheLLMgenerated theplan:Movethetable tontothenotebook. qualitative analysisoffailuremod eswefoundthatllm-as- plannerwasabletogene r-atevalidactions,bu tnot consistentlyvalidpla ns.forinstance,whenp romptedwiththefollow inginitialcon-figura tionandgoal:initiall y:thewritingpadrests onthetable.thenotebo okisonthewritingpad. thetissueboxisonthen otebook.thereisnothi ngontheissuebox.thet abletrestsonthetable .thereisnothingonthe tablet.goal:thereisn othingonthenotebook. thellmgeneratedthepl an:movethetabletonto thenotebook 0.11
,adirectviolationoft hegoalconstraint;therefore,aninvalidp lan.ReferencesBlack, S. adirectviolation ofthegoalconstraint;therefore,aninvalidp lan.ReferencesBlack, S。 0.44
,Gao,L. ,Wang,P. ,Gao,L。 、Wang,P。 0.59
,Leahy,C. ,&Biderman,S. リーハイ、c。 とBiderman,S。 0.58
(2021,March). GPT-Neo:LargeScaleAu toregressiveLanguage ModelingwithMesh-Ten sorflow.Zenodo.Retrievedf romhttps://doi.org/1 0.5281/zenodo.529771 5(Ifyouusethissoftwa re,pleaseciteitusing thesemetadata.)doi:1 0.5281/zenodo.529771 5Brown,T.B.,Mann,B. (2021年3月) GPT-Neo:LargeScaleAu toregressiveLanguage ModelingwithMesh-Ten sorflow.Zenodo.Retri evedfromhttps://doi. org/10.5281/zenodo.5 297715(Ifyouusethiss oftware,pleaseciteit usingthesemetadata.) doi:10.5281/zenodo.5 297715Brown,T.B.,Man n,B 0.41
,Ryder,N. ,Subbiah,M. 、Ryder,N。 サブビア、m。 0.66
,Kaplan,J. ,Dhariwal,P. カプラン、j。 とDhariwal,P。 0.59
,...Amodei,D. (2020). アモディ、d。 (2020). 0.38
Languagemod-elsarefe w-shotlearners.CoRR, abs/2005.14165.Re-tr ievedfromhttps://arx iv.org/abs/2005.1416 5Chen,M. Languagemod-elsarefe w-shotlearners.CoRR, abs/2005.14165.Re-tr ieved Fromhttps://arxiv.or g/abs/2005.14165Chen ,M 0.15
,Tworek,J. ,Jun,H. Tworek,J。 、Jun,H。 0.35
,Yuan,Q. ,deOliveiraPinto,H.P .,Kaplan,J. 、Yuan,Q。 ,deOliveiraPinto,H.P .,Kaplan,J。 0.61
,...Zaremba,W. (2021). ザレンバ、W。 (2021). 0.30
Evaluat-inglargelang uagemodelstrainedonc ode.CoRR,abs/2107.03 374.Retrievedfromhtt ps://arxiv.org/abs/2 107.03374 Evaluat-Inlarge Languagemodelstraine doncode.CoRR,abs/210 7.03374.Retrieved fromhttps://arxiv.or g/abs/2107.03374 0.11
                       ページの最初に戻る

翻訳にはFugu-Machine Translatorを利用しています。