Fugu-MT 論文翻訳(概要): Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

論文の概要: Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

arxiv url: http://arxiv.org/abs/2312.04474v1
Date: Thu, 7 Dec 2023 17:51:43 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-08 14:00:05.859475
Title: Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
Title（参考訳）: コードの連鎖:言語モデル拡張コードエミュレータによる推論
Authors: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
Abstract要約: 言語モデル(LM)はコード記述を活用して思考の連鎖推論を改善する。我々は、LMコード駆動推論を改善するシンプルな、そして驚くほど効果的な拡張であるChain of Code (CoT)を提案する。
参考スコア（独自算出の注目度）: 119.0018170558366
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter -- we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks, but also for linguistic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for "detect_sarcasm(string)" that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they are used not only to write the code, but also to selectively "emulate" the interpreter by generating the expected output of "detect_sarcasm(string)" and other lines of code (e.g., that the interpreter could not compile). In this work, we propose Chain of Code (CoT), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format linguistic sub-tasks in a program as flexible pseudocode that the compiler can explicitly catch undefined behaviors and hand off to simulate with an LM (as an "LMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. CoT scales well with large and small models alike, and broadens the scope of reasoning questions that LMs can correctly answer by "thinking in code". Project webpage: https://chain-of-code.github.io/.
Abstract（参考訳）: Codeは、複雑なプログラムを構築し、コードインタプリタとペアになったときに正確な計算を行うための一般的な構文構造を提供します。 LMはインタプリタで実行できる"detect_sarcasm(string)"の実装を書くのに苦労するかもしれません(エッジケースの処理は不要でしょう)。しかし、LMはコードの記述だけでなく、"detect_sarcasm(string)"や他のコード行(例えば、インタプリタがコンパイルできない)の出力を生成することで、インタプリタを選択的に"エミュレート"するためにも有効なソリューションを生成することができる。本研究では,LMコード駆動推論を改善するシンプルな,驚くほど効果的な拡張であるChain of Code (CoT)を提案する。キーとなる考え方は、LMがプログラム内の言語サブタスクをフレキシブルな擬似コードとしてフォーマットすることを奨励し、コンパイラが明示的に定義されていない振る舞いをキャッチし、LMでシミュレートする("LMulator")ことである。さまざまなベンチマークにおいて、Chain of CodeがChain of Thoughtやその他のベースラインよりも優れており、BIG-Bench Hardでは、Chain of Codeが84%、Chain of Thoughtよりも12%向上している。 CoTは、大小のモデルと同様の規模でスケールし、LMが「コードを考える」ことで正しく答えられるような推論可能な質問の範囲を広げます。プロジェクトWebページ: https://chain-of-code.github.io/.com

関連論文リスト

CodeSimpleQA: Scaling Factuality in Code Large Language Models [55.705748501461294]
本稿では,コード関連質問への回答において,LLMの実際の精度を評価するための総合的なベンチマークであるCodeSimpleQAを提案する。また,66万サンプルの大規模インストラクションコーパスであるCodeSimpleQA-Instructを作成し,教師付き微調整と強化学習を組み合わせたポストトレーニングフレームワークを開発した。
論文参考訳（メタデータ） (2025-12-22T14:27:17Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
本稿では,Code LLMの命令追従能力を改善するために,前方および後方制約生成を提案する。 IFEvalCodeは、7つのプログラミング言語の1.6Kテストサンプルからなる多言語ベンチマークである。
論文参考訳（メタデータ） (2025-07-30T08:08:48Z)
Isolating Language-Coding from Problem-Solving: Benchmarking LLMs with PseudoEval [7.33924106492889]
既存のコード生成ベンチマークは、大規模言語モデルのエンドツーエンドのパフォーマンスを研究するために設計されている。我々は擬似コードで書かれたソリューションを入力として提供する多言語コード生成ベンチマークであるPseudoEvalを構築した。本研究は,プログラミング言語間で問題解決能力が伝達される可能性を示し,言語符号化には言語固有の取り組みが必要であることを示唆する。
論文参考訳（メタデータ） (2025-02-26T14:08:17Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
CRUXEVAL-Xコード推論ベンチマークには19のプログラミング言語が含まれている。各言語に対して少なくとも600人の被験者で構成され、合計19Kのコンテンツ一貫性テストがある。 Pythonでのみトレーニングされたモデルでさえ、他の言語で34.4%のPass@1を達成することができる。
論文参考訳（メタデータ） (2024-08-23T11:43:00Z)
What can Large Language Models Capture about Code Functional Equivalence? [24.178831487657945]
SeqCoBenchは、コード-LLMがコード関数同値をキャプチャする方法を評価するベンチマークである。我々は,SeqCoBenchにおける意味論的に等価なプログラムと異なるプログラムのペアを識別できるかどうかを,最先端(Code-)LLMで評価する。
論文参考訳（メタデータ） (2024-08-20T11:19:06Z)
Case2Code: Learning Inductive Reasoning with Synthetic Data [105.89741089673575]
プログラムの表現性と正確性を利用したtextbfCase2Code タスクを提案する。まず、合成したCase2Codeタスクにおける代表LLMを評価し、LLMにおいてケース・ツー・コード誘導が困難であることを実証する。実験結果から,このような帰納的学習は,Case2Codeの性能だけでなく,学習用LLMの各種符号化能力の向上にも寄与することがわかった。
論文参考訳（メタデータ） (2024-07-17T11:35:00Z)
Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
本稿では, MANGO (comMents As Natural loGic pivOts) を提案する。その結果、MANGOは強いベースラインに基づいてコードパス率を大幅に改善することがわかった。論理的なコメントの復号化戦略の堅牢性は、考えの連鎖よりも顕著に高い。
論文参考訳（メタデータ） (2024-04-11T08:30:46Z)
Language Models as Compilers: Simulating Pseudocode Execution Improves Algorithmic Reasoning in Language Models [17.76252625790628]
本稿では,言語モデルの推論過程を2段階に分解するフレームワークであるThink-and-Executeについて述べる。 7つのアルゴリズム的推論タスクについて広範な実験を行い、思考と実行の有効性を実証する。
論文参考訳（メタデータ） (2024-04-03T08:49:11Z)
CodeMind: A Framework to Challenge Large Language Models for Code Reasoning [1.4027589547318842]
大規模言語モデル(LLM)のコード推論能力を評価するために設計されたフレームワークであるCodeMindを紹介する。 CodeMindは、Independent Execution Reasoning (IER)、Dependent Execution Reasoning (DER)、Specification Reasoning (SR)の3つのコード推論タスクをサポートしている。
論文参考訳（メタデータ） (2024-02-15T02:24:46Z)
Code Prompting Elicits Conditional Reasoning Abilities in Text+Code LLMs [65.2379940117181]
自然言語の問題をコードに変換する一連のプロンプトであるコードプロンプトを導入します。コードプロンプトは複数のLLMに対して高速に向上することがわかった。 GPT 3.5を解析した結果,入力問題のコードフォーマッティングが性能向上に不可欠であることが判明した。
論文参考訳（メタデータ） (2024-01-18T15:32:24Z)
Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models [74.95486528482327]
コードプロンプト(code prompting)は、ゼロショットバージョンと少数ショットバージョンの両方を持ち、中間ステップとしてコードをトリガーするニューラルシンボルプロンプトである。我々は,記号的推論と算術的推論を含む7つの広く使用されているベンチマーク実験を行った。
論文参考訳（メタデータ） (2023-05-29T15:14:09Z)
PAL: Program-aided Language Models [112.94785609781503]
自然言語問題を理解するために,プログラム支援言語モデル(PaL)を提案する。 PaLはソリューションステップをPythonインタプリタのようなプログラムランタイムにオフロードする。私たちは12のベンチマークで新しい最先端の結果を設定しました。
論文参考訳（メタデータ） (2022-11-18T18:56:13Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。