Fugu-MT 論文翻訳(概要): $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

論文の概要: $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

arxiv url: http://arxiv.org/abs/2511.01340v1
Date: Mon, 03 Nov 2025 08:42:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:27.179209
Title: $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles
Title（参考訳）: $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: An Large and Diverse Multimodal Benchmark for the ability of Vision-Language Models to understand Rebus Puzzles
Authors: Trishanu Das, Abhilash Nandy, Khush Bajaj, Deepiha S,
Abstract要約: Rebus Puzzlesを理解する(Rebus Puzzlesは画像、シンボル、文字を使って言葉やフレーズを創造的に表現する)には、画像認識、認知スキル、常識推論、マルチステップ推論、画像ベースのワードプレイなど、さまざまなスキルが必要である。 RebusDescProgICE$は、非構造化記述とコードベースの構造化推論の組み合わせと、より優れた推論ベースのコンテキスト内例選択を利用する、モデルに依存しないフレームワークです。
参考スコア（独自算出の注目度）: 2.1040348692366426
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Understanding Rebus Puzzles (Rebus Puzzles use pictures, symbols, and letters to represent words or phrases creatively) requires a variety of skills such as image recognition, cognitive skills, commonsense reasoning, multi-step reasoning, image-based wordplay, etc., making this a challenging task for even current Vision-Language Models. In this paper, we present $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$, a large and diverse benchmark of $1,333$ English Rebus Puzzles containing different artistic styles and levels of difficulty, spread across 18 categories such as food, idioms, sports, finance, entertainment, etc. We also propose $RebusDescProgICE$, a model-agnostic framework which uses a combination of an unstructured description and code-based, structured reasoning, along with better, reasoning-based in-context example selection, improving the performance of Vision-Language Models on $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$ by $2.1-4.1\%$ and $20-30\%$ using closed-source and open-source models respectively compared to Chain-of-Thought Reasoning.
Abstract（参考訳）: Rebus Puzzles (Rebus Puzzles は、画像認識、認知スキル、コモンセンス推論、マルチステップ推論、画像ベースのワードプレイなど、様々なスキルを必要とする。本稿では,料理,イディオム,スポーツ,ファイナンス,エンターテイメントなどの18のカテゴリーにまたがる,異なる芸術様式と難易度を含む,1333ドルのイングリッシュ・リバス・プッズの大規模かつ多種多様なベンチマークである$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$を紹介する。 RebusDescProgICE$は、構造化されていない記述とコードに基づく構造化された推論の組み合わせと、より優れた推論ベースのインコンテキストの例の選択、$\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$2.1-4.1\%$と20-30\%$のVision-Language Models on $\left|\,\boxed{\text{BUS}}\,\right|$を、Chain-of-Thought Reasoningと比較してそれぞれクローズドソースモデルと20-30\%$で改善するモデルに依存しないフレームワークである。

論文の概要: $\left|\,\circlearrowright\,\boxed{\text{BUS}}\,\right|$: A Large and Diverse Multimodal Benchmark for evaluating the ability of Vision-Language Models to understand Rebus Puzzles

関連論文リスト