Fugu-MT 論文翻訳(概要): OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

論文の概要: OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

arxiv url: http://arxiv.org/abs/2506.20960v2
Date: Sun, 29 Jun 2025 15:16:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-01 13:01:42.731489
Title: OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs
Title（参考訳）: OmniEval: 視覚、聴覚、テキスト入力によるOmni-modalモデルの評価ベンチマーク
Authors: Yiman Zhang, Ziheng Luo, Qiangyu Yan, Wei He, Borui Jiang, Xinghao Chen, Kai Han,
Abstract要約: オムニエバル(OmniEval)は、オムニモダリティモデルを評価するためのベンチマークである。音声と映像の強い結合を強調する評価タスクを設計する。いくつかのOmni-modalityモデルを用いてOmniEvalの実験を行う。
参考スコア（独自算出の注目度）: 19.214764707089884
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce OmniEval, a benchmark for evaluating omni-modality models like MiniCPM-O 2.6, which encompasses visual, auditory, and textual inputs. Compared with existing benchmarks, our OmniEval has several distinctive features: (i) Full-modal collaboration: We design evaluation tasks that highlight the strong coupling between audio and video, requiring models to effectively leverage the collaborative perception of all modalities; (ii) Diversity of videos: OmniEval includes 810 audio-visual synchronized videos, 285 Chinese videos and 525 English videos; (iii) Diversity and granularity of tasks: OmniEval contains 2617 question-answer pairs, comprising 1412 open-ended questions and 1205 multiple-choice questions. These questions are divided into 3 major task types and 12 sub-task types to achieve comprehensive evaluation. Among them, we introduce a more granular video localization task named Grounding. Then we conduct experiments on OmniEval with several omni-modality models. We hope that our OmniEval can provide a platform for evaluating the ability to construct and understand coherence from the context of all modalities. Codes and data could be found at https://omnieval-benchmark.github.io/.
Abstract（参考訳）: 本稿では,視覚,聴覚,テキスト入力を含むMiniCPM-O 2.6のようなモダリティモデルを評価するベンチマークであるOmniEvalを紹介する。既存のベンチマークと比較すると、OmniEvalにはいくつかの特徴があります。 (i)フルモーダルコラボレーション:全てのモダリティの協調的認識を効果的に活用するためのモデルを必要とする、音声とビデオの強い結合を強調する評価タスクを設計する。 (ii)動画の多様性:OmniEvalには810の音声視覚同期ビデオ、285の中国語ビデオ、525の英語ビデオが含まれている。 (3)タスクの多様性と粒度:OmniEvalは2617の質問回答ペアを含み、1412のオープンエンド質問と1205のマルチチョイス質問で構成されている。これらの質問は、総合的な評価を達成するために、3つの主要なタスクタイプと12のサブタスクタイプに分けられる。その中で,より粒度の細かいビデオローカライゼーションタスクであるGroundingを紹介した。すると、OmniEval 上で、いくつかのOmni-modality モデルで実験を行う。当社のOmniEvalが,すべてのモダリティのコンテキストからコヒーレンスを構築し,理解する能力を評価するためのプラットフォームを提供することを期待しています。コードとデータはhttps://omnieval-benchmark.github.io/で確認できる。

論文の概要: OmniEval: A Benchmark for Evaluating Omni-modal Models with Visual, Auditory, and Textual Inputs

関連論文リスト