Fugu-MT 論文翻訳(概要): VisionArena: 230K Real World User-VLM Conversations with Preference Labels

論文の概要: VisionArena: 230K Real World User-VLM Conversations with Preference Labels

arxiv url: http://arxiv.org/abs/2412.08687v2
Date: Fri, 13 Dec 2024 23:12:23 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-17 13:40:10.424653
Title: VisionArena: 230K Real World User-VLM Conversations with Preference Labels
Title（参考訳）: VisionArena: プレファレンスラベルによる230万の現実世界のユーザ-VLM会話
Authors: Christopher Chou, Lisa Dunlap, Koki Mashita, Krishna Mandal, Trevor Darrell, Ion Stoica, Joseph E. Gonzalez, Wei-Lin Chiang,
Abstract要約: VisionArenaは、ユーザと視覚言語モデル(VLM)間の230万の現実世界会話のデータセット私たちのデータセットは73Kのユニークなユーザ、45のVLM、138の言語で構成されています。キャプションやユーモアのようなオープンなタスクは非常にスタイルに依存しており、現在のVLMは空間的推論や計画タスクに苦労している。
参考スコア（独自算出の注目度）: 68.11192349083832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the growing adoption and capabilities of vision-language models (VLMs) comes the need for benchmarks that capture authentic user-VLM interactions. In response, we create VisionArena, a dataset of 230K real-world conversations between users and VLMs. Collected from Chatbot Arena - an open-source platform where users interact with VLMs and submit preference votes - VisionArena spans 73K unique users, 45 VLMs, and 138 languages. Our dataset contains three subsets: VisionArena-Chat, 200k single and multi-turn conversations between a user and a VLM; VisionArena-Battle, 30K conversations comparing two anonymous VLMs with user preference votes; and VisionArena-Bench, an automatic benchmark of 500 diverse user prompts that efficiently approximate the live Chatbot Arena model rankings. Additionally, we highlight the types of question asked by users, the influence of response style on preference, and areas where models often fail. We find open-ended tasks like captioning and humor are highly style-dependent, and current VLMs struggle with spatial reasoning and planning tasks. Lastly, we show finetuning the same base model on VisionArena-Chat outperforms Llava-Instruct-158K, with a 17-point gain on MMMU and a 46-point gain on the WildVision benchmark. Dataset at https://huggingface.co/lmarena-ai
Abstract（参考訳）: ビジョン言語モデル(VLM)の採用と能力の増大に伴い、ユーザとVLMのインタラクションを正確にキャプチャするベンチマークの必要性が高まっている。その結果,ユーザとVLM間の230万件のリアルタイム会話のデータセットであるVisionArenaを開発した。 Chatbot Arena – VLMと対話し、優先票を提出するオープンソースプラットフォームから収集されたVisionArenaは、73Kのユニークなユーザ、45のVLM、138の言語にまたがる。データセットには、3つのサブセットが含まれている: VisionArena-Chat、ユーザとVLM間の200kのシングルとマルチターンの会話、VisionArena-Battle、匿名の2つのVLMとユーザの選好投票を比較した30Kの会話、500の多様なユーザプロンプトの自動ベンチマークであるVisionArena-Bench。さらに,ユーザが質問する質問の種類,好みに対する応答スタイルの影響,モデルが頻繁に失敗する領域についても強調する。キャプションやユーモアのようなオープンなタスクは非常にスタイルに依存しており、現在のVLMは空間的推論や計画タスクに苦労している。最後に、VisionArena-ChatのベースモデルがLlava-Instruct-158Kより優れており、MMMUでは17ポイント、WildVisionベンチマークでは46ポイント向上していることを示す。 dataset at https://huggingface.co/lmarena-ai

関連論文リスト

MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities [146.4724093405187]
MM-Vet v2は、"image-text sequence understanding"と呼ばれる新しい"image-text sequence understanding"機能を含んでいる。 MM-Vet v2を用いて大規模マルチモーダルモデルのベンチマークを行った結果,Claude 3.5 Sonnetはスコア71.8の最良のモデルであり,スコア71.0のGPT-4oより若干優れていた。
論文参考訳（メタデータ） (2024-08-01T17:59:54Z)
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
視覚言語モデル(VLM)は、視覚エンコーダと大型言語モデル(LLM)を組み合わせて世界を認識する。近年の研究では、VLMは幻覚に弱いことが示されている。我々は、True Understanding (TU)、IGnorance (IG)、StuBbornness (SB)、InDecision (ID)といった新しいメトリクスを紹介します。
論文参考訳（メタデータ） (2024-07-18T12:11:12Z)
WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences [122.87483437694706]
WildVision-Arena(WV-Arena)は、人間の好みを収集して視覚言語モデル(VLM)を評価するオンラインプラットフォームです。 WV-ベンチは、それぞれのVLMとClaude-3-Sonnetを比較し、WV-Arena Eloの0.94のスピアマン相関を達成している。実世界の20万件のインタラクションを包括的に分析した結果,トップパフォーマンスのVLMの障害事例に対する重要な洞察が得られた。
論文参考訳（メタデータ） (2024-06-16T20:53:25Z)
An Introduction to Vision-Language Modeling [128.6223984157515]
視覚言語モデル(VLM)の応用は、我々の技術との関係に大きな影響を与えるだろう。 VLMとは何か、どのように動作するのか、どのようにトレーニングするかを紹介します。本研究は,主に画像から言語へのマッピングに焦点を当てるが,ビデオへのVLMの拡張についても論じる。
論文参考訳（メタデータ） (2024-05-27T15:01:23Z)
Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation [31.062433484245684]
Prometheus-Visionは,評価中のユーザ定義スコアの理解が可能な,オープンソースのVLM評価モデルである。 Prometheus-Visionは、オープンソースのモデルの中で、人間の評価器とGPT-4Vとピアソンの相関が最も高いことを示している。
論文参考訳（メタデータ） (2024-01-12T14:19:23Z)
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [75.9621305227523]
LMSYS-Chat-1M(LMSYS-Chat-1M)について紹介する。このデータセットは、VicunaのデモとArenaのWebサイトで、210KのIPアドレスから収集されています。 GPT-4と同様の動作を行うコンテンツモデレーションモデルの開発、安全性ベンチマークの構築、Vicunaと同様の動作を行う命令追従モデルのトレーニング、挑戦的なベンチマーク問題の作成、という4つのユースケースを通じて、その汎用性を実証する。
論文参考訳（メタデータ） (2023-09-21T12:13:55Z)
TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real World [97.58623810402563]
我々はTikTalkと呼ばれるビデオベースのマルチモーダル対話データセットを導入する。人気ビデオ共有プラットフォームから38Kのビデオを収集し、その下のユーザーから367Kの会話を投稿した。ユーザーはビデオのマルチモーダルな体験に基づいて自発的な会話をし、現実世界のchitchatコンテキストを再現する。
論文参考訳（メタデータ） (2023-01-14T10:18:22Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。