Fugu-MT 論文翻訳(概要): HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval

論文の概要: HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval

arxiv url: http://arxiv.org/abs/2506.07296v1
Date: Sun, 08 Jun 2025 21:39:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-10 16:33:10.753348
Title: HotelMatch-LLM: Joint Multi-Task Training of Small and Large Language Models for Efficient Multimodal Hotel Retrieval
Title（参考訳）: HotelMatch-LLM: マルチモーダルホテル検索のための小・大言語モデルの複合マルチタスク学習
Authors: Arian Askari, Emmanouil Stergiadis, Ilya Gusev, Moran Beladev,
Abstract要約: HotelMatch-LLMは、自然言語のプロパティ検索を可能にする旅行領域のマルチモーダル密度検索モデルである。 The HotelMatch-LLM features of three key innovations: (1) Domain-specific multi-task optimization with three novel search, visual, and language modeling objectives; (2) Asymmetrical dense search architecture with a small language model (SLM) for efficient online query processing and a large language model (LLM) for embedded hotel data, (3) Extensive image processing to handle all property image galleries。
参考スコア（独自算出の注目度）: 0.8608609778974488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present HotelMatch-LLM, a multimodal dense retrieval model for the travel domain that enables natural language property search, addressing the limitations of traditional travel search engines which require users to start with a destination and editing search parameters. HotelMatch-LLM features three key innovations: (1) Domain-specific multi-task optimization with three novel retrieval, visual, and language modeling objectives; (2) Asymmetrical dense retrieval architecture combining a small language model (SLM) for efficient online query processing and a large language model (LLM) for embedding hotel data; and (3) Extensive image processing to handle all property image galleries. Experiments on four diverse test sets show HotelMatch-LLM significantly outperforms state-of-the-art models, including VISTA and MARVEL. Specifically, on the test set -- main query type -- we achieve 0.681 for HotelMatch-LLM compared to 0.603 for the most effective baseline, MARVEL. Our analysis highlights the impact of our multi-task optimization, the generalizability of HotelMatch-LLM across LLM architectures, and its scalability for processing large image galleries.
Abstract（参考訳）: 本論文では,旅行ドメインを対象とした多モーダル密集検索モデルであるHotelMatch-LLMを提案する。 The HotelMatch-LLM features of three key innovations: (1) Domain-specific multi-task optimization with three novel search, visual, and language modeling objectives; (2) Asymmetrical dense search architecture with a small language model (SLM) for efficient online query processing and a large language model (LLM) for embedded hotel data, (3) Extensive image processing to handle all property image galleries。 4つの異なるテストセットの実験では、HotelMatch-LLMはVISTAやMARVELといった最先端モデルよりも大幅に優れていた。特に、メインクエリタイプであるテストセットでは、最も効果的なベースラインであるMARVELの0.603に対して、HotelMatch-LLMの0.681を達成する。本稿では,マルチタスク最適化の影響,LLMアーキテクチャ間のHotelMatch-LLMの一般化性,大規模画像ギャラリー処理のスケーラビリティについて述べる。

関連論文リスト

CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR)は、マルチモーダルクエリに基づいた画像検索を目的とした複雑なタスクである。本稿では,イメージキャプションペアからトリプレットをオンザフライで生成するワンストップフレームワークであるCoLLMを提案する。我々はLarge Language Models (LLMs) を利用して参照画像の埋め込みと修正テキストを生成する。
論文参考訳（メタデータ） (2025-03-25T17:59:50Z)
Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval [44.008094698200026]
クロスモーダル検索は研究コミュニティから有効性や関心が増している。本稿では,画像とテキストの両方からなるマルチモーダルクエリを実現するアプローチを設計する。我々のモデルであるReTは、視覚とテキストの両方のバックボーンの異なるレイヤから抽出されたマルチレベル表現を用いる。
論文参考訳（メタデータ） (2025-03-03T19:01:17Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
テキスト画像検索(TIIR)タスクを導入し、クエリと文書をインターリーブしたテキスト画像シーケンスとする。我々は、自然にインターリーブされたwikiHowチュートリアルに基づいてTIIRベンチマークを構築し、インターリーブされたクエリを生成するために特定のパイプラインを設計する。異なる粒度で視覚トークンの数を圧縮する新しいMMEを提案する。
論文参考訳（メタデータ） (2025-02-18T12:00:47Z)
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models [70.2997884478129]
LMMにおけるマルチイメージ、マルチフレーム(ビデオ)、マルチビュー(3D)、マルチパッチ(シングルイメージ)シナリオを同時に扱うLLaVA-NeXT-Interleaveを紹介する。また,LMMのマルチイメージ性能を総合的に評価するために,LLaVA-Interleave Benchをキュレートする。
論文参考訳（メタデータ） (2024-07-10T17:59:43Z)
Multi-Modal Generative Embedding Model [34.34876575183736]
本稿では,MM-GEM(Multi-Modal Generative Embedding Model)を提案する。例えば、ViT-LargeとTinyLlamaからインスタンス化されたMM-GEMは、マルチモーダル埋め込みモデルのベンチマーク上での競合性能を示している。 MM-GEMの高度なテキストモデルは、長いテキストと画像検索のためのRecall@1を5%以上改善する。
論文参考訳（メタデータ） (2024-05-29T17:59:10Z)
Matryoshka Multimodal Models [92.41824727506751]
我々はM3: Matryoshka Multimodal Modelsを提案する。 COCOスタイルのベンチマークでは,576個のトークンを使用する場合と同様の精度を得るために,9個のビジュアルトークンしか必要としないことがわかった。
論文参考訳（メタデータ） (2024-05-27T17:59:56Z)
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions [64.89284104414865]
オープンエンド命令をサポートする自己教師付き画像検索モデルであるMagicLensを紹介する。 MagicLensは、重要な新しい洞察に基づいて構築されている。同じWebページで自然に起こるイメージペアは、幅広い暗黙の関係を含んでいる。 MagicLensは、さまざまな画像検索タスクの8つのベンチマークで、これまでの最高値に匹敵する結果を得る。
論文参考訳（メタデータ） (2024-03-28T17:59:20Z)
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora [3.166549403591528]
本稿では,高速かつ効率的な画像検索のための2段階の粗度指数共有検索(CFIR)フレームワークを提案する。 CFIRは、Recall@1000で既存のMLLMを最大11.06%上回り、トレーニング時間と検索時間をそれぞれ68.75%、99.79%削減している。
論文参考訳（メタデータ） (2024-02-23T11:47:16Z)
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models [97.40590590880144]
MLLM(Multimodality Large Language Model)シリーズを開発した。我々は、言語、ビジョン、視覚言語タスクで利用可能なリソースを網羅した包括的なデータセットを組み立てる。パラメータサイズや多言語能力の異なるMLLMのスペクトルを得る。
論文参考訳（メタデータ） (2024-02-08T18:59:48Z)
One-Shot Neural Ensemble Architecture Search by Diversity-Guided Search Space Shrinking [97.60915598958968]
本稿では,この2つの課題に対処するワンショットニューラルアンサンブルアーキテクチャサーチ(NEAS)ソリューションを提案する。最初の課題として,探索空間の縮小を導くために,多様性に基づく新しい指標を導入する。第2の課題として,異なるモデル間の階層共有を効率向上のために学習する新たな探索次元を実現する。
論文参考訳（メタデータ） (2021-04-01T16:29:49Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。