Fugu-MT 論文翻訳(概要): Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

論文の概要: Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

arxiv url: http://arxiv.org/abs/2507.18661v2
Date: Mon, 28 Jul 2025 04:30:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-07-29 12:09:50.633331
Title: Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back
Title（参考訳）: 視覚マップからの強化学習による視覚に基づく次世代GPS位置予測モデル
Authors: Ruixing Zhang, Yang Zhang, Tongyu Zhu, Leilei Sun, Weifeng Lv,
Abstract要約: 次の位置予測は、人間の移動性の研究における基本的な課題である。 VLM(Vision-Language Models)の最近の開発は、視覚知覚や視覚的推論において強力な能力を示している。第1段階では,道路ネットワークと軌道構造を理解するのに役立つ2つのスーパービジョンファインチューニングタスクを設計する。第2段階では、ビジュアルマップフィードバックからの強化学習を導入し、モデルが次の位置予測能力を自己改善できるようにする。
参考スコア（独自算出の注目度）: 25.50467870648379
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Next Location Prediction is a fundamental task in the study of human mobility, with wide-ranging applications in transportation planning, urban governance, and epidemic forecasting. In practice, when humans attempt to predict the next location in a trajectory, they often visualize the trajectory on a map and reason based on road connectivity and movement trends. However, the vast majority of existing next-location prediction models do not reason over maps \textbf{in the way that humans do}. Fortunately, the recent development of Vision-Language Models (VLMs) has demonstrated strong capabilities in visual perception and even visual reasoning. This opens up a new possibility: by rendering both the road network and trajectory onto an image and leveraging the reasoning abilities of VLMs, we can enable models to perform trajectory inference in a human-like manner. To explore this idea, we first propose a method called Vision-Guided Location Search (VGLS), which evaluates whether a general-purpose VLM is capable of trajectory-based reasoning without modifying any of its internal parameters. Based on insights from the VGLS results, we further propose our main approach: VLMLocPredictor, which is composed of two stages: In the first stage, we design two Supervised Fine-Tuning (SFT) tasks that help the VLM understand road network and trajectory structures and acquire basic reasoning ability on such visual inputs. In the second stage, we introduce Reinforcement Learning from Visual Map Feedback, enabling the model to self-improve its next-location prediction ability through interaction with the environment. Experiments conducted on datasets from four different cities show that our method achieves state-of-the-art (SOTA) performance and exhibits superior cross-city generalization compared to other LLM-based approaches.
Abstract（参考訳）: 次世代の立地予測は、交通計画、都市ガバナンス、流行予測に幅広く応用されている、人間の移動性の研究における基本的な課題である。実際には、人間が軌道上の次の位置を予測しようとすると、しばしば地図上で軌道を可視化し、道路の接続性や動きの傾向に基づいて理由を導出する。しかし、既存の次の位置予測モデルの大半は、地図 \textbf{ in the way of human do} に従わない。幸いなことに、近年のVLM(Vision-Language Models)の発展は、視覚知覚や視覚的推論において強力な能力を示している。これにより、道路ネットワークと軌跡の両方を画像上にレンダリングし、VLMの推論能力を活用することにより、モデルが人間のような方法で軌道推定を行えるようになる。このアイデアを探索するために、まず視覚誘導位置探索(VGLS)と呼ばれる手法を提案し、VLMが内部パラメータを変更せずに軌道に基づく推論が可能かどうかを評価する。 VGLSの結果から得られた知見に基づいて、VLMLocPredictorは2つの段階から構成される。第一段階では、VLMが道路ネットワークと軌道構造を理解し、そのような視覚的な入力に対して基本的な推論能力を得るのに役立つ2つのスーパービジョンファインタニング(SFT)タスクを設計する。第2段階では、視覚マップからの強化学習を導入し、環境との相互作用を通じて次の位置予測能力を自己改善する。 4つの異なる都市のデータセットを用いて行った実験から,本手法がSOTA(State-of-the-art)性能を達成し,他のLCM手法と比較して都市間一般化が優れていることが示された。

論文の概要: Eyes Will Shut: A Vision-Based Next GPS Location Prediction Model by Reinforcement Learning from Visual Map Feed Back

関連論文リスト