Fugu-MT 論文翻訳(概要): Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

論文の概要: Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

arxiv url: http://arxiv.org/abs/2512.15340v1
Date: Wed, 17 Dec 2025 11:37:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-18 17:06:26.964857
Title: Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics
Title（参考訳）: シームレスインタラクションに向けて:対話型3次元対話型ヘッドダイナミクスの因果レベルモデリング
Authors: Junjie Chen, Fei Wang, Zhihao Huang, Qing Zhou, Kun Li, Dan Guo, Linfeng Zhang, Xun Yang,
Abstract要約: 本稿では,TIMAR(Turn-level Interleaved Masked AutoRegression)について述べる。各ターンにマルチモーダル情報を融合させ、会話履歴を蓄積するためにターンレベルの因果注意を適用する。 DualTalkベンチマークの実験では、TIMARはテストセット上でFréchet DistanceとMSEを15～30%削減している。
参考スコア（独自算出の注目度）: 40.86039227407712
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human conversation involves continuous exchanges of speech and nonverbal cues such as head nods, gaze shifts, and facial expressions that convey attention and emotion. Modeling these bidirectional dynamics in 3D is essential for building expressive avatars and interactive robots. However, existing frameworks often treat talking and listening as independent processes or rely on non-causal full-sequence modeling, hindering temporal coherence across turns. We present TIMAR (Turn-level Interleaved Masked AutoRegression), a causal framework for 3D conversational head generation that models dialogue as interleaved audio-visual contexts. It fuses multimodal information within each turn and applies turn-level causal attention to accumulate conversational history, while a lightweight diffusion head predicts continuous 3D head dynamics that captures both coordination and expressive variability. Experiments on the DualTalk benchmark show that TIMAR reduces Fréchet Distance and MSE by 15-30% on the test set, and achieves similar gains on out-of-distribution data. The source code will be released in the GitHub repository https://github.com/CoderChen01/towards-seamleass-interaction.
Abstract（参考訳）: 人間の会話には、頭蓋骨、視線シフト、注意と感情を伝える表情など、言論と非言語的な手がかりの連続的な交換が含まれる。これらの双方向のダイナミクスを3Dでモデル化することは、表現力のあるアバターや対話型ロボットを構築するのに不可欠である。しかし、既存のフレームワークは、会話や聞き取りを独立したプロセスとして扱うことや、旋回する時間的コヒーレンスを妨げる非因果的なフルシーケンスモデリングに依存していることが多い。 TIMAR(Turn-level Interleaved Masked AutoRegression)は、対話をインターリーブされた音声視覚コンテキストとしてモデル化する3次元対話ヘッド生成のための因果的フレームワークである。各ターンにマルチモーダル情報を融合し,会話履歴の蓄積にターンレベルの因果注意を適用し,軽量拡散ヘッドは協調性と表現性の両方を捉える連続した3次元ヘッドダイナミクスを予測する。 DualTalkベンチマークの実験では、TIMARはテストセット上でFréchet DistanceとMSEを15～30%削減し、アウト・オブ・ディストリビューションデータで同様の利益を得る。ソースコードはGitHubリポジトリhttps://github.com/CoderChen01/towards-seamleass-interactionでリリースされる。

論文の概要: Towards Seamless Interaction: Causal Turn-Level Modeling of Interactive 3D Conversational Head Dynamics

関連論文リスト