Fugu-MT 論文翻訳(概要): Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

論文の概要: Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

arxiv url: http://arxiv.org/abs/2602.07106v1
Date: Fri, 06 Feb 2026 18:03:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:24.452397
Title: Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models
Title（参考訳）: 元Omni:Omniモード大言語モデルのための3次元顔アニメーション生成の実現
Authors: Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu,
Abstract要約: 提案するExpressive Omniは,大規模言語モデルに音声対応の3D顔アニメーションを付加したオープンソースフレームワークである。元Omniは、意味論的推論を時間的生成から切り離すことによって学習の難しさを軽減する。 InstructExは、音声対応の3D顔アニメーションによるOLLMの拡張を容易にするデータセットである。
参考スコア（独自算出の注目度）: 31.79073190007222
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet incorporating speech with 3D facial animation remains largely unexplored despite its importance for natural interaction. A key challenge arises from the representation mismatch between discrete, token-level semantic reasoning in LLMs and the dense, fine-grained temporal dynamics required for 3D facial motion, which makes direct modeling difficult to optimize under limited data. We propose Expressive Omni (Ex-Omni), an open-source omni-modal framework that augments OLLMs with speech-accompanied 3D facial animation. Ex-Omni reduces learning difficulty by decoupling semantic reasoning from temporal generation, leveraging speech units as temporal scaffolding and a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection. We further introduce InstructEx, a dataset aims to facilitate augment OLLMs with speech-accompanied 3D facial animation. Extensive experiments demonstrate that Ex-Omni performs competitively against existing open-source OLLMs while enabling stable aligned speech and facial animation generation.
Abstract（参考訳）: Omni-modal large language model (OLLMs) は、マルチモーダルな理解と生成を統一することを目的としているが、自然な相互作用が重要であるにもかかわらず、3D顔のアニメーションを組み込んだ音声はほとんど探索されていない。 LLMにおける離散的、トークンレベルのセマンティック推論と、3次元顔の動きに必要な密度できめ細かな時間的ダイナミクスとの表現ミスマッチにより、直接モデリングを制限されたデータで最適化することが困難になる。提案するExpressive Omni(Ex-Omni)は,OLLMを音声対応の3D顔アニメーションで拡張するオープンソースオムニモーダルフレームワークである。元Omniは、意味推論を時間的生成から切り離し、音声単位を時間的足場として利用し、制御された意味注入のためのトークン・アズ・カリー・ゲート融合(TQGF)機構を統一することで学習の難しさを軽減する。さらに,音声対応3D顔アニメーションによるOLLMの拡張を目的としたデータセットであるInstructExを紹介する。大規模な実験により、Ex-Omniは既存のオープンソースOLLMと競合し、安定したアライメント音声と顔のアニメーション生成を可能にした。

論文の概要: Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

関連論文リスト