Fugu-MT 論文翻訳(概要): MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

論文の概要: MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

arxiv url: http://arxiv.org/abs/2601.20881v1
Date: Tue, 27 Jan 2026 09:19:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.345014
Title: MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading
Title（参考訳）: MA-LipNet:ロバストリリップのための多次元アテンションネットワーク
Authors: Matteo Rossi,
Abstract要約: リップリーディング技術は、公共のセキュリティなどの分野において重要なアプリケーション価値を持っている。既存のリップリーディング法は、特徴の識別性や一般化能力の低下に悩まされることが多い。マルチアテンション・リブディング・ネットワーク(MA-LipNet)という新しい手法を提案する。
参考スコア（独自算出の注目度）: 0.7276200658540084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Lipreading, the technology of decoding spoken content from silent videos of lip movements, holds significant application value in fields such as public security. However, due to the subtle nature of articulatory gestures, existing lipreading methods often suffer from limited feature discriminability and poor generalization capabilities. To address these challenges, this paper delves into the purification of visual features from temporal, spatial, and channel dimensions. We propose a novel method named Multi-Attention Lipreading Network(MA-LipNet). The core of MA-LipNet lies in its sequential application of three dedicated attention modules. Firstly, a \textit{Channel Attention (CA)} module is employed to adaptively recalibrate channel-wise features, thereby mitigating interference from less informative channels. Subsequently, two spatio-temporal attention modules with distinct granularities-\textit{Joint Spatial-Temporal Attention (JSTA)} and \textit{Separate Spatial-Temporal Attention (SSTA)}-are leveraged to suppress the influence of irrelevant pixels and video frames. The JSTA module performs a coarse-grained filtering by computing a unified weight map across the spatio-temporal dimensions, while the SSTA module conducts a more fine-grained refinement by separately modeling temporal and spatial attentions. Extensive experiments conducted on the CMLR and GRID datasets demonstrate that MA-LipNet significantly reduces the Character Error Rate (CER) and Word Error Rate (WER), validating its effectiveness and superiority over several state-of-the-art methods. Our work highlights the importance of multi-dimensional feature refinement for robust visual speech recognition.
Abstract（参考訳）: 唇運動のサイレントビデオから音声コンテンツを復号する技術であるLipreadingは、公共セキュリティなどの分野において大きな応用価値を持っている。しかし, 調音ジェスチャーの微妙な性質から, 従来のリップリーディング法は特徴識別能力の限界や一般化能力の低下に悩まされることが多い。これらの課題に対処するため、本稿では、時間的、空間的、チャネル的次元からの視覚的特徴の浄化について検討する。本稿では,MA-LipNet(Multi-Attention Lipreading Network)という新しい手法を提案する。 MA-LipNetのコアは、3つの専用アテンションモジュールのシーケンシャルな応用にある。まず、‘textit{Channel Attention(CA)モジュールを使用して、チャネルの機能を適応的に再調整することで、より少ない情報チャネルからの干渉を緩和する。その後、異なる粒度を持つ2つの時空間アテンションモジュール-\textit{Joint Spatial-Temporal Attention (JSTA) と \textit{Separate Spatial-Temporal Attention (SSTA) を利用して、無関係なピクセルやビデオフレームの影響を抑える。 JSTAモジュールは時空間次元の統一重みマップを計算し、SSTAモジュールは時空間の注意を別々にモデル化してよりきめ細やかな精細化を行う。 CMLRおよびGRIDデータセットで行った大規模な実験により、MA-LipNetはキャラクタエラー率(CER)とワードエラー率(WER)を著しく低減し、いくつかの最先端手法よりも有効性と優位性を検証した。本研究は,頑健な音声認識のための多次元特徴改善の重要性を強調した。

論文の概要: MA-LipNet: Multi-Dimensional Attention Networks for Robust Lipreading

関連論文リスト