Fugu-MT 論文翻訳(概要): Transfer-learning for video classification: Video Swin Transformer on multiple domains

論文の概要: Transfer-learning for video classification: Video Swin Transformer on multiple domains

arxiv url: http://arxiv.org/abs/2210.09969v1
Date: Tue, 18 Oct 2022 16:24:55 GMT
ステータス: 翻訳完了
システム内更新日: 2022-10-19 13:03:46.962972
Title: Transfer-learning for video classification: Video Swin Transformer on multiple domains
Title（参考訳）: ビデオ分類のための転送学習:複数ドメイン上のビデオスウィン変換器
Authors: Daniel Oliveira, David Martins de Matos
Abstract要約: Video Swin Transformer (VST) は、ビデオ分類用に開発された純粋なトランスフォーマーモデルである。 2つの大規模データセット上でのVSTの性能について検討する。
参考スコア（独自算出の注目度）: 0.609170287691728
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The computer vision community has seen a shift from convolutional-based to pure transformer architectures for both image and video tasks. Training a transformer from zero for these tasks usually requires a lot of data and computational resources. Video Swin Transformer (VST) is a pure-transformer model developed for video classification which achieves state-of-the-art results in accuracy and efficiency on several datasets. In this paper, we aim to understand if VST generalizes well enough to be used in an out-of-domain setting. We study the performance of VST on two large-scale datasets, namely FCVID and Something-Something using a transfer learning approach from Kinetics-400, which requires around 4x less memory than training from scratch. We then break down the results to understand where VST fails the most and in which scenarios the transfer-learning approach is viable. Our experiments show an 85\% top-1 accuracy on FCVID without retraining the whole model which is equal to the state-of-the-art for the dataset and a 21\% accuracy on Something-Something. The experiments also suggest that the performance of the VST decreases on average when the video duration increases which seems to be a consequence of a design choice of the model. From the results, we conclude that VST generalizes well enough to classify out-of-domain videos without retraining when the target classes are from the same type as the classes used to train the model. We observed this effect when we performed transfer-learning from Kinetics-400 to FCVID, where most datasets target mostly objects. On the other hand, if the classes are not from the same type, then the accuracy after the transfer-learning approach is expected to be poor. We observed this effect when we performed transfer-learning from Kinetics-400, where the classes represent mostly objects, to Something-Something, where the classes represent mostly actions.
Abstract（参考訳）: コンピュータビジョンコミュニティは、画像とビデオの両方のタスクのために畳み込みベースのアーキテクチャから純粋なトランスフォーマーアーキテクチャにシフトしている。これらのタスクのために0からトランスフォーマーをトレーニングするには、通常、大量のデータと計算リソースが必要です。 Video Swin Transformer (VST) は、ビデオ分類のために開発された純粋なトランスフォーマーモデルであり、複数のデータセットの精度と効率を向上する。本稿では、VSTがドメイン外設定で十分に使えるように一般化されているかを理解することを目的とする。本研究では,FCVIDとSomethingという2つの大規模データセット上でのVSTの性能について,Kinetics-400の転送学習手法を用いて検討した。次に、結果を分解して、VSTが最も失敗する場所と、移行学習アプローチが実行可能なシナリオを理解する。実験の結果,FCVIDでは,データセットの最先端に匹敵するモデル全体をトレーニングすることなく85%のTop-1精度を示し,Somethingでは21%の精度を示した。また, モデルの設計選択の結果と考えられるビデオ長が大きくなると, VSTの性能は平均で低下することを示した。結果から,VSTは,対象クラスがモデルのトレーニングに使用するクラスと同じタイプである場合に,再トレーニングすることなく,ドメイン外のビデオの分類を十分に行うことができると結論付けた。 Kinetics-400 から FCVID への移行学習を行ったところ,ほとんどのデータセットが対象としていた。一方、クラスが同じ型ではない場合、トランスファーラーニングアプローチ後の精度が低くなることが期待される。この効果は,クラスが主にオブジェクトを表すkinetics-400から,クラスがほとんどアクションを表す何かに転送学習を行ったときに観察した。

論文の概要: Transfer-learning for video classification: Video Swin Transformer on multiple domains

関連論文リスト