論文の概要、ライセンス

# (参考訳) 映像シーンセグメンテーションのためのシーン一貫性表現学習 [全文訳有]

Scene Consistency Representation Learning for Video Scene Segmentation ( http://arxiv.org/abs/2205.05487v1 )

ライセンス: CC BY 4.0
Haoqian Wu, Keyu Chen, Yanan Luo, Ruizhi Qiao, Bo Ren, Haozhe Liu, Weicheng Xie, Linlin Shen(参考訳) 映画やテレビ番組のような長期ビデオは様々なシーンで構成されており、それぞれが同じ意味のストーリーを共有する一連のショットを表している。 モデルがビデオのストーリーラインを理解して、シーンの開始と終了の場所を理解する必要があるため、長期的なビデオから適切なシーン境界を見つけることは難しい作業である。 そこで本稿では,ラベルのない長期ビデオからより優れたショット表現を学習するための,効果的な自己監視学習(SSL)フレームワークを提案する。 具体的には,シーンの一貫性を実現するためのSSLスキームを提案するとともに,モデルの一般化性を高めるためのデータ拡張とシャッフル手法を提案する。 先行手法のようにシーン境界特徴を明示的に学習する代わりに,ショット特徴の品質を検証するために,帰納的バイアスの少ないバニラ時間モデルを導入する。 本手法は,映像シーンセグメンテーションのタスクにおける最先端性能を実現する。 さらに,映像シーンセグメンテーション手法の性能を評価するための,より公平で合理的なベンチマークを提案する。 コードは利用可能である。

A long-term video, such as a movie or TV show, is composed of various scenes, each of which represents a series of shots sharing the same semantic story. Spotting the correct scene boundary from the long-term video is a challenging task, since a model must understand the storyline of the video to figure out where a scene starts and ends. To this end, we propose an effective Self-Supervised Learning (SSL) framework to learn better shot representations from unlabeled long-term videos. More specifically, we present an SSL scheme to achieve scene consistency, while exploring considerable data augmentation and shuffling methods to boost the model generalizability. Instead of explicitly learning the scene boundary features as in the previous methods, we introduce a vanilla temporal model with less inductive bias to verify the quality of the shot features. Our method achieves the state-of-the-art performance on the task of Video Scene Segmentation. Additionally, we suggest a more fair and reasonable benchmark to evaluate the performance of Video Scene Segmentation methods. The code is made available.
公開日: Wed, 11 May 2022 13:31:15 GMT

※ 翻訳結果を表に示しています。PDFがオリジナルの論文です。翻訳結果のライセンスはCC BY-SA 4.0です。詳細はトップページをご参照ください。

翻訳結果

    Page: /      
英語(論文から抽出)日本語訳スコア
SceneConsistencyRepr esentationLearningfo rVideoSceneSegmentat ionHaoqianWu1,2,3,4∗,KeyuChen2∗,YananLuo2,RuizhiQia o2,BoRen2,HaozheLiu1 ,3,4,5,WeichengXie1, 3,4†,LinlinShen1,3,41Com puterVisionInstitute ,ShenzhenUniversity2 TencentYouTuLab3Shen zhenInstituteofArtificialIntelligenceandR oboticsforSociety4Gu angdongKeyLaboratory ofIntelligentInforma tionProcessing5KAUST wuhaoqian2019@email. szu.edu.cn{yolochen,ruizhiqiao, timren}@tencent.comluoyanan 93@gmail.comhaozhe.l iu@kaust.edu.sa{wcxie,llshen}@szu.edu.cnScene D SceneConsistencyRepr esentationLearningfo rVideoSegmentationHa oqianWu1,2,3,4∗,KeyuChen2∗,YananLuo2,RuizhiQia o2,BoRen2,HaozheLiu1 ,3,4,5,WeichengXie1, 3,4! 0.19
(b) Scene Perspective ClipScene AScene BScene CScene E (b)シーンパースペクティブClipSceneAsceneBscen eCSceneE 0.63
(a) Shot Perspective Clip Ground Truth Scene Boundary…Scene Consistency Selection (a)ショット・パースペクティブ・クリップ・グラウンド・トゥルース境界...シーン一貫性の選択 0.50
(c) Scene ConsistencyShotConsi stencytNN SelectionNN SelectionFigure1.Ani llustrationofreprese ntationlearningmetho dsfromtheshot-to-sce neperspective.Severa lcontinuousshotsares howninFig. (c)Scene ConsistencyShotConsi stencytNN SelectionFigure1.Ani llustrationofreprese ntationlearningmetho ds fromtheshot-to-scene perspective.Severalc ontinuousshotsaresho wninFig 0.09
(a),whereexistingSSL approachesobtainposi tivepairsfromtheadja centshots(e g ,byperformingNearest Neighbor(NN)Selectio n[1]). (a)SSLapproachesobta in positivepairsfromthe adjacentshots(e g ,byperformingNearest Neighbor(NN)Selectio n[1] 0.34
Whileweproposetolook furtherforscenesthat areoftencrossedover, asSceneA/CandSceneB/ EshowninFig. AsSceneA/CandSceneB/ EshowninFig 0.30
(b),wherepositivesam plesareexploredinabr oaderregionandthesho tsareclusteredtothes amesceneinthefeature representationspace, i.e.,Fig. (b)Samplesareexplore dinabroaderreaandthe shotsareclusteredtot hesceneintheaturerep resentationspace,すなわちFig。 0.07
(c). Bestviewedincolor.Ab stractAlong-termvide o,suchasamovieorTVsh ow,iscom-posedofvari ousscenes,eachofwhic hrepresentsaseriesof shotssharingthesames emanticstory.Spottin gthecor-rectscenebou ndaryfromthelong-ter mvideoisachalleng-in gtask,sinceamodelmus tunderstandthestoryl ineofthevideotofigureoutwhereascenest artsandends.Tothisen d,weproposeaneffecti veSelf-SupervisedLea rning(SSL)frameworkt olearnbettershotrepr esentationsfromunlab eledlong-termvideos. Morespecifically,wepresentanSSL schemetoachievescene consistency,whileexp lor-ingconsiderabled ataaugmentationandsh ufflingmeth-odstoboostth emodelgeneralizabili ty.Insteadofexplic-i tlylearningthesceneb oundaryfeaturesasint heprevi-ousmethods,w eintroduceavanillate mporalmodelwith∗EqualContribution†CorrespondingAuthorl essinductivebiastove rifythequalityofthes hotfeatures.Ourmetho dachievesthestate-of -the-artperformanceo nthetaskofVideoScene Segmentation.Additio nally,wesug-gestamor efairandreasonablebe nchmarktoevaluatethe performanceofVideoSc eneSegmentationmetho ds.Thecodeismadeavai lable.11.Introductio nIntheprocessofvideo creation,tomakethest orymorecompelling,th eeditorwillusevariou seditingtechniques,s uchasmontage,oneshot totheend,etc.Quickly switch-ingbetweensto riesandscenesmakesth emovieplottighter,e g insertingoutdoorbatt lescenesintoindoordi aloguescenes,asshown inFig.1 (c)。 Bestviewedincolor.Ab stractAlong-termvide o,suchasamovieorTVsh ow,iscom-posedofvari ousscenes,eachofwhic hrepresentsaseriesof shotssharingthesames emanticstory.Spottin gthecor-rectscenebou ndaryfromthelong-ter mvideoisachalleng-in gtask,sinceamodelmus tunderstandthestoryl ineofthevideotofigureoutwhereascenest artsandends.Tothisen d,weproposeaneffecti veSelf-SupervisedLea rning(SSL)frameworkt olearnbettershotrepr esentationsfromunlab eledlong-termvideos. Morespecifically,wepresentanSSL schemetoachievescene consistency,whileexp lor-ingconsiderabled ataaugmentationandsh ufflingmeth-odstoboostth emodelgeneralizabili ty.Insteadofexplic-i tlylearningthesceneb oundaryfeaturesasint heprevi-ousmethods,w eintroduceavanillate mporalmodelwith∗EqualContribution†CorrespondingAuthorl essinductivebiastove rifythequalityofthes hotfeatures.Ourmetho dachievesthestate-of -the-artperformanceo nthetaskofVideoScene Segmentation.Additio nally,wesug-gestamor efairandreasonablebe nchmarktoevaluatethe performanceofVideoSc eneSegmentationmetho ds.Thecodeismadeavai lable.11.Introductio nIntheprocessofvideo creation,tomakethest orymorecompelling,th eeditorwillusevariou seditingtechniques,s uchasmontage,oneshot totheend,etc.Quickly switch-ingbetweensto riesandscenesmakesth emovieplottighter,e g insertingoutdoorbatt lescenesintoindoordi aloguescenes,asshown inFig.1 0.21
(b),makingthescenetr an-1https://github.c om/TencentYoutuResea rch/SceneSegmentatio n-SCRLarXiv:2205.054 87v1 [cs.CV] 11 May 2022 (b) makingthescenetran-1 https://github.com/T encentYoutuResearch/ SceneSegmentation-SC RLarXiv:2205.05487v1 [cs.CV] 11 May 2022 0.24
英語(論文から抽出)日本語訳スコア
sitionmoreintriguing andunpredictable,thu sthetaskofVideoScene Segmentationturnsout toberatherchalleng-i ng.Hence,itisessenti altounderstandthehig h-levelse-manticinfo rmationofeachscenein thelong-termvideo.Th erehasbeenextensives tudiesdealingwithvid eoun-derstandingtask sondatasetswherethei ndividualvideoclipis typicallyshort,while requiringalotoflabor tosegmentuncuratedvi deosintoshortvideosb ycategory.Althoughso mestudiesfocusonspli ttingthelongvideoint osmallersegments,e g ,themethodsofActionS pot-ting[2–5]aimtolocatethepositi onsofthebeginningand endingoftheaction,ho wever,theyarethecate gory-awareapproaches .Bycontrast,VideoSce neSegmentationisacat egory-agnostictaskth atonlythescenebounda rylabelisavailable,a ndit’sveryconfusingtoclas sifyascenefrag-mentt axonomically.Sinceal ong-termvideoisinher entlystructuredinasp e-cificway,asequenceoffram escanbedividedintosh otsorscenesintermsof thegranularityofsema ntics[6][7]. sitionmoreintriguing andunpredictable,thu sthetaskofVideoScene Segmentationturnsout toberatherchalleng-i ng.Hence,itisessenti altounderstandthehig h-levelse-manticinfo rmationofeachscenein thelong-termvideo.Th erehasbeenextensives tudiesdealingwithvid eoun-derstandingtask sondatasetswherethei ndividualvideoclipis typicallyshort,while requiringalotoflabor tosegmentuncuratedvi deosintoshortvideosb ycategory.Althoughso mestudiesfocusonspli ttingthelongvideoint osmallersegments,e g ,themethodsofActionS pot-ting[2–5]aimtolocatethepositi onsofthebeginningand endingoftheaction,ho wever,theyarethecate gory-awareapproaches .Bycontrast,VideoSce neSegmentationisacat egory-agnostictaskth atonlythescenebounda rylabelisavailable,a ndit’sveryconfusingtoclas sifyascenefrag-mentt axonomically.Sinceal ong-termvideoisinher entlystructuredinasp e-cificway,asequenceoffram escanbedividedintosh otsorscenesintermsof thegranularityofsema ntics[6][7]. 0.05
Morespecifically,ashotcontainso nlycontinuousframest akenbythecamerawitho utinterruption,andas ceneiscomposedofsucc essiveshotsanddescri besthesameshortstory .Fordetectingshotbou ndaries,[8][7]splitavideointomanys eparateshotsusinglow er-levelvisualcontex t.Basedonthis,manyma instreamapproachesof VideoSceneSegmentati on[9][10][6][1]determinescenebounda riesbyexploringseman ticcorrelationsamong theadjacentshots.Whi lecomputervisiontask ssufferfromthehighco stofmanualannotation ,Self-SupervisedLear ning(SSL)basedmethod s[11–18]areproposedtotrainag eneralfeatureex-trac torusingunlabeleddat a.Byleveragingasmall amountofannotateddat afortraining,theseSS Lmethodscanachieveap pealingfeaturerepres entationtoevenrivals omesupervisedlearnin gmethods.ForVideoSce neSegmenta-tion,[1]proposestonarrowthef eaturerepresentation dis-tanceofthemostsi milarshotpairinaloca lregion,itsig-nificantlysurpassesthesu pervisedlearningmeth od[6]byemployingamereMLPc lassifier.However,incurrent SSLmethodsonthetasko fVideoSceneSegmentat ion,thestrat-egyofpo sitivesampleselectio n,pretrainingprotoco l,eval-uationmetrica nddownstreammodelare notwelldiscussedorad dressed.Toachievethi sgoal,weproposeaself -supervisedlearn-ing schemetolearnbetterr epresentations,aswel lastheevaluationmetr icforthetaskofVideoS ceneSegmentation.The contributionsofthisp aperaresummarizedasf ollows:•Arepresentationlearn ingschemebasedonScen eCon-sistencyispropo sedtoobtainbettersho trepresenta-tionsont heunlabeledlong-term video. Morespecifically,ashotcontainso nlycontinuousframest akenbythecamerawitho utinterruption,andas ceneiscomposedofsucc essiveshotsanddescri besthesameshortstory .Fordetectingshotbou ndaries,[8][7]splitavideointomanys eparateshotsusinglow er-levelvisualcontex t.Basedonthis,manyma instreamapproachesof VideoSceneSegmentati on[9][10][6][1]determinescenebounda riesbyexploringseman ticcorrelationsamong theadjacentshots.Whi lecomputervisiontask ssufferfromthehighco stofmanualannotation ,Self-SupervisedLear ning(SSL)basedmethod s[11–18]areproposedtotrainag eneralfeatureex-trac torusingunlabeleddat a.Byleveragingasmall amountofannotateddat afortraining,theseSS Lmethodscanachieveap pealingfeaturerepres entationtoevenrivals omesupervisedlearnin gmethods.ForVideoSce neSegmenta-tion,[1]proposestonarrowthef eaturerepresentation dis-tanceofthemostsi milarshotpairinaloca lregion,itsig-nificantlysurpassesthesu pervisedlearningmeth od[6]byemployingamereMLPc lassifier.However,incurrent SSLmethodsonthetasko fVideoSceneSegmentat ion,thestrat-egyofpo sitivesampleselectio n,pretrainingprotoco l,eval-uationmetrica nddownstreammodelare notwelldiscussedorad dressed.Toachievethi sgoal,weproposeaself -supervisedlearn-ing schemetolearnbetterr epresentations,aswel lastheevaluationmetr icforthetaskofVideoS ceneSegmentation.The contributionsofthisp aperaresummarizedasf ollows:•Arepresentationlearn ingschemebasedonScen eCon-sistencyispropo sedtoobtainbettersho trepresenta-tionsont heunlabeledlong-term video. 0.06
•Asimpleyeteffectivet emporalmodelwithless induc-tivebiasisprop osedtoassessthequali tyoftheshotrepresent ationforthedownstrea mVideoSceneSeg-menta tiontask. ・Aimpleyet Effectivetemporalmod elwithless-duc-tiveb iasisproposedtoasses sthequalityoftheshot representation forthedownstreamVide oSceneSeg-mentationt ask。 0.02
•Abenchmarkthatismore fairandreasonableisi ntro-ducedforbothpre trainingandevaluatio n.Moreim-portantly,t heproposedmethodoutp erformsthestate-of-t he-artmethodsunderal ltheprotocols,andcan sig-nificantlyimprovetheperf ormanceofexistingsup er-visedmethodswitho utbellsandwhistles.2 .RelatedWorkSelf-Sup ervisedLearninginIma gesandVideos.Toaddre sstheproblemsofthein sufficientandexpensiveman ualannotation,manyap proachesexplorethein herentknowledgeinunl abeleddatabydesignin galotofpretexttasks, includingpredictingt hetransformationsofi mages,e g ,imagerotation[20],inpainting[21],colorizing[22],jigsaw[23],etc.Inshort,theseSe lf-SupervisedLearnin g(SSL)methodsusethei nformationexploredfr omthedatathemselvesf orthesupervision.Rec ently,[11–18]introducethecontrast ivesimilaritymetrics tolearninvariantfeat urerepresentationofv ariousviewsaugmented fromtheorig-inalimag e,wherestrongdataaug mentations[13]arefre-quentlyusedin image-levelSSLmethod stoimprovetherobustn essofthelearnedrepre sentations.Fromanoth eras-pect,byfinetuningthemodelwith asmallamountofla-bel eddata,SSLmethodscan achievecompetitivepe rfor-mancecomparedwi thsupervisedlearning methods,further-more ,thepretrainedmodelc anbeusedinspecificdown-streamtasks.Fo rvideo-orientedSSLme thods,[24–30]showtheappealingperf ormanceandpotentialo nthetaskofvideoclass ification,whiletheirpos itivepairsareselecte dfromtheadjacentclip swithinasamevideo.Me anwhile,mostofstudie sarebasedonshortvide osandthequalityoflea rnedfeaturesisassess edbasedonvideoclassi fication.Hence,itismea ningfultoexploreasui tableSSLschemefortas kswithlong-termvideo s.VideoShotBoundaryD etectionandSceneSegm en-tation.ForVideoSc eneSegmentation,shot boundaryde-tectionis oftenconductedinadva nce,whichisspecifiedasataskoflocatingt hetransitionposition sinvideosbasedonthes imilarityoftheframes .3Dconvolutionalnetw orksandcolorhistogra mdifferencing[8]areusedtoidentifythe transitionboundaries .Basedontheshotbound aries,[6]learnsthelocalandglo balshotrepresentatio nsandutilizesthemtos plitthecontinuoussho tsintoscenesaccordin gtothetransitionofth estory.Morespecifically,identifica-tionofeachshot’ssegmentationpointis treatedasabinaryclas sification,whichisfreeto thelocationoftheshot . •Abenchmarkthatismore fairandreasonableisi ntro-ducedforbothpre trainingandevaluatio n.Moreim-portantly,t heproposedmethodoutp erformsthestate-of-t he-artmethodsunderal ltheprotocols,andcan sig-nificantlyimprovetheperf ormanceofexistingsup er-visedmethodswitho utbellsandwhistles.2 .RelatedWorkSelf-Sup ervisedLearninginIma gesandVideos.Toaddre sstheproblemsofthein sufficientandexpensiveman ualannotation,manyap proachesexplorethein herentknowledgeinunl abeleddatabydesignin galotofpretexttasks, includingpredictingt hetransformationsofi mages,e g ,imagerotation[20],inpainting[21],colorizing[22],jigsaw[23],etc.Inshort,theseSe lf-SupervisedLearnin g(SSL)methodsusethei nformationexploredfr omthedatathemselvesf orthesupervision.Rec ently,[11–18]introducethecontrast ivesimilaritymetrics tolearninvariantfeat urerepresentationofv ariousviewsaugmented fromtheorig-inalimag e,wherestrongdataaug mentations[13]arefre-quentlyusedin image-levelSSLmethod stoimprovetherobustn essofthelearnedrepre sentations.Fromanoth eras-pect,byfinetuningthemodelwith asmallamountofla-bel eddata,SSLmethodscan achievecompetitivepe rfor-mancecomparedwi thsupervisedlearning methods,further-more ,thepretrainedmodelc anbeusedinspecificdown-streamtasks.Fo rvideo-orientedSSLme thods,[24–30]showtheappealingperf ormanceandpotentialo nthetaskofvideoclass ification,whiletheirpos itivepairsareselecte dfromtheadjacentclip swithinasamevideo.Me anwhile,mostofstudie sarebasedonshortvide osandthequalityoflea rnedfeaturesisassess edbasedonvideoclassi fication.Hence,itismea ningfultoexploreasui tableSSLschemefortas kswithlong-termvideo s.VideoShotBoundaryD etectionandSceneSegm en-tation.ForVideoSc eneSegmentation,shot boundaryde-tectionis oftenconductedinadva nce,whichisspecifiedasataskoflocatingt hetransitionposition sinvideosbasedonthes imilarityoftheframes .3Dconvolutionalnetw orksandcolorhistogra mdifferencing[8]areusedtoidentifythe transitionboundaries .Basedontheshotbound aries,[6]learnsthelocalandglo balshotrepresentatio nsandutilizesthemtos plitthecontinuoussho tsintoscenesaccordin gtothetransitionofth estory.Morespecifically,identifica-tionofeachshot’ssegmentationpointis treatedasabinaryclas sification,whichisfreeto thelocationoftheshot . 0.05
[1]leveragesunlabeledvi deodatatoobtainshotr epresenta-tions,whic houtperformsmanysupe rvisedlearningmeth-o dsonthedownstreamtas kofVideoSceneSegment ation.However,thisme thodispretrainedonth eentirevideodataofMo vieNet[31]thatincludethetestin gvideos,i.e.the [1]leveragesunlabeledvi deodatatoobtainshotr epresenta-tions,whic houtperformsmanysupe rvisedlearningmeth-o dsonthedownstreamtas kofVideoSceneSegment ation. but,thismethodispret rainedontheentire videosofMovieNet[31] thatincludethetestin gvideos,. 0.09
英語(論文から抽出)日本語訳スコア
Query EncoderKey EncoderInput ShotscliptPositive PairSelection()Key SamplesRe-indexMinim izeContrastive Loss Query EncoderKey EncoderInput ShotscliptPositive PairSelection()Key SamplesRe-indexMinim izeContrastive Loss 0.50
(a) Representation LearningStageFrozenQ uery EncoderMLP/ Bi-LSTM (a)Representation LearningStageFrozenQ uery EncoderMLP/Bi-LSTM 0.39
(b) Video Scene SegmentationStageSce neBoundariesFigure2. Thepipelineoftheprop osedmethod. (b)ビデオシーンセグメンテーションstagesceneboundaries 例2.thepipelineofthepr oposed method 0.62
(a)UnsupervisedRepre sentationLearningSta geforlearningshotrep resentations,whereMa p (a)UnsupervisedRepre sentationLearningSta geforlearningshotrep resentations,whereMa p 0.08
(i)isthemappingfunct ionforselectingposit ivesamples. (i)陽性サンプルの選択のためのイソテマッピング機能。 0.33
(b)SupervisedVideoSc eneSegmentationStage ,wherethequalityofth eshotrepresentations isevaluatedunderthep rotocolsofthenon-tem poral(MLP)andtempora l(Bi-LSTM[19])models.trainingprot ocolisinconsistentwi ththatofconventional Self-SupervisedLearn ingmethods[24][25]. b)supervisedvideosce nesegmentationstage, wherethequalityofthe shotrepresentationsi sevaluatedunderthepr otocolsofthenon-temp oral(mlp) andtemporal(bi-lstm[19])models.trainingprot ocolisin consistentwiththatof conventional self-supervisedlearn ingmethods[24][25] 0.13
Forevaluat-ingVideoS ceneSegmentationappr oaches,thedatasetsof OVSD[32],BBCplanetearth[10],MovieNet[31]andAd-Cuepoints[1]arefrequentlyemploye d.Inthiswork,wepropo seanunsupervisedrepr esentationlearningme thodbasedonscenecons istencyandareason-ab leevaluationschemefo rVideoSceneSegmentat iontask.3.Methodolog yAsshowninFig.2,weai mtoobtainsceneconsis tencyrepresentations onunlabeledlong-term videosanddesignamore reasonablebenchmarkt overifythequalityoft heex-tractedfeatures onthetaskofVideoScen eSegmentation.Tothis end,we Forevaluat-ingVideoS ceneSegmentationappr oaches,thedatasetsof OVSD[32],BBCplanetearth[10],MovieNet[31]andAd-Cuepoints[1]arefrequentlyemploye d.Inthiswork,wepropo seanunsupervisedrepr esentationlearningme thodbasedonscenecons istencyandareason-ab leevaluationschemefo rVideoSceneSegmentat iontask.3.Methodolog yAsshowninFig.2,weai mtoobtainsceneconsis tencyrepresentations onunlabeledlong-term videosanddesignamore reasonablebenchmarkt overifythequalityoft heex-tractedfeatures onthetaskofVideoScen eSegmentation.Tothis end,we 0.07
(i)proposeaSelf-Supe rvisedLearningscheme basedonanovelnon-tem poralselectionstrate gytoachievescenecons istencyfromvarioussh ots,and (i)ProposeaSelf-Supe rvisedLearningscheme basedonanovelnon-tem poralselectionstrate gytoachievescene Consistency fromvariousshots, and 0.09
(ii)introduceavanill atemporalmodelwithle ssinductivebiasaswel lasthecorrespondingb enchmarkforthissegme ntationtask.3.1.Cons istencybasedRepresen tationLearningApproa chesofSelf-Supervise dLearning(SSL)aimtom odelrepresentationco nsistencytoenhancene tworkro-bustnessagai nstvariousvariations ,e g spatialortemporaltra nsformations.Inthisw ork,weuseanSSLframew orkofSiamesenetworkt oachievetherepresent ationconsistency.Mor eprecisely,foragiven queryshot,theobjecti veisto (ii)Introduceavanill atemporalmodelwithle ss inductivebiasaswella scorssociatedingbenc hmarkforthissegmenta tiontask.3.1.Consist ency basedRepresentationL earningApproachesofS elf-SupervisedLearni ng(SSL)aimtomodelrep resentationConsisten cytoencenetworkro-bu stnessagainstvarious variations,e g spaceortemporalforma tions.in thiswork,weuseanSSLf rameworkofSiamesenet worktoachievereprese ntationConsistency.m oreprecisely,foragiv enqueryshot,theobjec tiveistovariations. 0.06
(i)maximizethesimila ritybetweentherepres entationsofqueryshot andpositivesamples,i .e.,keyshots; (i)キーショットと陽性サンプルの表現を最大化する。 0.45
(ii)mini-mizethesimi larityofthenegatives amplepairsiftheyexis t.AsshowninFig.2 (二)アショーニンフィグ.2 0.18
(a),theinputsamplesX arefirstaug-mented,i.e.,Q =AugQ(X),K=AugK(X),andthei-thpo sitivepair{q,k+}isformulatedasfollow s:{q,k+}={f(Q[i]|θQ),f(K[MAP (a)theinputsamplesXa refirstaug-mented、すなわちQ=AugQ(X),K=AugK(X),andthei-th positivepair{q,k+}isformulatedasfollow s:{q,k+}={f(Q[i]|θQ),f(K[MAP] 0.48
(i)]|θK)+}(1)where[·]standsfortheindexing operation,f(·|θQ)andf(·|θK)aretheencoderswith parametersθQandθK,re-spectively,MAP (i)]|θK(+}(1)where[·]standsfortheindexing operation,f(·|θQ) andf(·|θK)aretheencoderswith parametersθQandθK,re-spectively,MAP 0.38
(i)isthemappingfunct ionforselectingposit ivesamples.Forthesel ectionofpositivesamp lesinSSLmethodsbased onvideodata,threesel ectionstrategiesaref re-quentlyemployed,i .e.,Self-Augmented[14],Random[27]andNearestNeighbor(N N)[1]selections.Forclarit y,thethreeconvention alselectionstrategie sforpositivesamplesa rerepresentedinFig.3 一 陽性samplesinSSLmethods basedonvideodata, Threeselectionstrate giesarefre-quentlyem ployed, i., Self-Augmented[14],Random[27]andNearestNeighbor(N N)[1]selections.Forclarit y,the Threeconventionalsel ectionstrategiesfor positivesamplesarere presentedinFig. 0.24
(a)-(c). (a)Self-Augmented (a)-(c) (a)自己紹介 0.42
(b) Random(c)NN(d)Scene Consistency (SC)QueryNNSC (Center) Soft-SC A ClusterFigure3.Theil lustrationoffourdiff erentselectionstrate giesforpositivepairs .Bestviewedincolor.3 .1.1ConventionalPosi tiveSampleSelections Self-AugmentedSelect ion.Asimage-levelSSL ap-proaches,theaugme ntedviewofoneshotisf requentlyusedasitspo sitivesample,asshown inFig.3 b)Random(c)NN(d)Scen e Consistency (SC)QueryNNSC (Center) Soft-SC A ClusterFigure3.Theil lustrationoffourdiff erentselectionstrate giesfor positivepairs.Bestvi ewedincolor.3.1Conve ntionalPositiveSampl eSelectionsSelf-Augm entedSelection.Asima ge-levelSSLap-proach es,theaugmentedviewo foneshotisfrequently usedasits positivesample,assho wninFig3 0.12
(a),themappingfuncti on,i.e.theidentityma pping,isemployedasfo llows:MAPSA (a)themapping function、すなわちtheidentitymapping、isemployedasfollows: MAPSA 0.43
(i)=i(2)RandomSelection. AssomeSSLmethods[27][24]forvideoclassification,weselecttwoad jacentshotsofthe (i)=i(2)randomselection. assomesslmethods[27][24]forvideoclassificati on,weselecttwoadjace ntshotsofthe 0.42
英語(論文から抽出)日本語訳スコア
samevideoasthepositi vepair,asshowninFig. 3(b),andthemappingfu nctioncanbeformulate dasfollows:MAPRS samevideoasthepositi vepair,asshowninfig. 3(b),andthemappingfu nctioncanbeformulate dasfollows:maprs 0.19
(i)=max(i+j,0)(3)wherej∈{−n,−n+1,...,n−1,n}andndenotesthesizeof thesearchregionaroun dthei-thshot.Nearest Neighborhood(NN)Sele ction.AsshowninFig.3 (c),[1]proposedtoselectthep ositiveshotwiththecl osestrepresentationd istancetothequerysho twithinafixedrange,andthemappi ngfunctionisasfollow s:MAPNN (i)=max(i+j,0)(3)wherej∂{−n,−n+1,...,n−1,n}andndenotesthesizeof searcharoundthei-ths hot.NearestNeighborh ood(NN)Selection.Ass howninFig.3(c),[1]proposedtoselectthe positiveshotwiththec losestrepresentation distancetotheryshotw ithinafixedrange,and themappingfunctionis asfollows:MAPNN 0.18
(i)=argmaxj∈IMf(Q[i]|θQ)·f(Q[max(j,0)]|θQ)(4)whereIM={i−m,...,i−1,i+1,...,i+m},IMstandsfortheindic esofcandidatesamples forNNselection,misth esearchregionsizeofa givenshot,and2m+1isthelengthofthesli dingwindow.3.1.2Scen eConsistencySelectio nInthiswork,wepropos etheSceneConsistency Selection,whileexplo ringconsiderabledata augmentationandshuf- flingmethodsforthetask ofVideoSceneSegmenta tion.PositiveSampleS electionwithSceneCon sistency.AsshowninFi g.1,forthevideowithn on-linearnarrative,p re-viousselectionmet hodsmaynotworkinthec asethatthemostmatchi ngshotsarefaraway.Th erefore,weproposetos electpositiveshotpai rbasedonsceneconsist ency,whilethemainadv antageoverRandom/NNS electionisthatourmet hodisnon-temporal,wh ichisfreetotheshotso rder.Wearguethatscen econsistencyiscritic alforthetrain-ingont heunlabeledlong-term videosduetothethreer ea-sons: (i)=argmaxj∈IMf(Q[i]|θQ)·f(Q[max(j,0)]|θQ)(4)whereIM={i−m,...,i−1,i+1,...,i+m},IMstandsfortheindic esofcandidatesamples forNNselection,misth esearchregionsizeofa givenshot,and2m+1isthelengthofthesli dingwindow.3.1.2Scen eConsistencySelectio nInthiswork,wepropos etheSceneConsistency Selection,whileexplo ringconsiderabledata augmentationandshuf- flingmethodsforthetask ofVideoSceneSegmenta tion.PositiveSampleS electionwithSceneCon sistency.AsshowninFi g.1,forthevideowithn on-linearnarrative,p re-viousselectionmet hodsmaynotworkinthec asethatthemostmatchi ngshotsarefaraway.Th erefore,weproposetos electpositiveshotpai rbasedonsceneconsist ency,whilethemainadv antageoverRandom/NNS electionisthatourmet hodisnon-temporal,wh ichisfreetotheshotso rder.Wearguethatscen econsistencyiscritic alforthetrain-ingont heunlabeledlong-term videosduetothethreer ea-sons: 0.10
(i)thesimilarshotsin thesamescenemaybefar away; (ii)thegreaterfeatur espacingbetweenscene sisben-eficialtothedownstreamt askofVideoSceneSegme ntation,anditcanbeac hievedbymaximizingin ter-scenedistanceand minimizingintra-scen edistance; (i)類似する部分 (ii)世界間距離・ミニミジングintra-scene distance,anditcanbea chievedbymaximizing inter-scene distance,minimizingi ntra-scene distance 0.30
(iii)whiletheNNse-le ctionmayresultinatri vialobjective,duetot hemaxi-mizationofthe similarityofthesampl epairsthatmaybealrea dytheclosest,thescen econsistencyenablest heselec-tiontoachiev eamorenon-trivialobj ective.Forthepropose dsceneconsistency-ba sedselection,weperfo rmonlineclusteringof samplesinabatch,andu setheclustercentersa mpleasthepositivesam plewithrespecttotheq ueryshot,asshowninFi g.3(d). (iii)whilethennse-le ctionmayresultinatri vialobjective,duetot hemaxi-mizationofthe samplepairsthatmaybe alreadytheclosest,th esceneconsistencyena blesthestheselec-tio ntoachieveamorenon-t rivialobjective.fort heproposedsceneconsi stency-basedselectio n,weperformonlineclu steringofsamplesinab atch,andusethecluste rcentersampleasthe positivesamplewithre specttothequeryshot, asshowninfig.3(d) 0.09
Thespecifiedmappingfunctionisf ormulatedasfollows:M APSC Thespecifiedmappingf unctionisformulateda sfollows:MAPSC 0.09
(i)=argminj∈ICkf(Q[i]|θQ)−f(Q[j]|θQ)k2(5)whereIC={ic1,ic2,...,ic#class }standsfortheindiceso fclustercenters,#cla ssisthenumberofclust ercenters.Whilecente rsamplereflectsthecluster-speci fiedcom-moninformation ,weadditionallyuseth equery-specificin-dividualinformat iontogeneratetheposi tivesample.Un-liketh econventionalmultipl e-instancelearning[29],whichtreatscenteran dquerysamplesasmulti plepositivesam-ples, weproposetoconstruct thesoftpositivesampl e,namelySoft-SceneCo nsistency(SC)samplea sfollows:kSoft−SC=γkSA+(1−γ)kSC(6)whereγisatrade-offparamete r,kSAandkSCarethekey (positive)samplessel ectedbySelf-Augmente dSelec-tionandSceneC onsistencySelection. SceneConsistencyData Augmentation.Sinceth eearlystageoftrainin gisnotstable,toomuch coloraug-mentations, e g grayscaletransformat ions,colorjitter,etc . (i)=argminj∈ICkf(Q[i]|θQ)−f(Q[j]|θQ)k2(5)whereIC={ic1,ic2,...,ic#class }standsfortheindiceso fclustercenters,#cla ssisthenumberofclust ercenters.Whilecente rsamplereflectsthecluster-speci fiedcom-moninformation ,weadditionallyuseth equery-specificin-dividualinformat iontogeneratetheposi tivesample.Un-liketh econventionalmultipl e-instancelearning[29],whichtreatscenteran dquerysamplesasmulti plepositivesam-ples, weproposetoconstruct thesoftpositivesampl e,namelySoft-SceneCo nsistency(SC)samplea sfollows:kSoft−SC=γkSA+(1−γ)kSC(6)whereγisatrade-offparamete r,kSAandkSCarethekey (positive)samplessel ectedbySelf-Augmente dSelec-tionandSceneC onsistencySelection. SceneConsistencyData Augmentation.Sinceth eearlystageoftrainin gisnotstable,toomuch coloraug-mentations, e g grayscaletransformat ions,colorjitter,etc . 0.19
,misguidetheselectio nofpositivesamples,n amelyasSe-lectionShi ft.Inthiscase,themod elfocusesmoreonnon-s emanticinformation.T osolvethisproblem,so mestud-ies[33]directlyomitcoloraug mentationsforbetterp er-formance.Bycontra st,weproposeAsymmetr icAugmen-tationtoall eviatetheinfluenceofSelectionShif tandusecoloraugmenta tiontofurtherimprove theperformance.Mores pecifically,augmentationsw ithoutthecolortrans- formationareusedinAu gQtogetmoreaccuratea ndsceneconsistentpos itivesamples,whileth ecolordataaugmenta-t ionoperationsareperf ormedinAugK.SceneAgn osticClip-Shuffling.Forfullyleveragi ngthelimitedvideodat a,weproposetoconstru ctmorepseudoscenecue s.Inthiswork,thedata augmentationisbasedo nthebasicunitofclip, i.e.ρcontinuousshots,theg eneratedclipsarethen randomlyspliceddisor derlyforthetraining. TheprocessofSceneAgn osticClipGenerationa ndShufflingisshowninFig.4.Vi deo AVideo BVideo CBatch i-1 Batch iBatch i+ 1tFigure4.Theillustr ationofSceneAgnostic Clip-Shuffling.Clipsaresplicedd isorderlyfortraining andeachclipcontainsρcontinuousshots.3.1. 3NegativeSampleSelec tionsThewaytochoosen egativesamplesvaries accordingtothespecificSSLframeworks.ForSi mCLR[12],thesetofallnon-posi tivesampleswithinaba tchisusedasthenegati vesamples,andMoCo[14]leveragesanegativesa mplequeue,whichisame morybankofprevioussa mplesoutputfromtheke yencoder.However,BYO L[17]andSimSiam[16]donotusenegativesamp lesandinsteadresortt oexploringmorenon-tr ivialsolutionsofSSL. ,misguidetheselectio nofpositivesamples,n amelyasSe-lectionShi ft.Inthiscase,themod elfocusesmoreonnon-s emanticinformation.T osolvethisproblem,so mestud-ies[33]directlyomitcoloraug mentationsforbetterp er-formance.Bycontra st,weproposeAsymmetr icAugmen-tationtoall eviatetheinfluenceofSelectionShif tandusecoloraugmenta tiontofurtherimprove theperformance.Mores pecifically,augmentationsw ithoutthecolortrans- formationareusedinAu gQtogetmoreaccuratea ndsceneconsistentpos itivesamples,whileth ecolordataaugmenta-t ionoperationsareperf ormedinAugK.SceneAgn osticClip-Shuffling.Forfullyleveragi ngthelimitedvideodat a,weproposetoconstru ctmorepseudoscenecue s.Inthiswork,thedata augmentationisbasedo nthebasicunitofclip, i.e.ρcontinuousshots,theg eneratedclipsarethen randomlyspliceddisor derlyforthetraining. TheprocessofSceneAgn osticClipGenerationa ndShufflingisshowninFig.4.Vi deo AVideo BVideo CBatch i-1 Batch iBatch i+ 1tFigure4.Theillustr ationofSceneAgnostic Clip-Shuffling.Clipsaresplicedd isorderlyfortraining andeachclipcontainsρcontinuousshots.3.1. 3NegativeSampleSelec tionsThewaytochoosen egativesamplesvaries accordingtothespecificSSLframeworks.ForSi mCLR[12],thesetofallnon-posi tivesampleswithinaba tchisusedasthenegati vesamples,andMoCo[14]leveragesanegativesa mplequeue,whichisame morybankofprevioussa mplesoutputfromtheke yencoder.However,BYO L[17]andSimSiam[16]donotusenegativesamp lesandinsteadresortt oexploringmorenon-tr ivialsolutionsofSSL. 0.06
英語(論文から抽出)日本語訳スコア
3.1.4ObjectiveFuncti onWithNegativeSample s.Bydefiningsim(·,·)astheco-sinesimilar ity,thecontrastivelo ssfunction,i.e.,Info NCE[11]isemployedandformula tedasfollows:Lcon=−logPk∈{k+}e(sim(q,k)/τ)Pk∈{k+,k−}e(sim(q,k)/τ)(7)wherek+andk−standforthepositivea ndnegativesam-plesfo rthequeryq,andtheτisthetemperatureterm [34]. 3.1.4objectivefuncti onwith negativesamples.byde finingsim(·,·)astheco-sine similarity, thecontrastivelossfu nction,i.e.,infonce[11]isemployedandformula tedasfollows:lcon=−logpk gui{k+}e(sim(q,k)/τ)pk→{k+,k−}e(sim(q,k)/τ)(7) wherek+andk−standforthe positiveandnegatives am-plesforthequeryq, andtheτisthe temperatureterm[34] 0.35
WithoutNegativeSampl es.Bymaximizingthesi m-ilaritybetweentheq ueryandpositivesampl es,thecon-trastivelo sswithoutnegativesam plesisformulatedasfo l-lows:Lcon=−2Xk∈{k+}(sim(Pθ(q),kSG)+sim(Pθ(k),qSG))(8)wherePθisthepredictorPwithp arametersθ[16,17],kSGandqSGarethesamp leswithstop-gradient (SG)[16,17].3.2.VideoSceneSegme ntationAftertheunsup ervisedpretraining,t wodownstreammodelsar eusedtoevaluatethequ alityoftheextractedf ea-tureswiththefroze nqueryencoder.Proble mDefinition.FortheVideoSc eneSegmentation,[6][1]convertthetaskintoab inaryclassificationtaskofshotsema nticsbymodelingthete mporalrelationshipof ad-jacentshotfeature s.Inthisway,wecandet erminewhethertheendo fashotistheendofasce nestory. WithoutNegativeSampl es.Bymaximizingthesi m-ilaritybetweentheq ueryandpositivesampl es,thecon-trastivelo sswithoutnegativesam plesisformulatedasfo l-lows:Lcon=−2Xk∈{k+}(sim(Pθ(q),kSG)+sim(Pθ(k),qSG))(8)wherePθisthepredictorPwithp arametersθ[16,17],kSGandqSGarethesamp leswithstop-gradient (SG)[16,17].3.2.VideoSceneSegme ntationAftertheunsup ervisedpretraining,t wodownstreammodelsar eusedtoevaluatethequ alityoftheextractedf ea-tureswiththefroze nqueryencoder.Proble mDefinition.FortheVideoSc eneSegmentation,[6][1]convertthetaskintoab inaryclassificationtaskofshotsema nticsbymodelingthete mporalrelationshipof ad-jacentshotfeature s.Inthisway,wecandet erminewhethertheendo fashotistheendofasce nestory. 0.12
……MLP / Bi-LSTM(a)Boundary based model(b)Boundary free modelShotFeaturesInd uctiveBiaswithSlidin g Window 0/10/1…Bi-LSTMShotFeatures0 /10/1Figure5.Theillu strationofboundaryba sedmodel ...MLP / Bi-LSTM(a)境界ベースモデル(b)境界自由モデルShotFeaturesInductiv eBiaswithSliding Window 0/10/1...Bi-LSTMShot Features0/10/1Figure 5.Theillustrationofb oundary based model 0.10
(a)andbound-aryfreem odel (a)andbound-aryfreem odel 0.42
(b)forVideoSceneSegm entation.Boundaryfre emodel.Whiletheprevi ousdownstreamtaskofV ideoSceneSegmentatio nisconcludedtoashotb oundarymodelingbased approach,asshowninFi g.5 (b)forVideoSceneSegm entation.Boundaryfre emodel.Whilethethepr eviousdownstreamtask ofVideoSegmentationi sconcludedtoashotary modeling basedapproach,asshow ninFig.5 0.06
(a),weintroduceavani llaboundary-freemode l.AsshowninFig.5 (a)weintroduceavanil laboundary-freemodel .asshowninfig.5 0.25
(b),theproposedmodel coversthelong-termde pendenceofshotrepres entationsbasedonsequ ence-to-sequencelear ning.Comparedwithbou ndarybasedmodelinFig .5 b)theproposedmodelco versslong-termdepend enceofshotrepresenta tions basedonsequence-to-s equencelearning.comp aredwithboundary basedmodelinFig.5 0.12
(a)thatintroducesind uctivebiasfortheshot boundarymodelingwith theslidingwindows,th esuggestedmodel (a)誘導性バイアスフォーザショットバウンダリモデルの導入 0.63
(b)takestheshotfeatu resasthebasictempora linputunit,enablingt hemodeltoexploreboth localandglobalsemant icrelations.4.Experi ments4.1.Experimenta lSetupDataset.MovieN et[31]consistsof1,100movie swithalargeamountofm ulti-modaldataandann otations,andthetotal durationofallmoviesi sabout3000hours,itis thelargestdatasetfor movieunderstandingan alysisbyfar.Besides, MovieNet[31]issplitintoatraining setwith660movies,ava lidationsetwith220mo viesandatestingsetwi th220movies.Currentl y,forthetaskofVideoS ceneSegmentation,190 ,64and64videosarelab eledwithsceneboundar iesforthetraining,va lidationandtestsets, respec-tively.Moreim portantly,moviesinth eMovieScene[6]areallincludedinMovi eNet[31]. (b)takestheshotfeatu resasthebasictempora linputunit,enablingt hemodeltoexploreboth localandglobalsemant icrelations.4.Experi ments4.1.Experimenta lSetupDataset.MovieN et[31]consistsof1,100movie swithalargeamountofm ulti-modaldataandann otations,andthetotal durationofallmoviesi sabout3000hours,itis thelargestdatasetfor movieunderstandingan alysisbyfar.Besides, MovieNet[31]issplitintoatraining setwith660movies,ava lidationsetwith220mo viesandatestingsetwi th220movies.Currentl y,forthetaskofVideoS ceneSegmentation,190 ,64and64videosarelab eledwithsceneboundar iesforthetraining,va lidationandtestsets, respec-tively.Moreim portantly,moviesinth eMovieScene[6]areallincludedinMovi eNet[31]. 0.11
Itisworthnotingthatt herearetwoversionsof annota-tionaboutVide oSceneSegmentationta skassociatedwithMovi eNet,onewithonly150a nnotations(calledMov i-eScenesin[1,6],usedinearliermethod s[6]butitisnolongeravail able),andonewithatot alof318annotations(a bbreviatedasMovieSce nes-318inthiswork). Itisworthnoting thatthereareversions ofannota-tionaboutVi deoSegmentationtaska ssociatedwithMovieNe t,onewithonly150anno tations(Movi-eScenes in[1,6],usedinearliermethod s[6]butitisnolonger available),andonewit hatotalof318annotati ons(略してMovieScenes-318inthi swork)。 0.18
Sincethesmallscaleof ofBBC[10]andOVSD[32]datasetsandunavailab ilityofAdCuepoints[1]dataset,weinsteadado ptMovieNet[31]datasettoevaluatethe relatedapproaches,mo redetailsareintheSup plementaryMaterials. RepresentationLearni ngStage.Forvisualmod ality,eachshotconsis tsof3keyframesandRes Net50[38]ischo-senasthedefaul tbackbonetolearnthes hotrepresentations.T heaudiobackboneusedi n[6]isappliedforaudiomod al-ity,moredetailsab outthebackboneencode rscanbefoundinSupple mentaryMaterials.For pretrainingdata, the smallscaleofofbbc[10]andovsd[32]datasetsandunavailab ilityofadcuepoints[1]dataset,weinsteadado ptmovienet[31]datasettoevaluaterel atedapproaches,mored etailsareinsupplemen tarymaterials.repres entationlearningstag e.forvisualmodality, eachshotconsistsof3k eyframesandresnet50[38]ischo-senasthedefaul tbackbonetolearnthes hotrepresentation.th eaudiobackboneusedin [6]isappliedforaudiaudi omodal-ity,moredetai lsaboutthebackboneen coderscanbefoundinsu pplementary materialss.forpretra iningdatas.com 0.13
(i)trainingset(660mo vies)inMovieNet[31]isusedtolearntheshot representations,whil ewealsoconductexperi mentswith (i)Ttrainingset(660m ovies)inMovieNet[31]isusedtolearntheshot representations, whilewe alsoconductexperimen ts with 0.30
(ii)alldata(1,100mov ies)[1]forafaircomparison.I nparticular,although testdatawithoutthesc eneboundarylabelsare usedforrep-resentati onlearninginsetting (ii)alldata(1,100mov ies)[1]forafaircomparison.特に、特に、非日常的な有界ラブラス(forrep-resentationl earninginsetting)について 0.37
(ii),itisnotrecommen dedtouseallthedatafo rpretrainingbecausew eusuallyhavenopriora ccesstotestdatainrea lscenarios.Moreover, forthemostofSelf-Sup ervisedbenchmarks[11–18],represen-tationlear ningisperformedonlyo nthetrainingset,rath erthanallofthedata.V ideoSceneSegmentatio nStage.ForexistingSe lf-Supervisedmethods onimagesandvideos,as impledown-streammode lisfrequentlyusedtoe valuatetherepresenta -tionqualityofthefro zenencoders.Forinsta nce,alinear (ii)itisnotrecommend edtouseallthedatafor pretrainingbecauseha venoprior Accesstotestdatainre alscenarios.Moreover ,forthemostofSelf-Su pervisedbenchmarks[11-18],represen-tationlear ningisperformedonlyo nthetrainingset,rath erthanallofthedata.V ideoSceneSegmentatio nStage.ForexistingSe lf-Supervisedmethods onimagesandvideos,as impledown-streammode lisfrequentlyedtoeva luatea-tionqualityof thefrozenencoders.Fo rinance,alearinearin earencoders 0.08
英語(論文から抽出)日本語訳スコア
Table1.Resultsofsupe rvisedmethodsw/oSSLf orthetaskofVideoScen eSegmentationonMovie Net.MethodsDatasetAP F1SCSA[9]M.S.14.7-StoryGraph[35]M.S.25.1-Siamese[10]M.S.28.1-ImageNet[36]M.S.41.26-Places[37]M.S.43.23-LGSS[6]M.S.47.1-LGSSw/oDP[6]M.S.44.9-LGSSw/oDP[6]*M.S-31844.938.52*Ou rimplementationsbase donofficialpubliccodebase,2 whileDP(DynamicProgr amming)isn’tpublicavailable.Tab le2.Resultsofmethods w/SSLforthetaskofVid eoSceneSegmenta-tion onMovieNet.MethodsPr etrainDataEval.Proto colAPF1ShotCoL[1]Train. table1.resultsofsupe rvisedmethodsw/osslf orthetaskofvideoscen esegmentationonmovie net.methodsdatasetap f1scsa[9]m.s.14.7-storygraph[35]m.s.25.1-siamese[10]m.s.28.1-imagenet[36]m.s.41.26-places[37]m.s.43.23-lgs[6]m.s.47.47.1-lgssw/od p[6]m.44.9-lgssw/odp[6]*m.s-31844.938.52*ou rimplementation basedonofficialcodeb ase, while2dp.com 0.15
+Test. +Val.M. テスト。 Val.M。 0.38
SMLP[1]52.83-ShotCoL[1]Train. SMLP[1]52.83-ShotCoL[1]Train 0.43
+Test. +Val.M. テスト。 Val.M。 0.38
S-318MLP[1]53.37-ShotCoL[1]*Train. s-318mlp[1]53.37-shotcol[1]*train。 0.62
+Test. +Val.M. テスト。 Val.M。 0.38
S-318MLP[1]52.8949.17SCRL(ours) Train. S-318MLP[1]52.8949.17SCRL(ours) トレイン 0.64
+Test. +Val.M. テスト。 Val.M。 0.38
S-318MLP[1]54.8251.43ShotCoL[1]*Train.onlyM.S-318ML P[1]46.7745.78SCRL(ours) Train.onlyM.S-318MLP [1]53.7450.40ShotCoL[1]*Train.onlyM.S-318Bi -LSTM48.2146.52SCRL( ours)Train.onlyM.S-3 18Bi-LSTM54.5551.39* Ourimplementations.f ully-connectedlayeri swidelyusedforevalua tion.How-ever,forthe VideoSceneSegmentati ontask,wecannotde-te rminewhethertheendin gpositionofasinglesh otisthesceneboundary ornot.Consequently,a boundary-basednon-te mporalmodel(MLP-base dprotocol,followedby [1])andaboundary-freete mporalmodel(Bi-LSTM[19]-basedprotocol,propo sedbyus)areemployedt oevaluatetheca-pabil ityoftheencoderforlo cal-to-globalmodelin g.Metrics.Weusetheme anofAveragePrecision (AP)[6][1]specifiedtogroundtruthscene boundariesofeachmovi e,aswellasF1-scorefo rtheevaluation.Imple mentationDetails.Dur ingthelearningstageo fSelf-Supervisedrepr esentation,thebatchs izeissetto1,024(shot s),initiallearningra teissetto0.03andthet rainingepochis100.Th eparametersofthevisu alandaudioen-codersa rerandomlyinitialize d.Besides,weperformn aiveK-Meansalgorithm [39,40]foronlineclusteringa ndtheclusternumber#c lassissetto24,whilet hecliplength,i.e.ρofSceneAgnosticClipS hufflingissetto16.Mo-Cov2 [14]withthequeuesizeof65 ,536,momentumvalueof 0.999,temperatureof0 .07andcosinelearning ratedecay,areusedaso urSSLframeworksettin g.FortheVideoSceneSe gmentationtask,num-o f-shot[1]issetto4and40fortheM LP[1]andBi-LSTMprotocols, respectively.Eachpre trainingtrialiscondu ctedontheserverwith8 NVIDIAV100GPUsforapp roximate24hoursinvis ualmodalityand10hour sinaudiomodality.The dimensionsofvisualan daudiofeaturesusedfo rbothpretrainingande valua-tionare2,048an d512,respectively.Mo redetails,e g ,thechoiceofhyperpar ameter,arepresentedi nSupplementaryMateri als.4.2.Comparisonwi thExistingMethodsTab les1and2presentanove rallperformanceofmet h-odsw/orw/oSSLforth eVideoSceneSegmentat iontask,whereM.S. S-318MLP[1]54.8251.43ShotCoL[1]*Train.onlyM.S-318ML P[1]46.7745.78SCRL(ours) Train.onlyM.S-318MLP [1]53.7450.40ShotCoL[1]*Train.onlyM.S-318Bi -LSTM48.2146.52SCRL( ours)Train.onlyM.S-3 18Bi-LSTM54.5551.39* Ourimplementations.f ully-connectedlayeri swidelyusedforevalua tion.How-ever,forthe VideoSceneSegmentati ontask,wecannotde-te rminewhethertheendin gpositionofasinglesh otisthesceneboundary ornot.Consequently,a boundary-basednon-te mporalmodel(MLP-base dprotocol,followedby [1])andaboundary-freete mporalmodel(Bi-LSTM[19]-basedprotocol,propo sedbyus)areemployedt oevaluatetheca-pabil ityoftheencoderforlo cal-to-globalmodelin g.Metrics.Weusetheme anofAveragePrecision (AP)[6][1]specifiedtogroundtruthscene boundariesofeachmovi e,aswellasF1-scorefo rtheevaluation.Imple mentationDetails.Dur ingthelearningstageo fSelf-Supervisedrepr esentation,thebatchs izeissetto1,024(shot s),initiallearningra teissetto0.03andthet rainingepochis100.Th eparametersofthevisu alandaudioen-codersa rerandomlyinitialize d.Besides,weperformn aiveK-Meansalgorithm [39,40]foronlineclusteringa ndtheclusternumber#c lassissetto24,whilet hecliplength,i.e.ρofSceneAgnosticClipS hufflingissetto16.Mo-Cov2 [14]withthequeuesizeof65 ,536,momentumvalueof 0.999,temperatureof0 .07andcosinelearning ratedecay,areusedaso urSSLframeworksettin g.FortheVideoSceneSe gmentationtask,num-o f-shot[1]issetto4and40fortheM LP[1]andBi-LSTMprotocols, respectively.Eachpre trainingtrialiscondu ctedontheserverwith8 NVIDIAV100GPUsforapp roximate24hoursinvis ualmodalityand10hour sinaudiomodality.The dimensionsofvisualan daudiofeaturesusedfo rbothpretrainingande valua-tionare2,048an d512,respectively.Mo redetails,e g ,thechoiceofhyperpar ameter,arepresentedi nSupplementaryMateri als.4.2.Comparisonwi thExistingMethodsTab les1and2presentanove rallperformanceofmet h-odsw/orw/oSSLforth eVideoSceneSegmentat iontask,whereM.S. 0.13
standsforMovieScenes datasetwith150anno-t atedmovies,andEval.m eansthedatasetusedfo rsuper-2https://gith ub.com/AnyiRao/Scene Segvisedevaluationst ageafterthepretraini ng.Besides,Train. standforMovieScenesd atasetwith150anno-ta tedmovies,andEval.me ansthedatasetusedfor super-2https://githu b.com/AnyiRao/SceneS egvisedevaluation stage afterthepretraining. 0.15
,Test. ,andVal.representtra ining,testingandvali dationsetsofMovieNet [31]. テスト。 Val.representtrainin g,testingandvalidati onsetsofMovieNet[31]。 0.37
Wehavereproducedthep erformanceofShotCoL[1]ontheentiredataset(1 ,100movies)forcompar ison,althoughitissug gestedtoconductthepr etrainingstageonlyon thetrainingset.Compa redwithShotCoL[1]thathasadeclineof6.1 2intermsofAP,ourmeth odcanachievecompetit iveperformancewithle sstrainingdata,witho nlyadeclineof1.08int ermsofAP.Theproposed methodoutperformsthe supervisedstate-of-t he-artmethod,i.e.,LG SS[6]bymar-ginsof9.65inte rmsofAPand12.87inter msofF1.4.3.AblationS tudyWeperformallthea blationexperimentsus ingonlythetrainingda taofMovieNetinSSLsta ge,andevaluatetheper formanceondownstream taskbasedonMLPprotoc olforfairness.Positi veSampleSelection.We firstconductablationex perimentsonthefourdi fferentselectionmeth odsofpos-itivepairs, i.e.,Self-AugmentedS election,RandomSelec -tion,NearestNeighbo rhood(NN)Selectionan dSceneCon-sistency(S C)Selection.Tab.3sho wsthatSceneConsis-te ncySelectionmethodac hievesbetterperforma ncethantheotherselec tionmethods,whichout performsthestate-of- the-artalgorithm[1]byamarginof2.95inter msofAP.Meanwhile,the lossevolutioncurveso fabovemeth-odsaresho wninFig.6.WecanfindthatSelf-Augmented Selectionreachesthel owestlossvalue,while obtainingtheworstper formanceonthetaskofV ideoSceneSegmenta-ti on.Duetothetrivialob jectiveintroducedbyN NSelec-tionthatisdis cussedinSection3.1.2 ,itachievesthefastes tconvergencerateduri ngtheearlytraining,w hilestagnatingtoamed iocreperformance.Byc ontrast,SCSelectionh asarelativelymoderat econvergencerate,and achievesthebestperfo rmanceamongallthesel ectionstrategies. Wehavereproducedthep erformanceofShotCoL[1]ontheentiredataset(1 ,100movies)forcompar ison,althoughitissug gestedtoconductthepr etrainingstageonlyon thetrainingset.Compa redwithShotCoL[1]thathasadeclineof6.1 2intermsofAP,ourmeth odcanachievecompetit iveperformancewithle sstrainingdata,witho nlyadeclineof1.08int ermsofAP.Theproposed methodoutperformsthe supervisedstate-of-t he-artmethod,i.e.,LG SS[6]bymar-ginsof9.65inte rmsofAPand12.87inter msofF1.4.3.AblationS tudyWeperformallthea blationexperimentsus ingonlythetrainingda taofMovieNetinSSLsta ge,andevaluatetheper formanceondownstream taskbasedonMLPprotoc olforfairness.Positi veSampleSelection.We firstconductablationex perimentsonthefourdi fferentselectionmeth odsofpos-itivepairs, i.e.,Self-AugmentedS election,RandomSelec -tion,NearestNeighbo rhood(NN)Selectionan dSceneCon-sistency(S C)Selection.Tab.3sho wsthatSceneConsis-te ncySelectionmethodac hievesbetterperforma ncethantheotherselec tionmethods,whichout performsthestate-of- the-artalgorithm[1]byamarginof2.95inter msofAP.Meanwhile,the lossevolutioncurveso fabovemeth-odsaresho wninFig.6.WecanfindthatSelf-Augmented Selectionreachesthel owestlossvalue,while obtainingtheworstper formanceonthetaskofV ideoSceneSegmenta-ti on.Duetothetrivialob jectiveintroducedbyN NSelec-tionthatisdis cussedinSection3.1.2 ,itachievesthefastes tconvergencerateduri ngtheearlytraining,w hilestagnatingtoamed iocreperformance.Byc ontrast,SCSelectionh asarelativelymoderat econvergencerate,and achievesthebestperfo rmanceamongallthesel ectionstrategies. 0.06
英語(論文から抽出)日本語訳スコア
Table3.Ablationresul tsofPositiveSampleSe lection.MethodsSelec tionStrategyAPMoCo[14]Self-Augmented42.51- Random(n=1)43.24ShotCol[1]NN(m=8)46.77SCSceneConsis tency49.71Table4.Abl ationresultsofSSLmet hodsw/andw/oSceneAgn osticClip-Shuffling.Methodsw/ow/APNN X×46.77NN×X48.63SCX×49.71SC×X52.17Table5.Ablatio nresultsofMultiplePo s-itiveSamples(MPS). 表3.ablationresultsofp ositivesampleselecti on.methodsselectstra tegyapmoco[14]self-augmented42.51- random(n=1)43.24shotcol[1]nn(m=8)46.77scsceneconsis tency49.71table4.abl ationresultsofsslmet hodsw/andw/osceneagn osticclip-shuffling. methodsw/ow/apnn×46.77nn×x48.63scx×49.71scx×x52.17table5.ablatio nresultsofmultiplepo s-itivesamples(mps) 0.11
MethodsPositiveSampl e(s)APSCCenter52.17M PS-SCSelfandCenter51 .20Soft-SCEq.6(γ=0.5)53.74Figure6.Los sevolutioncurvesandA Presultsofthetrainin gwithdifferentselect ionstrategies.SceneA gnosticClip-Shuffling.Theresultsofabla tionstudyspecifictotheSceneAgnosticC lip-Shufflingarepre-sentedinTa b.4.Tab.4showsthatth eproposedClip-Shufflingachievesimproveme ntsof1.85and2.46inte rmsofAPforNNandSCmet hods,respectively.Th eseresultsverifythea dvantageofthepropose dpositivesampleselec -tiondiscussedinSect ion3.1.2thatSCisfree totheshotorderinavid eo.MultiplePositiveS amples(MPS). MethodsPositiveSampl e(s)APSCCenter52.17M PS-SCSelfandCenter51 .20Soft-SCEq.6(γ=0.5)53.74Figure6.Los sevolutioncurvesandA Presultsofthetrainin gwithdifferentselect ionstrategies.SceneA gnosticClip-Shuffling.Theresultsofabla tionstudyspecifictotheSceneAgnosticC lip-Shufflingarepre-sentedinTa b.4.Tab.4showsthatth eproposedClip-Shufflingachievesimproveme ntsof1.85and2.46inte rmsofAPforNNandSCmet hods,respectively.Th eseresultsverifythea dvantageofthepropose dpositivesampleselec -tiondiscussedinSect ion3.1.2thatSCisfree totheshotorderinavid eo.MultiplePositiveS amples(MPS). 0.05
Moreover,westudythep erformanceofmultiple positivesamplesinTab .5.AsshowninTab.5,So ft-SCachievesthebest performanceof53.74in termsofAP.Althoughsi nglepositivesampleis employedinSC,itstill achievesbetterperfor mancethanMPS-SCthate mploymultiplepositiv esamples.4.4.Analysi softheProposedMethod Generalizabilitytoth elarge-scalesupervis edap-proach.Tostudyt hegeneralizabilityof theproposedmethod,we equipourtrainedmodel swithLGSS[6],whereLGSSisalarge-s calesupervisedmethod andutilizesvar-iousp retrainedmodelswithm ulti-modality.AsisSh owninTab.6,ourtraine dmodel,trainedonlyon theunlabeledtraining set,andbasedonthesam ebackbone,i.e.ResNet -50[38],achievesanimproveme ntof4.0intermsofAPov ertheapproachwithout ourtrainedmodel.Tabl e6.Generalizabilityt othelarge-scalesuper visedapproach.Method sModalitiesAPLGSS[6]w/oSSLVisual(Place,R esNet50)+Action+Actor+Audio44.9LGSS[6]w/SSLVisual(SSL,ResN et50)+Action+Actor+Audio48.9Performance ondifferentSelf-Supe rvisedLearning(SSL)f rameworks.Fourpopula rSSLframeworksareuse dforevaluatingourmet hod,i.e.,SimCLR[12],MoCo[14],BYOL[17]andSimSiam[16]. Moreover,westudythep erformanceofmultiple positivesamplesinTab .5.AsshowninTab.5,So ft-SCachievesthebest performanceof53.74in termsofAP.Althoughsi nglepositivesampleis employedinSC,itstill achievesbetterperfor mancethanMPS-SCthate mploymultiplepositiv esamples.4.4.Analysi softheProposedMethod Generalizabilitytoth elarge-scalesupervis edap-proach.Tostudyt hegeneralizabilityof theproposedmethod,we equipourtrainedmodel swithLGSS[6],whereLGSSisalarge-s calesupervisedmethod andutilizesvar-iousp retrainedmodelswithm ulti-modality.AsisSh owninTab.6,ourtraine dmodel,trainedonlyon theunlabeledtraining set,andbasedonthesam ebackbone,i.e.ResNet -50[38],achievesanimproveme ntof4.0intermsofAPov ertheapproachwithout ourtrainedmodel.Tabl e6.Generalizabilityt othelarge-scalesuper visedapproach.Method sModalitiesAPLGSS[6]w/oSSLVisual(Place,R esNet50)+Action+Actor+Audio44.9LGSS[6]w/SSLVisual(SSL,ResN et50)+Action+Actor+Audio48.9Performance ondifferentSelf-Supe rvisedLearning(SSL)f rameworks.Fourpopula rSSLframeworksareuse dforevaluatingourmet hod,i.e.,SimCLR[12],MoCo[14],BYOL[17]andSimSiam[16]. 0.12
Tab.7showsthattheSSL frameworkwithmomentu mupdatesandnegatives amplesachievesthebes tperformancefortheVi deoSceneSegmentation task.Duetothemomentu mupdatemecha-nism,th eproposedmethodembed dedintheframeworkofB YOL[17]achievesanimprovemen tof10.53overthatinSi mSiam[16],andasimilarconclusi onisreachedin[27]. Tab.7shows thattheSSLframeworkw ithmomentumupdatesan d negativesamplesachie vesthebest PerformancefortheVid eoSegmentationtask.D uetothemomentumupdat emecha-nism,thepropo sedmethodemdedinthef rameworkofBYOL[17]achievesanimprovemen tof10.53overthatinSi mSiam[16],andasimilarconclusi onisreachedin[27]。 0.10
Table7.Resultsofthep roposedmethodbasedon variousSelf-Supervis edFrameworks.Methods SSLFra-meworksw/nega tivesamplesw/momen-t umupdateAPSCRLSimSia m[16]××39.82SCRLSimCLR[12]X×45.32SCRLBYOL[17]×X50.35ShotCoL[1]MoCo[14]XX46.77SCRLMoCo[14]XX53.74Boundaryfreem odelforevaluation.To studytheper-formance oftheintroducedbound aryfreemodel,thepro- posedmethodunderMLPa ndBi-LSTMprotocolsfo rthescenesegmentatio ntaskisevaluatedinTa b.8.SinceBi-LSTMprot ocolhaslessinductive biasthanslidingwindo wbasedMLPprotocol,it isabletomodelreprese ntationsoflongershot ,henceachievesbetter performanceonthetask ofVideoSceneSegmenta tion.Morespecifically,theper-formanc eofBi-LSTMprotocolin creasesasthelengthof theshotsincreases,wh iletheperformanceofM LPprotocoldecreasesi nstead.Moredetailsca nbefoundinSupplemen- taryMaterials. Table7.Resultsofthep roposedmethodbasedon variousSelf-Supervis edFrameworks.Methods SSLFra-meworksw/nega tivesamplesw/momen-t umupdateAPSCRLSimSia m[16]××39.82SCRLSimCLR[12]X×45.32SCRLBYOL[17]×X50.35ShotCoL[1]MoCo[14]XX46.77SCRLMoCo[14]XX53.74Boundaryfreem odelforevaluation.To studytheper-formance oftheintroducedbound aryfreemodel,thepro- posedmethodunderMLPa ndBi-LSTMprotocolsfo rthescenesegmentatio ntaskisevaluatedinTa b.8.SinceBi-LSTMprot ocolhaslessinductive biasthanslidingwindo wbasedMLPprotocol,it isabletomodelreprese ntationsoflongershot ,henceachievesbetter performanceonthetask ofVideoSceneSegmenta tion.Morespecifically,theper-formanc eofBi-LSTMprotocolin creasesasthelengthof theshotsincreases,wh iletheperformanceofM LPprotocoldecreasesi nstead.Moredetailsca nbefoundinSupplemen- taryMaterials. 0.07
英語(論文から抽出)日本語訳スコア
64180083183585099910 15996916987641991124 96310401196686416526 12994993995QuerySCNN Image-NetSelft……Similarity to QueryShot IDFigure7.Thevisuali zationresultsofshotr etrieval.Overall,NNt endstoselectadjacent shots,Selfshowslessr elevancetothequeryan dImageNetretrievesma nykindsofboats.Compa redwiththeothermetho ds,theresultsofSCare moreconsistentinthes emanticinformation,i .e.,thereisamanstayi ngintheboat,andSCach ievesalargerspan(i.e .,from641to850)thanN NaccordingtoshotIDs. Meanwhile,SCshowsbet terrobustnessagainst theinterferenceofpin ksmokeinthe994-thsho tastheresultsaremore pure.Table8.Resultso fVideoSceneSegmentat ionusingtheproposedm ethodunderMLPandBi-L STMprotocols.Protoco lsShot-LenAPF1#Param MLP[1]453.7450.4037.75MMLP [1]1049.61↓44.04↓88.09M↑Bi-LSTM1043.9442.121 5.22MBi-LSTM4054.55↑51.39↑15.22MVisualizationo fShotRetrieval.Toget moreintuitionforthep roposedselection,wec onductretrievalexper imentsusingfourselec tionmethods,i.e.,SC, NN,Self-Augmentedand ImageNetselections,a ndpresenttheresultsi nFig.7.Morespecifically,foragivenshot, wecalculatethesimila r-itiesbetweenitandt heothershotsintheent iremovie,thenvisuali zetheTOP-5mostsimila rshotsinFig.7.4.5.Li mitationsMulti-modal Pretraining.Inordert otesttheperfor-mance oftheproposedalgorit hmgeneralizingtomult i-modaldata,wealsoco nductexperimentswith audioandvisualmodali tiesintheSSLstage,th ejointmulti-modallea rningschemefollows[41]. 64180083183585099910 15996916987641991124 96310401196686416526 12994993995QuerySCNN Image-NetSelft……Similarity to QueryShot IDFigure7.Thevisuali zationresultsofshotr etrieval.Overall,NNt endstoselectadjacent shots,Selfshowslessr elevancetothequeryan dImageNetretrievesma nykindsofboats.Compa redwiththeothermetho ds,theresultsofSCare moreconsistentinthes emanticinformation,i .e.,thereisamanstayi ngintheboat,andSCach ievesalargerspan(i.e .,from641to850)thanN NaccordingtoshotIDs. Meanwhile,SCshowsbet terrobustnessagainst theinterferenceofpin ksmokeinthe994-thsho tastheresultsaremore pure.Table8.Resultso fVideoSceneSegmentat ionusingtheproposedm ethodunderMLPandBi-L STMprotocols.Protoco lsShot-LenAPF1#Param MLP[1]453.7450.4037.75MMLP [1]1049.61↓44.04↓88.09M↑Bi-LSTM1043.9442.121 5.22MBi-LSTM4054.55↑51.39↑15.22MVisualizationo fShotRetrieval.Toget moreintuitionforthep roposedselection,wec onductretrievalexper imentsusingfourselec tionmethods,i.e.,SC, NN,Self-Augmentedand ImageNetselections,a ndpresenttheresultsi nFig.7.Morespecifically,foragivenshot, wecalculatethesimila r-itiesbetweenitandt heothershotsintheent iremovie,thenvisuali zetheTOP-5mostsimila rshotsinFig.7.4.5.Li mitationsMulti-modal Pretraining.Inordert otesttheperfor-mance oftheproposedalgorit hmgeneralizingtomult i-modaldata,wealsoco nductexperimentswith audioandvisualmodali tiesintheSSLstage,th ejointmulti-modallea rningschemefollows[41]. 0.07
However,wedidnotachi eveanyimprovementand wereconfrontedwithth esamecon-cernthatism entionedin[1],asshowninTable.9.Po ssiblereasonsarethat しかし、wedidnotachieveanyim provement andwerefronted with thesamecon-cernthati smentionedin[1],asshownintable.9.po ssiblereasonsarethat 0.23
(i)thepubliclyavaila bleaudiodataofeachsh otareincomplete, (i)公に利用可能なaudiodataofeachshota reincomplete 0.17
(ii)therawaudiodataa renotavailableyetdue tocopyrightrestricti ons[31]and (ii)therawaudiodataa renot availableyetduetocop yrightrestrictions[31]および 0.21
(iii)LGSS[6]uti-lizesvariouspret rainedmodelsontheoth erdatasets,whilethem ethodsinthecompariso naretrainedfromscrat ch.Therefore,itismea ningfultoshedlighton howtopretrainbetterm ulti-modalrepresenta tionsontheMovieNet[31]. (iii)lgss[6]uti-lizesvariouspret rainedmodelsontheoth erdatasets,whilethem ethodsinthecompariso naretrainedfromscrat ch。 0.11
Table9.APresultsofth emulti-modalexperime ntonMovieNet.Backbon esoffollowingmethods foreachmodalityareth esame.MethodsVisualA udioVisual+AudioLGSS[6]39.017.543.4ShotCoL[1]46.7727.9244.32SCRL5 3.7429.3950.805.Conc lusionWepresentaSelf -SupervisedLearning( SSL)schemebasedonSce neConsistencytoobtai nbettershotrep-resen tationsfortheunlabel edlong-termvideos.Th eproposedmethodachie vesthestate-of-the-a rtperformanceontheta skofVideoSceneSegmen tationundervariouspr otocols,andsignificantbettergeneraliza tionperformancewheni tisequippedwithlarge -scalesupervisedappr oach.Besides,weintro duceafairpretraining protocolandamorecomp rehensiveevaluationm etricforthetaskofVid eoSceneSegmentation, tomaketheassessmento ftheSSLmoremeaningfu linpractice.Acknowle dgmentTheworkwassupp ortedbytheNa-tionalN aturalScienceFoundat ionofChinaundergrant sno.61602315,9195910 8,theScienceandTech- nologyProjectofGuang dongProvinceundergra ntno.2020A1515010707 ,theScienceandTechno logyInnovationCommis sionofShenzhenunderg rantno.JCYJ201908081 65203670. Table9.APresultsofth emulti-modalexperime ntonMovieNet.Backbon esoffollowingmethods foreachmodalityareth esame.MethodsVisualA udioVisual+AudioLGSS[6]39.017.543.4ShotCoL[1]46.7727.9244.32SCRL5 3.7429.3950.805.Conc lusionWepresentaSelf -SupervisedLearning( SSL)schemebasedonSce neConsistencytoobtai nbettershotrep-resen tationsfortheunlabel edlong-termvideos.Th eproposedmethodachie vesthestate-of-the-a rtperformanceontheta skofVideoSceneSegmen tationundervariouspr otocols,andsignificantbettergeneraliza tionperformancewheni tisequippedwithlarge -scalesupervisedappr oach.Besides,weintro duceafairpretraining protocolandamorecomp rehensiveevaluationm etricforthetaskofVid eoSceneSegmentation, tomaketheassessmento ftheSSLmoremeaningfu linpractice.Acknowle dgmentTheworkwassupp ortedbytheNa-tionalN aturalScienceFoundat ionofChinaundergrant sno.61602315,9195910 8,theScienceandTech- nologyProjectofGuang dongProvinceundergra ntno.2020A1515010707 ,theScienceandTechno logyInnovationCommis sionofShenzhenunderg rantno.JCYJ201908081 65203670. 0.03
英語(論文から抽出)日本語訳スコア
References[1]ShixingChen,XiaohanN ie,DavidFan,Dongqing Zhang,VimalBhat,andR affayHamid.Shotcontr astiveself-supervise dlearningforscenebou ndarydetection.InPro -ceedingsoftheIEEE/C VFConferenceonComput erVisionandPatternRe cognition,pages9796–9805,2021.1,2,3,4,5, 6,7,8[2]TianweiLin,XuZhao,Ha ishengSu,ChongjingWa ng,andMingYang.Bsn:B oundarysensitivenetw orkfortemporalaction proposalgeneration.I nProceedingsoftheEur opeanConferenceonCom puterVision(ECCV),pa ges3–19,2018.2[3]JingTan,JiaqiTang,Li minWang,andGangshanW u.Re-laxedtransforme rdecodersfordirectac tionproposalgener-at ion.InProceedingsoft heIEEE/CVFInternatio nalCon-ferenceonComp uterVision,pages1352 6–13535,2021.2[4]AnthonyCioppa,Adrien Deliege,SilvioGianco la,BernardGhanem,Mar cVanDroogenbroeck,Ri kkeGade,andThomasBMo eslund.Acontext-awar elossfunctionforacti onspottinginsoccervi deos.InProceedingsof theIEEE/CVFConferenc eonComputerVisionand PatternRecognition,p ages13126–13136,2020.2[5]SilvioGiancola,Mohie ddineAmine,TarekDgha ily,andBernardGhanem .Soccernet:Ascalable datasetforactionspot tinginsoccervideos.I nProceedingsoftheIEE Econfer-enceoncomput ervisionandpatternre cognitionworkshops,p ages1711–1721,2018.2[6]AnyiRao,LinningXu,Yu Xiong,GuodongXu,Qing qiuHuang,BoleiZhou,a ndDahuaLin.Alocal-to -globalap-proachtomu lti-modalmoviescenes egmentation.InPro-ce edingsoftheIEEE/CVFC onferenceonComputerV isionandPatternRecog nition,pages10146–10155,2020.2,5,6,7,8 [7]PanagiotisSidiropoul os,VasileiosMezaris, IoannisKompat-siaris ,HugoMeinedo,MiguelB ugalho,andIsabelTran coso.Temporalvideose gmentationtoscenesus inghigh-levelau-diov isualfeatures.IEEETr ansactionsonCircuits andSys-temsforVideoT echnology,21(8):1163 –1177,2011.2[8]JakubLokoˇc,GregorKovalˇcik,Tom´aˇsSouˇcek,JaroslavMoravec, andPˇremyslˇCech.Aframeworkforef fectiveknown-itemsea rchinvideo.InProceed ingsofthe27thACMInte rnationalConferenceo nMultimedia,pages177 7–1785,2019.2[9]VasileiosTChasanis,A ristidisCLikas,andNi kolaosPGalatsanos.Sc enedetectioninvideos usingshotclusteringa ndsequencealignment. IEEEtransactionsonmu ltimedia,11(1):89–100,2008.2,6[10]LorenzoBaraldi,Costa ntinoGrana,andRitaCu cchiara.Adeepsiamese networkforscenedetec tioninbroadcastvideo s.InProceedingsofthe 23rdACMinternational con-ferenceonMultime dia,pages1199–1202,2015.2,3,5,6[11]KaimingHe,HaoqiFan,Y uxinWu,SainingXie,an dRossGirshick.Moment umcontrastforunsuper visedvisualrep-resen tationlearning.InPro ceedingsoftheIEEE/CV FCon-ferenceonComput erVisionandPatternRe cognition,pages9729–9738,2020.2,5[12]TingChen,SimonKornbl ith,MohammadNorouzi, andGe-offreyHinton.A simpleframeworkforco ntrastivelearningofv isualrepresentations .InInternationalconf erenceonma-chinelear ning,pages1597–1607.PMLR,2020.2,4,5 ,7[13]TingChen,SimonKornbl ith,KevinSwersky,Moh ammadNorouzi,andGeof freyEHinton.Bigself- supervisedmod-elsare strongsemi-supervise dlearners.Advancesin NeuralInformationPro cessingSystems,33:22 243–22255,2020.2,5[14]XinleiChen,HaoqiFan, RossGirshick,andKaim ingHe.Improvedbaseli neswithmomentumcontr astivelearning.arXiv preprintarXiv:2003.0 4297,2020.2,3,4,5,6, 7[15]XinleiChen,SainingXi e,andKaimingHe.Anemp iri-calstudyoftraini ngself-supervisedvis iontransformers.InPr oceedingsoftheIEEE/C VFInternationalConfe renceonComputerVisio n,pages9640–9649,2021.2,5[16]XinleiChenandKaiming He.Exploringsimplesi ameserep-resentation learning.InProceedin gsoftheIEEE/CVFCon-f erenceonComputerVisi onandPatternRecognit ion,pages15750–15758,2021.2,4,5,7[17]Jean-BastienGrill,Fl orianStrub,FlorentAl tch´e,CorentinTallec,Pie rreRichemond,ElenaBu chatskaya,CarlDoersc h,BernardoAvilaPires ,ZhaohanGuo,Mohammad Ghesh-laghiAzar,etal .Bootstrapyourownlat ent-anewapproachtose lf-supervisedlearnin g.AdvancesinNeuralIn formationProcessingS ystems,33:21271–21284,2020.2,4,5,7[18]MathildeCaron,IshanM isra,JulienMairal,Pr iyaGoyal,Pi-otrBojan owski,andArmandJouli n.Unsupervisedlearn- ingofvisualfeaturesb ycontrastingclustera ssignments.InThirty- fourthConferenceonNe uralInformationProce ssingSystems(NeurIPS ),2020.2,5[19]ZhihengHuang,WeiXu,a ndKaiYu.Bidirectiona llstm-crfmodelsforse quencetagging.arXivp reprintarXiv:1508.01 991,2015.3,6[20]SpyrosGidaris,Pravee rSingh,andNikosKomod akis.Un-supervisedre presentationlearning bypredictingimagerot a-tions.InICLR2018,2 018.2[21]DeepakPathak,Philipp Krahenbuhl,JeffDonah ue,TrevorDarrell,and AlexeiAEfros.Context encoders:Featurelear ningbyinpainting.InP roceedingsoftheIEEEc on-ferenceoncomputer visionandpatternreco gnition,pages2536–2544,2016.2[22]RichardZhang,Phillip Isola,andAlexeiAEfro s.Colorfulimagecolor ization.InEuropeanco nferenceoncomputervi sion,pages649–666.Springer,2016.2[23]MehdiNorooziandPaolo Favaro.Unsupervisedl earningofvisualrepre sentationsbysolvingj igsawpuzzles.InEuro- peanconferenceoncomp utervision,pages69–84.Springer,2016.2[24]RuiQian,TianjianMeng ,BoqingGong,Ming-Hsu anYang,HuishengWang, SergeBelongie,andYin Cui.Spatiotempo-ralc ontrastivevideorepre sentationlearning.In Proceedings References[1]ShixingChen,XiaohanN ie,DavidFan,Dongqing Zhang,VimalBhat,andR affayHamid.Shotcontr astiveself-supervise dlearningforscenebou ndarydetection.InPro -ceedingsoftheIEEE/C VFConferenceonComput erVisionandPatternRe cognition,pages9796–9805,2021.1,2,3,4,5, 6,7,8[2]TianweiLin,XuZhao,Ha ishengSu,ChongjingWa ng,andMingYang.Bsn:B oundarysensitivenetw orkfortemporalaction proposalgeneration.I nProceedingsoftheEur opeanConferenceonCom puterVision(ECCV),pa ges3–19,2018.2[3]JingTan,JiaqiTang,Li minWang,andGangshanW u.Re-laxedtransforme rdecodersfordirectac tionproposalgener-at ion.InProceedingsoft heIEEE/CVFInternatio nalCon-ferenceonComp uterVision,pages1352 6–13535,2021.2[4]AnthonyCioppa,Adrien Deliege,SilvioGianco la,BernardGhanem,Mar cVanDroogenbroeck,Ri kkeGade,andThomasBMo eslund.Acontext-awar elossfunctionforacti onspottinginsoccervi deos.InProceedingsof theIEEE/CVFConferenc eonComputerVisionand PatternRecognition,p ages13126–13136,2020.2[5]SilvioGiancola,Mohie ddineAmine,TarekDgha ily,andBernardGhanem .Soccernet:Ascalable datasetforactionspot tinginsoccervideos.I nProceedingsoftheIEE Econfer-enceoncomput ervisionandpatternre cognitionworkshops,p ages1711–1721,2018.2[6]AnyiRao,LinningXu,Yu Xiong,GuodongXu,Qing qiuHuang,BoleiZhou,a ndDahuaLin.Alocal-to -globalap-proachtomu lti-modalmoviescenes egmentation.InPro-ce edingsoftheIEEE/CVFC onferenceonComputerV isionandPatternRecog nition,pages10146–10155,2020.2,5,6,7,8 [7]PanagiotisSidiropoul os,VasileiosMezaris, IoannisKompat-siaris ,HugoMeinedo,MiguelB ugalho,andIsabelTran coso.Temporalvideose gmentationtoscenesus inghigh-levelau-diov isualfeatures.IEEETr ansactionsonCircuits andSys-temsforVideoT echnology,21(8):1163 –1177,2011.2[8]JakubLokoˇc,GregorKovalˇcik,Tom´aˇsSouˇcek,JaroslavMoravec, andPˇremyslˇCech.Aframeworkforef fectiveknown-itemsea rchinvideo.InProceed ingsofthe27thACMInte rnationalConferenceo nMultimedia,pages177 7–1785,2019.2[9]VasileiosTChasanis,A ristidisCLikas,andNi kolaosPGalatsanos.Sc enedetectioninvideos usingshotclusteringa ndsequencealignment. IEEEtransactionsonmu ltimedia,11(1):89–100,2008.2,6[10]LorenzoBaraldi,Costa ntinoGrana,andRitaCu cchiara.Adeepsiamese networkforscenedetec tioninbroadcastvideo s.InProceedingsofthe 23rdACMinternational con-ferenceonMultime dia,pages1199–1202,2015.2,3,5,6[11]KaimingHe,HaoqiFan,Y uxinWu,SainingXie,an dRossGirshick.Moment umcontrastforunsuper visedvisualrep-resen tationlearning.InPro ceedingsoftheIEEE/CV FCon-ferenceonComput erVisionandPatternRe cognition,pages9729–9738,2020.2,5[12]TingChen,SimonKornbl ith,MohammadNorouzi, andGe-offreyHinton.A simpleframeworkforco ntrastivelearningofv isualrepresentations .InInternationalconf erenceonma-chinelear ning,pages1597–1607.PMLR,2020.2,4,5 ,7[13]TingChen,SimonKornbl ith,KevinSwersky,Moh ammadNorouzi,andGeof freyEHinton.Bigself- supervisedmod-elsare strongsemi-supervise dlearners.Advancesin NeuralInformationPro cessingSystems,33:22 243–22255,2020.2,5[14]XinleiChen,HaoqiFan, RossGirshick,andKaim ingHe.Improvedbaseli neswithmomentumcontr astivelearning.arXiv preprintarXiv:2003.0 4297,2020.2,3,4,5,6, 7[15]XinleiChen,SainingXi e,andKaimingHe.Anemp iri-calstudyoftraini ngself-supervisedvis iontransformers.InPr oceedingsoftheIEEE/C VFInternationalConfe renceonComputerVisio n,pages9640–9649,2021.2,5[16]XinleiChenandKaiming He.Exploringsimplesi ameserep-resentation learning.InProceedin gsoftheIEEE/CVFCon-f erenceonComputerVisi onandPatternRecognit ion,pages15750–15758,2021.2,4,5,7[17]Jean-BastienGrill,Fl orianStrub,FlorentAl tch´e,CorentinTallec,Pie rreRichemond,ElenaBu chatskaya,CarlDoersc h,BernardoAvilaPires ,ZhaohanGuo,Mohammad Ghesh-laghiAzar,etal .Bootstrapyourownlat ent-anewapproachtose lf-supervisedlearnin g.AdvancesinNeuralIn formationProcessingS ystems,33:21271–21284,2020.2,4,5,7[18]MathildeCaron,IshanM isra,JulienMairal,Pr iyaGoyal,Pi-otrBojan owski,andArmandJouli n.Unsupervisedlearn- ingofvisualfeaturesb ycontrastingclustera ssignments.InThirty- fourthConferenceonNe uralInformationProce ssingSystems(NeurIPS ),2020.2,5[19]ZhihengHuang,WeiXu,a ndKaiYu.Bidirectiona llstm-crfmodelsforse quencetagging.arXivp reprintarXiv:1508.01 991,2015.3,6[20]SpyrosGidaris,Pravee rSingh,andNikosKomod akis.Un-supervisedre presentationlearning bypredictingimagerot a-tions.InICLR2018,2 018.2[21]DeepakPathak,Philipp Krahenbuhl,JeffDonah ue,TrevorDarrell,and AlexeiAEfros.Context encoders:Featurelear ningbyinpainting.InP roceedingsoftheIEEEc on-ferenceoncomputer visionandpatternreco gnition,pages2536–2544,2016.2[22]RichardZhang,Phillip Isola,andAlexeiAEfro s.Colorfulimagecolor ization.InEuropeanco nferenceoncomputervi sion,pages649–666.Springer,2016.2[23]MehdiNorooziandPaolo Favaro.Unsupervisedl earningofvisualrepre sentationsbysolvingj igsawpuzzles.InEuro- peanconferenceoncomp utervision,pages69–84.Springer,2016.2[24]RuiQian,TianjianMeng ,BoqingGong,Ming-Hsu anYang,HuishengWang, SergeBelongie,andYin Cui.Spatiotempo-ralc ontrastivevideorepre sentationlearning.In Proceedings 0.13
英語(論文から抽出)日本語訳スコア
oftheIEEE/CVFConfere nceonComputerVisiona ndPat-ternRecognitio n,pages6964–6974,2021.2,3[25]AliDiba,VivekSharma, RezaSafdari,DariushL otfi,SaquibSarfraz,Raine rStiefelhagen,andLuc VanGool.Vi2clr:Video andimageforvisualcon trastivelearningofre presen-tation.InProc eedingsoftheIEEE/CVF InternationalCon-fer enceonComputerVision ,pages1502–1512,2021.2,3[26]HaofeiKuang,YiZhu,Zh iZhang,XinyuLi,Josep hTighe,SorenSchwertf eger,CyrillStachniss ,andMuLi.Videocon-tr astivelearningwithgl obalcontext.InProcee dingsoftheIEEE/CVFIn ternationalConferenc eonComputerVision,pa ges3195–3204,2021.2[27]ChristophFeichtenhof er,HaoqiFan,BoXiong, RossGir-shick,andKai mingHe.Alarge-scales tudyonunsupervisedsp atiotemporalrepresen tationlearning.InPro ceedingsoftheIEEE/CV FConferenceonCompute rVisionandPatternRec ognition,pages3299–3309,2021.2,3,7[28]JiahaoXie,XiaohangZh an,ZiweiLiu,YewOng,a ndChenChangeLoy.Unsu pervisedobject-level representationlearni ngfromsceneimages.Ad vancesinNeuralInform ationProcessingSyste ms,34,2021.2[29]AntoineMiech,Jean-Ba ptisteAlayrac,LucasS maira,IvanLaptev,Jos efSivic,andAndrewZis serman.End-to-endlea rningofvisualreprese ntationsfromuncurate dinstruc-tionalvideo s.InProceedingsofthe IEEE/CVFConferenceon ComputerVisionandPat ternRecognition,page s9879–9889,2020.2,4[30]AdriaRecasens,Paulin eLuc,Jean-BaptisteAl ayrac,LuyuWang,Flori anStrub,CorentinTall ec,MateuszMalinowski ,VioricaP˘atr˘aucean,FlorentAltch´e,MichalValko,etal.B roadenyourviewsforse lf-supervisedvideole arning.InProceedings oftheIEEE/CVFInterna tionalConferenceonCo mputerVision,pages12 55–1265,2021.2[31]QingqiuHuang,YuXiong ,AnyiRao,JiazeWang,a ndDahuaLin.Movienet: Aholisticdatasetform ovieunder-standing.I nComputerVision–ECCV2020:16thEuropea nConference,Glasgow, UK,August23–28,2020,Proceed-ings ,PartIV16,pages709–727.Springer,2020.2, 3,5,6,8[32]DanielRotman,DrorPor at,andGalAshour.Opti malse-quentialgroupi ngforrobustvideoscen edetectionusingmulti plemodalities.Intern ationalJournalofSema nticCom-puting,11(02 ):193–208,2017.3,5[33]JiaruiXuandXiaolongW ang.Rethinkingself-s upervisedcorresponde ncelearning:Avideofr ame-levelsimilarityp er-spective.InProcee dingsoftheIEEE/CVFIn ternationalConferenc eonComputerVision,pa ges10075–10085,2021.4[34]ZhirongWu,YuanjunXio ng,StellaXYu,andDahu aLin.Unsupervisedfea turelearningvianon-p arametricinstancedis crimination.InProcee dingsoftheIEEEconfer enceoncomputervision andpatternrecognitio n,pages3733–3742,2018.5[35]MakarandTapaswi,Mart inBauml,andRainerSti efelhagen.Storygraph s:visualizingcharact erinteractionsasatim eline.InProceedingso ftheIEEEconferenceon computervisionandpat ternrecognition,page s827–834,2014.6[36]JiaDeng,WeiDong,Rich ardSocher,Li-JiaLi,K aiLi,andLiFei-Fei.Im agenet:Alarge-scaleh ierarchicalimagedata base.In2009IEEEconfe renceoncomputervisio nandpatternrecogniti on,pages248–255.Ieee,2009.6[37]BoleiZhou,AgataLaped riza,AdityaKhosla,Au deOliva,andAntonioTo rralba.Places:A10mil lionimagedatabasefor scenerecognition.IEE Etransactionsonpatte rnanalysisandmachine intelligence,40(6):1 452–1464,2017.6[38]KaimingHe,XiangyuZha ng,ShaoqingRen,andJi anSun.Deepresidualle arningforimagerecogn ition.InProceed-ings oftheIEEEconferenceo ncomputervisionandpa tternrecognition,pag es770–778,2016.5,7[39]StuartLloyd.Leastsqu aresquantizationinpc m.IEEEtrans-actionso ninformationtheory,2 8(2):129–137,1982.6[40]HaozheLiu,HaoqianWu, WeichengXie,FengLiu, andLin-linShen.Group -wiseinhibitionbased featureregularizatio nforrobustclassification.InProceedings oftheIEEE/CVFIn-tern ationalConferenceonC omputerVision,pages4 78–486,2021.6[41]Jean-BaptisteAlayrac ,AdriaRecasens,Rosal iaSchneider,ReljaAra ndjelovic,JasonRamap uram,JeffreyDeFauw,L u-casSmaira,SanderDi eleman,andAndrewZiss erman.Self-supervise dmultimodalversatile networks.NeurIPS,2(6 ):7,2020.8 oftheIEEE/CVFConfere nceonComputerVisiona ndPat-ternRecognitio n,pages6964–6974,2021.2,3[25]AliDiba,VivekSharma, RezaSafdari,DariushL otfi,SaquibSarfraz,Raine rStiefelhagen,andLuc VanGool.Vi2clr:Video andimageforvisualcon trastivelearningofre presen-tation.InProc eedingsoftheIEEE/CVF InternationalCon-fer enceonComputerVision ,pages1502–1512,2021.2,3[26]HaofeiKuang,YiZhu,Zh iZhang,XinyuLi,Josep hTighe,SorenSchwertf eger,CyrillStachniss ,andMuLi.Videocon-tr astivelearningwithgl obalcontext.InProcee dingsoftheIEEE/CVFIn ternationalConferenc eonComputerVision,pa ges3195–3204,2021.2[27]ChristophFeichtenhof er,HaoqiFan,BoXiong, RossGir-shick,andKai mingHe.Alarge-scales tudyonunsupervisedsp atiotemporalrepresen tationlearning.InPro ceedingsoftheIEEE/CV FConferenceonCompute rVisionandPatternRec ognition,pages3299–3309,2021.2,3,7[28]JiahaoXie,XiaohangZh an,ZiweiLiu,YewOng,a ndChenChangeLoy.Unsu pervisedobject-level representationlearni ngfromsceneimages.Ad vancesinNeuralInform ationProcessingSyste ms,34,2021.2[29]AntoineMiech,Jean-Ba ptisteAlayrac,LucasS maira,IvanLaptev,Jos efSivic,andAndrewZis serman.End-to-endlea rningofvisualreprese ntationsfromuncurate dinstruc-tionalvideo s.InProceedingsofthe IEEE/CVFConferenceon ComputerVisionandPat ternRecognition,page s9879–9889,2020.2,4[30]AdriaRecasens,Paulin eLuc,Jean-BaptisteAl ayrac,LuyuWang,Flori anStrub,CorentinTall ec,MateuszMalinowski ,VioricaP˘atr˘aucean,FlorentAltch´e,MichalValko,etal.B roadenyourviewsforse lf-supervisedvideole arning.InProceedings oftheIEEE/CVFInterna tionalConferenceonCo mputerVision,pages12 55–1265,2021.2[31]QingqiuHuang,YuXiong ,AnyiRao,JiazeWang,a ndDahuaLin.Movienet: Aholisticdatasetform ovieunder-standing.I nComputerVision–ECCV2020:16thEuropea nConference,Glasgow, UK,August23–28,2020,Proceed-ings ,PartIV16,pages709–727.Springer,2020.2, 3,5,6,8[32]DanielRotman,DrorPor at,andGalAshour.Opti malse-quentialgroupi ngforrobustvideoscen edetectionusingmulti plemodalities.Intern ationalJournalofSema nticCom-puting,11(02 ):193–208,2017.3,5[33]JiaruiXuandXiaolongW ang.Rethinkingself-s upervisedcorresponde ncelearning:Avideofr ame-levelsimilarityp er-spective.InProcee dingsoftheIEEE/CVFIn ternationalConferenc eonComputerVision,pa ges10075–10085,2021.4[34]ZhirongWu,YuanjunXio ng,StellaXYu,andDahu aLin.Unsupervisedfea turelearningvianon-p arametricinstancedis crimination.InProcee dingsoftheIEEEconfer enceoncomputervision andpatternrecognitio n,pages3733–3742,2018.5[35]MakarandTapaswi,Mart inBauml,andRainerSti efelhagen.Storygraph s:visualizingcharact erinteractionsasatim eline.InProceedingso ftheIEEEconferenceon computervisionandpat ternrecognition,page s827–834,2014.6[36]JiaDeng,WeiDong,Rich ardSocher,Li-JiaLi,K aiLi,andLiFei-Fei.Im agenet:Alarge-scaleh ierarchicalimagedata base.In2009IEEEconfe renceoncomputervisio nandpatternrecogniti on,pages248–255.Ieee,2009.6[37]BoleiZhou,AgataLaped riza,AdityaKhosla,Au deOliva,andAntonioTo rralba.Places:A10mil lionimagedatabasefor scenerecognition.IEE Etransactionsonpatte rnanalysisandmachine intelligence,40(6):1 452–1464,2017.6[38]KaimingHe,XiangyuZha ng,ShaoqingRen,andJi anSun.Deepresidualle arningforimagerecogn ition.InProceed-ings oftheIEEEconferenceo ncomputervisionandpa tternrecognition,pag es770–778,2016.5,7[39]StuartLloyd.Leastsqu aresquantizationinpc m.IEEEtrans-actionso ninformationtheory,2 8(2):129–137,1982.6[40]HaozheLiu,HaoqianWu, WeichengXie,FengLiu, andLin-linShen.Group -wiseinhibitionbased featureregularizatio nforrobustclassification.InProceedings oftheIEEE/CVFIn-tern ationalConferenceonC omputerVision,pages4 78–486,2021.6[41]Jean-BaptisteAlayrac ,AdriaRecasens,Rosal iaSchneider,ReljaAra ndjelovic,JasonRamap uram,JeffreyDeFauw,L u-casSmaira,SanderDi eleman,andAndrewZiss erman.Self-supervise dmultimodalversatile networks.NeurIPS,2(6 ):7,2020.8 0.13
英語(論文から抽出)日本語訳スコア
SceneConsistencyRepr esentationLearningfo rVideoSceneSegmentat ionSupplementaryMate rialsHaoqianWu1,2,3, 4∗,KeyuChen2∗,YananLuo2,RuizhiQia o2,BoRen2,HaozheLiu1 ,3,4,5,WeichengXie1, 3,4†,LinlinShen1,3,41Com puterVisionInstitute ,ShenzhenUniversity2 TencentYouTuLab3Shen zhenInstituteofArtificialIntelligenceandR oboticsforSociety4Gu angdongKeyLaboratory ofIntelligentInforma tionProcessing5KAUST wuhaoqian2019@email. szu.edu.cn{yolochen,ruizhiqiao, timren}@tencent.comluoyanan 93@gmail.comhaozhe.l iu@kaust.edu.sa{wcxie,llshen}@szu.edu.cn1.Additio nalIllustrations1.1. Frame,Shot,ClipandSc eneFig.1showstheconn ectionbetweenframe,s hot,sceneandvideo.Mo respecifically,ashotcontainso nlycontinu-ousframes takenbythecamerawith outinterruption,anda sceneiscomposedofsuc cessiveshotsanddescr ibesasameshortstory. Typically,aconsecuti veshotsequenceofarbi -trarylengthcanbetre atedasaclip.FrameSho tSceneVideotFigure1. Theillustrationofthe connectionbetweenfra me,shot,sceneandvide o.Thegreenandorangel inesrepresentShotand SceneBoundaries,resp ectively.1.2.ShotSam plingStrategyVideoda taishighlyredundantb ecausetherearemanyre petitiveframesinchro nologicalorder,andwe followthesamplingstr ategyin[1,2]forlong-termvideos.M orecon-cretely,thevi deosequenceissliceda ccordingtotheshotbou ndariesthataredeterm inedbythetransitiono fthevi-sualmodality. Basedonthebeginninga ndendingpositions∗EqualContribution†CorrespondingAuthoro fshots,afixednumberofN=3framesareselectedan dtreatedastheorigina lfeatureofoneshot,wh erethestart-ing,midd leandendingframesoft heshotaresampled.2.A lgorithmDetails2.1.R epresentationLearnin gStageIntheRepresent ationLearningStage,t heproposedSSLalgorit hmcanbesummarizedinA lgorithm1.Algorithm1 SSLforVideoSceneSegm entation.Input:Train ingsamplesXRequire:I nitializedencodersf( ·|θQ)andf(·|θK);Ini-tializedmemoryba nkQueue;Augmentationoperatio nsAugQ(·)andAugK(·);Numberofiterationsni ter1:fori=1toniterdo2:Obtainau gmentedtrainingsampl esQ,KbyQ=AugQ(X),K=AugK(X);3:Obtainmappingfunct ionMAP(·);4:Obtainpositivepair s{q,k+}byEq. SceneConsistencyRepr esentationLearningfo rVideoSceneSegmentat ionSupplementaryMate rialsHaoqianWu1,2,3, 4∗,KeyuChen2∗,YananLuo2,RuizhiQia o2,BoRen2,HaozheLiu1 ,3,4,5,WeichengXie1, 3,4†,LinlinShen1,3,41Com puterVisionInstitute ,ShenzhenUniversity2 TencentYouTuLab3Shen zhenInstituteofArtificialIntelligenceandR oboticsforSociety4Gu angdongKeyLaboratory ofIntelligentInforma tionProcessing5KAUST wuhaoqian2019@email. szu.edu.cn{yolochen,ruizhiqiao, timren}@tencent.comluoyanan 93@gmail.comhaozhe.l iu@kaust.edu.sa{wcxie,llshen}@szu.edu.cn1.Additio nalIllustrations1.1. Frame,Shot,ClipandSc eneFig.1showstheconn ectionbetweenframe,s hot,sceneandvideo.Mo respecifically,ashotcontainso nlycontinu-ousframes takenbythecamerawith outinterruption,anda sceneiscomposedofsuc cessiveshotsanddescr ibesasameshortstory. Typically,aconsecuti veshotsequenceofarbi -trarylengthcanbetre atedasaclip.FrameSho tSceneVideotFigure1. Theillustrationofthe connectionbetweenfra me,shot,sceneandvide o.Thegreenandorangel inesrepresentShotand SceneBoundaries,resp ectively.1.2.ShotSam plingStrategyVideoda taishighlyredundantb ecausetherearemanyre petitiveframesinchro nologicalorder,andwe followthesamplingstr ategyin[1,2]forlong-termvideos.M orecon-cretely,thevi deosequenceissliceda ccordingtotheshotbou ndariesthataredeterm inedbythetransitiono fthevi-sualmodality. Basedonthebeginninga ndendingpositions∗EqualContribution†CorrespondingAuthoro fshots,afixednumberofN=3framesareselectedan dtreatedastheorigina lfeatureofoneshot,wh erethestart-ing,midd leandendingframesoft heshotaresampled.2.A lgorithmDetails2.1.R epresentationLearnin gStageIntheRepresent ationLearningStage,t heproposedSSLalgorit hmcanbesummarizedinA lgorithm1.Algorithm1 SSLforVideoSceneSegm entation.Input:Train ingsamplesXRequire:I nitializedencodersf( ·|θQ)andf(·|θK);Ini-tializedmemoryba nkQueue;Augmentationoperatio nsAugQ(·)andAugK(·);Numberofiterationsni ter1:fori=1toniterdo2:Obtainau gmentedtrainingsampl esQ,KbyQ=AugQ(X),K=AugK(X);3:Obtainmappingfunct ionMAP(·);4:Obtainpositivepair s{q,k+}byEq. 0.10
(1);5:Detachsamplesk+fromCalculationGraph ;6:Obtainnegativesamp lesk−fromQueue;7:Calculatecontrasti velossLconbyEq. 1):detachsamplesk+fromcalculationgraph ;6:obtain negativesamplesk−fromqueue;7:calculatecontrasti velosslconbyeq 0.34
(7)or(8);8:Performbackpropaga tionsforLcon;9:Updateencoderf(·|θQ)bygradientdescent;10:Updateencoderf(·|θK)bymomentumupdate;11:Enqueuethepositiv esamplesk+toQueue;12:endforOutput:Quer yencoderf(·|θQ).2.2.VideoSceneSeg mentationStageTheBou ndarybasedmodel(i.e. ,MLPprotocol[2])andBoundaryfreemode l(i.e.,Bi-LSTMprotoc olintroducedbyus)are presentedinFig.2.1ar Xiv:2205.05487v1 [cs.CV] 11 May 2022 (7)or(8);8:Performbackpropaga tions forLcon;9:Updateencoderf(·|θQ)bygradientdescent;10:Updateencoderf(·|θK)bymomentumupdate;11:Enqueuethe positivesamplesk+toQueue;12:endforOutput:Quer yencoderf(·|θQ)2.2.VideoSceneSegm entationStageTheBoun dary basedmodel(i.e.,MLPp rotocol[2])andBoundaryfreemode l(i.,Bi-LSTMprotocol introtroducedbyus) arepresentedinFig.2. 1arX:2205.87vs [c.2222] 0.33
英語(論文から抽出)日本語訳スコア
FortheMLPprotocol[2],weuseSGDastheoptimi zer,andtheweightdeca yis1e-4andtheSGDmome ntumis0.9.Inthetrain ingstage,weuseamini- batchsizeof128anddro poutrateof0.4forFCla yers.Besides,wetrain for200epochswiththel earningratemultiplie dby0.1at50,100and150 epochs.FortheBi-LSTM protocol,weuseSGDast heoptimizer,andthewe ightdecayis1e-4andth eSGDmomentumis0.9.In thetrainingstage,weu seamini-batchsizeof3 2anddropoutrateof0.7 forFClayers(exceptfo rthelastlayer),theBi -LSTMwasimplementedu singtheLSTMmoduleinP yTorch[3],whichincludestwolay erswith512hid-denuni tstogetherwithadropo utlayerwiththedropou tprobabilityof0.6bet weenthem.Besides,wet rainfor200epochswith thelearningratemulti pliedby0.1at160,180e pochs.Intheinference stage,inordertomakee achshotaggregateasmu chinformationaspossi blefromtheadjacentsh ots,ineachinferenceb atch,weusethemiddlep ortionofthemodeloutp utsequenceasthescene bound-ary.Morespecifically,fortheinputsho tfeaturesequence,i.e .[S0,S1,···,SShot−Len−1],weadoptthesubsequen ce,i.e.[YdShot−Len/4e,···,Yd3∗Shot−Len/4e−1]ofthecorre-spondingo utputsequence,i.e.[Y0,Y1,···,YShot−Len−1]asthepredictionresul t.Additionally,thefirstandlastshotfeatur esareusedtopadthebeg inningandendingofthe shotfeaturesequence, respectively. FortheMLPprotocol[2],weuseSGDastheoptimi zer,andtheweightdeca yis1e-4andtheSGDmome ntumis0.9.Inthetrain ingstage,weuseamini- batchsizeof128anddro poutrateof0.4forFCla yers.Besides,wetrain for200epochswiththel earningratemultiplie dby0.1at50,100and150 epochs.FortheBi-LSTM protocol,weuseSGDast heoptimizer,andthewe ightdecayis1e-4andth eSGDmomentumis0.9.In thetrainingstage,weu seamini-batchsizeof3 2anddropoutrateof0.7 forFClayers(exceptfo rthelastlayer),theBi -LSTMwasimplementedu singtheLSTMmoduleinP yTorch[3],whichincludestwolay erswith512hid-denuni tstogetherwithadropo utlayerwiththedropou tprobabilityof0.6bet weenthem.Besides,wet rainfor200epochswith thelearningratemulti pliedby0.1at160,180e pochs.Intheinference stage,inordertomakee achshotaggregateasmu chinformationaspossi blefromtheadjacentsh ots,ineachinferenceb atch,weusethemiddlep ortionofthemodeloutp utsequenceasthescene bound-ary.Morespecifically,fortheinputsho tfeaturesequence,i.e .[S0,S1,···,SShot−Len−1],weadoptthesubsequen ce,i.e.[YdShot−Len/4e,···,Yd3∗Shot−Len/4e−1]ofthecorre-spondingo utputsequence,i.e.[Y0,Y1,···,YShot−Len−1]asthepredictionresul t.Additionally,thefirstandlastshotfeatur esareusedtopadthebeg inningandendingofthe shotfeaturesequence, respectively. 0.11
(a) Boundary basedModel (a)境界ベースモデル 0.84
(b) Boundary freeModelFC1SoftmaxC oncat.Boundary Label (0/1)FC2ReLU+DropoutReLU+DropoutFC3[B, Shot-Len, N][B, Shot-Len x N][B, 4096][B, 2048][B, 2]Shot FeaturesFC1Bi-LSTMFC 2Boundary Labels (0/1)[B, Shot-Len, N][B, Shot-Len, 1024]Softmax[B, Shot-Len, 2 x 512][B, Shot-Len, 2]ReLU+DropoutReLU+DropoutFC3[B, Shot-Len, 512]Figure2.Theillustrat ionofBoundarybasedan dBoundaryfreemodels, whereB,NandShot-Lenr epresentthebatchsize ,di-mensionofthefeat ureandlengthofshotsp rocessedwithinabatch .3.AdditionalImpleme ntationDetails3.1.Da tasetsThedetailsofBB C[4],OVSD[5],MovieNet[6]andMovieScene[1]datasetsareshowninTa b.1.Thescala-bilityo fBBC[4]andOVSD[5]ismuchsmallerthantha tofMovieNet[6]andMovieScene[6]. (b) Boundary freeModelFC1SoftmaxC oncat.Boundary Label (0/1)FC2ReLU+DropoutReLU+DropoutFC3[B, Shot-Len, N][B, Shot-Len x N][B, 4096][B, 2048][B, 2]Shot FeaturesFC1Bi-LSTMFC 2Boundary Labels (0/1)[B, Shot-Len, N][B, Shot-Len, 1024]Softmax[B, Shot-Len, 2 x 512][B, Shot-Len, 2]ReLU+DropoutReLU+DropoutFC3[B, Shot-Len, 512]Figure2.Theillustrat ionofBoundarybasedan dBoundaryfreemodels, whereB,NandShot-Lenr epresentthebatchsize ,di-mensionofthefeat ureandlengthofshotsp rocessedwithinabatch .3.AdditionalImpleme ntationDetails3.1.Da tasetsThedetailsofBB C[4],OVSD[5],MovieNet[6]andMovieScene[1]datasetsareshowninTa b.1.Thescala-bilityo fBBC[4]andOVSD[5]ismuchsmallerthantha tofMovieNet[6]andMovieScene[6]. 0.30
BBC[4]has5dif-ferentannota tionsandthenumberofs cenesisaveraged,assh owninTab.1.Theground truthShotDetectionre sultofOVSD[5]isunavailablefromits officialwebsiteand[1],thus,weprovideourim plementationresultso nShotDe-tectioninthe Tab.1.Itisworthnotin gthatmoviesintheMovi eScene[1]areallincludedinMovi eNet[6]. BBC[4]has5dif-ferentannota tionsandthenumberofs cenesisaveraged,assh owninTab.1.Theground truthShotDetectionre sultofOVSD[5]isun availablefromitsoffi cialwebsiteand[1],thus,weprovideourim plementationresultso nShotDe-tectioninthe Tab.1.Itisworthnotin g thatmovies intheMovieScene[1]areallincludedinMovi eNet[6] 0.15
Table1.DetailsofBBC/ OVSD/MovieNet/MovieS cenedatasets.Dataset s#VideoTime(h)#Shot# SceneBBC[4]1194,900547OVSD[5]21109,377607MovieSce ne-150[1]150297270,45021,428M ovieScene-318[1]318601503,52241,963M ovieNet[6]1,1003,000Toshowthed ifferenceofthe5annot ationresults,weprese nttheaverage(mean),m inimum(min),maximum( max)andstandarddevia tion(std.)ofthenumbe rofscenesforeachvide oinBBC[4]witheachannotationin Tab.2.Table2.Numbero fscenesineachvideoof BBCdataset.VideoName s#Scenemeanminmaxstd .FromPoletoPole47.82 36514.4Mountains47.6 36629.8IceWorlds50.6 336912.1GreatPlains5 2.0307414.1Jungles47 .4255911.8SeasonalFo rests53.4337112.1Fre shWater55.2377010.5O ceanDeep47.8296712.7 ShallowSeas49.233661 2.0Caves45.4226314.1 Deserts50.8266413.7A llVideosAvg.5473.2.B ackbonesForthevisual modality,ResNet50[7]isusedastheen-coderw iththesamemodification[2]thattheinputchanneln umberoffirstconvolutionlayeri schangedfrom3to9.Asf oraudiomodality,wead optthesamebackboneus edin[1]. Table1.DetailsofBBC/ OVSD/MovieNet/MovieS cenedatasets.Dataset s#VideoTime(h)#Shot# SceneBBC[4]1194,900547OVSD[5]21109,377607MovieSce ne-150[1]150297270,45021,428M ovieScene-318[1]318601503,52241,963M ovieNet[6]1,1003,000Toshowthed ifferenceofthe5annot ationresults,weprese nttheaverage(mean),m inimum(min),maximum( max)andstandarddevia tion(std.)ofthenumbe rofscenesforeachvide oinBBC[4]witheachannotationin Tab.2.Table2.Numbero fscenesineachvideoof BBCdataset.VideoName s#Scenemeanminmaxstd .FromPoletoPole47.82 36514.4Mountains47.6 36629.8IceWorlds50.6 336912.1GreatPlains5 2.0307414.1Jungles47 .4255911.8SeasonalFo rests53.4337112.1Fre shWater55.2377010.5O ceanDeep47.8296712.7 ShallowSeas49.233661 2.0Caves45.4226314.1 Deserts50.8266413.7A llVideosAvg.5473.2.B ackbonesForthevisual modality,ResNet50[7]isusedastheen-coderw iththesamemodification[2]thattheinputchanneln umberoffirstconvolutionlayeri schangedfrom3to9.Asf oraudiomodality,wead optthesamebackboneus edin[1]. 0.18
英語(論文から抽出)日本語訳スコア
3.3.ChoiceofHyperpar ametersTwohyperparam etersareintroducedin theSec.3.1.2ofthemai nbodyofthiswork,i.e. numberofcluster(#cla ss)forSceneConsisten cySelectionstrategya ndlengthofcon-tinuou sshots(ρ)forSceneAgnosticCli p-Shuffling.Westudythesensit ivityoftheproposedal gorithmagainsttheset wohyperparametersinT ab.3.MLP[2]protocolontheMovieSc ene-318datasetisused inthisexperiment.Tab le3.APresultsfordiff erentsettingsofhyper parameters.Thebolded andunderlinedvaluess tandfortheoptimaland suboptimalperformanc es,respectively. 3.3.ChoiceofHyperpar ametersTwohyperparam etersareintroducedin theSec.3.1.2ofthemai nbodyofthiswork,i.e. numberofcluster(#cla ss)forSceneConsisten cySelectionstrategya ndlengthofcon-tinuou sshots(ρ)forSceneAgnosticCli p-Shuffling.Westudythesensit ivityoftheproposedal gorithmagainsttheset wohyperparametersinT ab.3.MLP[2]protocolontheMovieSc ene-318datasetisused inthisexperiment.Tab le3.APresultsfordiff erentsettingsofhyper parameters.Thebolded andunderlinedvaluess tandfortheoptimaland suboptimalperformanc es,respectively. 0.05
ρ/#class1624321651.80 53.7453.222452.6453. 6253.133252.6852.915 3.053.4.DataAugmenta tionDetailsWefollowt hedataaugmentationop erationusedinthe[8],i.e.randomcropping, flipping,colordistorti on,Gaus-sianbluring. APyTorch-likepseudoc odeforthedataaug-men tations,i.e.,Asymmet ricAugmentationmenti onedinSec.3.1.2ofthe bodyofthiswork,ispre sentedasfollows:1imp orttorchvision.trans formsastransforms2no rmalize=transforms.Normalize (3mean=[0.485,0.456,0.406],4std=[0.229,0.224,0.225])5#augmentationforke yencoder6augmentatio n_key_encoder=[7transforms.ToPILIma ge(),8transforms.Ran domResizedCrop(224,s cale=(0.2,1.)),9transform s.RandomApply([10transforms.ColorJi tter(0.4,0.4,0.4,0.1 )],p=0.5),11transforms.Ra ndomGrayscale(p=0.2),12transforms.Ra ndomApply([GaussianBlur([.1,2.])],p=0.5),13transforms.Ra ndomHorizontalFlip() ,14transforms.ToTens or(),15normalize16]17#augmentationforqu eryencoder18augmenta tion_query_encoder=[19transforms.ToPILIm age(),20transforms.R andomResizedCrop(224 ,scale=(0.2,1.)),21transfor ms.RandomApply([GaussianBlur([.1,2.])],p=0.5),22transforms.Ra ndomHorizontalFlip() ,23transforms.ToTens or(),24normalize25]Listing1.APyTorch-li kepseudocodefortheda taaugmentation.4.Add itionalResults4.1.Re sultsonBBC/OVSDDatas etsSincethetraining/ validation/testingda tasetsofBBC[4]/OVSD[5]arenotavailableandth escaleofthesetwodata setsisverysmallcompa redtoMovieNet[6]dataset,weapplythemo deltrainedonMovieNet [6]ontoBBC[4]andOVSD[5]tostudythegeneraliza tionabilitiesoftheal gorithmswithoutthefinetuning,theresultsa reshowninTab.4andTab .5.Tab.4showsthatthe proposedmethodoutper formsShotCoL[2]byalargemarginof13.2 7intermsofAPonOVSD[5]. ρ/#class1624321651.80 53.7453.222452.6453. 6253.133252.6852.915 3.053.4.DataAugmenta tionDetailsWefollowt hedataaugmentationop erationusedinthe[8],i.e.randomcropping, flipping,colordistorti on,Gaus-sianbluring. APyTorch-likepseudoc odeforthedataaug-men tations,i.e.,Asymmet ricAugmentationmenti onedinSec.3.1.2ofthe bodyofthiswork,ispre sentedasfollows:1imp orttorchvision.trans formsastransforms2no rmalize=transforms.Normalize (3mean=[0.485,0.456,0.406],4std=[0.229,0.224,0.225])5#augmentationforke yencoder6augmentatio n_key_encoder=[7transforms.ToPILIma ge(),8transforms.Ran domResizedCrop(224,s cale=(0.2,1.)),9transform s.RandomApply([10transforms.ColorJi tter(0.4,0.4,0.4,0.1 )],p=0.5),11transforms.Ra ndomGrayscale(p=0.2),12transforms.Ra ndomApply([GaussianBlur([.1,2.])],p=0.5),13transforms.Ra ndomHorizontalFlip() ,14transforms.ToTens or(),15normalize16]17#augmentationforqu eryencoder18augmenta tion_query_encoder=[19transforms.ToPILIm age(),20transforms.R andomResizedCrop(224 ,scale=(0.2,1.)),21transfor ms.RandomApply([GaussianBlur([.1,2.])],p=0.5),22transforms.Ra ndomHorizontalFlip() ,23transforms.ToTens or(),24normalize25]Listing1.APyTorch-li kepseudocodefortheda taaugmentation.4.Add itionalResults4.1.Re sultsonBBC/OVSDDatas etsSincethetraining/ validation/testingda tasetsofBBC[4]/OVSD[5]arenotavailableandth escaleofthesetwodata setsisverysmallcompa redtoMovieNet[6]dataset,weapplythemo deltrainedonMovieNet [6]ontoBBC[4]andOVSD[5]tostudythegeneraliza tionabilitiesoftheal gorithmswithoutthefinetuning,theresultsa reshowninTab.4andTab .5.Tab.4showsthatthe proposedmethodoutper formsShotCoL[2]byalargemarginof13.2 7intermsofAPonOVSD[5]. 0.18
Table4.APresultsonOV SDdataset.MethodsAPS hotCoL[2]25.53SCRL38.80Wecond uctexperimentsof5dif ferentannotatorsonBB C[4]andshowtheaverageper formancesinTab.5,whe retheproposedmethodo utperformsthecompare dmethodbyamarginof2. 20intermsofAP.Table5 .APresultsonBBCdatas et.A. Table4.APresultsonOV SDdataset.MethodsAPS hotCoL[2]25.53SCRL38.80Wecond uctexperimentsof5dif ferentannotatorsonBB C[4]andshowtheaverage performancesinTab.5, wheretheproposedmeth odoutperformsthecomp aredmethodbyamargino f2.20intermsofAP.Tab le5.APresultsonBBCda taset.A 0.10
istandsforthei-thann o-tationandAvg.repre sentstheaverageofthe resultsof5differenta nnotators.MethodsA.1 A.2A.3A.4A.5Avg.Shot CoL[2]29.9030.8131.4526.45 21.2727.98SCRL32.453 2.5433.2728.3624.273 0.18Model SizeB. istandsforthei-thann o-tationandAvg.repre sentstheaverageofthe resultsof5differenta nnotators.MethodsA.1 A.2A.3A.4A.5Avg.Shot CoL[2]29.9030.8131.4526.45 26.2727.98SCRL32.453 2.5433.2727.3627.273 0.18Model SizeB 0.07
based: MLPB. based: LGSSB. ベースはMLPB。 ベースはLGSSB。 0.48
free: Bi-LSTM228M38M15MFig ure3.APresultsontheM ovieScene-318dataset andmodelsizeofBounda rybasedandBoundaryfr eemodels,whereB.stan dsforBoundaryandShot -Lenrepresentsthelen gthofshotsprocessedw ithinabatch. 無料: Bi-LSTM228M38M15MFig ure3.APresultsontheM ovieScene-318dataset andmodelsizeofBounda ry basedandBoundaryfree models,whereB.stands forBoundaryandShot-L enrepresentsthelengt hofshotsprocessedwit hinabatch 0.07
英語(論文から抽出)日本語訳スコア
35334720123461313201 13463452012201335328 55215221963533461262 1353235320132015SCNN Image-NetSelft……Query2014Similarity to QueryShot IDFigure4.Thevisuali zationresultsofshotr etrieval.Comparedwit htheothermethods,the resultsofSCappearmor econsistentintermsof thethesemanticinform ation,i.e.,sceneswit hexplosions.4.2.Resu ltsunderMLP/Bi-LSTMp rotocolsTostudythesu periorityoftheintrod ucedevaluationpro-to colforthetaskofVideo SceneSegmentation,AP resultstogetherwitht hemodelsizeofBoundar ybased(i.e.,LGSS[1]andMLP[2]protocol)andBoundary free(i.e.,intro-duce dBi-LSTMprotocol)mod elsareshowninFig.3.A sshowninFig.3,althou ghperformancesofallm odelsareassociatedwi ththelengthoftheshot s,i.e.,Shot-Len,theB oundarybasedmodelach ievesthebestperforma nceonlywhentheShot-L enisrelativelysmall( andtheopti-malShot-L enislessthantheavera genumberofshotspersc ene,i.e.12). 35334720123461313201 13463452012201335328 55215221963533461262 1353235320132015SCNN Image-NetSelft……Query2014Similarity to QueryShot IDFigure4.Thevisuali zationresultsofshotr etrieval.Comparedwit htheothermethods,the resultsofSCappearmor econsistentintermsof thethesemanticinform ation,i.e.,sceneswit hexplosions.4.2.Resu ltsunderMLP/Bi-LSTMp rotocolsTostudythesu periorityoftheintrod ucedevaluationpro-to colforthetaskofVideo SceneSegmentation,AP resultstogetherwitht hemodelsizeofBoundar ybased(i.e.,LGSS[1]andMLP[2]protocol)andBoundary free(i.e.,intro-duce dBi-LSTMprotocol)mod elsareshowninFig.3.A sshowninFig.3,althou ghperformancesofallm odelsareassociatedwi ththelengthoftheshot s,i.e.,Shot-Len,theB oundarybasedmodelach ievesthebestperforma nceonlywhentheShot-L enisrelativelysmall( andtheopti-malShot-L enislessthantheavera genumberofshotspersc ene,i.e.12). 0.07
AsShot-Lenbecomeslar ger,theperfor-manced ecreasesandthemodels izeincreases.Bycontr ast,theBoundaryfreem odelproduceslessindu ctivebiasandtakesthe shotfeaturesastheuni tofbasictemporalinpu t,hence,itisabletomo delrepresentationsof longershots,whileach ievingbetterperforma nceswhenShot-Lentake savalueintheapproxim aterangeandamodelwit hthesamesizeisemploy ed.4.3.Visualization 4.3.1ShotRetrievalAn additionalresultofSh otRetrieval,whichisi ntroducedintheSec.4. 4ofthemainbodyofthis work,isgivenintheFig .4.4.3.2SceneBoundar yTostudythepractical performanceofourappr oachfortheVideoScene Segmentationtask,wev isualizetheGT/Predic -tionsceneboundaries inFig.5.Forsimplesce nesinFig.5(a1)/(a2), theproposedmethodeas ilyidentifiesthesescenesandgive scorrectpredictionso fthesceneboundary.As showninFig.5(b1)/(b2 )/(c1)/(c2),thereare alsobadcaseswherethe proposedmethodfailst odistinguishbetweent hesegmentationpoints ofashotandascene,and forthesecases,itmayb econfusingtoidentify whethertheseshotsbel ongtothesamesceneorn ot,fromthevisualmoda lity. AsShot-Lenbecomeslar ger,theperfor-manced ecreasesandthemodels izeincreases.Bycontr ast,theBoundaryfreem odelproduceslessindu ctivebiasandtakesthe shotfeaturesastheuni tofbasictemporalinpu t,hence,itisabletomo delrepresentationsof longershots,whileach ievingbetterperforma nceswhenShot-Lentake savalueintheapproxim aterangeandamodelwit hthesamesizeisemploy ed.4.3.Visualization 4.3.1ShotRetrievalAn additionalresultofSh otRetrieval,whichisi ntroducedintheSec.4. 4ofthemainbodyofthis work,isgivenintheFig .4.4.3.2SceneBoundar yTostudythepractical performanceofourappr oachfortheVideoScene Segmentationtask,wev isualizetheGT/Predic -tionsceneboundaries inFig.5.Forsimplesce nesinFig.5(a1)/(a2), theproposedmethodeas ilyidentifiesthesescenesandgive scorrectpredictionso fthesceneboundary.As showninFig.5(b1)/(b2 )/(c1)/(c2),thereare alsobadcaseswherethe proposedmethodfailst odistinguishbetweent hesegmentationpoints ofashotandascene,and forthesecases,itmayb econfusingtoidentify whethertheseshotsbel ongtothesamesceneorn ot,fromthevisualmoda lity. 0.06
英語(論文から抽出)日本語訳スコア
PredictionShot 261-268 (Movie: tt0082089)GTShot 524-531 (Movie: tt0082089)Shot 364-371 (Movie: tt0096320)Shot 83-90 (Movie: tt0103776)Shot 400-407 (Movie: tt0123755)Shot 1182-1189 (Movie: tt0120382)(a1)(a2)(b 1)(b2)(c1)(c2)Figure 5.Thegroundtruth(GT) andpredictionscenebo undariesarepresented inthisfigure,wherethemiddlef rameofeachshotisvisu alized.Fig. predictionshot 261-268 (movie: tt0082089)gtshot 524-531 (movie: tt0096320)shot 364-371 (movie: tt0096320)shot 83-90 (movie: tt0103776)shot 400-407 (movie: tt0123755)shot 1182-1189 (movie: tt0120382)(a1)(a1)(b 1)(c1)(c2)g.thegroun dtruth(gt)andpredict ionsceneboundariesar epresentedinthis screenshot,wherethef rameofeachshotisvisu alized.fig)
訳抜け防止モード: PredictionShot 261 - 268 (映画 : tt0082089)GTShot 524 - 531 (映画 : tt0082089)Shot 364 - 371 (映画 : tt0096320)Shot 83 - 90 (映画 : tt0103776)Shot 400 - 407 (映画 : tt0123755)Shot 1182 - 1189 (映画 : tt0120382)(a1)(a2)(b 2)(b2)(c1)Figure5. Thegroundtruth(GT)an d Predictionscenebound aries arepresented in thesefigure, そこでは、中程度フレームが視覚化されます。
0.59
(a1)/(a2),Fig. (a1)/(a2),Fig。 0.43
(b1)/(b2)andFig. (b1)/(b2)andFig 0.42
(c1)/(c2)showthepred ictioncasesoftruepos itive,falsenegativea ndfalsepositive,resp ectively.References[1]AnyiRao,LinningXu,Yu Xiong,GuodongXu,Qing qiuHuang,BoleiZhou,a ndDahuaLin.Alocal-to -globalap-proachtomu lti-modalmoviescenes egmentation.InPro-ce edingsoftheIEEE/CVFC onferenceonComputerV isionandPatternRecog nition,pages10146–10155,2020.1,2,4[2]ShixingChen,XiaohanN ie,DavidFan,Dongqing Zhang,Vi-malBhat,and RaffayHamid.Shotcont rastiveself-supervis edlearningforscenebo undarydetection.InPr oceedingsoftheIEEE/C VFConferenceonComput erVisionandPatternRe cognition,pages9796–9805,2021.1,2,3,4[3]AdamPaszke,SamGross, FranciscoMassa,AdamL erer,JamesBradbury,G regoryChanan,TrevorK illeen,ZemingLin,Nat aliaGimelshein,LucaA ntiga,etal.Pytorch:A nim-perativestyle,hi gh-performancedeeple arninglibrary.Ad-van cesinneuralinformati onprocessingsystems, 32:8026–8037,2019.2[4]LorenzoBaraldi,Costa ntinoGrana,andRitaCu cchiara.Adeepsiamese networkforscenedetec tioninbroadcastvideo s.InProceedingsofthe 23rdACMinternational conferenceonMultimed ia,pages1199–1202,2015.2,3[5]DanielRotman,DrorPor at,andGalAshour.Opti malsequen-tialgroupi ngforrobustvideoscen edetectionusingmulti -plemodalities.Inter nationalJournalofSem anticComputing,11(02 ):193–208,2017.2,3[6]QingqiuHuang,YuXiong ,AnyiRao,JiazeWang,a ndDahuaLin.Movienet: Aholisticdatasetform ovieunderstanding.In ComputerVision–ECCV2020:16thEuropea nConference,Glasgow, UK,August23–28,2020,Proceedings, PartIV16,pages709–727.Springer,2020.2, 3[7]KaimingHe,XiangyuZha ng,ShaoqingRen,andJi anSun.Deepresidualle arningforimagerecogn ition.InProceedingso ftheIEEEconferenceon computervisionandpat ternrecog-nition,pag es770–778,2016.2[8]TingChen,SimonKornbl ith,MohammadNorouzi, andGe-offreyHinton.A simpleframeworkforco ntrastivelearningofv isualrepresentations .InInternationalconf erenceonma-chinelear ning,pages1597–1607.PMLR,2020.3 (c1)/(c2)showthepred ictioncasesoftruepos itive,falsenegativea ndfalsepositive,resp ectively.References[1]AnyiRao,LinningXu,Yu Xiong,GuodongXu,Qing qiuHuang,BoleiZhou,a ndDahuaLin.Alocal-to -globalap-proachtomu lti-modalmoviescenes egmentation.InPro-ce edingsoftheIEEE/CVFC onferenceonComputerV isionandPatternRecog nition,pages10146–10155,2020.1,2,4[2]ShixingChen,XiaohanN ie,DavidFan,Dongqing Zhang,Vi-malBhat,and RaffayHamid.Shotcont rastiveself-supervis edlearningforscenebo undarydetection.InPr oceedingsoftheIEEE/C VFConferenceonComput erVisionandPatternRe cognition,pages9796–9805,2021.1,2,3,4[3]AdamPaszke,SamGross, FranciscoMassa,AdamL erer,JamesBradbury,G regoryChanan,TrevorK illeen,ZemingLin,Nat aliaGimelshein,LucaA ntiga,etal.Pytorch:A nim-perativestyle,hi gh-performancedeeple arninglibrary.Ad-van cesinneuralinformati onprocessingsystems, 32:8026–8037,2019.2[4]LorenzoBaraldi,Costa ntinoGrana,andRitaCu cchiara.Adeepsiamese networkforscenedetec tioninbroadcastvideo s.InProceedingsofthe 23rdACMinternational conferenceonMultimed ia,pages1199–1202,2015.2,3[5]DanielRotman,DrorPor at,andGalAshour.Opti malsequen-tialgroupi ngforrobustvideoscen edetectionusingmulti -plemodalities.Inter nationalJournalofSem anticComputing,11(02 ):193–208,2017.2,3[6]QingqiuHuang,YuXiong ,AnyiRao,JiazeWang,a ndDahuaLin.Movienet: Aholisticdatasetform ovieunderstanding.In ComputerVision–ECCV2020:16thEuropea nConference,Glasgow, UK,August23–28,2020,Proceedings, PartIV16,pages709–727.Springer,2020.2, 3[7]KaimingHe,XiangyuZha ng,ShaoqingRen,andJi anSun.Deepresidualle arningforimagerecogn ition.InProceedingso ftheIEEEconferenceon computervisionandpat ternrecog-nition,pag es770–778,2016.2[8]TingChen,SimonKornbl ith,MohammadNorouzi, andGe-offreyHinton.A simpleframeworkforco ntrastivelearningofv isualrepresentations .InInternationalconf erenceonma-chinelear ning,pages1597–1607.PMLR,2020.3 0.14
                               ページの最初に戻る

翻訳にはFugu-Machine Translatorを利用しています。