M&M Mix: A Multimodal Multiview Transformer Ensemble
- URL: http://arxiv.org/abs/2206.09852v1
- Date: Mon, 20 Jun 2022 15:31:13 GMT
- Title: M&M Mix: A Multimodal Multiview Transformer Ensemble
- Authors: Xuehan Xiong, Anurag Arnab, Arsha Nagrani, Cordelia Schmid
- Abstract summary: This report describes the approach behind our winning solution to the 2022 Epic-Kitchens Action Recognition Challenge.
Our approach builds upon our recent work, Multiview Transformer for Video Recognition (MTV), and adapts it to multimodal inputs.
Our approach achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1% higher than last year's winning entry.
- Score: 77.16389667210427
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report describes the approach behind our winning solution to the 2022
Epic-Kitchens Action Recognition Challenge. Our approach builds upon our recent
work, Multiview Transformer for Video Recognition (MTV), and adapts it to
multimodal inputs. Our final submission consists of an ensemble of Multimodal
MTV (M&M) models varying backbone sizes and input modalities. Our approach
achieved 52.8% Top-1 accuracy on the test set in action classes, which is 4.1%
higher than last year's winning entry.
Related papers
- Multi-Object Tracking Retrieval with LLaVA-Video: A Training-Free Solution to MOT25-StAG Challenge [42.013930541762484]
The aim of this challenge is to accurately localize and track multiple objects that match specific and free-form language queries.<n>We model the underlying task as a video retrieval problem and present a two-stage, zero-shot approach.<n>On the MOT25-StAG test set, our method achieves m-HIoU and HOTA scores of 20.68 and 10.73 respectively, which won second place in the challenge.
arXiv Detail & Related papers (2025-11-05T10:01:31Z) - MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed [55.526939500742]
We use OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, to generate unified embeddings for text, images, audio, and video.<n>Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025.
arXiv Detail & Related papers (2025-06-11T05:40:26Z) - The 1st Solution for 4th PVUW MeViS Challenge: Unleashing the Potential of Large Multimodal Models for Referring Video Segmentation [31.44879457190659]
We propose a simple and effective inference optimization method to fully unleash the potential of LMMs in referring video segmentation.
Our solution achieved 61.98% J&F on the MeViS test set and ranked 1st place in the 4th PVUW Challenge MeViS Track at CVPR 2025.
arXiv Detail & Related papers (2025-04-07T15:24:54Z) - MMHMER:Multi-viewer and Multi-task for Handwritten Mathematical Expression Recognition [0.6694605027794318]
We propose a new multi-view, multi-task framework that can effectively integrate the strengths of CNN and Transformer.
Our model can better handle the complexity of handwritten mathematical expressions.
arXiv Detail & Related papers (2025-02-08T13:03:52Z) - MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding [150.28164854480912]
We introduce MuirBench, a benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs.
MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations.
We show that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy.
arXiv Detail & Related papers (2024-06-13T17:59:52Z) - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.08008540513596]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z) - M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos.
Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Skating-Mixer: Multimodal MLP for Scoring Figure Skating [31.346611498891964]
We introduce a multimodal architecture, named Skating-Mixer.
It effectively learns long-term representations through our designed memory recurrent unit (MRU)
Experiments show the proposed method outperforms SOTAs over all major metrics on the public Fis-V and our FS1000 dataset.
arXiv Detail & Related papers (2022-03-08T10:36:55Z) - Multiview Transformers for Video Recognition [69.50552269271526]
We present Multiview Video Recognition (MTV) for different resolutions.
MTV consistently performs better than single-view counterparts in terms of accuracy and computational cost.
We achieve state-of-the-art results on five standard datasets, and improve even further with large-scale pretraining.
arXiv Detail & Related papers (2022-01-12T03:33:57Z) - Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval [36.50847375135979]
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation.
We present a multi-modal, modality fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation.
arXiv Detail & Related papers (2021-12-08T18:14:57Z) - Top1 Solution of QQ Browser 2021 Ai Algorithm Competition Track 1 :
Multimodal Video Similarity [0.6445605125467573]
We describe the solution to the QQ Browser 2021 Ai Algorithm Competition (AIAC) Track 1.
In the pretrain phase, we train the model with three tasks, (1) Video Tag Classification (VTC), (2) Mask Language Modeling (MLM) and (3) Mask Frame Modeling (MFM)
In the finetune phase, we train the model with video similarity based on rank normalized human labels.
Our full pipeline, after ensembling several models, scores 0.852 on the leaderboard, which we achieved the 1st place in the competition.
arXiv Detail & Related papers (2021-10-30T15:38:04Z) - M5Product: A Multi-modal Pretraining Benchmark for E-commercial Product
Downstream Tasks [94.80043324367858]
We contribute a large-scale dataset, named M5Product, which consists of over 6 million multimodal pairs.
M5Product contains rich information of multiple modalities including image, text, table, video and audio.
arXiv Detail & Related papers (2021-09-09T13:50:22Z) - The Multi-Modal Video Reasoning and Analyzing Competition [40.13636409397136]
We introduce the Multi-Modal Video Reasoning and Analyzing Competition (MMVRAC) workshop in conjunction with ICCV 2021.
This competition is composed of four different tracks, namely, video question answering, skeleton-based action recognition, fisheye video-based action recognition, and person re-identification.
We summarize the top-performing methods submitted by the participants in this competition and show their results achieved in the competition.
arXiv Detail & Related papers (2021-08-18T18:40:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.