CuriosAI Submission to the EgoExo4D Proficiency Estimation Challenge 2025
- URL: http://arxiv.org/abs/2507.08022v1
- Date: Tue, 08 Jul 2025 12:33:02 GMT
- Title: CuriosAI Submission to the EgoExo4D Proficiency Estimation Challenge 2025
- Authors: Hayato Tanoue, Hiroki Nishihara, Yuma Suzuki, Takayuki Hori, Hiroki Takushima, Aiswariya Manojkumar, Yuki Shibata, Mitsuru Takeda, Fumika Beppu, Zhao Hengwei, Yuto Kanda, Daichi Yamaga,
- Abstract summary: This report presents the CuriosAI team's submission to the EgoExo4D Estimation Challenge at CVPR 2025.<n>We propose two methods for multi-view skill assessment: (1) a multi-task learning framework using Sapiens-2B that jointly predicts proficiency and scenario labels (43.6 % accuracy), and (2) a two-stage pipeline combining zero-shot scenario recognition with view-specific VideoMAE classifiers (47.8 % accuracy)<n>The superior performance of the two-stage approach demonstrates the effectiveness of scenario-conditioned modeling for proficiency estimation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This report presents the CuriosAI team's submission to the EgoExo4D Proficiency Estimation Challenge at CVPR 2025. We propose two methods for multi-view skill assessment: (1) a multi-task learning framework using Sapiens-2B that jointly predicts proficiency and scenario labels (43.6 % accuracy), and (2) a two-stage pipeline combining zero-shot scenario recognition with view-specific VideoMAE classifiers (47.8 % accuracy). The superior performance of the two-stage approach demonstrates the effectiveness of scenario-conditioned modeling for proficiency estimation.
Related papers
- Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge [0.0]
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge.<n>The BEHAVIOR Challenge is a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation.<n>Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
arXiv Detail & Related papers (2025-12-07T18:08:45Z) - Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis [7.392659193819963]
Traffic safety analysis requires complex video understanding to capture behavioral patterns and generate descriptions for accident prevention.<n>In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization.
arXiv Detail & Related papers (2025-10-13T20:18:23Z) - VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results [106.15762208088985]
VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models hosted as part of ICCV 2025 Workshop on Visual Quality Assessment.<n>Challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images.<n>Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment.
arXiv Detail & Related papers (2025-09-11T07:00:50Z) - KAT-V1: Kwai-AutoThink Technical Report [50.84483585850113]
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks.<n>KAT dynamically switches between reasoning and non-reasoning modes based on task complexity.<n>We also propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework.
arXiv Detail & Related papers (2025-07-11T04:07:10Z) - ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning [57.67273340380651]
Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks.<n>These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.
arXiv Detail & Related papers (2025-07-03T14:29:43Z) - Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment [10.701522670464463]
multimodal large language models (MLLMs) can proficiently evaluate visual quality through interpretable assessments.<n>We propose a unified two-stage training framework comprising a cold-start stage and a reinforcement learning-based fine-tuning stage.<n>We designate the models derived from these two stages as Q-Ponder-CI and Q-Ponder.
arXiv Detail & Related papers (2025-06-03T10:11:51Z) - SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation [70.27631454256024]
SkillVerse is an unsupervised tree-structured diagnosis framework for understanding model proficiency in specific abilities.<n>Given proficiency at arbitrary levels of granularity, SkillVerse is flexible to produce insights of behaviors of modern large models.
arXiv Detail & Related papers (2025-05-31T00:08:59Z) - CLaC at SemEval-2025 Task 6: A Multi-Architecture Approach for Corporate Environmental Promise Verification [0.20482269513546458]
This paper presents our approach to the SemEval-2025 Task6 (PromiseEval), which focuses on verifying promises in corporate ESG (Environmental, Social, and Governance) reports.<n>We explore three model architectures to address the four subtasks of promise identification, supporting evidence assessment, clarity evaluation, and verification timing.<n>Our work highlights the effectiveness of linguistic feature extraction, attention pooling, and multi-objective learning in promise verification tasks, despite challenges posed by class imbalance and limited training data.
arXiv Detail & Related papers (2025-05-29T15:19:00Z) - ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [37.54872845368151]
We conduct a case study using a synthetic dataset solvable only through visual reasoning.<n>We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions.<n>Although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%.
arXiv Detail & Related papers (2025-05-19T17:59:27Z) - VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning [59.68917139718813]
We show that a strong off-the-shelf frozen pretrained visual encoder can achieve state-of-the-art (SoTA) performance in forecasting and procedural planning.
By conditioning on frozen clip-level embeddings from observed steps to predict the actions of unseen steps, our prediction model is able to learn robust representations for forecasting.
arXiv Detail & Related papers (2024-10-04T14:52:09Z) - Solution for OOD-CV Workshop SSB Challenge 2024 (Open-Set Recognition Track) [6.998958192483059]
The challenge required identifying whether a test sample belonged to the semantic classes of a classifier's training set.
We proposed a hybrid approach, experimenting with the fusion of various post-hoc OOD detection techniques and different Test-Time Augmentation strategies.
Our best-performing method combined Test-Time Augmentation with the post-hoc OOD techniques, achieving a strong balance between AUROC and FPR95 scores.
arXiv Detail & Related papers (2024-09-30T13:28:14Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning [109.9413329636322]
This paper introduces an efficient framework that integrates multiple modalities (images, 3D, audio and video) to a frozen Large Language Models (LLMs)
Our approach explores two distinct projection mechanisms: Q-Formers and Linear Projections (LPs)
arXiv Detail & Related papers (2023-11-30T18:43:51Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - KaLM at SemEval-2020 Task 4: Knowledge-aware Language Models for
Comprehension And Generation [4.94950858749529]
We propose a novel way to search for evidence and choose the different large-scale pre-trained models as the backbone for three subtasks.
The results show that our evidence-searching approach improves model performance on commonsense explanation task.
arXiv Detail & Related papers (2020-05-24T15:09:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.