Related papers: Listening to the Unspoken: Exploring "365" Aspects of Multimodal Interview Performance Assessment

Listening to the Unspoken: Exploring "365" Aspects of Multimodal Interview Performance Assessment

URL: http://arxiv.org/abs/2507.22676v3
Date: Tue, 05 Aug 2025 07:29:09 GMT
Title: Listening to the Unspoken: Exploring "365" Aspects of Multimodal Interview Performance Assessment
Authors: Jia Li, Yang Wang, Wenhao Qian, Jialong Hu, Zhenzhen Hu, Richang Hong, Meng Wang,
Abstract summary: We propose a novel and comprehensive framework that explores 365'' aspects of interview performance.<n>The framework employs modality-specific feature extractors to encode heterogeneous data streams.<n>By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data.
Score: 45.92718704785823
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interview performance assessment is essential for determining candidates' suitability for professional positions. To ensure holistic and fair evaluations, we propose a novel and comprehensive framework that explores ``365'' aspects of interview performance by integrating \textit{three} modalities (video, audio, and text), \textit{six} responses per candidate, and \textit{five} key evaluation dimensions. The framework employs modality-specific feature extractors to encode heterogeneous data streams and subsequently fused via a Shared Compression Multilayer Perceptron. This module compresses multimodal embeddings into a unified latent space, facilitating efficient feature interaction. To enhance prediction robustness, we incorporate a two-level ensemble learning strategy: (1) independent regression heads predict scores for each response, and (2) predictions are aggregated across responses using a mean-pooling mechanism to produce final scores for the five target dimensions. By listening to the unspoken, our approach captures both explicit and implicit cues from multimodal data, enabling comprehensive and unbiased assessments. Achieving a multi-dimensional average MSE of 0.1824, our framework secured first place in the AVI Challenge 2025, demonstrating its effectiveness and robustness in advancing automated and multimodal interview performance assessment. The full implementation is available at https://github.com/MSA-LMC/365Aspects.

Related papers

HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding [79.06209664703258]
Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks involving both images and videos.<n>Existing human-centric benchmarks predominantly emphasize video generation quality and action recognition, while overlooking essential perceptual and cognitive abilities required in human-centered scenarios.<n>We propose a rigorously curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric video understanding.
arXiv Detail & Related papers (2025-07-07T11:52:24Z)
SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions [21.149270997910403]
SoMi-ToM benchmark is designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions.<n>We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions.<n>Results show that LVLMs perform significantly worse than humans on SoMi-ToM.
arXiv Detail & Related papers (2025-06-29T00:54:13Z)
AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs [24.403284945948272]
AutoJudger is an agent-driven framework for efficient and adaptive benchmarking of multimodal large language models.<n>AutoJudger employs the Item Response Theory (IRT) to estimate the question difficulty and an autonomous evaluation agent to dynamically select the most informative test questions.
arXiv Detail & Related papers (2025-05-27T16:17:15Z)
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment [51.3011761744484]
Multi-modal Large language models can only process a finite number of frames in a single inference.<n>We propose multiple predictions through visual context sampling, followed by a scoring mechanism to select the final prediction.<n> Experiments show that this approach covers the correct answer for a high percentage of long video questions.
arXiv Detail & Related papers (2025-03-26T11:53:03Z)
Mind the Gap! Static and Interactive Evaluations of Large Audio Models [55.87220295533817]
Large Audio Models (LAMs) are designed to power voice-native experiences.<n>This study introduces an interactive approach to evaluate LAMs and collect 7,500 LAM interactions from 484 participants.
arXiv Detail & Related papers (2025-02-21T20:29:02Z)
DAT: Dialogue-Aware Transformer with Modality-Group Fusion for Human Engagement Estimation [42.87704953679693]
Engagement estimation plays a crucial role in understanding human social behaviors. We propose a Dialogue-Aware Transformer framework that relies solely on audio-visual input and is language-independent. Our approach achieves a CCC score of 0.76 on the NoXi Base test set and an average CCC of 0.64 across the NoXi Base, NoXi-Add, and MPIIGI test sets.
arXiv Detail & Related papers (2024-10-11T02:43:45Z)
Peer Review as A Multi-Turn and Long-Context Dialogue with Role-Based Interactions [62.0123588983514]
Large Language Models (LLMs) have demonstrated wide-ranging applications across various fields. We reformulate the peer-review process as a multi-turn, long-context dialogue, incorporating distinct roles for authors, reviewers, and decision makers. We construct a comprehensive dataset containing over 26,841 papers with 92,017 reviews collected from multiple sources.
arXiv Detail & Related papers (2024-06-09T08:24:17Z)
UATVR: Uncertainty-Adaptive Text-Video Retrieval [90.8952122146241]
A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities. We propose an Uncertainty-language Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure.
arXiv Detail & Related papers (2023-01-16T08:43:17Z)
Assessing Data Efficiency in Task-Oriented Semantic Parsing [54.87705549021248]
We introduce a four-stage protocol which gives an approximate measure of how much in-domain "target" data a requires to achieve a certain quality bar. We apply our protocol in two real-world case studies illustrating its flexibility and applicability to practitioners in task-oriented semantic parsing.
arXiv Detail & Related papers (2021-07-10T02:43:16Z)
The Multimodal Sentiment Analysis in Car Reviews (MuSe-CaR) Dataset: Collection, Insights and Improvements [14.707930573950787]
We present MuSe-CaR, a first of its kind multimodal dataset. The data is publicly available as it recently served as the testing bed for the 1st Multimodal Sentiment Analysis Challenge.
arXiv Detail & Related papers (2021-01-15T10:40:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.