Beyond Factual QA: Mentorship-Oriented Question Answering over Long-Form Multilingual Content
- URL: http://arxiv.org/abs/2601.17173v1
- Date: Fri, 23 Jan 2026 21:08:02 GMT
- Title: Beyond Factual QA: Mentorship-Oriented Question Answering over Long-Form Multilingual Content
- Authors: Parth Bhalerao, Diola Dsouza, Ruiwen Guan, Oana Ignat,
- Abstract summary: Question answering systems are evaluated on factual correctness, yet many real-world applications-such as education and career guidance-require mentorship.<n>We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos.<n>We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value.
- Score: 5.831342304669597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Question answering systems are typically evaluated on factual correctness, yet many real-world applications-such as education and career guidance-require mentorship: responses that provide reflection and guidance. Existing QA benchmarks rarely capture this distinction, particularly in multilingual and long-form settings. We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos, comprising nearly 9,000 QA pairs from 180 hours of content across four languages. We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value. Using MentorQA, we compare Single-Agent, Dual-Agent, RAG, and Multi-Agent QA architectures under controlled conditions. Multi-Agent pipelines consistently produce higher-quality mentorship responses, with especially strong gains for complex topics and lower-resource languages. We further analyze the reliability of automated LLM-based evaluation, observing substantial variation in alignment with human judgments. Overall, this work establishes mentorship-focused QA as a distinct research problem and provides a multilingual benchmark for studying agentic architectures and evaluation design in educational AI. The dataset and evaluation framework are released at https://github.com/AIM-SCU/MentorQA.
Related papers
- EduAgentQG: A Multi-Agent Workflow Framework for Personalized Question Generation [56.43882334582494]
We propose EduAgentQG, a multi-agent collaborative framework for generating high-quality and diverse personalized questions.<n>The framework consists of five specialized agents and operates through an iterative feedback loop.<n>EduAgentQG outperforms existing single-agent and multi-agent methods in terms of question diversity, goal consistency, and overall quality.
arXiv Detail & Related papers (2025-11-08T12:25:31Z) - AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment [69.06977852423564]
Image quality assessment (IQA) reflects both the quantification and interpretation of perceptual quality rooted in the human visual system.<n>AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution.<n>To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents.
arXiv Detail & Related papers (2025-09-30T09:37:01Z) - Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment [76.77693558769934]
We introduce a new image quality assessment (IQA) task paradigm, **grounding-IQA**.<n>This paradigm integrates multimodal referring and grounding with IQA to realize more fine-grained quality perception.<n>We develop a well-designed benchmark, GIQA-Bench. The benchmark evaluates the grounding-IQA performance from three perspectives: description quality, VQA accuracy, and grounding precision.
arXiv Detail & Related papers (2024-11-26T09:03:16Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Improving Automatic VQA Evaluation Using Large Language Models [6.468405905503242]
We propose to leverage the in-context learning capabilities of instruction-tuned large language models to build a better VQA metric.
We demonstrate the proposed metric better correlates with human judgment compared to existing metrics across several VQA models and benchmarks.
arXiv Detail & Related papers (2023-10-04T03:59:57Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Learning to Answer Multilingual and Code-Mixed Questions [4.290420179006601]
Question-answering (QA) that comes naturally to humans is a critical component in seamless human-computer interaction.
Despite being one of the oldest research areas, the current QA system faces the critical challenge of handling multilingual queries.
This dissertation focuses on advancing QA techniques for handling end-user queries in multilingual environments.
arXiv Detail & Related papers (2022-11-14T16:49:58Z) - ProQA: Structural Prompt-based Pre-training for Unified Question
Answering [84.59636806421204]
ProQA is a unified QA paradigm that solves various tasks through a single model.
It concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task.
ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios.
arXiv Detail & Related papers (2022-05-09T04:59:26Z) - Towards Automatic Generation of Questions from Long Answers [11.198653485869935]
We propose a novel evaluation benchmark to assess the performance of existing AQG systems for long-text answers.
We empirically demonstrate that the performance of existing AQG methods significantly degrades as the length of the answer increases.
Transformer-based methods outperform other existing AQG methods on long answers in terms of automatic as well as human evaluation.
arXiv Detail & Related papers (2020-04-10T16:45:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.