Related papers: A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation

URL: http://arxiv.org/abs/2509.02864v1
Date: Tue, 02 Sep 2025 22:21:55 GMT
Title: A-SEA3L-QA: A Fully Automated Self-Evolving, Adversarial Workflow for Arabic Long-Context Question-Answer Generation
Authors: Kesen Wang, Daulet Toibazar, Pedro J. Moreno,
Abstract summary: We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic.<n>Our system iteratively refines its own performance without any human intervention.<n>We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges.
Score: 4.208390540058878
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an end-to-end, self-evolving adversarial workflow for long-context Question-Answer (QA) Generation in Arabic. By orchestrating multiple specialized LVLMs: a question generator, an evaluator, and a swarm of answer generators, our system iteratively refines its own performance without any human intervention. Starting from raw, multi-page Arabic documents across diverse domains, the question generator produces fine-grained, context-aware queries to be tackled by the answer generator swarm, and the evaluator assesses and feeds back quality metrics. This closed-loop cycle enables continuous learning: low-confidence outputs trigger automated re-generation and model updates, progressively enhancing question difficulty and relevance. Moreover, we set the quality metrics as a tunable hyperparameter, enabling question generation at controllable and customizable difficulty levels. We release AraLongBench, a large-scale Arabic benchmark of single- and multi-page challenges spanning hundreds of pages, and demonstrate that our self-evolving workflow substantially outperform static pipelines, markedly boosting the long-context comprehension capabilities of leading Arabic Large Vision Language Models (LVLMs). Lastly, we also meticulously architect a fully automated agentic workflow for long-context Arabic document collection.

Related papers

Beyond Factual QA: Mentorship-Oriented Question Answering over Long-Form Multilingual Content [5.831342304669597]
Question answering systems are evaluated on factual correctness, yet many real-world applications-such as education and career guidance-require mentorship.<n>We introduce MentorQA, the first multilingual dataset and evaluation framework for mentorship-focused question answering from long-form videos.<n>We define mentorship-focused evaluation dimensions that go beyond factual accuracy, capturing clarity, alignment, and learning value.
arXiv Detail & Related papers (2026-01-23T21:08:02Z)
SPELL: Self-Play Reinforcement Learning for evolving Long-Context Language Models [79.01078135582127]
SPELL enables scalable, label-free optimization for long-context reasoning.<n>We introduce an automated curriculum that gradually increases document length and a reward function that adapts question difficulty to the model's evolving capabilities.
arXiv Detail & Related papers (2025-09-28T13:08:10Z)
Multi-Agent Interactive Question Generation Framework for Long Document Understanding [5.059854277690664]
We propose a fully automated, multi-agent interactive framework to generate long-context questions efficiently.<n>Our approach efficiently generates high-quality single- and multi-page questions for extensive English and Arabic documents.
arXiv Detail & Related papers (2025-07-27T06:44:53Z)
Language Models can Self-Lengthen to Generate Long Texts [74.96074422345806]
This paper introduces an innovative iterative training framework called Self-Lengthen. It leverages only the intrinsic knowledge and skills of Large Language Models without the need for auxiliary data or proprietary models. Experiments on benchmarks and human evaluations show that Self-Lengthen outperforms existing methods in long-text generation.
arXiv Detail & Related papers (2024-10-31T13:47:10Z)
Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books. Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z)
Automatic Question-Answer Generation for Long-Tail Knowledge [65.11554185687258]
We propose an automatic approach to generate specialized QA datasets for tail entities. We conduct extensive experiments by employing pretrained LLMs on our newly generated long-tail QA datasets.
arXiv Detail & Related papers (2024-03-03T03:06:31Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z)
Towards Automatic Generation of Questions from Long Answers [11.198653485869935]
We propose a novel evaluation benchmark to assess the performance of existing AQG systems for long-text answers. We empirically demonstrate that the performance of existing AQG methods significantly degrades as the length of the answer increases. Transformer-based methods outperform other existing AQG methods on long answers in terms of automatic as well as human evaluation.
arXiv Detail & Related papers (2020-04-10T16:45:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.