PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models
- URL: http://arxiv.org/abs/2511.10002v2
- Date: Fri, 14 Nov 2025 07:47:21 GMT
- Title: PustakAI: Curriculum-Aligned and Interactive Textbooks Using Large Language Models
- Authors: Shivam Sharma, Riya Naik, Tejas Gawas, Heramb Patil, Kunal Korgaonkar,
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content.<n>We present the framework "PustakAI"footnotePustak means book' in many Indian languages.<n>We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting.
- Score: 4.419156740280761
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating human-like content. This has revolutionized various sectors such as healthcare, software development, and education. In education, LLMs offer potential for personalized and interactive learning experiences, especially in regions with limited teaching resources. However, adapting these models effectively to curriculum-specific content, such as the National Council of Educational Research and Training (NCERT) syllabus in India, presents unique challenges in terms of accuracy, alignment, and pedagogical relevance. In this paper, we present the framework "PustakAI"\footnote{Pustak means `book' in many Indian languages.} for the design and evaluation of a novel question-answering dataset "NCERT-QA" aligned with the NCERT curriculum for English and Science subjects of grades 6 to 8. We classify the curated QA pairs as Factoid, Inferential, and Others (evaluative and reasoning). We evaluate the dataset with various prompting techniques, such as meta-prompt, few-shot, and CoT-style prompting, using diverse evaluation metrics to understand which approach aligns more efficiently with the structure and demands of the curriculum. Along with the usability of the dataset, we analyze the strengths and limitations of current open-source LLMs (Gemma3:1b, Llama3.2:3b, and Nemotron-mini:4b) and high-end LLMs (Llama-4-Scout-17B and Deepseek-r1-70B) as AI-based learning tools in formal education systems.
Related papers
- EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education [11.130206904690745]
We introduce EduEval, a comprehensive hierarchical benchmark for evaluating large language models (LLMs) in Chinese K-12 education.<n>EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels.<n>We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation.
arXiv Detail & Related papers (2025-11-29T03:09:50Z) - MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams [50.293164501645975]
Multimodal large language models (MLLMs) integrate language and visual cues for problem-solving.<n>Current benchmarks for measuring the intelligence of MLLMs suffer from limited scale, narrow coverage, and unstructured knowledge.<n>We introduce MDK12-Bench, a large-scale multidisciplinary benchmark built from real-world K-12 exams spanning six disciplines.
arXiv Detail & Related papers (2025-08-09T06:21:10Z) - ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios [23.549720214649476]
Large Language Models (LLMs) present transformative opportunities for education, generating numerous novel application scenarios.<n>Current benchmarks predominantly measure general intelligence rather than pedagogical capabilities.<n>We introduce ELMES, an open-source automated evaluation framework specifically designed for assessing LLMs in educational settings.
arXiv Detail & Related papers (2025-07-27T15:20:19Z) - MateInfoUB: A Real-World Benchmark for Testing LLMs in Competitive, Multilingual, and Multimodal Educational Tasks [0.0]
This study presents a novel bilingual (English-Romanian) multimodal (text and image) dataset of multiple-choice questions.<n>A particularity of our dataset is that the problems are conceived such that some of them are easier solved using reasoning on paper, while for others writing code is more efficient.
arXiv Detail & Related papers (2025-07-03T20:43:28Z) - Benchmarking the Pedagogical Knowledge of Large Language Models [4.417539128489408]
This paper introduces The Pedagogy Benchmark, a novel dataset designed to evaluate large language models on their pedagogical knowledge.<n>These benchmarks are built on a carefully curated set of questions sourced from professional development exams for teachers.<n>We report results for 97 models, with accuracies spanning a range from 28% to 89% on the pedagogical knowledge questions.
arXiv Detail & Related papers (2025-06-23T14:49:01Z) - MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset [0.0]
Language models (LMs) have excelled in various broad domains.<n>They must demonstrate proficiency in specific, granular areas of knowledge.<n> MALAMUTE is the first education-based cloze-style dataset.
arXiv Detail & Related papers (2024-12-13T12:46:33Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks.
OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z) - A Survey of Knowledge Enhanced Pre-trained Language Models [78.56931125512295]
We present a comprehensive review of Knowledge Enhanced Pre-trained Language Models (KE-PLMs)
For NLU, we divide the types of knowledge into four categories: linguistic knowledge, text knowledge, knowledge graph (KG) and rule knowledge.
The KE-PLMs for NLG are categorized into KG-based and retrieval-based methods.
arXiv Detail & Related papers (2022-11-11T04:29:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.