JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory
- URL: http://arxiv.org/abs/2509.22888v1
- Date: Fri, 26 Sep 2025 20:11:47 GMT
- Title: JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory
- Authors: Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang,
- Abstract summary: We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space.<n>For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings.
- Score: 4.680675835708409
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.
Related papers
- Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving [50.05273675575345]
PlaneThought Problem Solving (PGPS) aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions.<n>Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams.<n>We train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning.
arXiv Detail & Related papers (2026-01-29T02:03:33Z) - TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z) - LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation [110.610512800947]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge.<n>Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage.
arXiv Detail & Related papers (2025-10-13T12:57:45Z) - Do Large Language Models Truly Understand Geometric Structures? [15.915781154075615]
We introduce the GeomRel dataset to evaluate large language models' understanding of geometric structures.<n>We propose the Geometry Chain-of-Thought (GeoCoT) method, which enhances LLMs' ability to identify geometric relationships.
arXiv Detail & Related papers (2025-01-23T15:52:34Z) - From Distributional to Overton Pluralism: Investigating Large Language Model Alignment [82.99849359892112]
We re-examine previously reported reductions in response diversity post-alignment.<n>Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation.<n>Findings indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior.
arXiv Detail & Related papers (2024-06-25T16:32:33Z) - ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models [46.07900122810749]
Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging.
We contend that utilizing existing relational databases is a promising approach for constructing benchmarks.
We propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark.
arXiv Detail & Related papers (2024-03-08T12:42:36Z) - Beyond Lines and Circles: Unveiling the Geometric Reasoning Gap in Large
Language Models [28.819559978685806]
Large Language Models (LLMs) demonstrate ever-increasing abilities in mathematical and algorithmic tasks, yet their geometric reasoning skills are underexplored.
We investigate LLMs' abilities in constructive geometric problem-solving one of the most fundamental steps in the development of human mathematical reasoning.
Our work reveals notable challenges that the state-of-the-art LLMs face in this domain despite many successes in similar areas.
arXiv Detail & Related papers (2024-02-06T10:37:21Z) - Characterizing Large Language Model Geometry Helps Solve Toxicity Detection and Generation [15.77263269398368]
Large Language Models (LLMs) drive current AI breakthroughs.
We shed the light on LLMs inner mechanisms through the lens of geometry.
We derive interpretable geometrical features that can be extracted from any (pre-trained) LLM.
arXiv Detail & Related papers (2023-12-04T06:01:32Z) - The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context
Learning [61.68787689234622]
A recent study, LIMA, shows that using merely 1K examples for alignment tuning can achieve significant alignment performance as well.
This raises questions about how exactly the alignment tuning transforms a base LLM.
We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting.
arXiv Detail & Related papers (2023-12-04T00:46:11Z) - Search-in-the-Chain: Interactively Enhancing Large Language Models with
Search for Knowledge-intensive Tasks [121.74957524305283]
This paper proposes a novel framework named textbfSearch-in-the-Chain (SearChain) for the interaction between Information Retrieval (IR) and Large Language Model (LLM)
Experiments show that SearChain outperforms state-of-the-art baselines on complex knowledge-intensive tasks.
arXiv Detail & Related papers (2023-04-28T10:15:25Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.