LoXR: Performance Evaluation of Locally Executing LLMs on XR Devices
- URL: http://arxiv.org/abs/2502.15761v1
- Date: Thu, 13 Feb 2025 20:55:48 GMT
- Title: LoXR: Performance Evaluation of Locally Executing LLMs on XR Devices
- Authors: Dawar Khan, Xinyu Liu, Omar Mena, Donggang Jia, Alexandre Kouyoumdjian, Ivan Viola,
- Abstract summary: We deploy 17 large language models (LLMs) across four XR devices.<n>We evaluate performance on four key metrics: performance consistency, processing speed, memory usage, and battery consumption.
- Score: 55.33807002543901
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The deployment of large language models (LLMs) on extended reality (XR) devices has great potential to advance the field of human-AI interaction. In the case of direct, on-device model inference, selecting the appropriate model and device for specific tasks remains challenging. In this paper, we deploy 17 LLMs across four XR devices--Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct a comprehensive evaluation. We devise an experimental setup and evaluate performance on four key metrics: performance consistency, processing speed, memory usage, and battery consumption. For each of the 68 model-device pairs, we assess performance under varying string lengths, batch sizes, and thread counts, analyzing the trade-offs for real-time XR applications. We finally propose a unified evaluation method based on the Pareto Optimality theory to select the optimal device-model pairs from the quality and speed objectives. We believe our findings offer valuable insights to guide future optimization efforts for LLM deployment on XR devices. Our evaluation method can be followed as standard groundwork for further research and development in this emerging field. All supplemental materials are available at www.nanovis.org/Loxr.html.
Related papers
- Fake Runs, Real Fixes -- Analyzing xPU Performance Through Simulation [4.573673188291683]
We present xPU-Shark, a fine-grained methodology for analyzing ML models at the machine-code level.
xPU-Shark captures traces from production deployments running on accelerators and replays them in a modified microarchitecture simulator.
We optimize a common communication collective by up to 15% and reduce token generation latency by up to 4.1%.
arXiv Detail & Related papers (2025-03-18T23:15:02Z) - MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.
It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.
It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.
We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model [80.93387166769679]
We present IXC-2.5-Reward, a simple yet effective multi-modal reward model that aligns Large Vision Language Models with human preferences.<n>IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks.
arXiv Detail & Related papers (2025-01-21T18:47:32Z) - DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding [35.522774800394664]
We introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research.<n>DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding.<n>We develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt.
arXiv Detail & Related papers (2024-11-21T17:42:20Z) - How to Evaluate Reward Models for RLHF [51.31240621943791]
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback)
We build a predictive model of downstream LLM performance by evaluating the reward model on proxy tasks.
We launch an end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth.
arXiv Detail & Related papers (2024-10-18T21:38:21Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation [19.312330150540912]
An emerging application is using Large Language Models (LLMs) to enhance retrieval-augmented generation (RAG) capabilities.
We propose FRAMES, a high-quality evaluation dataset designed to test LLMs' ability to provide factual responses.
We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval.
arXiv Detail & Related papers (2024-09-19T17:52:07Z) - Vidur: A Large-Scale Simulation Framework For LLM Inference [9.854130239429487]
Vidur is a large-scale, high-fidelity simulation framework for LLM inference performance.
We present VidurSearch, a configuration search tool that helps optimize LLM deployment.
arXiv Detail & Related papers (2024-05-08T23:42:13Z) - Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM
Evaluation [51.99752147380505]
This paper presents a benchmark self-evolving framework to dynamically evaluate Large Language Models (LLMs)
We utilize a multi-agent system to manipulate the context or question of original instances, reframing new evolving instances with high confidence.
Our framework widens performance discrepancies both between different models and within the same model across various tasks.
arXiv Detail & Related papers (2024-02-18T03:40:06Z) - MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z) - SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension [27.53415400454066]
We introduce a benchmark named SEED-Bench to assess generative models.
SEED-Bench consists of 19K multiple choice questions with accurate human annotations.
We evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding.
arXiv Detail & Related papers (2023-07-30T04:25:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.