Related papers: Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

URL: http://arxiv.org/abs/2408.14438v4
Date: Fri, 03 Jan 2025 03:03:32 GMT
Title: Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study
Authors: Liuchang Xu, Shuo Zhao, Qingming Lin, Luyao Chen, Qianqian Luo, Sensen Wu, Xinyue Ye, Hailin Feng, Zhenhong Du,
Abstract summary: This study introduces a new multi-task spatial evaluation dataset designed to explore and compare the performance of several advanced models on spatial tasks.<n>The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers.
Score: 4.80612909282198
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of large language models such as ChatGPT, Gemini, and others highlights the importance of evaluating their diverse capabilities, ranging from natural language understanding to code generation. However, their performance on spatial tasks has not been thoroughly assessed. This study addresses this gap by introducing a new multi-task spatial evaluation dataset designed to systematically explore and compare the performance of several advanced models on spatial tasks. The dataset includes twelve distinct task types, such as spatial understanding and simple route planning, each with verified and accurate answers. We evaluated multiple models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach. First, we conducted zero-shot testing. Then, we categorized the dataset by difficulty and performed prompt-tuning tests. Results show that gpt-4o achieved the highest overall accuracy in the first phase, with an average of 71.3%. Although moonshot-v1-8k slightly underperformed overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on model performance in specific tasks. For instance, the Chain-of-Thought (CoT) strategy increased gpt-4o's accuracy in simple route planning from 12.4% to 87.5%, while a one-shot strategy improved moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.

Related papers

AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP [2.869780207429188]
Large language models (LLMs) have shown remarkable progress in reasoning abilities.<n>Yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored.<n>This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs.
arXiv Detail & Related papers (2025-06-10T13:10:31Z)
S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation. S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales. We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z)
LLM4DS: Evaluating Large Language Models for Data Science Code Generation [0.0]
This paper empirically assesses the performance of four leading AI assistants-Microsoft Copilot (GPT-4 Turbo), ChatGPT (o1-preview), Claude (3.5 Sonnet) and Perplexity Labs (Llama-3.1-70b-instruct) All models exceeded a 50% success rate, confirming their capability beyond random chance. ChatGPT demonstrated consistent performance across varying difficulty levels, while Claude's success rate fluctuated with task complexity.
arXiv Detail & Related papers (2024-11-16T18:43:26Z)
Evaluating the Performance of Large Language Models for SDG Mapping (Technical Report) [6.789534723913505]
Large language models (LLMs) enable users to protect data privacy by eliminating the need to provide data to third parties. We compare the performance of various language models on the Sustainable Development Goal mapping task. According to the results of this study, LLaMA 2 and Gemma still have significant room for improvement.
arXiv Detail & Related papers (2024-08-05T03:05:02Z)
An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm. Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Correctness Comparison of ChatGPT-4, Gemini, Claude-3, and Copilot for Spatial Tasks [0.0]
This study conducts a zero-shot correctness evaluation for a set of 76 spatial tasks across seven task categories assigned to four prominent chatbots. The chatbots performed well on tasks related to spatial literacy, GIS theory, and interpretation of programming code and functions, but revealed weaknesses in mapping, code writing, and spatial reasoning.
arXiv Detail & Related papers (2024-01-04T18:43:26Z)
Applying Large Language Models and Chain-of-Thought for Automatic Scoring [23.076596289069506]
This study investigates the application of large language models (LLMs) in the automatic scoring of student-written responses to science assessments. We focused on overcoming the challenges of accessibility, technical complexity, and lack of explainability that have previously limited the use of artificial intelligence-based automatic scoring tools.
arXiv Detail & Related papers (2023-11-30T21:22:43Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching [63.88319217738223]
We present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks. Matcher demonstrates impressive generalization performance across various segmentation tasks, all without training. Our results further showcase the open-world generality and flexibility of Matcher when applied to images in the wild.
arXiv Detail & Related papers (2023-05-22T17:59:43Z)
Data-Efficient Finetuning Using Cross-Task Nearest Neighbors [75.07773863013001]
We use unlabeled target-task examples to retrieve most similar labeled examples from a pool of multitask data augmented with prompts. Our approach of finetuning models on cross-task nearest neighbors is significantly more data-efficient.
arXiv Detail & Related papers (2022-12-01T00:53:04Z)
A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities. We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention. Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z)
Assessing Data Efficiency in Task-Oriented Semantic Parsing [54.87705549021248]
We introduce a four-stage protocol which gives an approximate measure of how much in-domain "target" data a requires to achieve a certain quality bar. We apply our protocol in two real-world case studies illustrating its flexibility and applicability to practitioners in task-oriented semantic parsing.
arXiv Detail & Related papers (2021-07-10T02:43:16Z)
Lifelong Learning Without a Task Oracle [13.331659934508764]
Supervised deep neural networks are known to undergo a sharp decline in the accuracy of older tasks when new tasks are learned. We propose and compare several candidate task-assigning mappers which require very little memory overhead. Best-performing variants only impose an average cost of 1.7% parameter memory increase.
arXiv Detail & Related papers (2020-11-09T21:30:31Z)
Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance. We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.