BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
- URL: http://arxiv.org/abs/2504.19314v2
- Date: Thu, 01 May 2025 05:02:57 GMT
- Title: BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
- Authors: Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua,
- Abstract summary: BrowseComp-ZH is a benchmark for evaluating large language models (LLMs) on the Chinese web.<n>It consists of 289 multi-hop questions spanning 11 diverse domains.<n>Despite strong conversational and retrieval capabilities, most models struggle severely.
- Score: 30.994503814617637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
Related papers
- Bactrainus: Optimizing Large Language Models for Multi-hop Complex Question Answering Tasks [5.439505575097552]
We evaluate the ability of large language models in performing domain-specific tasks using the HotpotQA dataset.<n>This task serves as a challenging benchmark for assessing the language comprehension capabilities of these models.<n>The results of the study show that the integration of large language models with these techniques can lead to up to a 4% improvement in F1 score for finding answers.
arXiv Detail & Related papers (2025-01-10T18:44:06Z) - Level-Navi Agent: A Framework and benchmark for Chinese Web Search Agents [9.003325286793288]
Large language models (LLMs), adopted to understand human language, drive the development of artificial intelligence (AI) web search agents.<n>We propose a general-purpose and training-free web search agent by level-aware navigation, Level-Navi Agent, accompanied by a well-annotated dataset (Web24) and a suitable evaluation metric.
arXiv Detail & Related papers (2024-12-20T08:03:12Z) - Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024 [1.3923460621808879]
We show that the reasoning power of large language models (LLMs) and the retrieval power of modern search engines can be combined to automate this process.
We integrate LLMs and search under a multi-hop evidence pursuit strategy.
Our submitted system achieves.510 AVeriTeC score on the dev set and.477 AVeriTeC score on the test set.
arXiv Detail & Related papers (2024-11-08T18:25:06Z) - INQUIRE: A Natural World Text-to-Image Retrieval Benchmark [51.823709631153946]
We introduce INQUIRE, a text-to-image retrieval benchmark designed to challenge multimodal vision-language models on expert-level queries.
InQUIRE includes iNaturalist 2024 (iNat24), a new dataset of five million natural world images, along with 250 expert-level retrieval queries.
Our benchmark evaluates two core retrieval tasks: (1) INQUIRE-Fullrank, a full dataset ranking task, and (2) INQUIRE-Rerank, a reranking task for refining top-100 retrievals.
arXiv Detail & Related papers (2024-11-04T19:16:53Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.
Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.
We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.
Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval [54.54576644403115]
We introduce BRIGHT, the first text retrieval benchmark that requires intensive reasoning to retrieve relevant documents.
Our dataset consists of 1,384 real-world queries spanning diverse domains, such as economics, psychology, mathematics, and coding.
We show that incorporating explicit reasoning about the query improves retrieval performance by up to 12.2 points.
arXiv Detail & Related papers (2024-07-16T17:58:27Z) - Tree Search for Language Model Agents [69.43007235771383]
We propose an inference-time search algorithm for LM agents to perform exploration and multi-step planning in interactive web environments.
Our approach is a form of best-first tree search that operates within the actual environment space.
It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks.
arXiv Detail & Related papers (2024-07-01T17:07:55Z) - Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning [67.26776442697184]
We introduce Husky, a holistic, open-source language agent that learns to reason over a unified action space.
Husky iterates between two stages: 1) generating the next action to take towards solving a given task and 2) executing the action using expert models.
Our experiments show that Husky outperforms prior language agents across 14 evaluation datasets.
arXiv Detail & Related papers (2024-06-10T17:07:25Z) - LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models.
We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages.
We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z) - LMentry: A Language Model Benchmark of Elementary Language Tasks [39.71352171304755]
LMentry is a benchmark that focuses on a compact set of tasks that are trivial to humans.
It provides insights into the capabilities and robustness of large language models.
arXiv Detail & Related papers (2022-11-03T18:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.