Related papers: BRIMA: low-overhead BRowser-only IMage Annotation tool (Preprint)

Related papers

Nested Browser-Use Learning for Agentic Information Seeking [60.775556172513014]
Information-seeking (IS) agents have achieved strong performance across a range of wide and deep search tasks, yet their tool use remains largely restricted to API-level snippet retrieval and URL-based page fetching.<n>We propose Nested Browser-Use Learning (NestBrowse), which introduces a minimal and complete browser-action framework that decouples interaction control from page exploration through a nested structure.
arXiv Detail & Related papers (2025-12-29T17:59:14Z)
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning [55.221850286246]
We introduce MindWatcher, a tool-integrated reasoning agent with interleaved thinking and multimodal chain-of-thought (CoT) reasoning.<n>MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use.<n>A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition.
arXiv Detail & Related papers (2025-12-29T12:16:12Z)
AUTO-Explorer: Automated Data Collection for GUI Agent [58.58097564914626]
We propose an automated data collection method with minimal annotation costs, named Auto-Explorer.<n>It incorporates a simple yet effective exploration mechanism that autonomously parses and explores GUI environments.<n>Using the data gathered, we fine-tune a multimodal large language model (MLLM) and establish a GUI element grounding testing set.
arXiv Detail & Related papers (2025-11-09T15:13:45Z)
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools [12.249551019598442]
We introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services.<n>We also provide manually annotated ground-truth tools for each task.<n>Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments.
arXiv Detail & Related papers (2025-10-22T06:42:01Z)
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent [68.3311163530321]
Web agents such as Deep Research have demonstrated cognitive abilities, capable of solving highly challenging information-seeking problems.<n>This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge.<n>We introduce WebWatcher, a multi-modal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities.
arXiv Detail & Related papers (2025-08-07T18:03:50Z)
Visual Agentic Reinforcement Fine-Tuning [73.37007472426299]
This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs)<n>With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques.<n>Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search
arXiv Detail & Related papers (2025-05-20T11:59:25Z)
Programming with Pixels: Computer-Use Meets Software Engineering [24.00640679767529]
General-purpose computer-use agents can approach or even surpass specialized tool-based agents on a variety of SWE tasks without the need for hand-engineered tools. Our results establish PwP as a scalable testbed for building and evaluating the next wave of software engineering agents.
arXiv Detail & Related papers (2025-02-24T18:41:33Z)
EdgeMark: An Automation and Benchmarking System for Embedded Artificial Intelligence Tools [0.0]
The integration of artificial intelligence (AI) into embedded devices is transforming industries by enabling intelligent data processing at the edge.<n>This paper provides a review of existing eAI tools, highlighting their features, trade-offs, and limitations.<n>We also introduce EdgeMark, an open-source automation system designed to streamline the benchmarking workflow for deploying and benchmarking machine learning (ML) models on embedded platforms.
arXiv Detail & Related papers (2025-02-03T08:28:01Z)
asanAI: In-Browser, No-Code, Offline-First Machine Learning Toolkit [0.0]
asanAI is an offline-first, open-source, no-code machine learning toolkit designed for users of all skill levels. It allows individuals to design, debug, train, and test ML models directly in a web browser. The toolkit runs on any device with a modern web browser, including smartphones, and ensures user privacy through local computations.
arXiv Detail & Related papers (2025-01-07T12:47:52Z)
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining [67.87810796668981]
Information-Sensitive Cropping (ISC) and Self-Refining Dual Learning (SRDL) Iris achieves state-of-the-art performance across multiple benchmarks with only 850K GUI annotations. These improvements translate to significant gains in both web and OS agent downstream tasks.
arXiv Detail & Related papers (2024-12-13T18:40:10Z)
Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models. Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions. We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z)
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z)
EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction. It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z)
Brainchop: Next Generation Web-Based Neuroimaging Application [0.0]
Brainchop is a groundbreaking in-browser tool that enables volumetric analysis of structural MRI using pre-trained full-brain deep learning models. This paper outlines the processing pipeline of Brainchop and evaluates the performance of models across various hardware configurations.
arXiv Detail & Related papers (2023-10-24T20:17:06Z)
Surgical tool classification and localization: results and methods from the MICCAI 2022 SurgToolLoc challenge [69.91670788430162]
We present the results of the SurgLoc 2022 challenge. The goal was to leverage tool presence data as weak labels for machine learning models trained to detect tools. We conclude by discussing these results in the broader context of machine learning and surgical data science.
arXiv Detail & Related papers (2023-05-11T21:44:39Z)
Interactive Segmentation and Visualization for Tiny Objects in Multi-megapixel Images [5.09193568605539]
We introduce an interactive image segmentation and visualization framework for identifying, inspecting, and editing tiny objects in large multi-megapixel high-range images. We developed an interactive toolkit that unifies inference model, HDR image visualization, segmentation mask inspection and editing into a single graphical user interface. Our interface features mouse-controlled, synchronized, dual-window visualization of the image and the segmentation mask, a critical feature for locating tiny objects in multi-megapixel images.
arXiv Detail & Related papers (2022-04-21T18:26:48Z)
MONAI Label: A framework for AI-assisted Interactive Labeling of 3D Medical Images [49.664220687980006]
The lack of annotated datasets is a major bottleneck for training new task-specific supervised machine learning models. We present MONAI Label, a free and open-source framework that facilitates the development of applications based on artificial intelligence (AI) models.
arXiv Detail & Related papers (2022-03-23T12:33:11Z)
Flashlight: Enabling Innovation in Tools for Machine Learning [50.63188263773778]
We introduce Flashlight, an open-source library built to spur innovation in machine learning tools and systems. We see Flashlight as a tool enabling research that can benefit widely used libraries downstream and bring machine learning and systems researchers closer together.
arXiv Detail & Related papers (2022-01-29T01:03:29Z)
Label Assistant: A Workflow for Assisted Data Annotation in Image Segmentation Tasks [0.8135412538980286]
We propose a generic workflow to assist the annotation process and discuss methods on an abstract level. Thereby, we review the possibilities of focusing on promising samples, image pre-processing, pre-labeling, label inspection, or post-processing of annotations. In addition, we present an implementation of the proposal by means of a developed flexible and extendable software prototype nested in hybrid touchscreen/laptop device.
arXiv Detail & Related papers (2021-11-27T19:08:25Z)
Shuffler: A Large Scale Data Management Tool for ML in Computer Vision [0.0]
We present Shuffler, an open source tool that makes it easy to manage large computer vision datasets. Shuffler defines over 40 data handling operations with annotations that are commonly useful in supervised learning applied to computer vision.
arXiv Detail & Related papers (2021-04-11T22:27:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.