MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications
- URL: http://arxiv.org/abs/2506.19502v2
- Date: Tue, 15 Jul 2025 06:04:25 GMT
- Title: MATE: LLM-Powered Multi-Agent Translation Environment for Accessibility Applications
- Authors: Aleksandr Algazinov, Matt Laing, Paul Laban,
- Abstract summary: MATE, a multimodal accessibility MAS, performs the modality conversions based on the user's needs.<n>MATE can be applied to a wide range of domains, industries, and areas, such as healthcare.<n>ModCon-Task-Identifier is a model that is capable of extracting the precise modality conversion task from the user input.
- Score: 44.99833362998488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accessibility remains a critical concern in today's society, as many technologies are not developed to support the full range of user needs. Existing multi-agent systems (MAS) often cannot provide comprehensive assistance for users in need due to the lack of customization stemming from closed-source designs. Consequently, individuals with disabilities frequently encounter significant barriers when attempting to interact with digital environments. We introduce MATE, a multimodal accessibility MAS, which performs the modality conversions based on the user's needs. The system is useful for assisting people with disabilities by ensuring that data will be converted to an understandable format. For instance, if the user cannot see well and receives an image, the system converts this image to its audio description. MATE can be applied to a wide range of domains, industries, and areas, such as healthcare, and can become a useful assistant for various groups of users. The system supports multiple types of models, ranging from LLM API calling to using custom machine learning (ML) classifiers. This flexibility ensures that the system can be adapted to various needs and is compatible with a wide variety of hardware. Since the system is expected to run locally, it ensures the privacy and security of sensitive information. In addition, the framework can be effectively integrated with institutional technologies (e.g., digital healthcare service) for real-time user assistance. Furthermore, we introduce ModCon-Task-Identifier, a model that is capable of extracting the precise modality conversion task from the user input. Numerous experiments show that ModCon-Task-Identifier consistently outperforms other LLMs and statistical models on our custom data. Our code and data are publicly available at https://github.com/AlgazinovAleksandr/Multi-Agent-MATE.
Related papers
- Learnware of Language Models: Specialized Small Language Models Can Do Big [50.285859986475394]
This paper presents a preliminary attempt to apply the learnware paradigm to language models.<n>We simulated a learnware system comprising approximately 100 learnwares of specialized SLMs with 8B parameters.<n>By selecting one suitable learnware for each task-specific inference, the system outperforms the base SLMs on all benchmarks.
arXiv Detail & Related papers (2025-05-19T17:54:35Z) - Can Multimodal Large Language Models be Guided to Improve Industrial Anomaly Detection? [5.979778557940213]
Traditional industrial anomaly detection models often struggle with flexibility and adaptability.<n>Recent advancements in Multimodal Large Language Models (MLLMs) hold promise for overcoming these limitations.<n>We propose Echo, a novel multi-expert framework designed to enhance MLLM performance for IAD.
arXiv Detail & Related papers (2025-01-27T05:41:10Z) - MMFactory: A Universal Solution Search Engine for Vision-Language Tasks [35.262080125288115]
We introduce MMFactory, a universal framework that acts like a solution search engine across various available models.<n>Based on a task description and few sample input-output pairs, MMFactory can suggest a diverse pool of programmatic solutions.<n> MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints.
arXiv Detail & Related papers (2024-12-24T00:59:16Z) - ROMAS: A Role-Based Multi-Agent System for Database monitoring and Planning [11.589862354606476]
We propose ROMAS, a Role-Based M ulti-A gent System designed to adapt to various scenarios while enabling low code development and one-click deployment.<n> ROMAS has been effectively deployed in DB-GPT [Xue et al., 2023a, 2024b], a well-known project utilizing LLM-powered database analytics.
arXiv Detail & Related papers (2024-12-18T05:45:39Z) - A General-Purpose Device for Interaction with LLMs [3.052172365469752]
This paper investigates integrating large language models (LLMs) with advanced hardware.
We focus on developing a general-purpose device designed for enhanced interaction with LLMs.
arXiv Detail & Related papers (2024-08-02T23:43:29Z) - An Interactive Multi-modal Query Answering System with Retrieval-Augmented Large Language Models [21.892975397847316]
We present an interactive Multi-modal Query Answering (MQA) system, empowered by our newly developed multi-modal retrieval framework and navigation graph index.
One notable aspect of MQA is its utilization of contrastive learning to assess the significance of different modalities.
The system achieves efficient retrieval through our advanced navigation graph index, refined using computational pruning techniques.
arXiv Detail & Related papers (2024-07-05T02:01:49Z) - LEGENT: Open Platform for Embodied Agents [60.71847900126832]
We introduce LEGENT, an open, scalable platform for developing embodied agents using Large Language Models (LLMs) and Large Multimodal Models (LMMs)
LEGENT offers a rich, interactive 3D environment with communicable and actionable agents, paired with a user-friendly interface.
In experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks.
arXiv Detail & Related papers (2024-04-28T16:50:12Z) - NExT-GPT: Any-to-Any Multimodal LLM [75.5656492989924]
We present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT.
We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio.
We introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation.
arXiv Detail & Related papers (2023-09-11T15:02:25Z) - Families In Wild Multimedia: A Multimodal Database for Recognizing
Kinship [63.27052967981546]
We introduce the first publicly available multi-task MM kinship dataset.
To build FIW MM, we developed machinery to automatically collect, annotate, and prepare the data.
Results highlight edge cases to inspire future research with different areas of improvement.
arXiv Detail & Related papers (2020-07-28T22:36:57Z) - Unsupervised Model Personalization while Preserving Privacy and
Scalability: An Open Problem [55.21502268698577]
This work investigates the task of unsupervised model personalization, adapted to continually evolving, unlabeled local user images.
We provide a novel Dual User-Adaptation framework (DUA) to explore the problem.
This framework flexibly disentangles user-adaptation into model personalization on the server and local data regularization on the user device.
arXiv Detail & Related papers (2020-03-30T09:35:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.