Related papers: An EcoSage Assistant: Towards Building A Multimodal Plant Care Dialogue Assistant

An EcoSage Assistant: Towards Building A Multimodal Plant Care Dialogue Assistant

URL: http://arxiv.org/abs/2401.06807v1
Date: Wed, 10 Jan 2024 19:06:35 GMT
Title: An EcoSage Assistant: Towards Building A Multimodal Plant Care Dialogue Assistant
Authors: Mohit Tomar, Abhisek Tiwari, Tulika Saha, Prince Jha, Sriparna Saha
Abstract summary: We make the first attempt at building a plant care assistant, which aims to assist people with plant(-ing) concerns through conversations. We propose a plant care conversational dataset named Plantational, which contains around 1K dialogues between users and plant care experts. We first benchmark the dataset with the help of various large language models (LLMs) and visual language model (VLM)
Score: 19.15902264945402
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In recent times, there has been an increasing awareness about imminent environmental challenges, resulting in people showing a stronger dedication to taking care of the environment and nurturing green life. The current $19.6 billion indoor gardening industry, reflective of this growing sentiment, not only signifies a monetary value but also speaks of a profound human desire to reconnect with the natural world. However, several recent surveys cast a revealing light on the fate of plants within our care, with more than half succumbing primarily due to the silent menace of improper care. Thus, the need for accessible expertise capable of assisting and guiding individuals through the intricacies of plant care has become paramount more than ever. In this work, we make the very first attempt at building a plant care assistant, which aims to assist people with plant(-ing) concerns through conversations. We propose a plant care conversational dataset named Plantational, which contains around 1K dialogues between users and plant care experts. Our end-to-end proposed approach is two-fold : (i) We first benchmark the dataset with the help of various large language models (LLMs) and visual language model (VLM) by studying the impact of instruction tuning (zero-shot and few-shot prompting) and fine-tuning techniques on this task; (ii) finally, we build EcoSage, a multi-modal plant care assisting dialogue generation framework, incorporating an adapter-based modality infusion using a gated mechanism. We performed an extensive examination (both automated and manual evaluation) of the performance exhibited by various LLMs and VLM in the generation of the domain-specific dialogue responses to underscore the respective strengths and weaknesses of these diverse models.

Related papers

What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z)
A Multimodal Conversational Assistant for the Characterization of Agricultural Plots from Geospatial Open Data [0.0]
This study presents an open-source conversational assistant that integrates multimodal retrieval and large language models (LLMs)<n>The proposed architecture combines orthophotos, Sentinel-2 vegetation indices, and user-provided documents through retrieval-augmented generation (RAG)<n>Preliminary results show that the system is capable of generating clear, relevant, and context-aware responses to agricultural queries.
arXiv Detail & Related papers (2025-09-22T09:02:53Z)
Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z)
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations [9.649908672930815]
MIRAGE captures the full complexity of expert consultations by combining natural user queries, expert-authored responses, and image-based context.<n>Grounded in over 35,000 real user-expert interactions, MIRAGE spans diverse crop health, pest diagnosis, and crop management scenarios.<n>The benchmark includes more than 7,000 unique biological entities, covering plant species, pests, and diseases.
arXiv Detail & Related papers (2025-06-25T03:07:54Z)
EarthMind: Towards Multi-Granular and Multi-Sensor Earth Observation with Large Multimodal Models [96.18182289276649]
We present EarthMind, a novel vision-language framework for multi-granular and multi-sensor Earth Observation (EO) data understanding.<n>EarthMind features two core components: (1) Spatial Attention Prompting (SAP), which reallocates attention within the LLM to enhance pixel-level understanding; and (2) Cross-modal Fusion, which aligns heterogeneous modalities into a shared space.<n>To facilitate multi-sensor fusion evaluation, we propose EarthMind-Bench, a comprehensive benchmark with over 2,000 human-annotated multi-sensor image-question pairs.
arXiv Detail & Related papers (2025-06-02T13:36:05Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings. We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews [54.35097932763878]
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data. Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews. We demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness.
arXiv Detail & Related papers (2025-03-26T15:58:16Z)
Coherence-Driven Multimodal Safety Dialogue with Active Learning for Embodied Agents [23.960719833886984]
M-CoDAL is a multimodal-dialogue system specifically designed for embodied agents to better understand and communicate in safety-critical situations. Our approach is evaluated using a newly created multimodal dataset comprising 1K safety violations extracted from 2K Reddit images. Results with this dataset demonstrate that our approach improves resolution of safety situations, user sentiment, as well as safety of the conversation.
arXiv Detail & Related papers (2024-10-18T03:26:06Z)
Enhancing IoT based Plant Health Monitoring through Advanced Human Plant Interaction using Large Language Models and Mobile Applications [0.0]
This paper presents the development of a novel plant communication application that allows plants to "talk" to humans. The app offers a seamless user experience with real-time interaction capabilities. By fostering human-plant connectivity, this system enhances plant care practices, promotes sustainability, and introduces innovative applications for AI and IoT technologies.
arXiv Detail & Related papers (2024-09-24T09:26:47Z)
Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion. In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation. We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z)
PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis [74.41260927676747]
This paper bridges the gaps by introducing a multimodal conversational Sentiment Analysis (ABSA) To benchmark the tasks, we construct PanoSent, a dataset annotated both manually and automatically, featuring high quality, large scale, multimodality, multilingualism, multi-scenarios, and covering both implicit and explicit sentiment elements. To effectively address the tasks, we devise a novel Chain-of-Sentiment reasoning framework, together with a novel multimodal large language model (namely Sentica) and a paraphrase-based verification mechanism.
arXiv Detail & Related papers (2024-08-18T13:51:01Z)
Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions [68.92637077909693]
This paper investigates the faithfulness of multimodal large language model (MLLM) agents in a graphical user interface (GUI) environment.<n>A general scenario is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated content.<n> Experimental results reveal that even the most powerful models, whether generalist agents or specialist GUI agents, are susceptible to distractions.
arXiv Detail & Related papers (2024-08-05T15:16:22Z)
Challenges in automatic and selective plant-clearing [0.32985979395737786]
We tackle the problem of automatic and selective plant-clearing in a sustainable forestry context. Such an autonomous system requires a high level of robustness to weather conditions, plant variability, terrain and weeds while remaining cheap and easy to maintain. We notably discuss the lack of robustness of spectral imagery, investigate the impact of the reference database's size and discuss issues specific to AI systems operating in uncontrolled environments.
arXiv Detail & Related papers (2024-04-22T09:01:14Z)
Building Emotional Support Chatbots in the Era of LLMs [64.06811786616471]
We introduce an innovative methodology that synthesizes human insights with the computational prowess of Large Language Models (LLMs) By utilizing the in-context learning potential of ChatGPT, we generate an ExTensible Emotional Support dialogue dataset, named ExTES. Following this, we deploy advanced tuning techniques on the LLaMA model, examining the impact of diverse training strategies, ultimately yielding an LLM meticulously optimized for emotional support interactions.
arXiv Detail & Related papers (2023-08-17T10:49:18Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
MPCHAT: Towards Multimodal Persona-Grounded Conversation [54.800425322314105]
We extend persona-based dialogue to the multimodal domain and make two main contributions. First, we present the first multimodal persona-based dialogue dataset named MPCHAT. Second, we empirically show that incorporating multimodal persona, as measured by three proposed multimodal persona-grounded dialogue tasks, leads to statistically significant performance improvements.
arXiv Detail & Related papers (2023-05-27T06:46:42Z)
Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision [84.31474052176343]
Recent AI-assistant agents, such as ChatGPT, rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback to align the output with human intentions. This dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision. We propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision.
arXiv Detail & Related papers (2023-05-04T17:59:28Z)
Domain Generalization for Crop Segmentation with Standardized Ensemble Knowledge Distillation [42.39035033967183]
Service robots need a real-time perception system that understands their surroundings and identifies their targets in the wild. Existing methods, however, often fall short in generalizing to new crops and environmental conditions. We propose a novel approach to enhance domain generalization using knowledge distillation.
arXiv Detail & Related papers (2023-04-03T14:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.