MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf
- URL: http://arxiv.org/abs/2502.04376v1
- Date: Wed, 05 Feb 2025 16:25:43 GMT
- Title: MEETING DELEGATE: Benchmarking LLMs on Attending Meetings on Our Behalf
- Authors: Lingxiang Hu, Shurun Yuan, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang,
- Abstract summary: Large Language Models (LLMs) have demonstrated their strong capabilities in natural language generation and reasoning.
We develop a prototype LLM-powered meeting delegate system and create a benchmark using real meeting transcripts.
GPT-4/4o maintain balanced performance between active and cautious engagement strategies.
Gemini 1.5 Pro tends to be more cautious, while Gemini 1.5 Flash and Llama3-8B/70B display more active tendencies.
- Score: 31.889028210500165
- License:
- Abstract: In contemporary workplaces, meetings are essential for exchanging ideas and ensuring team alignment but often face challenges such as time consumption, scheduling conflicts, and inefficient participation. Recent advancements in Large Language Models (LLMs) have demonstrated their strong capabilities in natural language generation and reasoning, prompting the question: can LLMs effectively delegate participants in meetings? To explore this, we develop a prototype LLM-powered meeting delegate system and create a comprehensive benchmark using real meeting transcripts. Our evaluation reveals that GPT-4/4o maintain balanced performance between active and cautious engagement strategies. In contrast, Gemini 1.5 Pro tends to be more cautious, while Gemini 1.5 Flash and Llama3-8B/70B display more active tendencies. Overall, about 60\% of responses address at least one key point from the ground-truth. However, improvements are needed to reduce irrelevant or repetitive content and enhance tolerance for transcription errors commonly found in real-world settings. Additionally, we implement the system in practical settings and collect real-world feedback from demos. Our findings underscore the potential and challenges of utilizing LLMs as meeting delegates, offering valuable insights into their practical application for alleviating the burden of meetings.
Related papers
- MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation [52.35744453954844]
This paper introduces MMRC, a benchmark for evaluating six core open-ended abilities of MLLMs.
Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions.
We propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses.
arXiv Detail & Related papers (2025-02-17T15:24:49Z) - Do Large Language Models with Reasoning and Acting Meet the Needs of Task-Oriented Dialogue? [10.464799846640625]
We apply the ReAct strategy to guide large language models (LLMs) performing task-oriented dialogue (TOD)
While ReAct-LLMs seem to underperform state-of-the-art approaches in simulation, human evaluation indicates higher user satisfaction rate compared to handcrafted systems.
arXiv Detail & Related papers (2024-12-02T08:30:22Z) - Evaluating Cultural and Social Awareness of LLM Web Agents [113.49968423990616]
We introduce CASA, a benchmark designed to assess large language models' sensitivity to cultural and social norms.
Our approach evaluates LLM agents' ability to detect and appropriately respond to norm-violating user queries and observations.
Experiments show that current LLMs perform significantly better in non-agent environments.
arXiv Detail & Related papers (2024-10-30T17:35:44Z) - What's Wrong? Refining Meeting Summaries with LLM Feedback [6.532478490187084]
We introduce a multi-LLM correction approach for meeting summarization using a two-phase process that mimics the human review process.
We release QMSum Mistake, a dataset of 200 automatically generated meeting summaries annotated by humans on nine error types.
We transform identified mistakes into actionable feedback to improve the quality of a given summary measured by relevance, informativeness, conciseness, and coherence.
arXiv Detail & Related papers (2024-07-16T17:10:16Z) - Large Language Model Agents for Improving Engagement with Behavior Change Interventions: Application to Digital Mindfulness [17.055863270116333]
Large Language Models show promise in providing human-like dialogues that could emulate social support.
We conducted two randomized experiments to assess the impact of LLM agents on user engagement with mindfulness exercises.
arXiv Detail & Related papers (2024-07-03T15:43:16Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Summaries, Highlights, and Action items: Design, implementation and evaluation of an LLM-powered meeting recap system [30.35387091657807]
Large language models (LLMs) for dialog summarization have the potential to improve the experience of meetings.
Despite this potential, they face technological limitation due to long transcripts and inability to capture diverse recap needs based on user's context.
We develop a system to operationalize the representations with dialogue summarization as its building blocks.
arXiv Detail & Related papers (2023-07-28T20:25:11Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - RICA: Evaluating Robust Inference Capabilities Based on Commonsense
Axioms [41.82685006832153]
We propose a new challenge, RICA: Robust Inference capability based on Commonsense Axioms.
We generate data for this challenge using commonsense knowledge bases and probe PTLMs across two different evaluation settings.
Experiments show that PTLMs perform no better than random guessing on the zero-shot setting, are heavily impacted by statistical biases, and are not robust to perturbation attacks.
arXiv Detail & Related papers (2020-05-02T10:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.