RUBICON：评估人机对话的质量_AI阅读总结 — 包阅AI

包阅导读总结

1. 关键词：

– RUBICON

– 人类-AI 对话评估

– 领域特定

– 评估系统

– 软件开发

2. 总结：

本文介绍了在 AIware 2024 会议上提出的 RUBICON，用于评估人类和 AI 系统的对话。它基于通信原则，是一种自动化评估技术，能从少量数据生成大量特定领域的评估标准，提高评估准确性，已在 Visual Studio IDE 中成功应用，未来有望拓展适用范围。

3. 主要内容：

– 背景

– 生成式 AI 使评估其对用户交互的影响更具挑战。

– RUBICON 介绍

– 是一种基于标准的评估特定领域人类-AI 对话的技术。

– 基于通信原则适应领域特定环境。

– RUBICON 的方法和评估

– 基于 SPUR 工作，用语言模型创建评估对话质量的摘要。

– 新算法筛选高质量标准，无需人工干预。

– 比 SPUR 准确性提高 18%，在 84%未标记数据中预测近完美。

– RUBICON 生成的标准

– 作为理解用户需求和规范的框架，在 Visual Studio IDE 中应用。

– 发现对话和系统设计缺陷。

– 影响和展望

– 是有价值的评估系统，未来将拓展适用范围。

思维导图：

文章地址：https://www.microsoft.com/en-us/research/blog/rubicon-evaluating-conversations-between-humans-and-ai-systems/

文章来源：microsoft.com

作者：Brenda Potts

发布时间：2024/7/15 16:00

语言：英文

总字数：1116字

预计阅读时间：5分钟

评分：90分

标签：人工智能评估,对话式人工智能,人机交互,RUBICON 框架,领域特定人工智能

以下为原文内容

本内容来源于用户推荐转载，旨在分享知识与观点，如有侵权请联系删除联系邮箱 media@ilingban.com

This paper has been accepted at the 1^st ACM International Conference on AI-powered Software (opens in new tab) (AIware 2024), co-located with FSE 2024 (opens in new tab). AIware is the premier international forum on AI-powered software.

Generative AI has redefined the landscape of AI assistants in software development, with innovations like GitHub Copilot providing real-time, chat-based programming support. As these tools increase in sophistication and domain specialization, assessing their impact on user interactions becomes more challenging. Developers frequently question whether modifications to their AI assistants genuinely improve the user experience, as indicated in a recent paper.

Traditional feedback mechanisms, such as simple thumbs-up or thumbs-down ratings, fall short in capturing the complexities of interactions within specialized settings, where nuanced data is often sparse. To address this issue, we introduce RUBICON: Rubric-based Evaluation of Domain Specific Human-AI Conversations,” presented at AIware 2024. RUBICON is an automated assessment technique that transforms a minimal dataset into an extensive array of domain-specific rubrics, helping ensure that updates not only modify but meaningfully improve user interactions.

Foundational communication principles

Effective conversation, whether human-to-human or human-to-AI, adheres to four maxims (opens in new tab) outlined by philosopher Paul Grice: quantity, quality, relation, and manner, ensuring that communication is concise, truthful, pertinent, and clear. In AI applications, they help create interactions that feel natural and engaging, fostering trust and empathy. Within domain-specific settings, RUBICON adapts these principles to ensure they are context-aware, improving the utility and clarity of interactions. For example, in Visual Studio, the AI helps the developer debug a program by providing detailed explanations and relevant code examples, shown in Figure 1. In Figure 2, its responses reflect that it’s guided by context.

In the image, we see two Human-AI debugging conversations side by side, both working on the same task but with different AI assistants. On the left side, the assistant suggests using an if-else block to catch and throw an exception. The user responds that they do not want to throw any exceptions. The assistant then proposes a try-catch block instead. The user ends the conversation by asking how to prevent the exception from occurring in the first place. The assistant makes assumptions without clarifying details about the scenario, leading to a superficial and unusable fix. On the right side, the assistant starts by asking the user to check a variable's value at a specific state. The user replies that the variable is empty. The assistant then forms a hypothesis and requests the relevant code file from the user. After receiving the code, the assistant provides a simple fix. The user ends the conversation by confirming that the solution worked. Here, the assistant actively investigates the error, collaborates with the user to gather information, and delivers a practical solution. — Figure 1. Contrasting interactions with two versions of the Visual Studio Debugging Assistant for the same task. On the left, the assistant makes assumptions without seeking clarification. On the right, the assistant proactively investigates the error, collaborates with the developer to gather essential information, and achieves a practical solution.

In the image, there are two sample initial responses to the same task by different debugging assistants, shown side by side. On the left, the assistant merely reiterates the meaning of the exception message and gives generic advice, such as asking the user to check why the serialization failed. On the right, the assistant identifies the probable source of the error, points out the specific method to the user, and requests the user to provide the code for that method. — Figure 2. Context awareness significantly improves the AI assistant’s efficacy. The response on the left is generic, superficially referring to the developer’s code and restating the obvious, providing little value. The reply on the right directs the developer toward a specific solution, the toJSON method.

In task-oriented environments, it’s important to assess how well a conversation aligns with user expectations and assists in achieving their goals. Conversations are only useful if they advance the user’s interests, and challenges can arise when users have misaligned expectations of the AI’s capabilities or when the AI directs the conversation too forcefully, prioritizing its methods over the user’s preferences. RUBICON balances the interaction dynamics between the AI and developer, promoting constructive exchanges without overwhelming or under-engaging. It calibrates the extent to which the AI should hypothesize and resolve issues versus how much it should leave to the developer.

Microsoft research podcast

Ideas: Designing AI for people with Abigail Sellen

Social scientist and HCI expert Abigail Sellen explores the critical understanding needed to build human-centric AI through the lens of the new AICE initiative, a collective of interdisciplinary researchers studying AI impact on human cognition and the economy.

Opens in a new tab

RUBICON’s rubric-based method and evaluation

RUBICON is built on the foundational work of SPUR—the Supervised Prompting for User Satisfaction Rubrics framework that was recently introduced—increasing its scope and crafting a broad spectrum of potential rubrics from each batch of data. Using a language model to create concise summaries that assess the quality of conversations, emphasizing communication principles, task orientation, and domain specificity. It identifies signals of user satisfaction and outlines the shared responsibilities of the user and the AI in achieving task objectives. These summaries are then refined into rubrics.

RUBICON’s novel selection algorithm sifts through numerous candidates to identify a select group of high-quality rubrics, enhancing their predictive accuracy in practical applications, as illustrated in Figure 3. The technique doesn’t require human intervention and can be trained directly on anonymized conversational data, helping to ensure customer data privacy while still extracting the important features for analysis.

The image contains three graphics. On the left is a bad Human-AI debugging conversation, and on the right is a good one. The center graphic lists sample rubrics generated by RUBICON from events of goodness/badness from both the conversations. Arrows connect specific events in the conversations to the corresponding rubric. For example, one arrow starts from the part of the right conversation where the assistant provides a ready-to-use code snippet to solve the bug, ending at the rubric, “The assistant provides a code snippet to illustrate the solution, aiding the user in implementing the fix.” — Figure 3. Overview of RUBICON’s framework and the various steps involved.

The effectiveness of RUBICON’s method is evidenced by its rubrics, which show an 18% increase in accuracy over SPUR in classifying conversations as positive or negative, as shown in Figure 4. Additionally, RUBICON achieves near-perfect precision in predicting conversation labels in 84% of cases involving unlabeled data.

The image depicts a workflow illustrating the RUBICON technique. It begins with a set of conversations, from which signals indicating conversation quality are extracted. An LLM then analyzes these signals, reasoning about why they occurred, using domain-specific insights and understanding of the user-assistant interaction. Another LLM summarizes these reasonings into a rubric pool, applying Gricean maxims to evaluate conversational situations. Finally, RUBICON’s novel selection policy algorithm selects the top-performing rubric from this pool. — Figure 4. Two analogous conversations facilitated by the Debugger AI assistant are evaluated against representative rubrics. Software engineers who evaluated the conversations found the one on the left less effective and the one on the right more so. RUBICON’s rubric also gave a higher score to the conversation on the right, demonstrating that RUBICON’s method of evaluation is consistent with that of the software engineers.

RUBICON-generated rubrics

RUBICON-generated rubrics serve as a framework for understanding user needs, expectations, and conversational norms. These rubrics have been successfully implemented in Visual Studio IDE, where they have guided analysis of over 12,000 debugging conversations, offering valuable insights into the effectiveness of modifications made to the assistant and facilitating rapid fast iteration and improvement.For example, the rubrics “The AI gave a solution too quickly, rather than asking the user for more information and trying to find the root cause of the issue,” or “The AI gave a mostly surface-level solution to the problem,” have indicated issues where the assistant prematurely offered solutions without gathering sufficient information. These findings led to adjustments in the AI’s behavior, making it more investigative and collaborative.

Beyond conversational dynamics, the rubrics also identify systemic design flaws not directly tied to the conversational assistant. These include issues with the user interface issues that impede the integration of new code and gaps in user education regarding the assistant’s capabilities. To use RUBICON, developers need a small set of labeled conversations from their AI assistant and specifically designed prompts that reflect the criteria for task progression and completion. The methodology and example of these rubrics are detailed in the paper.

Implications and looking ahead

Developers of AI assistance value clear insights into the performance of their interfaces. RUBICON represents a valuable step toward developing a refined evaluation system that is sensitive to domain-specific tasks, adaptable to changing usage patterns, efficient, easy-to-implement, and privacy-conscious. A robust evaluation system like RUBICON can help to improve the quality of these tools without compromising user privacy or data security. As we look ahead, our goal is to broaden the applicability of RUBICON beyond just debugging in AI assistants like GitHub Copilot. We aim to support additional tasks like migration and scaffolding within IDEs, extending its utility to other chat-based Copilot experiences across various products.