EvaluationAgent 🧐

EvaluationAgent 的目标是评估 Session 或 Round 是否已成功完成。EvaluationAgent 评估 HostAgent 和 AppAgent 在满足请求方面的表现。您可以在 config_dev.yaml 文件中配置是否启用 EvaluationAgent，详细文档请参见此处。

注意

EvaluationAgent 完全由 LLM 驱动，并根据动作轨迹和截图进行评估。由于 LLM 可能会犯错，因此它可能不是 100% 准确。

我们通过下图说明评估过程

配置

要启用 EvaluationAgent，您可以在 config_dev.yaml 文件中配置以下参数，以评估不同级别的任务完成状态

配置选项	描述	类型	默认值
`EVA_SESSION`	是否将会话包含在评估中。	布尔值	True
`EVA_ROUND`	是否将回合包含在评估中。	布尔值	False
`EVA_ALL_SCREENSHOTS`	是否将所有截图包含在评估中。	布尔值	True

评估输入

EvaluationAgent 接收以下输入进行评估

输入	描述	类型
用户请求	要评估的用户请求。	字符串
API 描述	执行中使用的 API 描述。	字符串列表
动作轨迹	由 `HostAgent` 和 `AppAgent` 执行的动作轨迹。	字符串列表
截图	执行过程中捕获的截图。	图片列表

有关如何构建输入的更多详细信息，请参阅 ufo/prompter/eva_prompter.py 中的 EvaluationAgentPrompter 类。

提示

您可以在 config_dev.yaml 文件的 EVA_ALL_SCREENSHOTS 中配置是使用所有截图还是仅使用第一张和最后一张截图进行评估。

评估输出

EvaluationAgent 在评估后生成以下输出

输出	描述	类型
原因	通过观察截图差异和详细的判断原因。.	字符串
子分数	将评估分解为多个子目标后的评估子分数。	字典列表
完成	评估的完成状态，可以是 `yes`、`no` 或 `unsure`。	字符串

以下是评估输出的示例

{
    "reason": "The agent successfully completed the task of sending 'hello' to Zac on Microsoft Teams. 
    The initial screenshot shows the Microsoft Teams application with the chat window of Chaoyun Zhang open. 
    The agent then focused on the chat window, input the message 'hello', and clicked the Send button. 
    The final screenshot confirms that the message 'hello' was sent to Zac.", 
    "sub_scores": {
        "correct application focus": "yes", 
        "correct message input": "yes", 
        "message sent successfully": "yes"
        }, 
    "complete": "yes"}

信息

评估结果日志将保存在 logs/{task_name}/evaluation.log 文件中。

EvaluationAgent 采用 CoT 机制，首先将评估分解为多个子目标，然后分别评估每个子目标。然后聚合子分数以确定评估的总体完成状态。

参考

基类：BasicAgent

评估代理。

初始化 FollowAgent。:agent_type: 代理的类型。:is_visual: 指示代理是否为视觉代理的标志。

源代码位于 agents/agent/evaluation_agent.py

def __init__(
    self,
    name: str,
    app_root_name: str,
    is_visual: bool,
    main_prompt: str,
    example_prompt: str,
    api_prompt: str,
):
    """
    Initialize the FollowAgent.
    :agent_type: The type of the agent.
    :is_visual: The flag indicating whether the agent is visual or not.
    """

    super().__init__(name=name)

    self._app_root_name = app_root_name
    self.prompter = self.get_prompter(
        is_visual,
        main_prompt,
        example_prompt,
        api_prompt,
        app_root_name,
    )

`status_manager` `property`

获取状态管理器。

`evaluate(request, log_path, eva_all_screenshots=True)`

评估任务完成情况。

参数	`log_path` (`str`) – 日志文件的路径。

返回	`Tuple[Dict[str, str], float]` – 评估结果和 LLM 的成本。

源代码位于 agents/agent/evaluation_agent.py

def evaluate(
    self, request: str, log_path: str, eva_all_screenshots: bool = True
) -> Tuple[Dict[str, str], float]:
    """
    Evaluate the task completion.
    :param log_path: The path to the log file.
    :return: The evaluation result and the cost of LLM.
    """

    message = self.message_constructor(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )
    result, cost = self.get_response(
        message=message, namescope="eva", use_backup_engine=True
    )

    result = json_parser(result)

    return result, cost

`get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)`

获取代理的提示器。

源代码位于 agents/agent/evaluation_agent.py

def get_prompter(
    self,
    is_visual,
    prompt_template: str,
    example_prompt_template: str,
    api_prompt_template: str,
    root_name: Optional[str] = None,
) -> EvaluationAgentPrompter:
    """
    Get the prompter for the agent.
    """

    return EvaluationAgentPrompter(
        is_visual=is_visual,
        prompt_template=prompt_template,
        example_prompt_template=example_prompt_template,
        api_prompt_template=api_prompt_template,
        root_name=root_name,
    )

`message_constructor(log_path, request, eva_all_screenshots=True)`

构建消息。

参数	`log_path` (`str`) – 日志文件的路径。 `request` (`str`) – 请求。 `eva_all_screenshots` (`bool`, 默认值: `True` ) – 指示是否评估所有截图的标志。

返回	`Dict[str, Any]` – 消息。

源代码位于 agents/agent/evaluation_agent.py

def message_constructor(
    self, log_path: str, request: str, eva_all_screenshots: bool = True
) -> Dict[str, Any]:
    """
    Construct the message.
    :param log_path: The path to the log file.
    :param request: The request.
    :param eva_all_screenshots: The flag indicating whether to evaluate all screenshots.
    :return: The message.
    """

    evaagent_prompt_system_message = self.prompter.system_prompt_construction()

    evaagent_prompt_user_message = self.prompter.user_content_construction(
        log_path=log_path, request=request, eva_all_screenshots=eva_all_screenshots
    )

    evaagent_prompt_message = self.prompter.prompt_construction(
        evaagent_prompt_system_message, evaagent_prompt_user_message
    )

    return evaagent_prompt_message

`print_response(response_dict)`

打印评估的响应。

参数	`response_dict` (`Dict[str, Any]`) – 响应字典。

源代码位于 agents/agent/evaluation_agent.py

def print_response(self, response_dict: Dict[str, Any]) -> None:
    """
    Print the response of the evaluation.
    :param response_dict: The response dictionary.
    """

    emoji_map = {
        "yes": "✅",
        "no": "❌",
        "maybe": "❓",
    }

    complete = emoji_map.get(
        response_dict.get("complete"), response_dict.get("complete")
    )

    sub_scores = response_dict.get("sub_scores", {})
    reason = response_dict.get("reason", "")

    print_with_color(f"Evaluation result🧐:", "magenta")
    print_with_color(f"[Sub-scores📊:]", "green")

    for score, evaluation in sub_scores.items():
        print_with_color(
            f"{score}: {emoji_map.get(evaluation, evaluation)}", "green"
        )

    print_with_color(
        "[Task is complete💯:] {complete}".format(complete=complete), "cyan"
    )

    print_with_color(f"[Reason🤔:] {reason}".format(reason=reason), "blue")

`process_comfirmation()`

确认，目前不执行任何操作。

源代码位于 agents/agent/evaluation_agent.py

def process_comfirmation(self) -> None:
    """
    Comfirmation, currently do nothing.
    """
    pass

EvaluationAgent 🧐

配置

评估输入

评估输出

参考

status_manager property

evaluate(request, log_path, eva_all_screenshots=True)

get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)

message_constructor(log_path, request, eva_all_screenshots=True)

print_response(response_dict)

process_comfirmation()

`status_manager` `property`

`evaluate(request, log_path, eva_all_screenshots=True)`

`get_prompter(is_visual, prompt_template, example_prompt_template, api_prompt_template, root_name=None)`

`message_constructor(log_path, request, eva_all_screenshots=True)`

`print_response(response_dict)`

`process_comfirmation()`