autogen_ext.agents.web_surfer#

class MultimodalWebSurfer(name: str, model_client: ChatCompletionClient, downloads_folder: str | None = None, description: str = DEFAULT_DESCRIPTION, debug_dir: str | None = None, headless: bool = True, start_page: str | None = DEFAULT_START_PAGE, animate_actions: bool = False, to_save_screenshots: bool = False, use_ocr: bool = False, browser_channel: str | None = None, browser_data_dir: str | None = None, to_resize_viewport: bool = True, playwright: Playwright | None = None, context: BrowserContext | None = None)[source]#

基类： BaseChatAgent, Component[MultimodalWebSurferConfig]

MultimodalWebSurfer 是一个多模态代理，充当 Web 冲浪者，可以搜索 Web 和访问网页。

安装

pip install "autogen-ext[web-surfer]"

它启动一个 Chromium 浏览器，并允许 playwright 与 Web 浏览器交互，并可以执行各种操作。浏览器在首次调用代理时启动，并在后续调用中重复使用。

它必须与支持函数/工具调用的多模态模型客户端一起使用，理想情况下目前是 GPT-4o。

当调用 on_messages() 或 on_messages_stream() 时，会发生以下情况

如果这是第一次调用，则初始化浏览器并加载页面。这是在 _lazy_init() 中完成的。浏览器仅在调用 close() 时关闭。
调用方法 _generate_reply()，然后如下创建最终响应。
代理会截取页面屏幕截图，提取交互式元素，并准备一套带有交互式元素周围边界框的标记屏幕截图。
代理使用 SOM 屏幕截图、消息历史记录和可用工具列表调用 model_client。
- 如果模型返回一个字符串，则代理会将该字符串作为最终响应返回。
- 如果模型返回工具调用列表，则代理使用 _playwright_controller 通过 _execute_tool() 执行工具调用。
- 代理返回最终响应，其中包括页面屏幕截图、页面元数据、所采取操作的描述以及网页的内部文本。
如果在任何时候代理遇到错误，它会将错误消息作为最终响应返回。

注意

请注意，使用 MultimodalWebSurfer 涉及与为人类设计的数字世界互动，这具有内在风险。请注意，代理有时可能会尝试冒险行为，例如招募人类寻求帮助或在没有人类参与的情况下接受 cookie 协议。始终确保对代理进行监控，并在受控环境中运行，以防止意外后果。此外，请注意 MultimodalWebSurfer 可能容易受到来自网页的 prompt 注入攻击。

注意

在 Windows 上，必须将事件循环策略设置为 WindowsProactorEventLoopPolicy，以避免子进程出现问题。

import sys
import asyncio

if sys.platform == "win32":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

参数:

name (str) – 代理的名称。
model_client (ChatCompletionClient) – 代理使用的模型客户端。必须是多模态的并支持函数调用。
downloads_folder (str, optional) – 保存下载的文件夹。默认为 None，不保存下载。
description (str, optional) – 代理的描述。默认为 MultimodalWebSurfer.DEFAULT_DESCRIPTION。
debug_dir (str, optional) – 保存调试信息的目录。默认为 None。
headless (bool, optional) – 浏览器是否应为无头模式。默认为 True。
start_page (str, optional) – 浏览器的起始页面。默认为 MultimodalWebSurfer.DEFAULT_START_PAGE。
animate_actions (bool, optional) – 是否动画显示操作。默认为 False。
to_save_screenshots (bool, optional) – 是否保存屏幕截图。默认为 False。
use_ocr (bool, optional) – 是否使用 OCR。默认为 False。
browser_channel (str, optional) – 浏览器通道。默认为 None。
browser_data_dir (str, optional) – 浏览器数据目录。默认为 None。
to_resize_viewport (bool, optional) – 是否调整视口大小。默认为 True。
playwright (Playwright, optional) – Playwright 实例。默认为 None。
context (BrowserContext, optional) – 浏览器上下文。默认为 None。

使用示例

以下示例演示如何使用模型客户端创建一个网络浏览代理，并运行多个回合。

import asyncio
from autogen_agentchat.ui import Console
from autogen_agentchat.teams import RoundRobinGroupChat
from autogen_ext.models.openai import OpenAIChatCompletionClient
from autogen_ext.agents.web_surfer import MultimodalWebSurfer


async def main() -> None:
    # Define an agent
    web_surfer_agent = MultimodalWebSurfer(
        name="MultimodalWebSurfer",
        model_client=OpenAIChatCompletionClient(model="gpt-4o-2024-08-06"),
    )

    # Define a team
    agent_team = RoundRobinGroupChat([web_surfer_agent], max_turns=3)

    # Run the team and stream messages to the console
    stream = agent_team.run_stream(task="Navigate to the AutoGen readme on GitHub.")
    await Console(stream)
    # Close the browser controlled by the agent
    await web_surfer_agent.close()


asyncio.run(main())

DEFAULT_DESCRIPTION = '\n 一个可以访问网络浏览器的有用的助手。\n 让他们执行网络搜索、打开页面以及与内容交互（例如，单击链接、滚动视口、填写表单字段等）。\n 它还可以总结整个页面，或根据页面的内容回答问题。\n 还可以要求它休眠并等待页面加载，以防页面似乎尚未完全加载。\n '#

DEFAULT_START_PAGE = 'https://www.bing.com/'#

MLM_HEIGHT = 765#

MLM_WIDTH = 1224#

SCREENSHOT_TOKENS = 1105#

VIEWPORT_HEIGHT = 900#

VIEWPORT_WIDTH = 1440#

classmethod _from_config(config: MultimodalWebSurferConfig) → Self[source]#

从配置对象创建一个组件的新实例。

参数:: config (T) – 配置对象。
返回值:: Self – 组件的新实例。

_to_config() → MultimodalWebSurferConfig[source]#

转储创建与此实例配置匹配的组件新实例所需的配置。

返回值:: T – 组件的配置。

async close() → None[source]#: 关闭浏览器和页面。当不再需要代理时应调用。

component_config_schema#: 别名为 MultimodalWebSurferConfig

component_provider_override: ClassVar[str | None] = 'autogen_ext.agents.web_surfer.MultimodalWebSurfer'#: 覆盖组件的提供程序字符串。这应该用于防止内部模块名称成为模块名称的一部分。

component_type: ClassVar[ComponentType] = 'agent'#: 组件的逻辑类型。

async on_messages(messages: Sequence[BaseChatMessage], cancellation_token: CancellationToken) → Response[source]#: 处理传入的消息并返回响应。

注意

Agent是有状态的，传递给此方法的消息应该是自上次调用此方法以来收到的新消息。 Agent应该在此方法的调用之间保持其状态。例如，如果Agent需要记住之前的消息才能响应当前消息，则应将之前的消息存储在Agent状态中。

async on_messages_stream(messages: Sequence[BaseChatMessage], cancellation_token: CancellationToken) → AsyncGenerator[BaseAgentEvent | BaseChatMessage | Response, None][source]#: 处理传入的消息并返回消息流，最后一个项目是响应。 BaseChatAgent 中的基本实现只是调用 on_messages() 并生成响应中的消息。

注意

Agent是有状态的，传递给此方法的消息应该是自上次调用此方法以来收到的新消息。 Agent应该在此方法的调用之间保持其状态。例如，如果Agent需要记住之前的消息才能响应当前消息，则应将之前的消息存储在Agent状态中。

async on_reset(cancellation_token: CancellationToken) → None[source]#: 将agent重置为其初始化状态。

property produced_message_types: Sequence[type[BaseChatMessage]]#: Agent在 Response.chat_message 字段中生成的消息的类型。它们必须是 BaseChatMessage 类型。

class PlaywrightController(downloads_folder: str | None = None, animate_actions: bool = False, viewport_width: int = 1440, viewport_height: int = 900, _download_handler: Callable[[Download], None] | None = None, to_resize_viewport: bool = True)[source]#

基类: object

一个辅助类，允许 Playwright 与网页交互以执行诸如单击、填充和滚动之类的操作。

参数:

downloads_folder (str | None) – 用于保存下载内容的文件夹。如果为 None，则不保存下载内容。
animate_actions (bool) – 是否对操作进行动画处理（创建假光标以单击）。
viewport_width (int) – 视口的宽度。
viewport_height (int) – 视口的高度。
_download_handler (Optional[Callable[[Download], None]]) – 用于处理下载的函数。
to_resize_viewport (bool) – 是否调整视口大小

async add_cursor_box(page: Page, identifier: str) → None[source]#

在具有给定标识符的元素周围添加一个红色光标框。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

async back(page: Page) → None[source]#

导航到上一页。

参数:: page (Page) – Playwright 页面对象。

async click_id(page: Page, identifier: str) → Page | None[source]#

点击具有给定标识符的元素。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

返回值:

Page | None – 如果打开了新页面，则为新页面；否则为 None。

async fill_id(page: Page, identifier: str, value: str, press_enter: bool = True) → None[source]#

使用指定的值填充具有给定标识符的元素。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。
value (str) – 要填充的值。

async get_focused_rect_id(page: Page) → str | None[source]#

检索当前聚焦元素的 ID。

参数:: page (Page) – Playwright 页面对象。
返回值:: str – 聚焦元素的 ID；如果没有任何控件具有焦点，则为 None。

async get_interactive_rects(page: Page) → Dict[str, InteractiveRegion][source]#

从网页中检索交互区域。

参数:: page (Page) – Playwright 页面对象。
返回值:: Dict[str, InteractiveRegion] – 交互区域的字典。

async get_page_markdown(page: Page) → str[source]#

检索网页的 markdown 内容。目前尚未实现。

参数:: page (Page) – Playwright 页面对象。
返回值:: str – 页面的 markdown 内容。

async get_page_metadata(page: Page) → Dict[str, Any][source]#

从网页中检索元数据。

参数:: page (Page) – Playwright 页面对象。
返回值:: Dict[str, Any] – 页面元数据的字典。

async get_visible_text(page: Page) → str[source]#

检索浏览器视口（大致）的文本内容。

参数:: page (Page) – Playwright 页面对象。
返回值:: str – 页面的文本内容。

async get_visual_viewport(page: Page) → VisualViewport[source]#

检索网页的可视视口。

参数:: page (Page) – Playwright 页面对象。
返回值:: VisualViewport – 页面的可视视口。

async get_webpage_text(page: Page, n_lines: int = 50) → str[source]#

检索网页的文本内容。

参数:

page (Page) – Playwright 页面对象。
n_lines (int) – 从页面内部文本返回的行数。

返回值:

str – 页面的文本内容。

async gradual_cursor_animation(page: Page, start_x: float, start_y: float, end_x: float, end_y: float) → None[source]#

从起始坐标到结束坐标逐步动画显示光标移动。

参数:

page (Page) – Playwright 页面对象。
start_x (float) – 起始 x 坐标。
start_y (float) – 起始 y 坐标。
end_x (float) – 结束 x 坐标。
end_y (float) – 结束 y 坐标。

async hover_id(page: Page, identifier: str) → None[source]#

将鼠标悬停在具有给定标识符的元素上。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

async on_new_page(page: Page) → None[source]#

处理在新页面上执行的操作。

参数:: page (Page) – Playwright 页面对象。

async page_down(page: Page) → None[source]#

将页面向下滚动一个视口高度减去 50 像素。

参数:: page (Page) – Playwright 页面对象。

async page_up(page: Page) → None[source]#

将页面向上滚动一个视口高度减去 50 像素。

参数:: page (Page) – Playwright 页面对象。

async remove_cursor_box(page: Page, identifier: str) → None[source]#

移除具有给定标识符的元素周围的红色光标框。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。

async scroll_id(page: Page, identifier: str, direction: str) → None[source]#

按指定方向滚动具有给定标识符的元素。

参数:

page (Page) – Playwright 页面对象。
identifier (str) – 元素标识符。
direction (str) – 滚动的方向（“up” 或 “down”）。

async sleep(page: Page, duration: int | float) → None[source]#

暂停执行指定的时间。

参数:

page (Page) – Playwright 页面对象。
duration (Union[int, float]) – 睡眠的持续时间，以毫秒为单位。

async visit_page(page: Page, url: str) → Tuple[bool, bool][source]#

访问指定的 URL。

参数:

page (Page) – Playwright 页面对象。
url (str) – 要访问的 URL。

返回值:

Tuple[bool, bool] – 一个元组，指示是否重置先前的元数据哈希和上次下载。