Web 自动化器

我们还支持使用 Web Automator 获取网页内容。Web Automator 在 ufo/automator/app_apis/web 模块中实现。

配置

在使用 API Automator 之前，需要在 config_dev.yaml 文件中设置多项配置。以下是与 API Automator 相关的配置列表：

配置选项	描述	类型	默认值
`USE_APIS`	是否允许使用应用程序 API。	布尔值	True
`APP_API_PROMPT_ADDRESS`	应用程序 API 的提示词地址。	字典	{"WINWORD.EXE": "ufo/prompts/apps/word/api.yaml", "EXCEL.EXE": "ufo/prompts/apps/excel/api.yaml", "msedge.exe": "ufo/prompts/apps/web/api.yaml", "chrome.exe": "ufo/prompts/apps/web/api.yaml"}

注意

Web Automator 目前只支持 msedge.exe 和 chrome.exe。

接收器

Web Automator 接收器是定义在 ufo/automator/app_apis/web/webclient.py 模块中的 WebReceiver 类。

基类：ReceiverBasic

使用 crawl4ai 的 Web COM 客户端基类。

初始化 Web COM 客户端。

源代码位于 automator/app_apis/web/webclient.py

def __init__(self) -> None:
    """
    Initialize the Web COM client.
    """
    self._headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
    }

`web_crawler(url, ignore_link)`

使用各种选项运行爬虫。

参数	`url` (`str`) – 网页的 URL。 `ignore_link` (`bool`) – 是否忽略链接。

返回	`字符串` – 结果 Markdown 内容。

源代码位于 automator/app_apis/web/webclient.py

def web_crawler(self, url: str, ignore_link: bool) -> str:
    """
    Run the crawler with various options.
    :param url: The URL of the webpage.
    :param ignore_link: Whether to ignore the links.
    :return: The result markdown content.
    """

    try:
        # Get the HTML content of the webpage
        response = requests.get(url, headers=self._headers)
        response.raise_for_status()

        html_content = response.text

        # Convert the HTML content to markdown
        h = html2text.HTML2Text()
        h.ignore_links = ignore_link
        markdown_content = h.handle(html_content)

        return markdown_content

    except requests.RequestException as e:
        print(f"Error fetching the URL: {e}")

        return f"Error fetching the URL: {e}"

命令

我们目前只支持 Web Automator 中的一个命令，用于将网页内容获取为 Markdown 格式。未来将为 Web Automator 添加更多命令。

@WebReceiver.register
class WebCrawlerCommand(WebCommand):
    """
    The command to run the crawler with various options.
    """

    def execute(self):
        """
        Execute the command to run the crawler.
        :return: The result content.
        """
        return self.receiver.web_crawler(
            url=self.params.get("url"),
            ignore_link=self.params.get("ignore_link", False),
        )

    @classmethod
    def name(cls) -> str:
        """
        The name of the command.
        """
        return "web_crawler"

以下是 UFO 当前支持的 Web Automator 中可用命令列表：

命令名称	函数名称	描述
`WebCrawlerCommand`	`web_crawler`	将网页内容获取为 Markdown 格式。

提示

有关 WebCrawlerCommand 命令的提示详情，请参阅 ufo/prompts/apps/web/api.yaml 文件。