视觉控制检测 (OmniParser)

我们还支持使用 OmniParser-v2 进行视觉控制检测。此方法有助于检测应用程序中可能无法被标准 UIA 方法识别的自定义控件。视觉控制检测使用计算机视觉技术，根据其视觉外观来识别和交互 UI 元素。

部署

在您的远程 GPU 服务器上，克隆 OmniParser 仓库

git clone https://github.com/microsoft/OmniParser.git

启动 omniparserserver 服务

cd OmniParser/omnitool/omniparserserver
python gradio_demo.py

这将为您提供一个短 URL

* Running on local URL:  http://0.0.0.0:7861
* Running on public URL: https://xxxxxxxxxxxxxxxxxx.gradio.live

注意：如果您对 OmniParser 的部署有任何疑问，请查阅 OmniParser 仓库中的 README 文件。

配置

部署 OmniParser 模型后，您需要在 config.yaml 文件中配置 OmniParser 设置

OMNIPARSER: {
  ENDPOINT: "<YOUR_END_POINT>", # The endpoint for the omniparser deployment
  BOX_THRESHOLD: 0.05, # The box confidence threshold for the omniparser, default is 0.05
  IOU_THRESHOLD: 0.1, # The iou threshold for the omniparser, default is 0.1
  USE_PADDLEOCR: True, # Whether to use the paddleocr for the omniparser
  IMGSZ: 640 # The image size for the omniparser
}

要激活图标控制过滤，您需要在 config_dev.yaml 文件中将 CONTROL_BACKEND 设置为 ["omniparser"]。

CONTROL_BACKEND: ["omniparser"]

参考

以下类用于 OmniParser 中的视觉控制检测

基类：BasicGrounding

OmniparserGrounding 类是 BasicGrounding 的子类，用于表示 Omniparser 基础模型。

`parse_results(results, application_window=None)`

将基础结果字符串解析为控件元素信息字典列表。

参数	`results` (`List[Dict[str, Any]]`) – 来自基础模型的结果字典列表。 `application_window` (`UIAWrapper`, 默认值: `None` ) – 获取绝对坐标的应用程序窗口。

返回	`List[Dict[str, Any]]` – 控件元素信息字典列表，字典应包含以下键：{ "control_type": 元素的控件类型, "name": 元素的名称, "x0": 边界框的绝对左坐标（整数）, "y0": 边界框的绝对顶坐标（整数）, "x1": 边界框的绝对右坐标（整数）, "y1": 边界框的绝对底坐标（整数）}

源代码位于 automator/ui_control/grounding/omniparser.py

def parse_results(
    self, results: List[Dict[str, Any]], application_window: UIAWrapper = None
) -> List[Dict[str, Any]]:
    """
    Parse the grounding results string into a list of control elements infomation dictionaries.
    :param results: The list of grounding results dictionaries from the grounding model.
    :param application_window: The application window to get the absolute coordinates.
    :return: The list of control elements information dictionaries, the dictionary should contain the following keys:
    {
        "control_type": The control type of the element,
        "name": The name of the element,
        "x0": The absolute left coordinate of the bounding box in integer,
        "y0": The absolute top coordinate of the bounding box in integer,
        "x1": The absolute right coordinate of the bounding box in integer,
        "y1": The absolute bottom coordinate of the bounding box in integer,
    }
    """

    control_elements_info = []

    if application_window is None:
        application_rect = RECT(0, 0, 0, 0)
    else:
        try:
            application_rect = application_window.rectangle()
        except Exception:
            application_rect = RECT(0, 0, 0, 0)

    for control_info in results:

        if not self._filter_interactivity and control_info.get(
            "interactivity", True
        ):
            continue

        application_left, application_top = (
            application_rect.left,
            application_rect.top,
        )

        control_box = control_info.get("bbox", [0, 0, 0, 0])

        control_left = int(
            application_left + control_box[0] * application_rect.width()
        )
        control_top = int(
            application_top + control_box[1] * application_rect.height()
        )
        control_right = int(
            application_left + control_box[2] * application_rect.width()
        )
        control_bottom = int(
            application_top + control_box[3] * application_rect.height()
        )

        control_elements_info.append(
            {
                "control_type": control_info.get("type", "Button"),
                "name": control_info.get("content", ""),
                "x0": control_left,
                "y0": control_top,
                "x1": control_right,
                "y1": control_bottom,
            }
        )

    return control_elements_info

`predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')`

预测给定图像的基础。

参数

image_path (str) –

图像的路径。
box_threshold (float, 默认值: 0.05 ) –

边界框的阈值。
iou_threshold (float, 默认值: 0.1 ) –

交并比的阈值。
use_paddleocr (bool, 默认值: True ) –

是否使用 paddleocr。
imgsz (int, 默认值: 640 ) –

图像尺寸。
api_name (str, 默认值: '/process' ) –

API 的名称。

返回	`List[Dict[str, Any]]` – 预测的基础结果字符串。

源代码位于 automator/ui_control/grounding/omniparser.py

def predict(
    self,
    image_path: str,
    box_threshold: float = 0.05,
    iou_threshold: float = 0.1,
    use_paddleocr: bool = True,
    imgsz: int = 640,
    api_name: str = "/process",
) -> List[Dict[str, Any]]:
    """
    Predict the grounding for the given image.
    :param image_path: The path to the image.
    :param box_threshold: The threshold for the bounding box.
    :param iou_threshold: The threshold for the intersection over union.
    :param use_paddleocr: Whether to use paddleocr.
    :param imgsz: The image size.
    :param api_name: The name of the API.
    :return: The predicted grounding results string.
    """

    list_of_grounding_results = []

    if not os.path.exists(image_path):
        print_with_color(
            f"Warning: The image path {image_path} does not exist.", "yellow"
        )
        return list_of_grounding_results

    try:
        results = self.service.chat_completion(
            image_path, box_threshold, iou_threshold, use_paddleocr, imgsz, api_name
        )
        grounding_results = results[1].splitlines()

    except Exception as e:
        print_with_color(
            f"Warning: Failed to get grounding results for Omniparser. Error: {e}",
            "yellow",
        )

        return list_of_grounding_results

    for item in grounding_results:
        try:
            item = json.loads(item)
            list_of_grounding_results.append(item)
        except json.JSONDecodeError:
            try:
                # the item string is a string converted from python's dict
                item = ast.literal_eval(item[item.index("{"):item.rindex("}") + 1])
                list_of_grounding_results.append(item)
            except Exception:
                pass

    return list_of_grounding_results