混合检测

我们还支持使用 UIA 和 OmniParser-v2 进行混合控制检测。此方法可用于使用 UI Automation (UIA) 框架检测应用程序中的标准控件，以及检测应用程序中可能无法通过标准 UIA 方法识别的自定义控件。通过基于 IOU 移除重复控件，将视觉检测到的控件与 UIA 控件合并。我们在下图中说明了混合控件检测

配置

在使用混合控件检测之前，您需要部署和配置 OmniParser 模型。您可以参考OmniParser 部署了解更多详情。

要激活图标控件过滤，您需要在 config_dev.yaml 文件中将 CONTROL_BACKEND 设置为 ["uia", "omniparser"]。

CONTROL_BACKEND: ["uia", "omniparser"]

参考

OmniParser 中用于视觉控件检测的类如下

基类：BasicGrounding

OmniparserGrounding 类是 BasicGrounding 的子类，用于表示 Omniparser 接地模型。

`parse_results(results, application_window=None)`

将接地结果字符串解析为控制元素信息字典列表。

参数	`results` (`List[Dict[str, Any]]`) – 来自接地模型的接地结果字典列表。 `application_window` (`UIAWrapper`, 默认值：`None` ) – 获取绝对坐标的应用程序窗口。

返回	`List[Dict[str, Any]]` – 控制元素信息字典列表，字典应包含以下键： { "control_type": 元素的控制类型, "name": 元素的名称, "x0": 边界框的绝对左坐标（整数）, "y0": 边界框的绝对上坐标（整数）, "x1": 边界框的绝对右坐标（整数）, "y1": 边界框的绝对下坐标（整数）, }

源代码位于 automator/ui_control/grounding/omniparser.py

def parse_results(
    self, results: List[Dict[str, Any]], application_window: UIAWrapper = None
) -> List[Dict[str, Any]]:
    """
    Parse the grounding results string into a list of control elements infomation dictionaries.
    :param results: The list of grounding results dictionaries from the grounding model.
    :param application_window: The application window to get the absolute coordinates.
    :return: The list of control elements information dictionaries, the dictionary should contain the following keys:
    {
        "control_type": The control type of the element,
        "name": The name of the element,
        "x0": The absolute left coordinate of the bounding box in integer,
        "y0": The absolute top coordinate of the bounding box in integer,
        "x1": The absolute right coordinate of the bounding box in integer,
        "y1": The absolute bottom coordinate of the bounding box in integer,
    }
    """

    control_elements_info = []

    if application_window is None:
        application_rect = RECT(0, 0, 0, 0)
    else:
        try:
            application_rect = application_window.rectangle()
        except Exception:
            application_rect = RECT(0, 0, 0, 0)

    for control_info in results:

        if not self._filter_interactivity and control_info.get(
            "interactivity", True
        ):
            continue

        application_left, application_top = (
            application_rect.left,
            application_rect.top,
        )

        control_box = control_info.get("bbox", [0, 0, 0, 0])

        control_left = int(
            application_left + control_box[0] * application_rect.width()
        )
        control_top = int(
            application_top + control_box[1] * application_rect.height()
        )
        control_right = int(
            application_left + control_box[2] * application_rect.width()
        )
        control_bottom = int(
            application_top + control_box[3] * application_rect.height()
        )

        control_elements_info.append(
            {
                "control_type": control_info.get("type", "Button"),
                "name": control_info.get("content", ""),
                "x0": control_left,
                "y0": control_top,
                "x1": control_right,
                "y1": control_bottom,
            }
        )

    return control_elements_info

`predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')`

预测给定图像的接地。

参数	`image_path` (`str`) – 图像路径。 `box_threshold` (`float`, 默认值：`0.05` ) – 边界框阈值。 `iou_threshold` (`float`, 默认值：`0.1` ) – 交并集阈值。 `use_paddleocr` (`bool`, 默认值：`True` ) – 是否使用 paddleocr。 `imgsz` (`int`, 默认值：`640` ) – 图像大小。 `api_name` (`str`, 默认值：`'/process'` ) – API 名称。

返回	`List[Dict[str, Any]]` – 预测的接地结果字符串。

源代码位于 automator/ui_control/grounding/omniparser.py

def predict(
    self,
    image_path: str,
    box_threshold: float = 0.05,
    iou_threshold: float = 0.1,
    use_paddleocr: bool = True,
    imgsz: int = 640,
    api_name: str = "/process",
) -> List[Dict[str, Any]]:
    """
    Predict the grounding for the given image.
    :param image_path: The path to the image.
    :param box_threshold: The threshold for the bounding box.
    :param iou_threshold: The threshold for the intersection over union.
    :param use_paddleocr: Whether to use paddleocr.
    :param imgsz: The image size.
    :param api_name: The name of the API.
    :return: The predicted grounding results string.
    """

    list_of_grounding_results = []

    if not os.path.exists(image_path):
        print_with_color(
            f"Warning: The image path {image_path} does not exist.", "yellow"
        )
        return list_of_grounding_results

    try:
        results = self.service.chat_completion(
            image_path, box_threshold, iou_threshold, use_paddleocr, imgsz, api_name
        )
        grounding_results = results[1].splitlines()

    except Exception as e:
        print_with_color(
            f"Warning: Failed to get grounding results for Omniparser. Error: {e}",
            "yellow",
        )

        return list_of_grounding_results

    for item in grounding_results:
        try:
            item = json.loads(item)
            list_of_grounding_results.append(item)
        except json.JSONDecodeError:
            try:
                # the item string is a string converted from python's dict
                item = ast.literal_eval(item[item.index("{"):item.rindex("}") + 1])
                list_of_grounding_results.append(item)
            except Exception:
                pass

    return list_of_grounding_results