视觉控制检测 (OmniParser)

我们还支持使用 OmniParser-v2 进行视觉控制检测。此方法有助于检测应用程序中可能无法被标准 UIA 方法识别的自定义控件。视觉控制检测使用计算机视觉技术,根据其视觉外观来识别和交互 UI 元素。

部署

在您的远程 GPU 服务器上,克隆 OmniParser 仓库

git clone https://github.com/microsoft/OmniParser.git

启动 omniparserserver 服务

cd OmniParser/omnitool/omniparserserver
python gradio_demo.py

这将为您提供一个短 URL

* Running on local URL:  http://0.0.0.0:7861
* Running on public URL: https://xxxxxxxxxxxxxxxxxx.gradio.live

注意:如果您对 OmniParser 的部署有任何疑问,请查阅 OmniParser 仓库中的 README 文件。

配置

部署 OmniParser 模型后,您需要在 config.yaml 文件中配置 OmniParser 设置

OMNIPARSER: {
  ENDPOINT: "<YOUR_END_POINT>", # The endpoint for the omniparser deployment
  BOX_THRESHOLD: 0.05, # The box confidence threshold for the omniparser, default is 0.05
  IOU_THRESHOLD: 0.1, # The iou threshold for the omniparser, default is 0.1
  USE_PADDLEOCR: True, # Whether to use the paddleocr for the omniparser
  IMGSZ: 640 # The image size for the omniparser
}

要激活图标控制过滤,您需要在 config_dev.yaml 文件中将 CONTROL_BACKEND 设置为 ["omniparser"]

CONTROL_BACKEND: ["omniparser"]

参考

以下类用于 OmniParser 中的视觉控制检测

基类:BasicGrounding

OmniparserGrounding 类是 BasicGrounding 的子类,用于表示 Omniparser 基础模型。

parse_results(results, application_window=None)

将基础结果字符串解析为控件元素信息字典列表。

参数
  • results (List[Dict[str, Any]]) –

    来自基础模型的结果字典列表。

  • application_window (UIAWrapper, 默认值: None ) –

    获取绝对坐标的应用程序窗口。

返回
  • List[Dict[str, Any]]

    控件元素信息字典列表,字典应包含以下键:{ "control_type": 元素的控件类型, "name": 元素的名称, "x0": 边界框的绝对左坐标(整数), "y0": 边界框的绝对顶坐标(整数), "x1": 边界框的绝对右坐标(整数), "y1": 边界框的绝对底坐标(整数)}

源代码位于 automator/ui_control/grounding/omniparser.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
def parse_results(
    self, results: List[Dict[str, Any]], application_window: UIAWrapper = None
) -> List[Dict[str, Any]]:
    """
    Parse the grounding results string into a list of control elements infomation dictionaries.
    :param results: The list of grounding results dictionaries from the grounding model.
    :param application_window: The application window to get the absolute coordinates.
    :return: The list of control elements information dictionaries, the dictionary should contain the following keys:
    {
        "control_type": The control type of the element,
        "name": The name of the element,
        "x0": The absolute left coordinate of the bounding box in integer,
        "y0": The absolute top coordinate of the bounding box in integer,
        "x1": The absolute right coordinate of the bounding box in integer,
        "y1": The absolute bottom coordinate of the bounding box in integer,
    }
    """

    control_elements_info = []

    if application_window is None:
        application_rect = RECT(0, 0, 0, 0)
    else:
        try:
            application_rect = application_window.rectangle()
        except Exception:
            application_rect = RECT(0, 0, 0, 0)

    for control_info in results:

        if not self._filter_interactivity and control_info.get(
            "interactivity", True
        ):
            continue

        application_left, application_top = (
            application_rect.left,
            application_rect.top,
        )

        control_box = control_info.get("bbox", [0, 0, 0, 0])

        control_left = int(
            application_left + control_box[0] * application_rect.width()
        )
        control_top = int(
            application_top + control_box[1] * application_rect.height()
        )
        control_right = int(
            application_left + control_box[2] * application_rect.width()
        )
        control_bottom = int(
            application_top + control_box[3] * application_rect.height()
        )

        control_elements_info.append(
            {
                "control_type": control_info.get("type", "Button"),
                "name": control_info.get("content", ""),
                "x0": control_left,
                "y0": control_top,
                "x1": control_right,
                "y1": control_bottom,
            }
        )

    return control_elements_info

predict(image_path, box_threshold=0.05, iou_threshold=0.1, use_paddleocr=True, imgsz=640, api_name='/process')

预测给定图像的基础。

参数
  • image_path (str) –

    图像的路径。

  • box_threshold (float, 默认值: 0.05 ) –

    边界框的阈值。

  • iou_threshold (float, 默认值: 0.1 ) –

    交并比的阈值。

  • use_paddleocr (bool, 默认值: True ) –

    是否使用 paddleocr。

  • imgsz (int, 默认值: 640 ) –

    图像尺寸。

  • api_name (str, 默认值: '/process' ) –

    API 的名称。

返回
  • List[Dict[str, Any]]

    预测的基础结果字符串。

源代码位于 automator/ui_control/grounding/omniparser.py
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
def predict(
    self,
    image_path: str,
    box_threshold: float = 0.05,
    iou_threshold: float = 0.1,
    use_paddleocr: bool = True,
    imgsz: int = 640,
    api_name: str = "/process",
) -> List[Dict[str, Any]]:
    """
    Predict the grounding for the given image.
    :param image_path: The path to the image.
    :param box_threshold: The threshold for the bounding box.
    :param iou_threshold: The threshold for the intersection over union.
    :param use_paddleocr: Whether to use paddleocr.
    :param imgsz: The image size.
    :param api_name: The name of the API.
    :return: The predicted grounding results string.
    """

    list_of_grounding_results = []

    if not os.path.exists(image_path):
        print_with_color(
            f"Warning: The image path {image_path} does not exist.", "yellow"
        )
        return list_of_grounding_results

    try:
        results = self.service.chat_completion(
            image_path, box_threshold, iou_threshold, use_paddleocr, imgsz, api_name
        )
        grounding_results = results[1].splitlines()

    except Exception as e:
        print_with_color(
            f"Warning: Failed to get grounding results for Omniparser. Error: {e}",
            "yellow",
        )

        return list_of_grounding_results

    for item in grounding_results:
        try:
            item = json.loads(item)
            list_of_grounding_results.append(item)
        except json.JSONDecodeError:
            try:
                # the item string is a string converted from python's dict
                item = ast.literal_eval(item[item.index("{"):item.rindex("}") + 1])
                list_of_grounding_results.append(item)
            except Exception:
                pass

    return list_of_grounding_results