在 Azure 中与 PDF 聊天#

作者:  头像 头像在 GitHub 上打开

这是一个简单的流程,它允许您询问有关 PDF 文件内容的问题并获取答案。您可以使用 PDF 文件的 URL 和问题作为参数来运行该流程。一旦启动,它将下载 PDF 并构建内容索引。然后,当您提出问题时,它将查找索引以检索相关内容,并将问题与相关内容一起发布到 OpenAI 聊天模型(gpt-3.5-turbo 或 gpt4)以获取答案。

0. 安装依赖项#

%pip install -r requirements.txt

1. 连接到 Azure 机器学习工作区#

from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

1.1 熟悉主界面 - PFClient#

import promptflow.azure as azure

# Get a handle to workspace
pf = azure.PFClient.from_config(credential=credential)

1.2 创建必要的连接#

Prompt flow 中的连接用于管理应用程序行为设置,包括如何与不同的服务(例如 Azure OpenAI)通信。

遵循此说明准备您的 Azure OpenAI 资源,如果您没有 api_key,请获取一个。

请前往工作区门户,点击 Prompt flow -> Connections -> Create,然后按照说明创建您自己的连接。了解更多关于连接的信息。

conn_name = "open_ai_connection"

# TODO integrate with azure.ai sdk
# currently we only support create connection in Azure ML Studio UI
# raise Exception(f"Please create {conn_name} connection in Azure ML Studio.")

2. 运行带有设置的流(上下文大小 2K)#

flow_path = "."
data_path = "./data/bert-paper-qna-3-line.jsonl"

config_2k_context = {
    "EMBEDDING_MODEL_DEPLOYMENT_NAME": "text-embedding-ada-002",
    "CHAT_MODEL_DEPLOYMENT_NAME": "gpt-35-turbo",
    "PROMPT_TOKEN_LIMIT": 2000,
    "MAX_COMPLETION_TOKENS": 256,
    "VERBOSE": True,
    "CHUNK_SIZE": 1024,
    "CHUNK_OVERLAP": 32,
}

column_mapping = {
    "question": "${data.question}",
    "pdf_url": "${data.pdf_url}",
    "chat_history": "${data.chat_history}",
    "config": config_2k_context,
}

run_2k_context = pf.run(
    flow=flow_path,
    data=data_path,
    column_mapping=column_mapping,
    display_name="chat_with_pdf_2k_context",
    tags={"chat_with_pdf": "", "1st_round": ""},
)
pf.stream(run_2k_context)
print(run_2k_context)
detail = pf.get_details(run_2k_context)

detail

3. 评估“基础性”#

eval-groundedness flow 使用 ChatGPT/GPT4 模型来评估 chat-with-pdf 流生成的答案。

eval_groundedness_flow_path = "../../evaluation/eval-groundedness/"
eval_groundedness_2k_context = pf.run(
    flow=eval_groundedness_flow_path,
    run=run_2k_context,
    column_mapping={
        "question": "${run.inputs.question}",
        "answer": "${run.outputs.answer}",
        "context": "${run.outputs.context}",
    },
    display_name="eval_groundedness_2k_context",
)
pf.stream(eval_groundedness_2k_context)

print(eval_groundedness_2k_context)

4. 尝试不同的配置并再次评估 - 实验#

flow_path = "."
data_path = "./data/bert-paper-qna-3-line.jsonl"

config_3k_context = {
    "EMBEDDING_MODEL_DEPLOYMENT_NAME": "text-embedding-ada-002",
    "CHAT_MODEL_DEPLOYMENT_NAME": "gpt-35-turbo",
    "PROMPT_TOKEN_LIMIT": 3000,  # different from 2k context
    "MAX_COMPLETION_TOKENS": 256,
    "VERBOSE": True,
    "CHUNK_SIZE": 1024,
    "CHUNK_OVERLAP": 32,
}

column_mapping = {
    "question": "${data.question}",
    "pdf_url": "${data.pdf_url}",
    "chat_history": "${data.chat_history}",
    "config": config_3k_context,
}
run_3k_context = pf.run(
    flow=flow_path,
    data=data_path,
    column_mapping=column_mapping,
    display_name="chat_with_pdf_3k_context",
    tags={"chat_with_pdf": "", "2nd_round": ""},
)
pf.stream(run_3k_context)
print(run_3k_context)
detail = pf.get_details(run_3k_context)

detail
eval_groundedness_3k_context = pf.run(
    flow=eval_groundedness_flow_path,
    run=run_3k_context,
    column_mapping={
        "question": "${run.inputs.question}",
        "answer": "${run.outputs.answer}",
        "context": "${run.outputs.context}",
    },
    display_name="eval_groundedness_3k_context",
)
pf.stream(eval_groundedness_3k_context)

print(eval_groundedness_3k_context)
pf.get_details(eval_groundedness_3k_context)
pf.visualize([eval_groundedness_2k_context, eval_groundedness_3k_context])