如何使用 OpenAI 和 ChromaDB 构建 PDF QA 聊天机器人

　　RAG、矢量数据库和 OCR 简介

　　在我们深入研究代码之前，让我们揭穿我们将要实现️的内容首先，OCR(光学字符识别)是计算机视觉领域的一种技术，可以识别文档中存在的字符并将其转换为文本 – 这在文档中的表格和图表的情况下特别有用在本教程中，我们将使用 Azure 认知服务提供的 OCR。

　　使用 OCR 提取文本块后，它们将使用 Word2Vec、FastText 或 BERT 等嵌入模型转换为高维向量(也称为矢量化)。这些向量封装了文本的语义含义，然后在向量数据库中建立索引。我们将使用 ChromaDB 作为内存中向量数据库

　　现在，让我们看看当用户询问他们的 PDF 时会发生什么。首先，首先使用用于矢量化提取的 PDF 文本块的相同嵌入模型对用户查询进行矢量化。然后，通过搜索向量数据库来获取语义上最相似的前 K 个文本块，该数据库记住，其中包含来自 PDF 的文本块。然后，将检索到的文本块作为上下文提供给 ChatGPT，以便根据其 PDF 中的信息生成答案。这是检索、增强、生成 (RAG) 的过程。

　　项目设置

　　首先，我将指导您如何设置项目文件夹和需要安装的任何依赖项。

　　通过运行以下命令创建项目文件夹和 python 虚拟环境：

　　mkdir chat-with-pdf

　　cd chat-with-pdf

　　python3 -m venv venv

　　source venv/bin/activate

　　您的终端现在应该启动如下内容：

　　(venv)

　　安装依赖项

　　运行以下命令以安装 OpenAI API、ChromaDB 和 Azure：

　　pip install openai chromadb azure-ai-formrecognizer streamlit tabulate

　　让我们简要回顾一下每个包的作用：

　　streamlit – 设置聊天 UI，其中包括一个 PDF 上传器(感谢上帝)

　　azure-ai-formrecognizer – 使用 OCR 从 PDF 中提取文本内容

　　chromadb – 是一个内存中的矢量数据库，用于存储提取的 PDF 内容

　　OpenAI – 我们都知道这是做什么的(从 Chromadb 接收相关数据，并根据您的聊天机器人输入返回响应)

　　接下来，创建一个新的 main.py 文件 – 应用程序的入口点

　　touch main.py

　　获取 API 密钥

　　最后，准备好 OpenAI 和 Azure API 密钥(如果还没有，请单击超链接获取它们)

　　注意：在 Azure 认知服务上注册帐户非常麻烦。你需要一张卡(虽然他们不会自动向你收费)和电话号码，但如果你想做一些严肃的事情，一定要试一试!

　　使用 Streamlit 构建聊天机器人 UI

　　Streamlit 是一种使用 python 构建前端应用程序的简单方法。让我们导入 streamlit 以及设置我们需要的所有其他内容：

　　import streamlit as st

　　from azure.ai.formrecognizer import DocumentAnalysisClient

　　from azure.core.credentials import AzureKeyCredential

　　from tabulate import tabulate

　　from chromadb.utils import embedding_functions

　　import chromadb

　　import openai

　　# You’ll need this client later to store PDF data

　　client = chromadb.Client()

　　client.heartbeat()

　　为我们的聊天 UI 指定一个标题并创建一个文件上传器：

　　…

　　st.write(“#Chat with PDF”)

　　uploaded_file = st.file_uploader(“Choose a PDF file”, type=”pdf”)

　　…

　　侦听“uploaded_file”中的更改事件。当您上传文件时，将触发此操作：

　　…

　　if uploaded_file is not None:

　　# Create a temporary file to write the bytes to

　　with open(“temp_pdf_file.pdf”, “wb”) as temp_file:

　　temp_file.write(uploaded_file.read())

　　…

　　通过运行“main.py”查看 streamlit 应用(我们稍后将实现聊天输入 UI)：

　　streamlit run main.py

　　这是完成的简单部分!接下来是不那么容易的部分……

　　从 PDF 中提取文本

　　从前面的代码片段开始，我们将向 Azure 认知服务发送“temp_file”以进行 OCR：

　　…

　　# you can set this up in the azure cognitive services portal

　　AZURE_COGNITIVE_ENDPOINT = “your-custom-azure-api-endpoint”

　　AZURE_API_KEY = “your-azure-api-key”

　　credential = AzureKeyCredential(AZURE_API_KEY)

　　AZURE_DOCUMENT_ANALYSIS_CLIENT = DocumentAnalysisClient(AZURE_COGNITIVE_ENDPOINT, credential)

　　# Open the temporary file in binary read mode and pass it to Azure

　　with open(“temp_pdf_file.pdf”, “rb”) as f:

　　poller = AZURE_DOCUMENT_ANALYSIS_CLIENT.begin_analyze_document(“prebuilt-document”, document=f)

　　doc_info = poller.result().to_dict()

　　…

　　在这里，“dict_info”是一个字典，其中包含有关提取的文本块的信息。这是一本非常复杂的词典，所以我建议把它打印出来，亲眼看看它是什么样子的。

　　粘贴以下内容以完成对从 Azure 接收的数据的处理：

　　…

　　res = []

　　CONTENT = “content”

　　PAGE_NUMBER = “page_number”

　　TYPE = “type”

　　RAW_CONTENT = “raw_content”

　　TABLE_CONTENT = “table_content”

　　for p in doc_info[‘pages’]:

　　dict = {}

　　page_content = ” “.join([line[“content”] for line in p[“lines”]])

　　dict[CONTENT] = str(page_content)

　　dict[PAGE_NUMBER] = str(p[“page_number”])

　　dict[TYPE] = RAW_CONTENT

　　res.append(dict)

　　for table in doc_info[“tables”]:

　　dict = {}

　　dict[PAGE_NUMBER] = str(table[“bounding_regions”][0][“page_number”])

　　col_headers = []

　　cells = table[“cells”]

　　for cell in cells:

　　if cell[“kind”] == “columnHeader” and cell[“column_span”] == 1:

　　for _ in range(cell[“column_span”]):

　　col_headers.append(cell[“content”])

　　data_rows = [[] for _ in range(table[“row_count”])]

　　for cell in cells:

　　if cell[“kind”] == “content”:

　　for _ in range(cell[“column_span”]):

　　data_rows[cell[“row_index”]].append(cell[“content”])

　　data_rows = [row for row in data_rows if len(row) > 0]

　　markdown_table = tabulate(data_rows, headers=col_headers, tablefmt=”pipe”)

　　dict[CONTENT] = markdown_table

　　dict[TYPE] = TABLE_CONTENT

　　res.append(dict)

　　…

　　在这里，我们访问了 Azure 返回的字典的各种属性，以获取页面上的文本和存储在表中的数据。由于所有嵌套结构，逻辑非常复杂，但从个人经验来看，Azure OCR 即使对于复杂的 PDF 结构也能很好地工作，因此我强烈建议您尝试一下:)

　　在 ChromaDB 中存储 PDF 内容

　　还在我身边吗? 太好了，我们快到了，所以坚持下去!

　　粘贴下面的代码，将从 ‘res’ 中提取的文本块存储在 ChromaDB 中。

　　…

　　try:

　　client.delete_collection(name=”my_collection”)

　　st.session_state.messages = []

　　except:

　　print(“Hopefully you’ll never see this error.”)

　　openai_ef = embedding_functions.OpenAIEmbeddingFunction(api_key=”your-openai-api-key”, model_name=”text-embedding-ada-002″)

　　collection = client.create_collection(name=”my_collection”, embedding_function=openai_ef)

　　data = []

　　id = 1

　　for dict in res:

　　content = dict.get(CONTENT, ”)

　　page_number = dict.get(PAGE_NUMBER, ”)

　　type_of_content = dict.get(TYPE, ”)

　　content_metadata = {

　　PAGE_NUMBER: page_number,

　　TYPE: type_of_content

　　}

　　collection.add(

　　documents=[content],

　　metadatas=[content_metadata],

　　ids=[str(id)]

　　)

　　id += 1

　　…

　　第一个尝试块确保我们可以继续上传 PDF，而无需刷新页面。

　　您可能已经注意到，我们将数据添加到集合中，而不是直接添加到数据库中。ChromaDB 中的集合是一个向量空间。当用户输入查询时，它会在此集合内执行搜索，而不是在整个数据库中执行搜索。在 Chroma 中，这个集合由一个唯一的名称标识，通过一行简单的代码，你可以通过 ‘collection.add(…) 将所有提取的文本块添加到这个集合中`

　　使用 OpenAI 生成响应

　　我经常被问到如何在不依赖 langchain 和 lLamaIndex 等框架的情况下构建 RAG 聊天机器人。以下是你如何做到的 – 你根据从向量数据库中检索到的结果动态地构造一个提示列表。

　　粘贴以下代码以总结内容：

　　…

　　if “messages” not in st.session_state:

　　st.session_state.messages = []

　　# Display chat messages from history on app rerun

　　for message in st.session_state.messages:

　　with st.chat_message(message[“role”]):

　　st.markdown(message[“content”])

　　if prompt := st.chat_input(“What do you want to say to your PDF?”):

　　# Display your message

　　with st.chat_message(“user”):

　　st.markdown(prompt)

　　# Add your message to chat history

　　st.session_state.messages.append({“role”: “user”, “content”: prompt})

　　# query ChromaDB based on your prompt, taking the top 5 most relevant result. These results are ordered by similarity.

　　q = collection.query(

　　query_texts=[prompt],

　　n_results=5,

　　)

　　results = q[“documents”][0]

　　prompts = []

　　for r in results:

　　# construct prompts based on the retrieved text chunks in results

　　prompt = “Please extract the following: ” + prompt + ” solely based on the text below. Use an unbiased and journalistic tone. If you’re unsure of the answer, say you cannot find the answer. \n\n” + r

　　prompts.append(prompt)

　　prompts.reverse()

　　openai_res = openai.ChatCompletion.create(

　　model=”gpt-4″,

　　messages=[{“role”: “assistant”, “content”: prompt} for prompt in prompts],

　　temperature=0,

　　)

　　response = openai_res[“choices”][0][“message”][“content”]

　　with st.chat_message(“assistant”):

　　st.markdown(response)

　　# append the response to chat history

　　st.session_state.messages.append({“role”: “assistant”, “content”: response})

　　请注意，在根据从 ChromaDB 检索到的文本块列表构造提示列表后，我们如何反转“提示”。这是因为从 ChromaDB 返回的结果按降序排序，这意味着最相关的文本块将始终是结果列表中的第一个。然而，ChatGPT 的工作方式是它更多地考虑提示列表中的最后一个提示，因此我们必须反转它。

　　运行 streamlit 应用程序并亲自尝试一下：

　　streamlit run main.py

　　恭喜你，你做到了最后!

　　更进一步

　　如您所知，LLM 应用程序是一个黑匣子，因此对于生产用例，您需要保护 PDF 聊天机器人的性能，以保持用户满意。

　　结论

　　在本文中，你了解了：

　　什么是向量数据库如何使用 ChromaDB

　　如何使用原始 OpenAI API 在不依赖第三方框架的情况下构建基于 RAG 的聊天机器人

　　什么是 OCR 以及如何使用 Azure 的 OCR 服务

　　如何使用 streamlit 快速设置漂亮的聊天机器人 UI，其中包括一个文件上传器。

　　本教程演示了一个示例，说明如何仅使用 Azure OCR、OpenAI 和 ChromaDB 构建“使用 PDF 聊天”应用程序。利用你所学到的知识，你可以构建强大的应用程序，帮助提高员工的工作效率(至少这是我遇到的最突出的用例)。