使用 Weaviate 和 Haystack 进行生成式文档问答

构建一个用于参考文档搜索的检索增强生成管道的指南。

2023年9月2日

您可以使用此 Colab 作为本文所述应用程序的可工作示例。

检索增强生成是近期 LLM 应用中的佼佼者。其背后的理念很简单：LLM 并不了解整个世界，更不用说您特定的世界了。但是，通过使用检索技术，我们可以为 LLM 提供最有用的信息片段，以便它拥有上下文来回复它原本没有接受过训练而不知道或无法回答的查询。

这项技术现在被用于驱动许多搜索系统。在本文中，我们将展示如何使用开源 LLM 框架 Haystack 和向量数据库 Weaviate 来构建这样一个系统。我们最终的管道将回答有关 Haystack 的查询，并提供包含答案的文档页面的参考。

几周前，我和一位同事为 Haystack 构建了一个自定义组件：ReadmeDocsFetcher。 Haystack 的设计以称为组件的小单元为中心。该框架背后的理念是提供简单的构建块，让您能够创建自己的自定义组件，超越 Haystack 项目本身提供的组件。Haystack 文档托管在 ReadMe 上，因此我们设计了这个节点来从 ReadMe 获取请求的文档页面，并以可用于完整 LLM 管道的方式对其进行处理。

索引管道

现在我们可以开始构建我们的管道了。首先，我们创建一个索引管道，将所有请求的文档页面从 https://docs.haystack.com.cn 写入我们的 Weaviate 数据库。构建索引管道的好处在于它们可以被重复使用。如果有任何新页面，我们可以通过索引管道推送它们，以确保我们用于 RAG 管道的数据库始终是最新的。

对于这个索引管道，我们使用了自定义构建的 ReadmeDocsFetcher。最终我们将进行嵌入检索，以便从数据库中获取最相关的文档。因此，对于此演示，我们使用 sentence-transformers 模型来创建文档的向量表示。

Weaviate 有一个名为 Weaviate Embedded 的便捷功能，我们可以在此处使用它。它允许我们在 Colab 中运行 Weaviate 数据库。

import weaviate  
from weaviate.embedded import EmbeddedOptions  
from haystack.document_stores import WeaviateDocumentStore  
  
client = weaviate.Client(  
  embedded_options=weaviate.embedded.EmbeddedOptions()  
)  
  
document_store = WeaviateDocumentStore(port=6666)

有了这些之后，我们将初始化最终索引管道所需的所有组件。

from readmedocs_fetcher_haystack import ReadmeDocsFetcher  
from haystack.nodes import EmbeddingRetriever, MarkdownConverter, PreProcessor  
  
converter = MarkdownConverter(remove_code_snippets=False)  
readme_fetcher = ReadmeDocsFetcher(api_key=readme_api_key,   
                                   markdown_converter=converter,   
                                   base_url="https://docs.haystack.com.cn")  
embedder = EmbeddingRetriever(document_store=document_store,   
                              embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1")  
preprocessor = PreProcessor()

然后我们只需构建并运行管道。它将预处理并为 https://docs.haystack.com.cn 下的所有文档页面创建嵌入。

from haystack import Pipeline  
  
indexing_pipeline = Pipeline()  
indexing_pipeline.add_node(component=readme_fetcher, name="ReadmeFetcher", inputs=["File"])  
indexing_pipeline.add_node(component=preprocessor, name="Preprocessor", inputs=["ReadmeFetcher"])  
indexing_pipeline.add_node(component=embedder, name="Embedder", inputs=["Preprocessor"])  
indexing_pipeline.add_node(component=document_store, name="DocumentStore", inputs=["Embedder"])  
indexing_pipeline.run()

检索增强生成（RAG）管道

在我们深入探讨 RAG 管道本身之前，我想单独讨论一下管道的两个关键构建块：提示和选择的 LLM。

如上所述，我在这里的目标是构建一个能够引用答案来源的文档页面的管道。特别是，我想获得一个 URL，我可以点击以了解更多详细信息。RAG 管道是否能实现这一点，在很大程度上取决于提供给 LLM 的指令。这也取决于 LLM 本身是否设计用于理解这样的指令。

在这里，可以说我“大出血”了。虽然您可以使用 Haystack 的开源 LLM（来自 Hugging Face，托管在 SageMaker 上，本地部署，选择权真正取决于您），但我选择了 GPT-4。我选择它的主要原因之一是，根据经验，GPT-4 在我打算用于此应用程序的提示（指令）类型方面表现最佳。话虽如此，如果您有不同的看法，请告诉我 🙏

提示

这是我们在此演示中使用的提示。它要求每个检索到的文档后都附带其来源的 URL。每个文档的 URL 存在于我们写入 WeaviateDocumentStore 的文档元数据中 👇

You will be provided some conetent from technical documentation,   
where each paragraph is followed by the URL that it appears in.   
Answer the query based on the provided Documentation Content. Your answer   
should reference the URLs that it was generated from.   
Documentation Content: {join(documents,   
                             delimiter=new_line,   
                             pattern='---'+new_line+'$content'+new_line+'URL: $url',   
                             str_replace={new_line: ' ', '[': '(', ']': ')'})}  
Query: {query}  
Answer:

请注意，我们如何构建提示，以便将文档（当我们将此添加到管道时，将由检索器提供）彼此分隔，并且内容始终后跟其来源的 URL。我们可以这样做，因为我们写入数据库的每个文档在其 metadata 中都有 url。

我们使用上面的提示创建一个名为 **answer_with_references_prompt** 的 PromptTemplate。

from haystack.nodes import PromptTemplate, AnswerParser  
  
answer_with_references_prompt = PromptTemplate(prompt = """You will be provided some conetent from technical documentation, where each paragraph is followed by the URL that it appears in. Answer the query based on the provided Documentation Content. Your answer should reference the URLs that it was generated from. Documentation Content: {join(documents, delimiter=new_line, pattern='---'+new_line+'$content'+new_line+'URL: $url', str_replace={new_line: ' ', '[': '(', ']': ')'})}\nQuery: {query}\nAnswer:""", output_parser=AnswerParser())

您可以探索我们使用的其他示例提示，包括一个类似的用于在 PromptHub 上进行引用的提示。

Pipeline

现在只需将所有这些整合在一起。首先，我们定义一个 PromptNode，这是与 LLM 交互的接口。

from haystack.nodes import PromptNode  
prompt_node = PromptNode(model_name_or_path="gpt-4",   
                        api_key='YOUR_OPENAI_KEY',   
                        default_prompt_template=answer_with_references_prompt,   
                        max_length=500)

最后，我们创建我们的管道。它有两个组件。首先，我们重用之前相同的 EmbeddingRetriever 从数据库中检索相关文档。其次，我们使用 PromptNode 根据这些检索到的文档生成答案。

pipeline = Pipeline()  
pipeline.add_node(component = embedder, name = "Retriever", inputs = ["Query"])  
pipeline.add_node(component = prompt_node, name = "GPT-4", inputs=["Retriever"])

当我们运行它时，这个最终管道将：检索一些相关文档，将它们添加到我们创建的提示中，然后将生成的完整提示发送给 GPT-4 以获取答案。

例如

pipeline.run("What are the optional installations of Haystack?", params = {"Retriever": {"top_k": 5}})

结果是以下完整提示（节选）

You will be provided some conetent from technical documentation, where each paragraph is followed   
by the URL that it appears in. Answer the query based on the provided Documentation Content.   
Your answer should reference the URLs that it was generated from.   
Documentation Content:   
---  
", "1-0": "all-gpu", "1-1": "Install Haystack and all optional dependencies, including developer   
tools with GPU support.", "2-0": "aws", "2-1": "Install Haystack and AWS SageMaker PromptNode support",   
"3-0": "colab", "3-1": "Install Haystack and all dependencies needed to run Haystack in Google Colab.",   
"4-0": "crawler", "4-1": "Install Haystack and all web crawling tools.",  
URL: https://docs.haystack.com.cn/docs/installation  
---  
[...]  
URL: https://docs.haystack.com.cn/docs/extras  
---  
[...]  
Query: What are the optional installations of Haystack?  
Answer:

当我们运行时，得到的回复是以下内容

Haystack offers several optional installations to support different use cases. These include:  
1. "all-gpu" to install Haystack with all optional dependencies, including developer tools with GPU support.  
2. "aws" for AWS SageMaker PromptNode support.  
3. "colab" to install all dependencies needed to run Haystack in Google Colab.  
4. "crawler" for web crawling tools.  
5. "dev" for all development tools needed by contributors.  
6. "ocr" for all OCR tools.  
7. "docstores" for all DocumentStores.  
8. "docstores-gpu" for all DocumentStores with GPU support.  
9. "faiss" for FAISS support for the FAISSDocumentStore.  
10. "faiss-gpu" for FAISS support for the FAISSDocumentStore with GPU.  
11. "inference" for all dependencies for model inference on local machine, including torch.  
12. "opensearch" for OpenSearch support for the OpenSearchDocumentStore.  
13. "elasticsearch", "elasticsearch7", and "elasticsearch8" for Elasticsearch support for the ElasticsearchDocumentStore.  
14. "pdf" for PyMuPDF for PDFToTextConverter.  
15. "pinecone" for Pinecone support for the PineconeDocumentStore.  
16. "preprocessing" for the basic preprocessing tools, such as langdetect for language identification and nltk for precise document splitting.  
17. "file-conversion" for all the dependencies for file conversion and parsing, like python-docx, tika, markdown.  
18. "ray" for Ray support for the RayPipeline.  
19. "weaviate" for Weaviate support for the WeaviateDocumentStore.  
  
Additionally, there are other components like "AnswerToSpeech" and "DocumentToSpeech" which are not installed with Haystack core but can be installed separately.  
  
For the latest release of Haystack and all its dependencies, you can use the command 'pip install farm-haystack(all)' or 'pip install farm-haystack(all-gpu)' for GPU-enabled dependencies.  
  
URLs:   
- https://docs.haystack.com.cn/docs/installation  
- https://docs.haystack.com.cn/docs/extras

进一步改进

到目前为止，我们只使用了一种检索技术。这可以通过混合检索方法得到显著改进，您也可以使用 Weaviate 和 Haystack 来实现。我认为这会提供一个更完善的、专为文档搜索设计的系统。虽然在此设置中我可以提出完整的句子式问题，但我可能希望提供一种方式，让用户只需搜索“EmbeddingRetrievers”即可。

在本文中，我们看到了如何构建一个简单的 RAG 设置，该设置使用巧妙的提示来获取带有文档引用的回复。要了解更多关于可帮助您构建自定义 LLM 应用程序的可用管道和组件的信息，请查看 Haystack 文档。