Chroma 索引和 RAG 示例

在 Colab 中打开下载

_{最后更新：2025 年 7 月 8 日}

安装依赖项

# Install the Chroma integration, Haystack will come as a dependency
!pip install -U chroma-haystack "huggingface_hub>=0.22.0"

索引管道：预处理、分割和索引文档

在本节中，我们将通过构建 Haystack 索引管道将文档索引到 Chroma DB 集合中。在这里，我们将 VIM 用户手册中的文档索引到 Haystack 的 ChromaDocumentStore 中。

在 ChromaDocumentStore 的示例文件夹中，我们有这些页面的 .txt 文件，因此我们使用 TextFileToDocument 和 DocumentWriter 组件来构建此索引管道。

# Fetch data files from the Github repo
!curl -sL https://github.com/deepset-ai/haystack-core-integrations/tarball/main -o main.tar
!mkdir main
!tar xf main.tar -C main --strip-components 1
!mv main/integrations/chroma/example/data .

mkdir: main: File exists
mv: rename main/integrations/chroma/example/data to ./data: Directory not empty

import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from haystack_integrations.document_stores.chroma import ChromaDocumentStore

file_paths = ["data" / Path(name) for name in os.listdir("data")]

# Chroma is used in-memory so we use the same instances in the two pipelines below
document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

{'writer': {'documents_written': 36}}

查询管道：构建检索增强生成 (RAG) 管道

一旦文档进入 ChromaDocumentStore，我们就可以使用配套的 Chroma 检索器来构建查询管道。下面的查询管道是一个简单的检索增强生成 (RAG) 管道，它使用 Chroma 的查询 API。

您可以通过使用其中一个 Haystack Embedders 配合 ChromaEmbeddingRetriever 来更改此处的索引管道和查询管道以进行嵌入搜索。

在此示例中，我们使用了

带有 gpt-4o-mini 的 OpenAIChatGenerator。（您需要 OpenAI API 密钥才能使用此模型）。您可以将其替换为任何其他 Generators。
包含提示模板的 ChatPromptBuilder。您可以将其调整为您选择的提示。
ChromaQueryTextRetriver 期望一个查询列表，并从您的 Chroma 集合中检索 top_k 个最相关的文档。

import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key: ········

from haystack_integrations.components.retrievers.chroma import ChromaQueryTextRetriever
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.components.builders import ChatPromptBuilder
from haystack.dataclasses.chat_message import ChatMessage

prompt = """
Answer the query based on the provided context.
If the context does not contain the answer, say 'Answer not found'.
Context:
{% for doc in documents %}
  {{ doc.content }}
{% endfor %}
query: {{query}}
Answer:
"""

template = [ChatMessage.from_user(prompt)]
prompt_builder = ChatPromptBuilder(template=template)

llm = OpenAIChatGenerator()
retriever = ChromaQueryTextRetriever(document_store)

querying = Pipeline()
querying.add_component("retriever", retriever)
querying.add_component("prompt_builder", prompt_builder)
querying.add_component("llm", llm)

querying.connect("retriever.documents", "prompt_builder.documents")
querying.connect("prompt_builder", "llm")

ChatPromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.





<haystack.core.pipeline.pipeline.Pipeline object at 0x308f29880>
🚅 Components
  - retriever: ChromaQueryTextRetriever
  - prompt_builder: ChatPromptBuilder
  - llm: OpenAIChatGenerator
🛤️ Connections
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.messages (List[ChatMessage])

query = "Should I write documentation for my plugin?"
results = querying.run({"retriever": {"query": query, "top_k": 3}, "prompt_builder": {"query": query}})

print(results["llm"]["replies"][0].text)

Yes, it is a good idea to write documentation for your plugin. This helps users understand how to use it, especially when its behavior can be changed by the user.