教程：按语言对文档和查询进行分类

_{最后更新：2025 年 8 月 25 日}

级别：入门
完成时间：15 分钟
使用的组件：InMemoryDocumentStore、DocumentLanguageClassifier、MetadataRouter、DocumentWriter、TextLanguageRouter、DocumentJoiner、InMemoryBM25Retriever、ChatPromptBuilder、OpenAIChatGenerator
目标：完成本教程后，您将学会如何构建一个 Haystack 管道来根据文档的书写（人类）语言对其进行分类。
可选地，在最后您还将把语言分类和查询路由整合到 RAG 管道中，这样您就可以根据提问的语言来查询文档。

概述

在全球化的社会中，当今世界有超过 7000 种人类语言，处理多语言输入是 NLP 应用的常见用例。

好消息：Haystack 内置了 DocumentLanguageClassifier。此组件可以检测文档的书写语言。此功能可让您在 Haystack 管道中创建“分支”，从而灵活地为每种语言添加不同的处理步骤。例如，您可以使用在德语方面表现更好的 LLM 来回答德语查询。或者，您可以仅为法语用户提取法式餐厅评论。

在本教程中，您将使用酒店评论中的文本样本，这些文本样本是用不同语言书写的。文本样本将被制作为 Haystack 文档并按语言进行分类。然后，每个文档将被写入特定语言的 DocumentStore。为了验证语言检测是否正常工作，您将过滤文档存储以显示其内容。

在最后一部分，您将构建一个多语言 RAG 管道。系统将检测问题的语言，并且仅使用该语言的文档来生成答案。对于这一部分，TextLanguageRouter 将非常有用。

准备 Colab 环境

安装 Haystack

%%bash

pip install haystack-ai
pip install langdetect

将文档写入 `InMemoryDocumentStore`

以下索引管道根据语言将法语和英语文档写入各自的 InMemoryDocumentStores。

导入您需要的模块。然后实例化一个 Haystack Documents 列表，其中包含各种语言的酒店评论片段。

from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.classifiers import DocumentLanguageClassifier
from haystack.components.routers import MetadataRouter
from haystack.components.writers import DocumentWriter


documents = [
    Document(
        content="Super appartement. Juste au dessus de plusieurs bars qui ferment très tard. A savoir à l'avance. (Bouchons d'oreilles fournis !)"
    ),
    Document(
        content="El apartamento estaba genial y muy céntrico, todo a mano. Al lado de la librería Lello y De la Torre de los clérigos. Está situado en una zona de marcha, así que si vais en fin de semana , habrá ruido, aunque a nosotros no nos molestaba para dormir"
    ),
    Document(
        content="The keypad with a code is convenient and the location is convenient. Basically everything else, very noisy, wi-fi didn't work, check-in person didn't explain anything about facilities, shower head was broken, there's no cleaning and everything else one may need is charged."
    ),
    Document(
        content="It is very central and appartement has a nice appearance (even though a lot IKEA stuff), *W A R N I N G** the appartement presents itself as a elegant and as a place to relax, very wrong place to relax - you cannot sleep in this appartement, even the beds are vibrating from the bass of the clubs in the same building - you get ear plugs from the hotel -> now I understand why -> I missed a trip as it was so loud and I could not hear the alarm next day due to the ear plugs.- there is a green light indicating 'emergency exit' just above the bed, which shines very bright at night - during the arrival process, you felt the urge of the agent to leave as soon as possible. - try to go to 'RVA clerigos appartements' -> same price, super quiet, beautiful, city center and very nice staff (not an agency)- you are basically sleeping next to the fridge, which makes a lot of noise, when the compressor is running -> had to switch it off - but then had no cool food and drinks. - the bed was somehow broken down - the wooden part behind the bed was almost falling appart and some hooks were broken before- when the neighbour room is cooking you hear the fan very loud. I initially thought that I somehow activated the kitchen fan"
    ),
    Document(content="Un peu salé surtout le sol. Manque de service et de souplesse"),
    Document(
        content="Nous avons passé un séjour formidable. Merci aux personnes , le bonjours à Ricardo notre taxi man, très sympathique. Je pense refaire un séjour parmi vous, après le confinement, tout était parfait, surtout leur gentillesse, aucune chaude négative. Je n'ai rien à redire de négative, Ils étaient a notre écoute, un gentil message tout les matins, pour nous demander si nous avions besoins de renseignement et savoir si tout allait bien pendant notre séjour."
    ),
    Document(
        content="Céntrico. Muy cómodo para moverse y ver Oporto. Edificio con terraza propia en la última planta. Todo reformado y nuevo. Te traen un estupendo desayuno todas las mañanas al apartamento. Solo que se puede escuchar algo de ruido de la calle a primeras horas de la noche. Es un zona de ocio nocturno. Pero respetan los horarios."
    ),
]

每种语言都有自己的 DocumentStore。

en_document_store = InMemoryDocumentStore()
fr_document_store = InMemoryDocumentStore()
es_document_store = InMemoryDocumentStore()

DocumentLanguageClassifier 接受语言列表。MetadataRouter 需要一个规则字典。这些规则根据文档的元数据指定将文档路由到哪个节点（在本例中是哪个特定语言的 DocumentWriter）。

字典的键是输出连接的名称，值是遵循 Haystack 中过滤表达式格式的字典。

language_classifier = DocumentLanguageClassifier(languages=["en", "fr", "es"])
router_rules = {
    "en": {"field": "meta.language", "operator": "==", "value": "en"},
    "fr": {"field": "meta.language", "operator": "==", "value": "fr"},
    "es": {"field": "meta.language", "operator": "==", "value": "es"},
}
router = MetadataRouter(rules=router_rules)

en_writer = DocumentWriter(document_store=en_document_store)
fr_writer = DocumentWriter(document_store=fr_document_store)
es_writer = DocumentWriter(document_store=es_document_store)

现在所有组件都已创建，实例化 Pipeline。将组件添加到管道中。将一个组件的输出连接到下一个组件的输入。

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=language_classifier, name="language_classifier")
indexing_pipeline.add_component(instance=router, name="router")
indexing_pipeline.add_component(instance=en_writer, name="en_writer")
indexing_pipeline.add_component(instance=fr_writer, name="fr_writer")
indexing_pipeline.add_component(instance=es_writer, name="es_writer")


indexing_pipeline.connect("language_classifier", "router")
indexing_pipeline.connect("router.en", "en_writer")
indexing_pipeline.connect("router.fr", "fr_writer")
indexing_pipeline.connect("router.es", "es_writer")

绘制管道图，以查看图的样子。

# indexing_pipeline.draw("indexing_pipeline.png")

运行管道，它将告诉您每种语言写入了多少文档。大功告成！

indexing_pipeline.run(data={"language_classifier": {"documents": documents}})

检查您的文档存储内容

您可以检查您的文档存储内容。每个存储应该只包含正确语言的文档。

print("English documents: ", en_document_store.filter_documents())
print("French documents: ", fr_document_store.filter_documents())
print("Spanish documents: ", es_document_store.filter_documents())

（可选）创建一个多语言 RAG 管道

要构建多语言 RAG 管道，您可以使用 TextLanguageRouter 来检测查询的语言。然后，从正确的 DocumentStore 中检索该语言的文档。

要做到这一点，您需要一个 OpenAI 访问令牌，尽管这种方法也可以与其他任何 Haystack 支持的生成器一起使用。

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

假设我们之前放入文档存储中的所有评论都针对的是同一处住宿。RAG 管道将允许您查询有关该公寓的信息，并使用您选择的语言。

导入 RAG 管道所需的组件。编写一个将与相关文档一起传递给 LLM 的提示。

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.joiners import DocumentJoiner
from haystack.components.builders import ChatPromptBuilder
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
from haystack.components.routers import TextLanguageRouter

prompt_template = [
    ChatMessage.from_user(
        """
You will be provided with reviews for an accommodation.
Answer the question concisely based solely on the given reviews.
Reviews:
  {% for doc in documents %}
    {{ doc.content }}
  {% endfor %}
Question: {{ query}}
Answer:
"""
    )
]

构建流水线

创建一个新的 Pipeline。添加以下组件

TextLanguageRouter
InMemoryBM25Retriever。您需要为每种语言一个检索器，因为每种语言都有自己的 DocumentStore。
DocumentJoiner
ChatPromptBuilder
OpenAIChatGenerator

注意：BM25Retriever 主要进行关键字匹配，不如其他搜索方法准确。为了使 LLM 回答更精确，您可以重构管道以使用 EmbeddingRetriever，它会对文档执行向量搜索。

rag_pipeline = Pipeline()
rag_pipeline.add_component(instance=TextLanguageRouter(["en", "fr", "es"]), name="router")
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=en_document_store), name="en_retriever")
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=fr_document_store), name="fr_retriever")
rag_pipeline.add_component(instance=InMemoryBM25Retriever(document_store=es_document_store), name="es_retriever")
rag_pipeline.add_component(instance=DocumentJoiner(), name="joiner")
rag_pipeline.add_component(instance=ChatPromptBuilder(template=prompt_template), name="prompt_builder")
rag_pipeline.add_component(instance=OpenAIChatGenerator(), name="llm")


rag_pipeline.connect("router.en", "en_retriever.query")
rag_pipeline.connect("router.fr", "fr_retriever.query")
rag_pipeline.connect("router.es", "es_retriever.query")
rag_pipeline.connect("en_retriever", "joiner")
rag_pipeline.connect("fr_retriever", "joiner")
rag_pipeline.connect("es_retriever", "joiner")
rag_pipeline.connect("joiner.documents", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder.prompt", "llm.messages")

您可以绘制此管道并比较架构与我们之前创建的 indexing_pipeline 图。

# rag_pipeline.draw("rag_pipeline.png")

尝试通过提问来验证。

en_question = "Is this apartment conveniently located?"

result = rag_pipeline.run({"router": {"text": en_question}, "prompt_builder": {"query": en_question}})

print(result["llm"]["replies"][0].text)

该管道在西班牙语中表现如何？

es_question = "¿El desayuno es genial?"

result = rag_pipeline.run({"router": {"text": es_question}, "prompt_builder": {"query": es_question}})

print(result["llm"]["replies"][0].text)

下一步

如果您一直关注，现在您就知道了如何在查询和索引 Haystack 管道中集成语言检测。去构建您梦想中的国际化应用程序吧。🗺️

如果您喜欢本教程，还有更多关于 Haystack 的内容可以学习。

要了解最新的 Haystack 开发进展，您可以注册我们的新闻通讯。

使用基于循环的自动纠错生成结构化输出

评估 RAG 流水线