使用 LLM 提取元数据

在 Colab 中打开下载

_{最后更新：2025 年 7 月 8 日}

此笔记本展示了如何使用 LLMMetadataExtractor，我们将使用大型语言模型从文档中提取元数据。

设置

!uv pip install haystack-ai
!uv pip install "sentence-transformers>=3.0.0"

初始化 LLMMetadataExtractor

让我们定义要从文档中提取的元数据类型，我们将通过 LLM 提示来实现，然后由 LLMMetadataExtractor 组件使用。在这种情况下，我们希望从文档中提取命名实体。

NER_PROMPT = """
    -Goal-
    Given text and a list of entity types, identify all entities of those types from the text.

    -Steps-
    1. Identify all entities. For each identified entity, extract the following information:
    - entity: Name of the entity
    - entity_type: One of the following types: [organization, product, service, industry]
    Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}

    2. Return output in a single list with all the entities identified in steps 1.

    -Examples-
    ######################
    Example 1:
    entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
    text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
    10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
    our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
    base and high cross-border usage.
    We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
    with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
    Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
    United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
    agreement with Emirates Skywards.
    And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
    issuers are equally
    ------------------------
    output:
    {"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
    #############################
    -Real Data-
    ######################
    entity_types: [company, organization, person, country, product, service]
    text: {{ document.content }}
    ######################
    output:
    """

让我们使用 OpenAI 作为 LLM 提供商来初始化 LLMMetadataExtractor 实例，并使用上面定义的提示进行元数据提取。

from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor

我们还需要设置 OPENAI_API_KEY。

import os
from getpass import getpass

if "OPENAI_API_KEY" not in os.environ:
  os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

Enter OpenAI API key: ········

我们将使用 OpenAI 作为 LLM 提供商来实例化 LLMMetadataExtractor 实例。请注意，参数 prompt 设置为我们上面定义的提示，并且我们还需要设置 JSON 输出中应存在的键，在这种情况下是“entities”。

另一个重要方面是 raise_on_failure=False，如果 LLM 对某个文档失败（例如：网络错误，或未返回有效的 JSON 对象），我们将继续处理输入中的所有文档。

from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
from haystack.components.generators.chat import OpenAIChatGenerator

chat_generator = OpenAIChatGenerator(
    generation_kwargs={
        "max_tokens": 500,
        "temperature": 0.0,
        "seed": 0,
        "response_format": {"type": "json_object"},
    },
    max_retries=1,
    timeout=60.0,
)

metadata_extractor = LLMMetadataExtractor(
    prompt=NER_PROMPT,
    chat_generator=chat_generator,
    expected_keys=["entities"],
    raise_on_failure=False,
)

让我们定义组件将从中提取元数据（即命名实体）的文档。

from haystack import Document

docs = [
    Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),    
    Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library"),
    Document(content="Google was founded in 1998 by Larry Page and Sergey Brin"),
    Document(content="Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot"),
    Document(content="Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded in 1847 by Werner von Siemens")
]

然后让我们提取:)

result = metadata_extractor.run(documents=docs)

result

{'documents': [Document(id=05fe6674dd4faf3dcaa991f9e6d520c9185d5644c4ac2b8b52276e6b70a831f2, content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}),
  Document(id=37364c858185cf02abc43b43db613d236baa4dd501453d6942681842863c313a, content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers librar...', meta: {'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}),
  Document(id=eb4e2410115dfb7edc47b84853d0cdc845699120509346383896ed7d47354e2d, content: 'Google was founded in 1998 by Larry Page and Sergey Brin', meta: {'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}),
  Document(id=ee28eff307d3a1d435f0515195e0a86e592b72b5570dcaddc4d3108769632596, content: 'Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot', meta: {'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}),
  Document(id=0a56bf794d37839113a73634cc0f3ecab33744eeea7b682b49fd2dc51737aed8, content: 'Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded i...', meta: {'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]})],
 'failed_documents': []}

带提取功能的索引管道

现在让我们构建一个索引管道，其中我们只需将文档作为输入，然后获取一个包含索引了元数据的文档的文档存储。

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter

doc_store = InMemoryDocumentStore()

p = Pipeline()
p.add_component(instance=metadata_extractor, name="metadata_extractor")
p.add_component(instance=SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"), name="embedder")
p.add_component(instance=DocumentWriter(document_store=doc_store), name="writer")
p.connect("metadata_extractor.documents", "embedder.documents")
p.connect("embedder.documents", "writer.documents")

<haystack.core.pipeline.pipeline.Pipeline object at 0x320d71010>
🚅 Components
  - metadata_extractor: LLMMetadataExtractor
  - embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - metadata_extractor.documents -> embedder.documents (List[Document])
  - embedder.documents -> writer.documents (List[Document])

试试吧！

p.run(data={"metadata_extractor": {"documents": docs}})

Batches:   0%|          | 0/1 [00:00<?, ?it/s]





{'metadata_extractor': {'failed_documents': []},
 'writer': {'documents_written': 5}}

让我们检查文档存储中的文档元数据。

for doc in doc_store.storage.values():
    print(doc.content)
    print(doc.meta)
    print("\n---------")

deepset was founded in 2018 in Berlin, and is known for its Haystack framework
{'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}

---------
Hugging Face is a company that was founded in New York, USA and is known for its Transformers library
{'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}

---------
Google was founded in 1998 by Larry Page and Sergey Brin
{'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}

---------
Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot
{'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}

---------
Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded in 1847 by Werner von Siemens
{'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]}

---------