使用 LLM 提取元数据
最后更新:2025 年 7 月 8 日
笔记本作者:David S. Batista
此笔记本展示了如何使用 LLMMetadataExtractor,我们将使用大型语言模型从文档中提取元数据。
设置
!uv pip install haystack-ai
!uv pip install "sentence-transformers>=3.0.0"
初始化 LLMMetadataExtractor
让我们定义要从文档中提取的元数据类型,我们将通过 LLM 提示来实现,然后由 LLMMetadataExtractor 组件使用。在这种情况下,我们希望从文档中提取命名实体。
NER_PROMPT = """
-Goal-
Given text and a list of entity types, identify all entities of those types from the text.
-Steps-
1. Identify all entities. For each identified entity, extract the following information:
- entity: Name of the entity
- entity_type: One of the following types: [organization, product, service, industry]
Format each entity as a JSON like: {"entity": <entity_name>, "entity_type": <entity_type>}
2. Return output in a single list with all the entities identified in steps 1.
-Examples-
######################
Example 1:
entity_types: [organization, person, partnership, financial metric, product, service, industry, investment strategy, market trend]
text: Another area of strength is our co-brand issuance. Visa is the primary network partner for eight of the top
10 co-brand partnerships in the US today and we are pleased that Visa has finalized a multi-year extension of
our successful credit co-branded partnership with Alaska Airlines, a portfolio that benefits from a loyal customer
base and high cross-border usage.
We have also had significant co-brand momentum in CEMEA. First, we launched a new co-brand card in partnership
with Qatar Airways, British Airways and the National Bank of Kuwait. Second, we expanded our strong global
Marriott relationship to launch Qatar's first hospitality co-branded card with Qatar Islamic Bank. Across the
United Arab Emirates, we now have exclusive agreements with all the leading airlines marked by a recent
agreement with Emirates Skywards.
And we also signed an inaugural Airline co-brand agreement in Morocco with Royal Air Maroc. Now newer digital
issuers are equally
------------------------
output:
{"entities": [{"entity": "Visa", "entity_type": "company"}, {"entity": "Alaska Airlines", "entity_type": "company"}, {"entity": "Qatar Airways", "entity_type": "company"}, {"entity": "British Airways", "entity_type": "company"}, {"entity": "National Bank of Kuwait", "entity_type": "company"}, {"entity": "Marriott", "entity_type": "company"}, {"entity": "Qatar Islamic Bank", "entity_type": "company"}, {"entity": "Emirates Skywards", "entity_type": "company"}, {"entity": "Royal Air Maroc", "entity_type": "company"}]}
#############################
-Real Data-
######################
entity_types: [company, organization, person, country, product, service]
text: {{ document.content }}
######################
output:
"""
让我们使用 OpenAI 作为 LLM 提供商来初始化 LLMMetadataExtractor 实例,并使用上面定义的提示进行元数据提取。
from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
我们还需要设置 OPENAI_API_KEY。
import os
from getpass import getpass
if "OPENAI_API_KEY" not in os.environ:
os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")
Enter OpenAI API key: ········
我们将使用 OpenAI 作为 LLM 提供商来实例化 LLMMetadataExtractor 实例。请注意,参数 prompt 设置为我们上面定义的提示,并且我们还需要设置 JSON 输出中应存在的键,在这种情况下是“entities”。
另一个重要方面是 raise_on_failure=False,如果 LLM 对某个文档失败(例如:网络错误,或未返回有效的 JSON 对象),我们将继续处理输入中的所有文档。
from haystack.components.extractors.llm_metadata_extractor import LLMMetadataExtractor
from haystack.components.generators.chat import OpenAIChatGenerator
chat_generator = OpenAIChatGenerator(
generation_kwargs={
"max_tokens": 500,
"temperature": 0.0,
"seed": 0,
"response_format": {"type": "json_object"},
},
max_retries=1,
timeout=60.0,
)
metadata_extractor = LLMMetadataExtractor(
prompt=NER_PROMPT,
chat_generator=chat_generator,
expected_keys=["entities"],
raise_on_failure=False,
)
让我们定义组件将从中提取元数据(即命名实体)的文档。
from haystack import Document
docs = [
Document(content="deepset was founded in 2018 in Berlin, and is known for its Haystack framework"),
Document(content="Hugging Face is a company that was founded in New York, USA and is known for its Transformers library"),
Document(content="Google was founded in 1998 by Larry Page and Sergey Brin"),
Document(content="Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot"),
Document(content="Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded in 1847 by Werner von Siemens")
]
然后让我们提取:)
result = metadata_extractor.run(documents=docs)
result
{'documents': [Document(id=05fe6674dd4faf3dcaa991f9e6d520c9185d5644c4ac2b8b52276e6b70a831f2, content: 'deepset was founded in 2018 in Berlin, and is known for its Haystack framework', meta: {'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}),
Document(id=37364c858185cf02abc43b43db613d236baa4dd501453d6942681842863c313a, content: 'Hugging Face is a company that was founded in New York, USA and is known for its Transformers librar...', meta: {'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}),
Document(id=eb4e2410115dfb7edc47b84853d0cdc845699120509346383896ed7d47354e2d, content: 'Google was founded in 1998 by Larry Page and Sergey Brin', meta: {'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}),
Document(id=ee28eff307d3a1d435f0515195e0a86e592b72b5570dcaddc4d3108769632596, content: 'Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot', meta: {'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}),
Document(id=0a56bf794d37839113a73634cc0f3ecab33744eeea7b682b49fd2dc51737aed8, content: 'Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded i...', meta: {'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]})],
'failed_documents': []}
带提取功能的索引管道
现在让我们构建一个索引管道,其中我们只需将文档作为输入,然后获取一个包含索引了元数据的文档的文档存储。
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
doc_store = InMemoryDocumentStore()
p = Pipeline()
p.add_component(instance=metadata_extractor, name="metadata_extractor")
p.add_component(instance=SentenceTransformersDocumentEmbedder(model="sentence-transformers/all-MiniLM-L6-v2"), name="embedder")
p.add_component(instance=DocumentWriter(document_store=doc_store), name="writer")
p.connect("metadata_extractor.documents", "embedder.documents")
p.connect("embedder.documents", "writer.documents")
<haystack.core.pipeline.pipeline.Pipeline object at 0x320d71010>
🚅 Components
- metadata_extractor: LLMMetadataExtractor
- embedder: SentenceTransformersDocumentEmbedder
- writer: DocumentWriter
🛤️ Connections
- metadata_extractor.documents -> embedder.documents (List[Document])
- embedder.documents -> writer.documents (List[Document])
试试吧!
p.run(data={"metadata_extractor": {"documents": docs}})
Batches: 0%| | 0/1 [00:00<?, ?it/s]
{'metadata_extractor': {'failed_documents': []},
'writer': {'documents_written': 5}}
让我们检查文档存储中的文档元数据。
for doc in doc_store.storage.values():
print(doc.content)
print(doc.meta)
print("\n---------")
deepset was founded in 2018 in Berlin, and is known for its Haystack framework
{'entities': [{'entity': 'deepset', 'entity_type': 'company'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Haystack', 'entity_type': 'product'}]}
---------
Hugging Face is a company that was founded in New York, USA and is known for its Transformers library
{'entities': [{'entity': 'Hugging Face', 'entity_type': 'company'}, {'entity': 'New York', 'entity_type': 'city'}, {'entity': 'USA', 'entity_type': 'country'}, {'entity': 'Transformers library', 'entity_type': 'product'}]}
---------
Google was founded in 1998 by Larry Page and Sergey Brin
{'entities': [{'entity': 'Google', 'entity_type': 'company'}, {'entity': 'Larry Page', 'entity_type': 'person'}, {'entity': 'Sergey Brin', 'entity_type': 'person'}]}
---------
Peugeot is a French automotive manufacturer that was founded in 1810 by Jean-Pierre Peugeot
{'entities': [{'entity': 'Peugeot', 'entity_type': 'company'}, {'entity': 'France', 'entity_type': 'country'}, {'entity': 'Jean-Pierre Peugeot', 'entity_type': 'person'}]}
---------
Siemens is a German multinational conglomerate company headquartered in Munich and Berlin, founded in 1847 by Werner von Siemens
{'entities': [{'entity': 'Siemens', 'entity_type': 'company'}, {'entity': 'Germany', 'entity_type': 'country'}, {'entity': 'Munich', 'entity_type': 'city'}, {'entity': 'Berlin', 'entity_type': 'city'}, {'entity': 'Werner von Siemens', 'entity_type': 'person'}]}
---------
