集成：Notion 提取器

一个组件，用于将 Notion 页面提取到 Haystack 文档中。适用于索引管道。

作者

Bogdan Kostić

GitHub 仓库 PyPI 包

此 Haystack 组件通过提供 Notion API 令牌，使您能够轻松地将 Notion 页面导出为 Haystack 文档。

鉴于 Notion API 会受到一些速率限制，此组件将自动重试失败的请求，并在重试前等待速率限制重置。这在导出大量页面时尤其有用。此外，此组件使用 asyncio 并行发送请求，这可以显著加快导出过程。

安装

pip install notion-haystack

使用

要使用此组件，您需要一个 Notion API 令牌。您可以按照 Notion 文档中概述的步骤来创建新的 Notion 集成，将其连接到您的页面，并获取您的 API 令牌。

以下最小示例演示了如何将页面列表导出为 Haystack 文档。

from notion_haystack import NotionExporter

exporter = NotionExporter(api_token="<your-token>")
exported_pages = exporter.run(file_paths=["<list-of-page-ids>"])

# exported_pages will be a list of Haystack Documents where each Document corresponds to a Notion page

以下示例展示了如何在索引管道中使用 NotionExporter。

from haystack import Pipeline

from notion_haystack import NotionExporter
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
exporter = NotionExporter(api_token="YOUR_API_KEY")
splitter = DocumentSplitter()
writer = DocumentWriter(document_store=document_store)

indexing_pipeline = Pipeline()
indexing_pipeline.add_component(instance=exporter, name="exporter")
indexing_pipeline.add_component(instance=splitter, name="splitter")
indexing_pipeline.add_component(instance=writer, name="writer")

indexing_pipeline.connect("exporter.documents", "splitter.documents")
indexing_pipeline.connect("splitter", "writer")

indexing_pipeline.run(data={"exporter": {"page_ids": ["your_page_id"] }})

NotionExporter 类接受以下参数：

api_token：您的 Notion API 令牌。您可以在 Notion 文档中找到有关如何获取 API 令牌的信息。
export_child_pages：是否递归导出提供的页面 ID 的所有子页面。默认为 False。
extract_page_metadata：是否从页面中提取元数据并将其添加为 markdown 的 frontmatter。提取的元数据包括页面的标题、作者、路径、URL、最后编辑者和最后编辑时间。默认为 False。
exclude_title_containing：如果指定，则会排除标题包含此字符串的页面。例如，这可能有助于排除已存档的页面。默认为 None。

NotionExporter.run 方法接受以下参数：

page_ids：要导出的页面 ID 列表。如果 export_child_pages 为 True，则还会导出这些页面的所有子页面。