📘 **TELUS Agriculture & Consumer Goods** 如何通过 **Haystack Agents** 转变促销交易

集成:Docling

使用 Docling 在 Haystack 中本地解析和分块 PDF、DOCX 和其他文档类型

作者
DS4SD

目录

概述

Docling can locally parse PDF, DOCX, HTML, and other document formats into a rich standardized representation (incl. layout, tables etc.), which it can then export to Markdown, JSON, and others.

Check out the Docling docs for more details.

This integration introduces Docling support, enabling Haystack users to

  • use various document types in LLM applications with ease and speed, and
  • leverage Docling’s rich format for advanced, document-native grounding.

安装

pip install docling-haystack

使用

组件

This integration introduces DoclingConverter, a component which reads document file paths (local or URL) and outputs Haystack Document objects.

DoclingConverter supports two different export modes, see export_type initialization argument further below.

使用 Docling Converter

Docling Converter 初始化

DoclingConverter creation can be parametrized via the following __init__() arguments, most of which refer to the initialization and usage of the underlying Docling DocumentConverter and chunker instances

  • converter: The Docling DocumentConverter to use; if not set, a system default is used.
  • convert_kwargs: Any parameters to pass to Docling conversion; if not set, a system default is used.
  • export_type: The export mode to use: ExportType.DOC_CHUNKS (default) chunks each input document (see chunker) and captures each individual chunk as a separate Haystack Document, while ExportType.MARKDOWN captures each input document as a separate Haystack Document (in which case splitting is likely required downstream).
  • md_export_kwargs: Any parameters to pass to Markdown export (in case of ExportType.MARKDOWN).
  • chunker: The Docling chunker instance to use; if not set, a system default is used (in case of ExportType.DOC_CHUNKS).
  • meta_extractor: The extractor instance to use for populating the output document metadata; if not set, a system default is used.

独立运行

from docling_haystack.converter import DoclingConverter

converter = DoclingConverter()
documents = converter.run(paths=["https://arxiv.org/pdf/2408.09869"])["documents"]

print(repr(documents[2].content))
# -> Abstract\nThis technical report introduces Docling [...]

在 Pipeline 中

Check out this notebook illustrating usage in a complete example with indexing and RAG pipelines.

许可证

MIT License.