集成:Docling
使用 Docling 在 Haystack 中本地解析和分块 PDF、DOCX 和其他文档类型
目录
概述
Docling can locally parse PDF, DOCX, HTML, and other document formats into a rich standardized representation (incl. layout, tables etc.), which it can then export to Markdown, JSON, and others.
Check out the Docling docs for more details.
This integration introduces Docling support, enabling Haystack users to
- use various document types in LLM applications with ease and speed, and
- leverage Docling’s rich format for advanced, document-native grounding.
安装
pip install docling-haystack
使用
组件
This integration introduces DoclingConverter, a component which reads document file paths (local or URL) and outputs Haystack Document objects.
DoclingConverter supports two different export modes, see export_type initialization argument further below.
使用 Docling Converter
Docling Converter 初始化
DoclingConverter creation can be parametrized via the following __init__() arguments, most of which refer to the initialization and usage of the underlying Docling DocumentConverter and chunker instances
converter: The DoclingDocumentConverterto use; if not set, a system default is used.convert_kwargs: Any parameters to pass to Docling conversion; if not set, a system default is used.export_type: The export mode to use:ExportType.DOC_CHUNKS(default) chunks each input document (seechunker) and captures each individual chunk as a separate HaystackDocument, whileExportType.MARKDOWNcaptures each input document as a separate HaystackDocument(in which case splitting is likely required downstream).md_export_kwargs: Any parameters to pass to Markdown export (in case ofExportType.MARKDOWN).chunker: The Docling chunker instance to use; if not set, a system default is used (in case ofExportType.DOC_CHUNKS).meta_extractor: The extractor instance to use for populating the output document metadata; if not set, a system default is used.
独立运行
from docling_haystack.converter import DoclingConverter
converter = DoclingConverter()
documents = converter.run(paths=["https://arxiv.org/pdf/2408.09869"])["documents"]
print(repr(documents[2].content))
# -> Abstract\nThis technical report introduces Docling [...]
在 Pipeline 中
Check out this notebook illustrating usage in a complete example with indexing and RAG pipelines.
许可证
MIT License.
