RAG:使用 Apify 和 Haystack 进行网络搜索和分析
最后更新:2025 年 7 月 8 日
想让您的任何一个 LLM 应用拥有搜索和浏览网络的能力吗?在本手册中,我们将向您展示如何使用 RAG Web Browser Actor 来搜索 Google 并提取网页内容,然后使用大型语言模型来分析结果——所有这些都将在 Haystack 生态系统中,通过 apify-haystack 集成来完成。
本手册还演示了如何利用 RAG Web Browser Actor 和 Haystack 来创建强大的网络感知应用程序。我们将探讨多种用例,展示它的便捷性。
我们将首先使用 RAG Web Browser Actor 执行网络搜索,然后使用 OpenAIGenerator 来分析和总结网页内容。
安装依赖项
!pip install -q apify-haystack==0.1.7 haystack-ai
设置 API 密钥
您需要拥有一个 Apify 账户并获取 APIFY_API_TOKEN。
您还需要一个 OpenAI 账户和 OPENAI_API_KEY
import os
from getpass import getpass
os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")
Enter YOUR APIFY_API_TOKEN··········
Enter YOUR OPENAI_API_KEY··········
搜索有趣的主题
RAG Web Browser Actor 旨在通过提供最新的网络内容来增强 AI 和大型语言模型 (LLM) 应用程序。它通过接受搜索短语或 URL,执行 Google 搜索,抓取顶级搜索结果中的网页,清理 HTML,并将内容转换为文本或 Markdown 来运行。
输出格式
RAG Web Browser Actor 的输出是一个 JSON 数组,其中每个对象包含:
- crawl:关于爬取过程的详细信息,包括 HTTP 状态码和加载时间。
- searchResult:搜索结果中的信息,例如标题、描述和 URL。
- metadata:其他元数据,如页面标题、描述、语言代码和 URL。
- markdown:页面的主要内容,转换为 Markdown 格式。
例如,查询:
rag web browser返回
[
{
"crawl": {
"httpStatusCode": 200,
"httpStatusMessage": "OK",
"loadedAt": "2024-11-25T21:23:58.336Z",
"uniqueKey": "eM0RDxDQ3q",
"requestStatus": "handled"
},
"searchResult": {
"title": "apify/rag-web-browser",
"description": "Sep 2, 2024 — The RAG Web Browser is designed for Large Language Model (LLM) applications ...",
"url": "https://github.com/apify/rag-web-browser"
},
"metadata": {
"title": "GitHub - apify/rag-web-browser: RAG Web Browser is an Apify Actor to feed your LLM applications ...",
"description": "RAG Web Browser is an Apify Actor to feed your LLM applications ...",
"languageCode": "en",
"url": "https://github.com/apify/rag-web-browser"
},
"markdown": "# apify/rag-web-browser: RAG Web Browser is an Apify Actor ..."
}
]
我们将使用 dataset_mapping_function 将此 JSON 转换为 Haystack Document,如下所示:
from haystack import Document
def dataset_mapping_function(dataset_item: dict) -> Document:
return Document(
content=dataset_item.get("markdown"),
meta={
"title": dataset_item.get("metadata", {}).get("title"),
"url": dataset_item.get("metadata", {}).get("url"),
"language": dataset_item.get("metadata", {}).get("languageCode")
}
)
现在设置 ApifyDatasetFromActorCall 组件
from apify_haystack import ApifyDatasetFromActorCall
document_loader = ApifyDatasetFromActorCall(
actor_id="apify/rag-web-browser",
run_input={
"maxResults": 2,
"outputFormats": ["markdown"],
"requestTimeoutSecs": 30
},
dataset_mapping_function=dataset_mapping_function,
)
请查看 RAG web browser 的 GitHub 页面 中的其他 run_input 参数。
请注意,您也可以在构造函数中将 API 密钥手动设置为名为 apify_api_token 的参数,如果它未设置为环境变量。
运行 Actor 并获取结果
让我们使用示例查询运行 Actor 并获取结果。该过程可能需要几十秒,具体取决于请求的网站数量。
query = "Artificial intelligence latest developments"
# Load the documents and extract the list of document
result = document_loader.run(run_input={"query": query})
documents = result.get("documents", [])
for doc in documents:
print(f"Title: {doc.meta['title']}")
print(f"Truncated content: \n {doc.content[:100]} ...")
print("---")
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> Status: RUNNING, Message:
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:24:58.032Z ACTOR: Pulling Docker image of build mYEmhSzwMdjILx279 from registry.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:24:58.034Z ACTOR: Creating Docker container.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:24:58.096Z ACTOR: Starting Docker container.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.014Z [32mINFO[39m System info[90m {"apifyVersion":"3.2.6","apifyClientVersion":"2.10.0","crawleeVersion":"3.12.0","osType":"Linux","nodeVersion":"v22.9.0"}[39m
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.165Z [32mINFO[39m Actor is running in the NORMAL mode.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.525Z [32mINFO[39m Loaded input: {"query":"Artificial intelligence latest developments","maxResults":2,"outputFormats":["markdown"],"requestTimeoutSecs":30,"serpProxyGroup":"GOOGLE_SERP","serpMaxRetries":2,"proxyConfiguration":{"useApifyProxy":true},"scrapingTool":"raw-http","removeElementsCssSelector":"nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]","htmlTransformer":"none","desiredConcurrency":5,"maxRequestRetries":1,"dynamicContentWaitSecs":10,"removeCookieWarnings":true,"debugMode":false},
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.527Z cheerioCrawlerOptions: {"keepAlive":false,"maxRequestRetries":2,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":["GOOGLE_SERP"],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"autoscaledPoolOptions":{"desiredConcurrency":1}},
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.529Z contentCrawlerOptions: {"type":"cheerio","crawlerOptions":{"keepAlive":false,"maxRequestRetries":1,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":[],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"requestHandlerTimeoutSecs":30,"autoscaledPoolOptions":{"desiredConcurrency":5}}},
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.531Z contentScraperSettings {"debugMode":false,"dynamicContentWaitSecs":10,"htmlTransformer":"none","maxHtmlCharsToProcess":1500000,"outputFormats":["markdown"],"removeCookieWarnings":true,"removeElementsCssSelector":"nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]"}
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.533Z
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.535Z [32mINFO[39m Creating new cheerio crawler with key {"keepAlive":false,"maxRequestRetries":2,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":["GOOGLE_SERP"],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"autoscaledPoolOptions":{"desiredConcurrency":1}}
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.547Z [32mINFO[39m Number of crawlers 1
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.549Z [32mINFO[39m Creating new cheerio crawler with key {"keepAlive":false,"maxRequestRetries":1,"proxyConfiguration":{"isManInTheMiddle":false,"nextCustomUrlIndex":0,"usedProxyUrls":{},"log":{"LEVELS":{"0":"OFF","1":"ERROR","2":"SOFT_FAIL","3":"WARNING","4":"INFO","5":"DEBUG","6":"PERF","OFF":0,"ERROR":1,"SOFT_FAIL":2,"WARNING":3,"INFO":4,"DEBUG":5,"PERF":6},"options":{"level":4,"maxDepth":4,"maxStringLength":2000,"prefix":"ProxyConfiguration","suffix":null,"logger":{"_events":{},"_eventsCount":0,"options":{"skipTime":true}},"data":{}},"warningsOnceLogged":{}},"domainTiers":{},"config":{"options":{},"services":{},"storageManagers":{}},"groups":[],"password":"*********","hostname":"10.0.88.126","port":8011,"usesApifyProxy":true},"requestHandlerTimeoutSecs":60,"autoscaledPoolOptions":{"desiredConcurrency":5}}
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.551Z [32mINFO[39m Number of crawlers 2
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.553Z [32mINFO[39m Added request to cheerio-google-search-crawler: http://www.google.com/search?q=Artificial intelligence latest developments&num=7
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.554Z [32mINFO[39m Running Google Search crawler with request: {"url":"http://www.google.com/search?q=Artificial intelligence latest developments&num=7","uniqueKey":"rdmUGAnhgm","userData":{"maxResults":2,"timeMeasures":[{"event":"request-received","timeMs":1751977500535,"timeDeltaPrevMs":0},{"event":"before-cheerio-queue-add","timeMs":1751977500536,"timeDeltaPrevMs":1},{"event":"before-cheerio-run","timeMs":1751977500525,"timeDeltaPrevMs":-11}],"query":"Artificial intelligence latest developments","contentCrawlerKey":"{\"keepAlive\":false,\"maxRequestRetries\":1,\"proxyConfiguration\":{\"isManInTheMiddle\":false,\"nextCustomUrlIndex\":0,\"usedProxyUrls\":{},\"log\":{\"LEVELS\":{\"0\":\"OFF\",\"1\":\"ERROR\",\"2\":\"SOFT_FAIL\",\"3\":\"WARNING\",\"4\":\"INFO\",\"5\":\"DEBUG\",\"6\":\"PERF\",\"OFF\":0,\"ERROR\":1,\"SOFT_FAIL\":2,\"WARNING\":3,\"INFO\":4,\"DEBUG\":5,\"PERF\":6},\"options\":{\"level\":4,\"maxDepth\":4,\"maxStringLength\":2000,\"prefix\":\"ProxyConfiguration\",\"suffix\":null,\"l... [line-too-long]
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> Status: RUNNING, Message: Starting the crawler.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.629Z [32mINFO[39m [33m CheerioCrawler:[39m Starting the crawler.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.454Z [32mINFO[39m Search-crawler requestHandler: Processing URL: http://www.google.com/search?q=Artificial intelligence latest developments&num=7
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.474Z [32mINFO[39m Extracted 2 results:
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.478Z https://www.artificialintelligence-news.com/
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.481Z https://www.crescendo.ai/news/latest-ai-news-and-updates
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.482Z [32mINFO[39m Added request to the cheerio-content-crawler: https://www.artificialintelligence-news.com/
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.485Z [32mINFO[39m Added request to the cheerio-content-crawler: https://www.crescendo.ai/news/latest-ai-news-and-updates
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.486Z [32mINFO[39m [33m CheerioCrawler:[39m All requests from the queue have been processed, the crawler will shut down.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.764Z [32mINFO[39m [33m CheerioCrawler:[39m Final request statistics:[90m {"requestsFinished":1,"requestsFailed":0,"retryHistogram":[1],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":3821,"requestsFinishedPerMinute":14,"requestsFailedPerMinute":0,"requestTotalDurationMillis":3821,"requestsTotal":1,"crawlerRuntimeMillis":4229}[39m
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.766Z [32mINFO[39m [33m CheerioCrawler:[39m Finished! Total 1 requests: 1 succeeded, 0 failed.[90m {"terminal":true}[39m
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.807Z [32mINFO[39m Running target page crawler with request: {"url":"http://www.google.com/search?q=Artificial intelligence latest developments&num=7","uniqueKey":"rdmUGAnhgm","userData":{"maxResults":2,"timeMeasures":[{"event":"request-received","timeMs":1751977500535,"timeDeltaPrevMs":0},{"event":"before-cheerio-queue-add","timeMs":1751977500536,"timeDeltaPrevMs":1},{"event":"before-cheerio-run","timeMs":1751977500525,"timeDeltaPrevMs":-11},{"event":"before-playwright-run","timeMs":1751977500525,"timeDeltaPrevMs":0}],"query":"Artificial intelligence latest developments","contentCrawlerKey":"{\"keepAlive\":false,\"maxRequestRetries\":1,\"proxyConfiguration\":{\"isManInTheMiddle\":false,\"nextCustomUrlIndex\":0,\"usedProxyUrls\":{},\"log\":{\"LEVELS\":{\"0\":\"OFF\",\"1\":\"ERROR\",\"2\":\"SOFT_FAIL\",\"3\":\"WARNING\",\"4\":\"INFO\",\"5\":\"DEBUG\",\"6\":\"PERF\",\"OFF\":0,\"ERROR\":1,\"SOFT_FAIL\":2,\"WARNING\":3,\"INFO\":4,\"DEBUG\":5,\"PERF\":6},\"options\":{\"level\":4,\"maxDepth\":4,\"m... [line-too-long]
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:04.899Z [32mINFO[39m [33m CheerioCrawler:[39m Starting the crawler.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:05.708Z [32mINFO[39m Processing URL: https://www.crescendo.ai/news/latest-ai-news-and-updates
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:06.075Z [32mINFO[39m Adding result to the Apify dataset, url: https://www.crescendo.ai/news/latest-ai-news-and-updates
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:06.141Z [32mINFO[39m Processing URL: https://www.artificialintelligence-news.com/
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:06.286Z [32mINFO[39m Adding result to the Apify dataset, url: https://www.artificialintelligence-news.com/
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:06.374Z [32mINFO[39m [33m CheerioCrawler:[39m All requests from the queue have been processed, the crawler will shut down.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:07.159Z [32mINFO[39m [33m CheerioCrawler:[39m Final request statistics:[90m {"requestsFinished":2,"requestsFailed":0,"retryHistogram":[2],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":1400,"requestsFinishedPerMinute":18,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2799,"requestsTotal":2,"crawlerRuntimeMillis":6623}[39m
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> Status: RUNNING, Message: Finished! Total 2 requests: 2 succeeded, 0 failed.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:07.161Z [32mINFO[39m [33m CheerioCrawler:[39m Finished! Total 2 requests: 2 succeeded, 0 failed.[90m {"terminal":true}[39m
Title: Latest AI Breakthroughs and News: May, June, July 2025 | News
Truncated content:
Latest AI Breakthroughs and News: May, June, July 2025 | News
July 7, 2025
# Latest AI Breakthroug ...
---
Title: AI News | Latest AI News, Analysis & Events
Truncated content:
AI News | Latest AI News, Analysis & Events [Skip to content](#content)
AI News is part of the Tech ...
---
使用 OpenAIChatGenerator 分析结果
使用 OpenAIChatGenerator 来分析和总结网页内容。
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage
generator = OpenAIChatGenerator(model="gpt-4o-mini")
for doc in documents:
result = generator.run(messages=[ChatMessage.from_user(doc.content)])
summary = result["replies"][0].text # Accessing the generated text
print(f"Summary for {doc.meta.get('title')} available from {doc.meta.get('url')}: \n{summary}\n ---")
Summary for Latest AI Breakthroughs and News: May, June, July 2025 | News available from https://www.crescendo.ai/news/latest-ai-news-and-updates:
The article you provided details significant advancements and updates in the AI landscape during May, June, and July of 2025. Here’s a summary of the notable points:
### Key AI Breakthroughs and News:
1. **Materials Science in Singapore**: The A*STAR research agency in Singapore is using AI to expedite breakthroughs in materials science, significantly reducing the time needed for sustainable and high-performance compound discovery.
2. **Capgemini Acquires WNS**: Capgemini's acquisition of WNS for $3.3 billion aims to enhance its enterprise AI capabilities, particularly in sectors like financial services and healthcare.
3. **Research on AI Safety**: A study indicated that under survival threats, some AI models may resort to deceitful tactics like blackmail, prompting discussions on AI ethics and safety.
4. **Isomorphic Labs**: This AI drug discovery company began human trials for drugs designed using AI, signifying a new age in pharmaceutical research.
5. **AI Job Displacement**: The rise of AI technologies is linked to increased unemployment rates among recent graduates, particularly in entry-level roles.
6. **Texas AI Regulation**: Texas passed comprehensive legislation governing the utilization of AI within both public and private sectors, establishing rules for transparency and bias mitigation.
7. **AI in Education**: A pledge by Donald Trump to incorporate AI education in K-12 schools gained support from numerous organizations, though critics expressed concerns over political influences.
8. **AI-Assisted Healthcare Innovations**: New AI models have shown promise in early disease detection, including a model with over 90% accuracy for cancer diagnoses.
9. **Defense and AI Collaboration**: A strategic partnership between HII and C3.ai aims to enhance U.S. Navy shipbuilding efficiency through AI applications.
10. **Regulatory Developments**: The BRICS nations have advocated for UN-led global governance on AI to ensure equitable access and ethical practices in technology.
### Major Players and Developments:
- **OpenAI's Future**: The upcoming GPT-5 model aims to integrate the strengths of various AI models, expected to launch later in 2025.
- **Samsung and AI Chips**: Anticipating a profit drop due to sluggish AI chip demand, emphasizing market volatility.
- **Meta's AI Investments**: Meta's significant investment indicates its dedication to AI infrastructure, though concerns about market saturation grow.
- **AI's Role in Content Creation**: AI tools are transforming industries like publishing and video generation, reflecting a shift in how content is created and managed.
These highlights reflect a rapidly evolving AI landscape, showcasing both opportunities for innovation and challenges regarding ethics, safety, and employment. The ongoing discourse in these areas will likely shape the future of AI applications across various sectors.
---
Summary for AI News | Latest AI News, Analysis & Events available from https://www.artificialintelligence-news.com/:
It seems you provided a large segment of a webpage related to AI news, including various articles and categories in the realm of artificial intelligence. If you're looking for specific information, summarization, or analysis of any section, please specify your request!
---
使用 Haystack Pipeline 进行网络搜索和分析
现在让我们创建一个更复杂的 pipeline,它可以处理不同类型的内容并生成专门的分析。我们将创建一个 pipeline,该 pipeline 将:
- 使用 RAG Web Browser 进行网络搜索
- 清理和过滤文档
- 根据内容类型进行路由
- 为不同类型的内容生成定制摘要
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.builders import ChatPromptBuilder
# Improved dataset_mapping_function with truncation of the content
def dataset_mapping_function(dataset_item: dict) -> Document:
max_chars = 10000
content = dataset_item.get("markdown", "")
return Document(
content=content[:max_chars],
meta={
"title": dataset_item.get("metadata", {}).get("title"),
"url": dataset_item.get("metadata", {}).get("url"),
"language": dataset_item.get("metadata", {}).get("languageCode")
}
)
def create_pipeline(query: str) -> Pipeline:
document_loader = ApifyDatasetFromActorCall(
actor_id="apify/rag-web-browser",
run_input={
"query": query,
"maxResults": 2,
"outputFormats": ["markdown"]
},
dataset_mapping_function=dataset_mapping_function,
)
cleaner = DocumentCleaner(
remove_empty_lines=True,
remove_extra_whitespaces=True,
remove_repeated_substrings=True
)
prompt_template = """
Analyze the following content and provide:
1. Key points and findings
2. Practical implications
3. Notable conclusions
Be concise.
Context:
{% for document in documents %}
{{ document.content }}
{% endfor %}
Analysis:
"""
prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(prompt_template)], required_variables="*")
generator = OpenAIChatGenerator(model="gpt-4o-mini")
pipe = Pipeline()
pipe.add_component("loader", document_loader)
pipe.add_component("cleaner", cleaner)
pipe.add_component("prompt_builder", prompt_builder)
pipe.add_component("generator", generator)
pipe.connect("loader", "cleaner")
pipe.connect("cleaner", "prompt_builder")
pipe.connect("prompt_builder", "generator")
return pipe
# Function to run the pipeline
def research_topic(query: str) -> str:
pipeline = create_pipeline(query)
result = pipeline.run({})
return result["generator"]["replies"][0].text
query = "latest developments in AI ethics"
analysis = research_topic(query)
print("Analysis Result:")
print(analysis)
您可以通过以下方式进一步自定义 pipeline:
- 添加更复杂的路由逻辑
- 实现额外的预处理步骤
- 为不同内容类型创建专门的生成器
- 添加错误处理和重试机制
- 实现缓存以提高性能
至此,我们完成了关于将 Apify 的 RAG Web Browser 与 Haystack 结合用于网络感知 AI 应用程序的探索。网络搜索能力与复杂的內容处理和分析相结合,为研究、分析和其他许多任务创造了强大的可能性。
