# Search and browse the web with Apify and Haystack

Want to give any of your LLM applications the power to search and browse the web? In this cookbook, we'll show you how to use the [RAG Web Browser Actor](https://apify.com/apify/rag-web-browser) to search Google and extract content from web pages, then analyze the results using a large language model - all within the Haystack ecosystem using the apify-haystack integration.

This cookbook also demonstrates how to leverage the RAG Web Browser Actor with Haystack to create powerful web-aware applications. We'll explore multiple use cases showing how easy it is to:

1. [Search interesting topics](#search-interesting-topics)
2. [Analyze the results with OpenAIGenerator](#analyze-the-results-with-openaigenerator)
3. [Use the Haystack Pipeline for web search and analysis](#use-the-haystack-pipeline-for-web-search-and-analysis)
   
**We'll start by using the RAG Web Browser Actor to perform web searches and then use the OpenAIGenerator to analyze and summarize the web content**

## Install dependencies

In [None]:
!pip install -q apify-haystack==0.1.7 haystack-ai

## Set up the API keys

You need to have an Apify account and obtain [APIFY_API_TOKEN](https://docs.apify.com/platform/integrations/api).

You also need an OpenAI account and [OPENAI_API_KEY](https://platform.openai.com/docs/quickstart)


In [2]:
import os
from getpass import getpass

os.environ["APIFY_API_TOKEN"] = getpass("Enter YOUR APIFY_API_TOKEN")
os.environ["OPENAI_API_KEY"] = getpass("Enter YOUR OPENAI_API_KEY")

Enter YOUR APIFY_API_TOKEN··········
Enter YOUR OPENAI_API_KEY··········


## Search interesting topics

The [RAG Web Browser Actor](https://apify.com/apify/rag-web-browser) is designed to enhance AI and Large Language Model (LLM) applications by providing up-to-date web content. It operates by accepting a search phrase or URL, performing a Google Search, crawling web pages from the top search results, cleaning the HTML, and converting the content into text or Markdown.  

### Output Format
The output from the RAG Web Browser Actor is a JSON array, where each object contains:
- **crawl**: Details about the crawling process, including HTTP status code and load time.
- **searchResult**: Information from the search result, such as the title, description, and URL.
- **metadata**: Additional metadata like the page title, description, language code, and URL.
- **markdown**: The main content of the page, converted into Markdown format.

> For example, query: `rag web browser` returns:

```json
[
    {
        "crawl": {
            "httpStatusCode": 200,
            "httpStatusMessage": "OK",
            "loadedAt": "2024-11-25T21:23:58.336Z",
            "uniqueKey": "eM0RDxDQ3q",
            "requestStatus": "handled"
        },
        "searchResult": {
            "title": "apify/rag-web-browser",
            "description": "Sep 2, 2024 — The RAG Web Browser is designed for Large Language Model (LLM) applications ...",
            "url": "https://github.com/apify/rag-web-browser"
        },
        "metadata": {
            "title": "GitHub - apify/rag-web-browser: RAG Web Browser is an Apify Actor to feed your LLM applications ...",
            "description": "RAG Web Browser is an Apify Actor to feed your LLM applications ...",
            "languageCode": "en",
            "url": "https://github.com/apify/rag-web-browser"
        },
        "markdown": "# apify/rag-web-browser: RAG Web Browser is an Apify Actor ..."
    }
]
```

We will convert this JSON to a Haystack Document using the `dataset_mapping_function` as follows:

In [3]:
from haystack import Document

def dataset_mapping_function(dataset_item: dict) -> Document:
    return Document(
        content=dataset_item.get("markdown"),
        meta={
            "title": dataset_item.get("metadata", {}).get("title"),
            "url": dataset_item.get("metadata", {}).get("url"),
            "language": dataset_item.get("metadata", {}).get("languageCode")
        }
    )

Now set up the `ApifyDatasetFromActorCall` component:

In [4]:
from apify_haystack import ApifyDatasetFromActorCall

document_loader = ApifyDatasetFromActorCall(
    actor_id="apify/rag-web-browser",
    run_input={
        "maxResults": 2,
        "outputFormats": ["markdown"],
        "requestTimeoutSecs": 30
    },
    dataset_mapping_function=dataset_mapping_function,
)

Check out other `run_input` parameters at [Github for the RAG web browser](https://github.com/apify/rag-web-browser?tab=readme-ov-file#query-parameters).

Note that you can also manualy set your API key as a named parameter `apify_api_token` in the constructor, if not set as environment variable.

### Run the Actor and fetch results

Let's run the Actor with a sample query and fetch the results. The process may take several dozen seconds, depending on the number of websites requested.

In [5]:
query = "Artificial intelligence latest developments"

# Load the documents and extract the list of document
result = document_loader.run(run_input={"query": query})
documents = result.get("documents", [])

for doc in documents:
    print(f"Title: {doc.meta['title']}")
    print(f"Truncated content:  \n {doc.content[:100]} ...")
    print("---")

[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> Status: RUNNING, Message: 
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:24:58.032Z ACTOR: Pulling Docker image of build mYEmhSzwMdjILx279 from registry.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:24:58.034Z ACTOR: Creating Docker container.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:24:58.096Z ACTOR: Starting Docker container.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.014Z [32mINFO[39m  System info[90m {"apifyVersion":"3.2.6","apifyClientVersion":"2.10.0","crawleeVersion":"3.12.0","osType":"Linux","nodeVersion":"v22.9.0"}[39m
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.165Z [32mINFO[39m  Actor is running in the NORMAL mode.
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12][0m -> 2025-07-08T12:25:00.533Z
[36m[apify.rag-web-browser runId:B5Vp3SdMsCBgdXh12]

Title: Latest AI Breakthroughs and News: May, June, July 2025 | News
Truncated content:  
 Latest AI Breakthroughs and News: May, June, July 2025 | News

July 7, 2025

# Latest AI Breakthroug ...
---
Title: AI News | Latest AI News, Analysis & Events
Truncated content:  
 AI News | Latest AI News, Analysis & Events [Skip to content](#content)

AI News is part of the Tech ...
---


## Analyze the results with OpenAIChatGenerator

Use the OpenAIChatGenerator to analyze and summarize the web content.

In [9]:
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack.dataclasses import ChatMessage

generator = OpenAIChatGenerator(model="gpt-4o-mini")

for doc in documents:
    result = generator.run(messages=[ChatMessage.from_user(doc.content)])
    summary = result["replies"][0].text  # Accessing the generated text
    print(f"Summary for {doc.meta.get('title')} available from {doc.meta.get('url')}: \n{summary}\n ---")

Summary for Latest AI Breakthroughs and News: May, June, July 2025 | News available from https://www.crescendo.ai/news/latest-ai-news-and-updates: 
The article you provided details significant advancements and updates in the AI landscape during May, June, and July of 2025. Here’s a summary of the notable points:

### Key AI Breakthroughs and News:

1. **Materials Science in Singapore**: The A*STAR research agency in Singapore is using AI to expedite breakthroughs in materials science, significantly reducing the time needed for sustainable and high-performance compound discovery.

2. **Capgemini Acquires WNS**: Capgemini's acquisition of WNS for $3.3 billion aims to enhance its enterprise AI capabilities, particularly in sectors like financial services and healthcare.

3. **Research on AI Safety**: A study indicated that under survival threats, some AI models may resort to deceitful tactics like blackmail, prompting discussions on AI ethics and safety.

4. **Isomorphic Labs**: This AI d

## Use the Haystack Pipeline for web search and analysis

Now let's create a more sophisticated pipeline that can handle different types of content and generate specialized analyses. We'll create a pipeline that:

1. Searches the web using RAG Web Browser
2. Cleans and filters the documents
3. Routes them based on content type
4. Generates customized summaries for different types of content
   

In [17]:
from haystack import Pipeline
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.builders import ChatPromptBuilder

# Improved dataset_mapping_function with truncation of the content
def dataset_mapping_function(dataset_item: dict) -> Document:
    max_chars = 10000
    content = dataset_item.get("markdown", "")
    return Document(
        content=content[:max_chars],
        meta={
            "title": dataset_item.get("metadata", {}).get("title"),
            "url": dataset_item.get("metadata", {}).get("url"),
            "language": dataset_item.get("metadata", {}).get("languageCode")
        }
    )

def create_pipeline(query: str) -> Pipeline:

    document_loader = ApifyDatasetFromActorCall(
        actor_id="apify/rag-web-browser",
        run_input={
            "query": query,
            "maxResults": 2,
            "outputFormats": ["markdown"]
        },
        dataset_mapping_function=dataset_mapping_function,
    )

    cleaner = DocumentCleaner(
        remove_empty_lines=True,
        remove_extra_whitespaces=True,
        remove_repeated_substrings=True
    )

    prompt_template = """
    Analyze the following content and provide:
    1. Key points and findings
    2. Practical implications
    3. Notable conclusions
    Be concise.

    Context:
    {% for document in documents %}
        {{ document.content }}
    {% endfor %}

    Analysis:
    """

    prompt_builder = ChatPromptBuilder(template=[ChatMessage.from_user(prompt_template)], required_variables="*")

    generator = OpenAIChatGenerator(model="gpt-4o-mini")

    pipe = Pipeline()
    pipe.add_component("loader", document_loader)
    pipe.add_component("cleaner", cleaner)
    pipe.add_component("prompt_builder", prompt_builder)
    pipe.add_component("generator", generator)

    pipe.connect("loader", "cleaner")
    pipe.connect("cleaner", "prompt_builder")
    pipe.connect("prompt_builder", "generator")

    return pipe

# Function to run the pipeline
def research_topic(query: str) -> str:
    pipeline = create_pipeline(query)
    result = pipeline.run({})
    return result["generator"]["replies"][0].text

In [None]:
query = "latest developments in AI ethics"
analysis = research_topic(query)

In [None]:
print("Analysis Result:")
print(analysis)

You can customize the pipeline further by:
- Adding more sophisticated routing logic
- Implementing additional preprocessing steps
- Creating specialized generators for different content types
- Adding error handling and retries
- Implementing caching for improved performance

This completes our exploration of using Apify's RAG Web Browser with Haystack for web-aware AI applications. The combination of web search capabilities with sophisticated content processing and analysis creates powerful possibilities for research, analysis and many other tasks.