从科技网站获取每日摘要
最后更新:2024 年 9 月 24 日
动机:我们希望随时了解科技领域的最新动态。然而,由于网站众多且每天发生大量新闻,我们不可能跟上所有动态。但如果我们能总结最新的发展,并用几行代码在本地运行一个现成的 LLM 来完成这一切呢?
让我们看看 Haystack 如何与 TitanML 的 Takeoff 推理服务器结合,帮助我们实现这一目标。
运行 Titan Takeoff 推理服务器镜像
请记住,您必须下载此笔记本并在本地环境中运行。Titan Takeoff 推理服务器允许您在自己的基础设施中运行现代开源 LLM。
docker run --gpus all -e TAKEOFF_MODEL_NAME=TheBloke/Llama-2-7B-Chat-AWQ \
-e TAKEOFF_DEVICE=cuda \
-e TAKEOFF_MAX_SEQUENCE_LENGTH=256 \
-it \
-p 3000:3000 tytn/takeoff-pro:0.11.0-gpu
使用 Deepset Haystack 和 Titan Takeoff 获取顶级科技网站的每日摘要
!pip install feedparser
!pip install takeoff_haystack
from typing import Dict, List
from haystack import Document, Pipeline
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
import feedparser
#from takeoff_haystack import TakeoffGenerator
# Dict of website RSS feeds
urls = {
'theverge': 'https://www.theverge.com/rss/frontpage/',
'techcrunch': 'https://techcrunch.com/feed',
'mashable': 'https://mashable.com/feeds/rss/all',
'cnet': 'https://cnet.com/rss/news',
'engadget': 'https://engadget.com/rss.xml',
'zdnet': 'https://zdnet.com/news/rss.xml',
'venturebeat': 'https://feeds.feedburner.com/venturebeat/SZYF',
'readwrite': 'https://readwrite.com/feed/',
'wired': 'https://wired.com/feed/rss',
'gizmodo': 'https://gizmodo.com/rss',
}
# Configurable parameters
NUM_WEBSITES = 3
NUM_TITLES = 1
def get_titles(urls: Dict[str, str], num_sites: int, num_titles: int) -> List[str]:
titles: List[str] = []
sites = list(urls.keys())[:num_sites]
for site in sites:
feed = feedparser.parse(urls[site])
entries = feed.entries[:num_titles]
for entry in entries:
titles.append(entry.title)
return titles
titles = get_titles(urls, NUM_WEBSITES, NUM_TITLES)
document_store = InMemoryDocumentStore()
document_store.write_documents(
[
Document(content=title) for title in titles
]
)
template = """
HEADLINES:
{% for document in documents %}
{{ document.content }}
{% endfor %}
REQUEST: {{ query }}
"""
pipe = Pipeline()
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
pipe.add_component("prompt_builder", PromptBuilder(template=template))
pipe.add_component("llm", TakeoffGenerator(base_url="https://", port=3000))
pipe.connect("retriever", "prompt_builder.documents")
pipe.connect("prompt_builder", "llm")
query = f"Summarize each of the {NUM_WEBSITES * NUM_TITLES} provided headlines in three words."
titles_string
'Two words: poker roguelike - Former Twitter engineers are building Particle, an AI-powered news reader - Best laptops of MWC 2024, including a 2-in-1 that broke a world record'
response = pipe.run({"prompt_builder": {"query": query}, "retriever": {"query": query}})
print(response["llm"]["replies"])
Ranking by BM25...: 0%| | 0/1 [00:00<?, ? docs/s]
['\n\n\nANSWER:\n\n1. Poker Roguelike - Exciting gameplay\n2. AI-powered news reader - Personalized feed\n3. Best laptops MWC 2024 - Powerful devices']
