URL
Fetches and extracts content from web pages using URLs
How it Works
The URL component takes one or more web addresses, makes a request to each page, and pulls the page’s text.
If you provide CSS selectors in the Selectors to extract field, the component uses BeautifulSoup to keep only the parts of the page that match those selectors.
The result is a list of Data
objects that contain the extracted text and some basic page metadata (URL, title, etc.). No external APIs are called; everything happens inside Nappai.
Inputs
- Selectors to extract: Enter CSS selectors (e.g.,
h1, p.article
) to keep only specific parts of the page. Leave blank to keep the whole page. - URLs: Enter one or more URLs, by clicking the ’+’ button. Each URL can be on a new line or separated by commas.
Outputs
- Data: A list of
Data
objects. Each object holds the page content that was fetched (or the extracted parts) and metadata such as the source URL and page title. This output can be fed into other components like text processors, storage modules, or AI models.
Usage Example
- Drag the URL component onto the canvas.
- In the URLs field, type
https://example.com
. - Leave Selectors to extract empty to get the full page.
- Connect the Data output to a Text Splitter component to break the content into smaller chunks.
- (Optional) If you only want headlines and paragraphs, set Selectors to extract to
h1, p
and run the workflow again.
Related Components
- Text Splitter – Breaks large text blocks into smaller pieces for easier processing.
- Document Loader – Loads documents from various sources; the URL component is a specialized loader for web pages.
- BeautifulSoup Transformer – The underlying tool that extracts content based on CSS selectors.
Tips and Best Practices
- Always include the protocol (
http://
orhttps://
). The component will addhttp://
automatically if it’s missing, but adding it yourself avoids confusion. - Use specific selectors to reduce the amount of data you process, which speeds up downstream steps.
- If you need to fetch many URLs, consider batching them in separate runs to avoid hitting rate limits or overloading the system.
- Test the component with a single URL first to confirm the selectors work before scaling up.
Security Considerations
- The component fetches content from any URL you provide, so be careful not to input untrusted or malicious sites.
- Nappai runs the requests in a sandboxed environment, but always validate URLs against your organization’s security policies.
- If your workflow includes sensitive data, avoid sending the fetched content to external services without proper encryption or access controls.