URL Advanced

URL Advanced is a simple tool that lets you pull useful information from web pages.
You can give it a list of web addresses or a set of data records that contain URLs, and it will return the page title, the main text, any images, and any links found on the page.
The component is handy for building dashboards that need to display or analyze content from the web.

How it Works

When you run the component, it first checks each URL you provide.
If a URL doesn’t start with “http://” or “https://”, the component automatically adds “http://” so the address is valid.
It then uses a small web‑scraping helper called UrlWebScrapper to load the page and look for the parts you want:

Title – the page’s headline.
Text – the main body of the page.
Images – any image URLs that match the selectors you give.
Links – any hyperlink URLs that match the selectors you give.

You can tell the scraper which parts of the page to look at by entering CSS selectors for text, images, and links.
If you give it a list of data records, it will look for a field named by URL Data Key (default is “url”) and use that as the address to scrape.
The component can handle many URLs at once, but it limits the number of simultaneous requests to avoid overloading the web server.

The result is a list of Data objects.
Each object contains:

title – the page title
text – the extracted text (plus any images or links you asked for)
links – a list of URLs found on the page
images – a list of image URLs
source – the original URL that was scraped

These objects can be fed into other components in your Nappai workflow.

Inputs

Data: Data contains url to fetch
Provide a list of data records that include a URL field. The component will read the URL from the field specified by URL Data Key.
Base URL: Base URL to use for the URLs
If your URLs are relative, enter the base address here so the component can build full URLs.
image selectors: Selectors to extract images
Enter one or more CSS selectors that match the images you want to capture.
link selectors: Selectors to extract links
Enter one or more CSS selectors that match the hyperlinks you want to capture.
text selectors: Selectors to extract text
Enter one or more CSS selectors that match the text blocks you want to capture.
URL Data Key: Key to extract url from data
The name of the field inside each data record that holds the URL. Default is “url”.
URLs: Enter one or more URLs, by clicking the ’+’ button.
Type or paste web addresses here. Separate multiple URLs with commas or add them one by one using the “+” button.

Outputs

Data (method: fetch_content)
A list of Data objects, each containing the scraped title, text, links, images, and source URL.
These can be used to feed other components, display in a dashboard, or store for later analysis.

Usage Example

Drag the URL Advanced component onto your canvas.
In the URLs field, click “+” and paste https://example.com and https://openai.com.
(Optional) In text selectors type p, article to grab paragraph and article tags.
In image selectors type img to capture all images.
In link selectors type a to capture all links.
Connect the Data output to a Table component to display the results in a grid.

The component will load each page, extract the requested information, and output a table that shows the title, a preview of the text, and lists of images and links.

Web Scraper – A more general scraper that lets you extract arbitrary HTML elements.
URL List – Generates a list of URLs that can be fed into URL Advanced.
Data Filter – Filters the output Data objects based on conditions (e.g., only keep pages with more than 5 images).

Tips and Best Practices

Keep the number of URLs moderate; the component can handle many, but very large lists may take time.
Use specific CSS selectors to avoid pulling too much data.
If you’re scraping a site that requires authentication, make sure the URLs include the necessary session cookies or tokens.
Test with a single URL first to confirm the selectors work before scaling up.
Remember that the component adds “http://” if a protocol is missing, so URLs like example.com will still work.

Security Considerations

Only scrape URLs from trusted sources to avoid exposing your system to malicious content.
The component does not store or log the fetched page content, but the data you receive can contain sensitive information. Handle it according to your organization’s data‑handling policies.
If you’re running the component in a shared environment, be mindful of the load it places on external websites; excessive requests can trigger rate limits or bans.