Skip to content

URL

This component in Nappai lets you extract specific information from one or more websites. Think of it as a web scraper that pulls only the data you need. You’ll give it website addresses and instructions on what to find, and it will return that information for you to use in your automation workflows.

Relationship with BeautifulSoup

This component uses BeautifulSoup, a powerful tool for parsing web pages. It helps the component understand the structure of a website and extract the exact information you specify, even from complex web pages. You don’t need to know about BeautifulSoup to use this component; it handles all the technical details for you.

Inputs

  • URLs: Enter the web addresses (URLs) of the websites you want to extract data from. You can list multiple URLs separated by commas (e.g., www.example.com, www.anothersite.com). Nappai will automatically add “http://” if you forget it.
  • Selectors to extract: This is where you tell the component exactly what information to grab from each website. This uses CSS selectors, a way to pinpoint specific parts of a webpage (like a specific paragraph or a table). Don’t worry if you don’t know CSS; you can often find the selectors by inspecting the webpage’s source code (usually by right-clicking and selecting “Inspect” or “Inspect Element”). If you leave this blank, the component will return the entire page content.

Outputs

The component produces a list of “Data” objects. Each object contains the information extracted from a website. This data can then be used by other components in Nappai to perform further actions, such as summarizing the text, analyzing it, or storing it.

Usage Example

Let’s say you want to get the title and the main article text from a news website.

  1. In the “URLs” input, you would enter the URL of the news article (e.g., https://www.example-news.com/article).
  2. In the “Selectors to extract” input, you would enter the CSS selectors to find the title and the article text (e.g., #article-title, #article-body). You might need to inspect the website’s source code to find the correct selectors.
  3. The component will then return a “Data” object containing the title and the article text. You can then use other Nappai components to, for example, summarize the article or send it in an email.

Templates

This component is used in the following Nappai templates:

  • Blog Writer
  • URL content to X
  • Summarizer: Use this to create a concise summary of the extracted text.
  • Categorizer: Use this to automatically categorize the extracted information.
  • Entities extraction: Use this to extract specific entities (like names, dates, locations) from the extracted text.
  • Google Drive Writer: Save the extracted data to a Google Drive file.
  • Many more: The extracted data can be used as input for a wide variety of other Nappai components.

Tips and Best Practices

  • Inspect the webpage: Use your browser’s developer tools to find the correct CSS selectors for the information you want to extract.
  • Test with one URL first: Before processing many URLs, test the component with a single URL to ensure it’s extracting the correct information.
  • Be mindful of website terms of service: Always respect the website’s robots.txt file and terms of service when scraping data. Excessive scraping can overload a website’s server.

Security Considerations

  • Avoid scraping websites that have explicit restrictions against scraping. Respect the website’s terms of service.
  • Be aware that the content of websites can change, so your selectors might need adjustments over time. Regularly check the output of this component to ensure it’s still extracting the correct information.