Skip to content

Web Crawler

This component, called the Web Crawler, helps you automatically collect information from websites. You provide a website address, and it will explore that site and gather data, such as text and links. This is useful for gathering information for analysis or use in other parts of Nappai.

Relationship with Firecrawl API

This component uses the Firecrawl API to do the actual web crawling. Firecrawl is a powerful tool that allows us to efficiently and safely collect information from websites. You don’t need to know anything about Firecrawl to use this component; Nappai handles all the technical details.

Inputs

  • URL: This is the web address (e.g., https://www.example.com) you want the crawler to start from. This is required.
  • Timeout: This sets how long (in milliseconds) the crawler should try to collect information before stopping. The default is 30 seconds (30000 milliseconds). You can increase this if a website is slow to load.
  • Crawler Options (Advanced): These are more detailed settings that control how the crawler behaves. Unless you need very specific control, you can leave these at their default settings. These options include things like how many pages deep the crawler should go, and whether it should follow links to other websites.
  • Page Options (Advanced): These settings control what information is collected from each web page. By default, it focuses on the main content of the page. You can adjust these options if you need more or less information.

Outputs

The component produces a single output called “Data”. This output contains all the information collected from the website, such as text, links, and potentially images (depending on your Page Options). You can then use this “Data” output in other Nappai components to process and analyze the information. For example, you could use it with the Summarizer to get a summary of the collected text.

Usage Example

Let’s say you want to collect information from the Wikipedia page about cats. You would:

  1. Enter https://en.wikipedia.org/wiki/Cat into the URL input.
  2. (Optional) Adjust the Timeout if you expect the page to take longer to load.
  3. Leave the Crawler Options and Page Options at their default settings unless you have specific needs.
  4. Run the component.
  5. The “Data” output will contain the information collected from the Wikipedia page. You can then connect this output to other components to further process the data.

Templates

This component is used in the ‘Eurocup 2024’ template.

  • Summarizer: Use this to create a concise summary of the text collected by the Web Crawler.
  • Entities extraction: Extract key information (like names, dates, locations) from the collected data.
  • Categorizer: Automatically categorize the collected information into relevant topics.
  • Google Drive Writer: Save the collected data to a Google Drive file.
  • Many more!: The “Data” output can be used as input for a wide variety of Nappai components.

Tips and Best Practices

  • Start with the default settings for Crawler Options and Page Options. Only adjust them if you need more specific control.
  • Be mindful of the website’s terms of service and robots.txt file. Excessive crawling can overload a website’s server.
  • Use the timeout setting to prevent the crawler from running indefinitely on slow or unresponsive websites.

Security Considerations

  • Avoid crawling websites that contain sensitive or private information.
  • Always respect the website’s terms of service and robots.txt file. Unauthorized crawling can have legal consequences.
  • Be aware that the data collected might contain inaccuracies or outdated information. Always verify the information before using it for critical decisions.