Web Scraper
Extract data from websites automatically. This component lets you grab information from web pages and use it in your Nappai workflows.
Relationship with Firecrawl API
This component uses the Firecrawl API to fetch and extract data from websites. Firecrawl is a powerful tool that handles the complexities of web scraping, allowing you to easily get the information you need.
Inputs
- URL: The web address (URL) of the page you want to scrape. This is required. For example:
https://www.example.com
- Timeout: How long (in milliseconds) to wait for the website to respond before giving up. The default is 10,000 milliseconds (10 seconds). Increase this if a website is slow to load.
- Page Options (Advanced): These settings let you fine-tune how the scraper works. Unless you need specific control, leave these at their default values. These options include:
- Include HTML: Whether to include the raw HTML of the page in the results. (Default: False)
- Include Raw HTML: Whether to include the raw HTML of the page in the results. (Default: False)
- Only Main Content: Whether to only extract the main content of the page, filtering out sidebars and other less relevant sections. (Default: True)
- Only Include Tags: A list of HTML tags to include (leave empty for default behavior).
- Remove Tags: A list of HTML tags to remove (Default: [‘script’]).
- Screenshot: Whether to take a screenshot of the page. (Default: False)
- Wait For: How long (in milliseconds) to wait for elements on the page to load before scraping. (Default: 0)
Outputs
The component produces a “Data” output containing the information scraped from the website. This data can then be used by other Nappai components, such as those that categorize, summarize, or analyze text. The data will be structured in a way that’s easy to use in your workflows.
Usage Example
Let’s say you want to extract product prices from an e-commerce website. You would:
- Enter the website’s URL in the “URL” input field.
- (Optional) Adjust the “Timeout” if the website is slow.
- Run the component.
- The “Data” output will contain the extracted product prices, which you can then use in other parts of your Nappai automation.
Templates
[List of templates where the component is used - This section will be populated based on actual template usage.]
Related Components
- Extract Data: Use this component to pull specific pieces of information (like prices or product names) from the scraped data.
- Summarizer: Summarize the scraped text using AI.
- Categorizer: Automatically categorize the extracted data.
- Google Sheet Writer: Write the scraped data to a Google Sheet.
- Many more: The scraped data can be used as input for a wide variety of Nappai components depending on your needs.
Tips and Best Practices
- Start simple: Begin with basic scraping and gradually add more advanced options as needed.
- Respect robots.txt: Be mindful of the website’s
robots.txt
file, which specifies which parts of the site should not be scraped. Excessive scraping can overload a website’s server. - Test thoroughly: Always test your scraping configuration on a small scale before running it on a large dataset.
- Handle errors: Websites can change, so be prepared to handle potential errors and adjust your configuration accordingly.
Security Considerations
- Avoid scraping sensitive information (like passwords or credit card details).
- Always respect the website’s terms of service and privacy policy.
- Be aware of the legal implications of web scraping. Always check the website’s terms of service before scraping. Unauthorized scraping can lead to legal issues.