Web Extract Agent (Legacy)
The Web Extract Agent (Legacy) is an intelligent tool designed to help you gather information from the internet. Instead of manually copying and pasting text from web pages, this agent uses artificial intelligence to understand your specific request and automatically extracts the relevant data for you. It is particularly useful when you need to collect specific details from complex websites or dynamic content that changes frequently.
Please note that this is a “Legacy” version of the agent. While it remains fully functional for backward compatibility, we recommend checking if a newer version of this component is available for future projects.
How it Works
This component acts as an autonomous assistant that follows a multi-step process to retrieve your data:
- Understanding Your Request: You provide a website address (URL) and describe what you want to extract using simple, everyday language (e.g., “Extract all product names and prices”). The AI analyzes this instruction to determine the best way to get the information.
- Smart Navigation: The agent visits the specified website. It uses standard web requests for simple pages and advanced browser simulation for pages that load content dynamically (like modern apps or sites with complex scripts).
- Data Extraction: Using visual cues (CSS selectors) and AI language models, it reads the page content and identifies the specific information you asked for.
- Organization: The extracted data is cleaned, organized into a structured format, and saved in a local database to prevent duplicates. Finally, it sends this organized data to the next step in your workflow.
Connection & Credentials
This component does not require configuring external API keys or credentials to function. It is designed to work immediately once added to your workflow.
Inputs
The following fields are available to configure this component. Please note that since this is a legacy component, the input fields are managed automatically based on the core functionality defined in the system’s base configuration. Typically, you will need to provide:
- Target URL: The web address you want to extract data from.
- Instruction: A natural language description of the data you want to extract (e.g., “Extract all email addresses from this page”).
Outputs
When the component finishes its task, it produces a structured data object containing the information you requested. This output is ready to be used by subsequent steps in your Nappai dashboard, such as sending emails, saving to a spreadsheet, or triggering further automation.
Output Data Example (JSON)json
{ “status”: “success”, “data”: [ { “title”: “Product Name 1”, “price”: “$29.99”, “description”: “A high-quality item” }, { “title”: “Product Name 2”, “price”: “$49.99”, “description”: “Another great item” } ], “metadata”: { “source_url”: “https://example.com”, “extraction_date”: “2023-10-27T10:00:00Z”, “record_count”: 2 } }
Connectivity
This component is typically connected from a trigger or a previous step that provides the URL and the specific instruction (e.g., a text input from a user or a CRM record).
It is typically connected to:
- Data Processing Nodes: To further clean or transform the extracted data.
- Storage Nodes: Such as Google Sheets, Excel, or a database, to save the extracted information.
- Communication Nodes: Such as Email or Slack, to notify users about the newly found data.
Usage Example
Imagine you want to monitor competitor prices for a specific product.
- Add the Web Extract Agent (Legacy) to your dashboard.
- In the Target URL field, enter the URL of the competitor’s product page.
- In the Instruction field, type: “Extract the product name, price, and availability status.”
- Connect the output of this agent to a Google Sheets component.
- When the workflow runs, the agent will visit the site, find the prices, and automatically add them to your spreadsheet.
Tips and Best Practices
- Be Specific in Instructions: Instead of saying “get data,” say “extract the product title and price.” The more specific your instruction, the more accurate the AI’s extraction will be.
- Check Page Structure: Ensure the website you are targeting displays the data clearly in its HTML code. Pages heavily reliant on complex animations or paywalls may be harder for the agent to scrape.
- Use Legacy Wisely: Since this is a legacy component, verify if your workflow relies on older data structures. If you are starting a new project, check for the non-legacy version of this component to ensure you are using the most up-to-date features.
Security Considerations
- Website Terms of Service: Ensure you have the right to scrape the target website. Some sites prohibit automated data extraction in their terms of service.
- Data Privacy: Be cautious when extracting personal information. Ensure that the data you are collecting complies with data protection regulations (such as GDPR or CCPA).
- Authentication: This legacy agent is designed for publicly accessible web content. It does not natively support logging into password-protected areas without additional configuration.