Entities extraction

This component is designed to help you extract specific pieces of information (called “entities”) from your data. Think of it as an AI-powered search tool that reads through a document or data object and pulls out exactly what you asked for, such as customer names, invoice numbers, or product types. It uses Large Language Models (LLMs) to understand the context and structure of your data, ensuring accurate extraction.

How it Works

The component works by taking your raw data and applying AI intelligence to identify the information you need. Here is a simple breakdown of the process:

Input: You provide the data object you want to analyze.
Analysis: An AI model (Language Model) is connected to the component. It reads the data using the “Extract keys” you specify.
Extraction: The AI looks for the specific patterns or information defined in your keys. You can also provide additional instructions to guide the AI if the data is complex.
Output: The component returns a list of the extracted data, formatted according to your needs.

The system automatically handles large texts by splitting them into smaller chunks if necessary, ensuring that no important information is missed due to text length limits.

Connection & Credentials

This component does not require direct credential configuration within the node itself. Instead, you must connect a Language Model (LLM) to it. This is typically a credential or model instance you have already set up in your Nappai workspace (e.g., an OpenAI or local LLM model).

Inputs

The following fields are available to configure this component. Note that some advanced settings are hidden by default and may require enabling “Advanced Mode” in your Nappai dashboard.

Data: The main data object you want to analyze. This is the source from which the AI will extract information.
- Visible in: All operations
Model: The AI language model that performs the analysis. You must connect a valid Language Model here for the component to work.
- Visible in: All operations
Extract keys: A list of specific items or properties you want to find in the data (e.g., “Customer Name”, “Date”, “Email”). This is the most important setting; be precise with your spelling and structure.
- Visible in: All operations
Additional instructions: Optional text that provides extra context to the AI. For example, you might add “Ignore dates before 2023” to refine the results.
- Visible in: All operations
Output instructions: Optional text that tells the AI how you want the final result formatted (e.g., “Return as a JSON list” or “Return as a comma-separated string”).
- Visible in: All operations
chunk size: The size of text segments the AI processes at one time. The default is 1,500 characters. Increase this for long documents to improve speed.
- Visible in: All operations
chunk overlap: The number of characters repeated between text segments to ensure context isn’t lost when splitting long texts. Default is 150.
- Visible in: All operations

Outputs

The component produces two main outputs that you can connect to other parts of your workflow:

Extracted Data: This is the primary output. It contains the specific information you requested, organized as a list of data objects. You can use this to feed into other components for further processing, saving to a database, or displaying in a dashboard.
Tool: This output allows the component to be used as a reusable “Tool” by other AI agents in your workflow. If you are building complex autonomous agents, you can connect this output to an agent that expects a tool interface.

Output Data Example (JSON)

Here is an example of what the Extracted Data output might look like if you were extracting customer names and emails from a text: json [ { “customer_name”: “John Doe”, “email”: “john.doe@example.com” }, { “customer_name”: “Jane Smith”, “email”: “jane.smith@company.org” } ]

Connectivity

In a typical workflow, this component acts as a bridge between raw data sources and your application’s logic.

Input Connections: Connect a Data component (like a Document Loader or a Database Output) to the Data input. You must also connect a Language Model (from your Model Registry or a previous node) to the Model input.
Output Connections: Connect the Extracted Data output to components that need to use this structured information, such as:
- Database Writers: To save the extracted entities into a CRM or database.
- AI Agents: To allow an agent to reason over the extracted facts.
- Output Nodes: To display the extracted information in a final user interface.

Usage Example

Imagine you have a database of customer support tickets (the Data). You want to automatically identify every customer who mentioned “Refund” and get their email addresses.

Connect your Customer Support Tickets data to the Data input.
Connect your OpenAI Model to the Model input.
In Extract keys, enter: ["Customer Name", "Email", "Refund Mention"].
In Additional instructions, enter: "Only extract the email of the person requesting the refund."
The component will process the tickets and output a list of customers who requested refunds along with their contact info, which you can then send to your email marketing tool.

Important Notes

🔒 Sensitive data is sent to the LLM 🔴 high All extracted data is processed by the supplied language model. Ensure the model runs in a secure, privacy-compliant environment if handling confidential information.

📋 Provide a LangChain-compatible LLM 🔴 high The component requires a LanguageModel instance that implements the LangChain interface. Without a valid model, extraction will fail.

⚠️ Data size may affect performance 🟡 medium Large Data objects are split into 1,500‑character chunks by default. Extremely large inputs can increase processing time and memory usage. Consider pre‑splitting or reducing size before use.

⚠️ Component is in development 🟡 medium Marked as is_development=True, the component may have bugs or incomplete features. Use with caution in production workflows.

⚠️ Output format applies only to async extraction 🟢 low When using the tool directly, the output_format input is ignored. The main async method supports custom output formatting. Use the appropriate method depending on your workflow.

⚠️ Specify precise entity keys 🟢 low List exact entity names you want extracted. Vague or misspelled keys will result in missing data. Double-check spelling and structure.

⚠️ Adjust chunk size for long texts 🟡 medium The default chunk size is 1,500 characters. For very long documents, increasing this value can reduce the number of LLM calls, improving speed. However, too large a value may exceed model limits.

⚠️ Additional context guides extraction 🟢 low Any text provided in Additional instructions influences how the LLM identifies entities. Use clear, concise instructions to improve accuracy.

⚠️ Hidden advanced inputs may be overlooked 🟢 low Inputs like Additional instructions, Output instructions, chunk size, and chunk overlap are hidden by default. Enable advanced settings if you need fine-tuning.

Tips and Best Practices

Be Specific: When entering Extract keys, be as precise as possible. Instead of just “Name,” use “Customer Name” or “Invoice Number.”
Use Instructions for Edge Cases: If your data has inconsistent formats (e.g., dates written as “Jan 1” vs “01/01/2024”), use Additional instructions to guide the AI on how to handle these variations.
Monitor Chunks: If your documents are very long, consider increasing the chunk size to reduce processing time, but test carefully to ensure accuracy isn’t compromised.
Test with Small Samples: Before running on large datasets, test the component with a small sample of data to verify that the Extract keys are capturing the correct information.

Security Considerations

Since this component sends your data to an external Language Model for processing, it is crucial to ensure that the data being sent does not contain sensitive personal information (PII) that violates privacy laws or company policies, unless the model you are using is known to be secure and compliant with those regulations. Always review the Additional instructions to ensure no sensitive data is inadvertently included in the prompt if you are masking it.