Contextual Text Splitter
The Contextual Text Splitter is a smart tool designed to help you organize and understand large amounts of text data. Unlike standard text splitters that just chop text into equal-sized chunks, this component uses Artificial Intelligence to analyze each piece and generate a short summary or “context” for it. This helps your automation system (Nappai) understand the content better, making it easier to find and retrieve the right information later.
How it Works
When you connect a text source to this component, it performs two main steps:
- Splitting: It breaks your long text into smaller segments (chunks) based on the rules you set, such as size or separators.
- Context Generation: For each chunk, it sends the text to a Language Model (LLM). The LLM reads the chunk along with the original document to create a brief summary explaining what that specific chunk is about.
The result is a set of text chunks that are not only separated logically but also enriched with AI-generated descriptions. This “enriched” data is then passed on to other parts of your workflow, such as databases or search tools.
Connection & Credentials
This component requires a Language Model to function. You must connect a valid LLM (such as OpenAI, Anthropic, or a local model) to the Language Model input field. Without a connected language model, the component cannot generate the context summaries and will return errors in the output.
Inputs
Input Fields
The following fields are available to configure this component. Each field may be visible in different operations:
- Data Inputs: [REQUIRED] A list of text data objects to be processed. This is where you connect your source documents or text files.
- Language Model: [REQUIRED] The AI model used to generate context for each text chunk. You must connect a functional LLM here.
- Chunk Size: An integer value defining the maximum length (in characters) for each text chunk. Default is 1000.
- Chunk Overlap: An integer value defining how many characters should be repeated between chunks to maintain continuity. Default is 200.
- Max Sentence Length: An integer value setting the maximum length for a single sentence before it is forcibly split. Default is 500.
- Preserve Paragraphs: A toggle (True/False) indicating whether the splitter should try to keep paragraphs intact. Default is True.
- Separator: A text field defining the character used to split the text (e.g., double newlines). Default is a double newline.
- Split Method: A dropdown menu to select the algorithm for splitting text. Options include “recursive”, “token_based”, and “sentence”. Default is “recursive”.
Outputs
Output Data Example (JSON)
The component produces a list of data objects. Each object contains three key pieces of information: the raw chunk, the AI-generated context, and a combined version of both. json [ { “chunk”: “This is the first part of the text…”, “context”: “Summary of what the first part discusses…”, “text”: “Summary of what the first part discusses…\n\nThis is the first part of the text…” }, { “chunk”: “This is the second part of the text…”, “context”: “Summary of what the second part discusses…”, “text”: “Summary of what the second part discusses…\n\nThis is the second part of the text…” } ]
- chunk: The original segment of text.
- context: The AI-generated summary.
- text: A combination of the context and the original chunk, often used for search indexing.
Connectivity
This component is typically part of a Retrieval-Augmented Generation (RAG) workflow.
- Incoming Connections: It usually receives data from Data Loaders, Text Splitters (basic ones), or Vector Store retrievers that provide raw text.
- Outgoing Connections: The output (Chunks) is typically sent to:
- Vector Stores: To index the text along with its context for semantic search.
- Retrievers: To fetch relevant context for answering questions.
- LLMs: To provide enriched context for chatbots or summarization tasks.
Usage Example
Imagine you have a large legal contract and you want to build a chatbot that can answer questions about it.
- Input: You connect the text of the contract to the Data Inputs of the Contextual Text Splitter.
- Configuration: You connect your OpenAI model to the Language Model input and set the Chunk Size to 1500 characters.
- Process: The splitter breaks the contract into paragraphs. Then, it asks the AI to summarize each paragraph.
- Output: You get a list of paragraphs, each with a short summary. You can then send this to a Vector Store. When a user asks “What are the termination clauses?”, the system uses the context to find the most relevant paragraph and answer the question accurately.
Important Notes
🔒 Sensitive data is sent to the LLM [high] Any text you input is passed to the connected language model for context generation. If the LLM is hosted externally, ensure it complies with your privacy requirements.
⚠️ Chunk overlap parameter is ignored [medium] The component does not use the ‘Chunk Overlap’ setting. All chunks are created without any overlapping text, which may affect context continuity if you rely on overlap.
⚠️ Advanced split options not implemented [medium] Options such as ‘Split Method’, ‘Preserve Paragraphs’, and ‘Max Sentence Length’ are present in the UI but have no effect on how the text is split. The component always splits on the chosen separator.
⚠️ Component marked as in development [medium] The component is flagged as ‘is_development = True’, meaning it may contain bugs or incomplete features. Use it with caution in production workflows.
💡 Provide a functioning language model [medium] The component relies on a language model to generate contextual summaries. If no LLM or an unsupported LLM is connected, the context field will contain an error message.
⚙️ Separator defaults to double newline [medium] The default separator is a blank line. If your source text uses single newlines to separate paragraphs, the splitter may produce unexpectedly large chunks.
⚙️ Set chunk size relative to model limits [medium] Choose a chunk size that balances context depth with the token limits of your language model. Very large chunks may exceed model capacity, while very small chunks might lose meaningful context.
ℹ️ Output includes combined context and chunk [medium] Each output Data object contains a ‘chunk’ string, a separate ‘context’ string, and a ‘text’ field that concatenates the context and chunk. Downstream components should be aware of this format.
Tips and Best Practices
- Use Appropriate Chunk Sizes: Smaller chunks are easier for the AI to summarize accurately, but too small might miss context. Medium sizes (around 1000-1500 characters) often work well for documents.
- Check Your Data Format: Ensure your input text is clean and consistent. If your text uses single line breaks for paragraphs but you keep the default double-newline separator, the splitter might not separate sections as expected.
- Monitor LLM Costs: Since this component calls an LLM for every chunk, large documents will incur higher usage costs. Keep an eye on the number of chunks generated.
- Verify Context Quality: After running your workflow, inspect the “context” output field to ensure the AI is generating useful summaries. If summaries are vague, consider adjusting the chunk size or the LLM model.
Security Considerations
Since this component sends your raw text to an external or internal Language Model for processing, be mindful of sensitive information. Do not input confidential personal data, passwords, or proprietary secrets if you are using a public or third-party LLM service. Ensure your chosen LLM provider adheres to your organization’s data privacy and security policies.