Natural Language Text Splitter
The Natural Language Text Splitter takes a long piece of text and breaks it into manageable chunks. It uses language‑aware rules to keep sentences and paragraphs intact, making it easier to feed the text into AI models or other downstream processes.
How it Works
The component uses the NLTKTextSplitter from the LangChain library. When you provide the text, it:
- Detects the language you specify (default is English).
- Looks for natural boundaries such as sentence ends or paragraph breaks.
- Splits the text into pieces that are no longer than the Chunk Size you set.
- Adds a small overlap of characters between consecutive pieces, as defined by Chunk Overlap, so that context isn’t lost.
- Uses the Separator you choose (or
\n\n
by default) to decide where to cut the text.
All of this happens locally on your machine; no external API calls are made.
Inputs
- Input: The text data to be split.
- Chunk Overlap: The number of characters that overlap between consecutive chunks.
- Chunk Size: The maximum number of characters in each chunk after splitting.
- Language: The language of the text. Default is “English”. Supports multiple languages for better text boundary recognition.
- Separator: The character(s) to use as a delimiter when splitting text. Defaults to
\n\n
if left empty.
Outputs
- Data: The split text pieces, returned as a Data object via the
split_data
method. Each piece can be used independently in later steps of your workflow.
Usage Example
- Add the component to your workflow.
- Connect a Document Loader (or any component that outputs text) to the Input field.
- Set Chunk Size to 1000 characters and Chunk Overlap to 200 characters.
- Leave Language as the default or change it to match your text.
- Run the workflow. The component will output a list of text chunks that you can feed into an LLM or store for later use.
Related Components
- Simple Text Splitter – Splits text by a fixed number of characters without language awareness.
- Custom Text Splitter – Allows you to define your own regex or delimiter for splitting.
- Document Loader – Loads documents from files or databases, often used as the source for the Natural Language Text Splitter.
Tips and Best Practices
- Match LLM limits: Choose a chunk size that fits the token limit of the AI model you plan to use.
- Use overlap wisely: A small overlap (e.g., 50–200 characters) helps preserve context across chunks.
- Set the right language: Selecting the correct language improves sentence detection and reduces awkward splits.
- Custom separators: If your text uses a unique delimiter (e.g.,
---
), set it in the Separator field to get cleaner splits. - Test with sample text: Run the splitter on a short excerpt first to verify that the chunks look correct before processing large documents.
Security Considerations
The component processes data locally and does not send any information outside your environment. Ensure that any sensitive text is handled according to your organization’s data‑handling policies.