CharacterTextSplitter
The CharacterTextSplitter component takes a long piece of text and breaks it into smaller, more manageable chunks. This is useful when you need to process or analyze large documents in parts, such as feeding them into a language model or storing them in a database.
How it Works
The component works entirely inside your dashboard. It reads the text you provide and divides it into pieces that are no longer than the Chunk Size you set. If you also set a Chunk Overlap, each new chunk will share a few characters with the previous one, which can help preserve context when the chunks are processed separately. You can choose which characters to split on with the Separator field; if you leave it empty, the component will split on two consecutive line breaks ("\n\n"
).
Inputs
-
Input: The texts to split.
Provide the document or data you want to divide. -
Chunk Overlap: The amount of overlap between chunks.
Set how many characters should be repeated at the start of each new chunk. -
Chunk Size: The maximum length of each chunk.
Choose the largest number of characters each piece can contain. -
Separator: The characters to split on.
If left empty defaults to"\n\n"
.
Outputs
- Data: The split text pieces.
The component returns a list of text chunks that you can feed into other components, such as a language model or a storage component.
Usage Example
- Drag the CharacterTextSplitter onto your workflow.
- Connect the output of a document‑loading component (e.g., PDFLoader) to the Input field.
- Set Chunk Size to
1000
and Chunk Overlap to200
. - Leave Separator empty to use the default double‑line‑break split.
- Connect the Data output to the next component in your workflow, such as a TextEmbedding component.
The workflow will now split your long document into 1,000‑character chunks, each overlapping the previous chunk by 200 characters, ready for further processing.
Related Components
- TextSplitter – A generic splitter that can use different strategies (e.g., by sentence or paragraph).
- DocumentLoader – Loads documents from files or URLs into the workflow.
- TextEmbedding – Converts text chunks into vector embeddings for similarity search or clustering.
Tips and Best Practices
- Choose a sensible chunk size: Too small and you lose context; too large and you may hit token limits in downstream models.
- Use overlap sparingly: Overlap helps preserve context but increases the amount of data processed.
- Test with a sample document: Verify that the chunks look correct before running the full workflow.
- Keep separators consistent: If your text uses a different delimiter, set the Separator accordingly to avoid unintended splits.
Security Considerations
This component processes text locally and does not send data outside your environment. Ensure that any sensitive documents are handled in accordance with your organization’s data‑handling policies.