Semantic Text Splitter
The Semantic Text Splitter helps you break long documents or paragraphs into smaller, logically related pieces. It uses the meaning of the text, not just word counts, so each chunk stays coherent and useful for later steps like summarization or analysis.
How it Works
The component takes a list of Data objects that contain text and optional metadata. It uses an embeddings model you provide to convert each sentence or phrase into a numerical vector that represents its meaning. By comparing these vectors, the splitter finds natural breakpoints where the meaning shifts. It then groups the text into a set number of chunks or until a similarity threshold is crossed. You can tweak how strict the threshold is and how many chunks you want. The result is a new list of Data objects, each holding one chunk of the original text.
Inputs
- Data Inputs: A list of Data objects containing the text and metadata to split.
- Embeddings: The embeddings model to use for semantic similarity.
- Breakpoint Threshold Amount: The numerical amount for the breakpoint threshold.
- Breakpoint Threshold Type: Method to determine the breakpoints. Options are ‘percentile’, ‘standard_deviation’, ‘interquartile’. Defaults to ‘percentile’.
- Buffer Size: The size of the buffer.
- Number of Chunks: The number of chunks to split the text into.
- Sentence Split Regex: Regular expression to split sentences. Optional.
Outputs
- Data: A list of Data objects (method: split_text). Each object contains one chunk of the original text, ready to be passed to other components such as summarizers or analyzers.
Usage Example
You have a long customer support transcript and want to feed it into a summarization component.
- Connect your transcript Data to the Semantic Text Splitter.
- Choose an embeddings model (e.g., OpenAI’s
text-embedding-ada-002
). - Set Number of Chunks to
5
. - Leave other settings at their defaults.
- The splitter will output 5 coherent chunks.
- Connect the output to a Summarizer component to get a concise summary of each chunk.
Related Components
- Text Splitter – Splits text by fixed size or delimiter.
- Summarizer – Generates concise summaries of text chunks.
- Embedding Generator – Creates embeddings for text, which can be reused by the splitter.
Tips and Best Practices
- Use a high‑quality embeddings model for better chunk quality.
- Keep Number of Chunks moderate; too many small chunks can lose context.
- If your text contains special formatting, adjust Sentence Split Regex to correctly split sentences.
- Test with a small sample before running on large datasets to fine‑tune thresholds.
Security Considerations
If you use an external embeddings service, the text may be sent over the network. Ensure that the service complies with your organization’s data‑privacy policies and that any sensitive information is handled appropriately.