Skip to content

Semantic Text Splitter

The Semantic Text Splitter is a tool designed to break down text into meaningful parts using semantic similarity. This ensures that the resulting segments maintain coherence and are related to each other, making it easier to process and analyze text.

Relationship with Semantic Similarity Models

This component uses semantic similarity models, specifically embeddings, to evaluate and determine where to split the text. By understanding the semantic relationships between different parts of the text, it can create meaningful and coherent segments.

Inputs

  • Data Inputs: A list of data objects containing the text and metadata to be split.
  • Embeddings: The model used to evaluate semantic similarity.
  • Breakpoint Threshold Type: Method to determine split points, such as ‘percentile’, ‘standard deviation’, or ‘interquartile’.
  • Breakpoint Threshold Amount: Numerical value for the breakpoint threshold.
  • Number of Chunks: Specifies how many segments the text should be divided into.
  • Sentence Split Regex: Optional advanced setting for splitting sentences using regular expressions.
  • Buffer Size: Advanced setting for specifying the buffer size.

Outputs

The component produces data that consists of text split into semantically meaningful segments. This output can be used in workflows where maintaining the coherence of text segments is crucial, such as in data analysis or content management.

Usage Example

Imagine you have a long document that needs to be analyzed. By using the Semantic Text Splitter, you can break down the document into smaller, meaningful parts, making it easier to focus on specific sections and extract valuable insights.

Templates

Currently, there are no specific templates where this component is pre-configured. However, it can be integrated into any workflow that requires text segmentation.

  • libSQLRetrieverTool: Interacts with libSQL Retriever for database operations.
  • Embedding Similarity: Computes similarity between two embedding vectors.
  • Natural Language Text Splitter: Splits text based on natural language boundaries.
  • Recursive Character Text Splitter: Splits text while keeping related text together.

Tips and Best Practices

  • Ensure that the embeddings model is well-suited for your text type to achieve the best results.
  • Adjust the breakpoint threshold settings to fine-tune the segmentation according to your needs.
  • Use the advanced settings like Sentence Split Regex and Buffer Size for more control over the splitting process.

Security Considerations

When using this component, ensure that any sensitive data within the text is handled according to your organization’s data protection policies. Consider anonymizing data if necessary before processing.