Language Recursive Text Splitter
This component in Nappai helps you break down large blocks of text (like long documents or code) into smaller, more easily processed pieces. It’s smart enough to understand different programming languages, making sure the splits are done correctly.
Relationship with RecursiveCharacterTextSplitter
This component uses a special technique called RecursiveCharacterTextSplitter
to divide the text. This method ensures that related parts of the text stay together in the same chunk, even when splitting long documents or code.
Inputs
- Chunk Size: This sets the maximum length (in characters) of each smaller text chunk. The default is 1000 characters. Increase this for larger chunks, decrease it for smaller ones.
- Chunk Overlap: This determines how many characters from the end of one chunk are repeated at the beginning of the next. The default is 200 characters. This overlap helps to maintain context between chunks.
- Input: This is where you provide the text you want to split. You can input text from various sources supported by Nappai.
- Code Language: Select the programming language of the text (e.g., Python, Java, JavaScript). This helps the component make more accurate splits, especially for code. If it’s not code, select “None” or a suitable general language option.
Outputs
The component produces a set of smaller text chunks. These chunks are then available for use by other components in your Nappai workflow. For example, you might use them as input for a summarization component or a sentiment analysis component.
Usage Example
Imagine you have a 5000-word research paper. You can use this component to split it into 5 chunks of 1000 words each (using a Chunk Size of 1000 and a Chunk Overlap of 0). Then, you could use Nappai’s summarizer to create a summary of each chunk, and finally combine those summaries for a concise overview of the entire paper.
Templates
This component is not specifically tied to any pre-built templates, but it can be used in many workflows. You’ll add it directly to your custom workflow within the Nappai dashboard.
Related Components
- Summarizer: Use this component to summarize the smaller text chunks created by the Language Recursive Text Splitter.
- Categorizer: Categorize the content of each chunk after splitting.
- Entities extraction: Extract key entities from each chunk.
- Many other components: The output of this component (the smaller text chunks) can be used as input for many other components in Nappai, depending on your workflow.
Tips and Best Practices
- Start with the default values for Chunk Size and Chunk Overlap. Adjust them only if necessary.
- Experiment with different Chunk Overlap values to find the optimal balance between chunk size and context preservation. A larger overlap helps maintain context but results in more repeated text.
- Choose the correct Code Language to ensure accurate splitting, especially for code.
Security Considerations
This component does not handle sensitive data directly. The security of your data depends on the security of the input data and the other components in your Nappai workflow.