Skip to content

BM25 Retriever

The BM25 Retriever acts like a smart keyword finder for your documents. When you connect your data collection and type a search phrase, the component scans through your files using a fast, reliable matching system called BM25. Instead of relying solely on complex AI predictions, it focuses on how often your specific keywords appear and how unique they are across all documents. You can control how many results you receive, and the system will either return the matched documents immediately or package the search tool for you to use in later steps of your workflow.

How it Works

When you run this component, it performs three simple steps:

  1. Connects to your document collection: It reads from the vector store you link to it, pulling in the text and labels stored there.
  2. Matches your query: As you type or paste a search phrase, it checks each document against your keywords using a proven ranking method (BM25). It calculates a relevance score based on keyword frequency and uniqueness.
  3. Returns structured results: The system either delivers the top matching documents directly to the next step in your dashboard, or saves the search setup so you can reuse it without rebuilding it later. If it encounters a problem (like an empty collection or missing data), it safely logs the issue and passes a clear error message instead of breaking your workflow.

Inputs

Input Fields

  • Parent document vectorstore: The main collection of documents you want to search through. Connect your document library, knowledge base, or database here.
  • Search Query: The exact words or phrase you are looking for. This field is required and drives the entire search process.
  • Top K: The maximum number of results you want to receive. Defaults to 5. Adjust this number based on how much context you need.

Outputs

The BM25 Retriever produces two types of outputs depending on your workflow needs:

  • Resultados: A structured list of the matching documents. Each item includes the document text, any attached labels, a relevance score, and a position number. Use this when you need immediate answers for chatbots, summaries, or displays.
  • Retriever: A ready-to-use search tool that you can pass to other components. This is useful when building multi-step pipelines, as it saves you from reconfiguring the search setup repeatedly.

Output Data Example (JSON)json

[ { “page_content”: “Your password can be reset through the account settings page. Visit the login portal and click ‘Forgot Password’ to receive a reset link.”, “metadata”: { “source”: “support_knowledge_base.pdf”, “author”: “IT_Support_Team”, “last_updated”: “2023-11-15” }, “score”: 0.82, “rank”: 1 } ]

Connectivity

In a typical Nappai workflow, this component connects to document or data storage nodes via the Parent document vectorstore input. Once connected, you link a text or message field to the Search Query input. The Resultados output is usually sent to language model (LLM) nodes for summarization, text formatting components for display, or chat interfaces for direct user responses. The Retriever output is best used when chaining multiple search or analysis steps together, allowing downstream components to reuse the same search configuration without rebuilding it.

Usage Example

Scenario: Automating Customer Support Responses

  1. Connect your internal knowledge base or document library to the Parent document vectorstore input.
  2. In your workflow, connect a text prompt or user message to the Search Query input.
  3. Set Top K to 3 to limit results to the three most relevant documents.
  4. Connect the Resultados output to a text summarizer or chatbot response node.
  5. When a user submits a query, the component quickly pulls the top 3 matching support articles and passes them to the chatbot, enabling faster and more accurate automated replies.

Important Notes

🔒 Error logging exposure 🟡 Medium The system saves detailed error messages for troubleshooting. If your documents contain sensitive or confidential information, please ensure logging is properly managed or disabled in production environments.

⚠️ Maximum documents loaded 🟡 Medium The component initially fetches up to 1,000 documents to build its search index. If your collection contains more than 1,000 items, additional documents will not be checked, which might cause you to miss some relevant results.

⚠️ Empty vectorstore error 🟡 Medium If the connected document collection is empty or has not been populated yet, the component will return an error message instead of search results. Always ensure your data source has content before running searches.

📋 VectorStore compatibility requirement 🟡 Medium This component only works with document collections that support standard search and retrieval functions. Collections lacking these basic features will trigger an error.

📋 Provide a compatible VectorStore 🟡 Medium You must connect a valid vector store object that can search and retrieve documents. Without a properly configured data source, the component cannot function.

📋 Install langchain_community 🟡 Medium This tool requires the langchain_community package. Ensure it is installed in your Python environment before using this component in your workflows.

💡 Craft specific search queries 🟢 Low Use clear, keyword-focused phrases instead of vague or overly broad questions. Specific terms help the system rank documents more accurately and reduce irrelevant matches.

💡 Set an appropriate Top K value 🟢 Low Choose a number that balances speed and accuracy. Setting it too high may return many weak matches and slow down your workflow.

💡 Maintain consistent document metadata 🟢 Low Keep labels, tags, and extra information uniform across your documents. Consistent metadata makes it much easier to filter, organize, and interpret results later.

⚙️ Do not leave search query empty 🟡 Medium The search query field is required. Leaving it blank will trigger an error. Always enter a valid phrase before running the component.

⚙️ Top K should not exceed available documents 🟡 Medium If you request more results than actually exist in your collection, the system will simply return fewer items. Adjust the number to match your available data and avoid confusion.

ℹ️ Search results include rank and score 🟢 Low Each returned document comes with a position number (rank) and a relevance score. You can use these values to display ordered results, filter by confidence, or prioritize high-quality matches.

Tips and Best Practices

  • Keep your connected document collection up to date and well-organized before running searches.
  • Use precise keywords rather than long, conversational sentences for better matching.
  • Start with a lower Top K value (e.g., 3–5) and increase it only if you need more context.
  • Use the Retriever output when building complex, multi-step automations to save time and avoid duplicate setup configurations.
  • Regularly test your workflows with sample queries to ensure the component returns the expected data structure.

Security Considerations

  • Review your logging settings if working with sensitive or private data to prevent accidental exposure of confidential information during debugging.
  • Verify that connected document collections comply with your organization’s data access and privacy policies before enabling automated retrieval.
  • Avoid embedding personal identifiable information (PII) in documents unless absolutely necessary, as retrieved content may be passed to downstream AI nodes.