Gemini OCR

A tool that reads and converts text from images and PDF files into editable digital text using Google’s advanced AI vision models.

How it Works

When you upload an image or PDF to this component, it first verifies that the file is valid and ready for processing. It then securely connects to Google’s Gemini AI vision models to analyze the visual content. The AI scans the document or picture, identifies letters, numbers, and formatting, and converts them into plain, editable text. The system is optimized to handle both standard images and multi-page PDFs efficiently. If you connect multiple files at once, the component can process them all together and return the results in the exact same order, making it easy to keep your documents organized.

Connection & Credentials

This component requires configuring a credential in the Nappai panel before interacting with the external service:

Go to the Credentials section in your Nappai panel.
Create a new credential of the type Google Gemini and fill in the required fields (Google API Key).
In your workflow, select the saved credential in the Credential input field of this node.

Inputs

Mapping Mode

This component has a special mode called “Mapping Mode”. When you enable this mode using the toggle switch, an additional input called “Mapping Data” is activated, and each input field offers you three different ways to provide data:

Fixed: You type the value directly into the field.
Mapped: You connect the output of another component to use its result as the value.
Javascript: You write Javascript code to dynamically calculate the value.

This flexibility allows you to create more dynamic and connected workflows.

Input Fields

The following fields are available to configure this component. Each field may be visible in different operations:

Vision Model: Choose the specific AI version that will analyze your document. Different models offer varying levels of accuracy and processing speed.
Max Tokens: Sets the maximum length of the text the AI can return. Higher limits allow for longer documents but may use more resources.
Multimedia Input: Upload or drag-and-drop the image or PDF file you want to extract text from. This is the primary file the AI will read.
JSON Parameters: Advanced settings for fine-tuning how the AI reads the text (e.g., temperature, penalties). Leave blank for default behavior.

Outputs

This component returns the recognized text in a clean, plain format that can be easily used by other nodes in your workflow. It also provides optional metadata that tells you how confident the AI was about the extraction and how many pages were processed. You can connect this output to text generators, databases, or document creators to continue automating your processes.

Output Data Example (JSON)json

{ “text”: “Invoice #1024\nDate: 2023-10-15\nTotal Due: $1,250.00\nThank you for your business.”, “metadata”: { “confidence_score”: 0.94, “pages_processed”: 1, “file_type”: “PDF” } }

Connectivity

This component is typically placed early in a document processing pipeline. It usually receives files from Image Upload, Document Storage, or File Processing nodes. After extracting the text, it commonly connects to:

Text LLMs or AI Assistants: To summarize, translate, or classify the extracted content.
Database or Spreadsheet Connectors: To save the extracted data for record-keeping.
Document Generators: To fill out templates or create new reports based on the recognized text.

Usage Example

Automated Invoice Processing Imagine you receive dozens of scanned invoices in PDF format each week. Instead of reading them manually, you can connect a Document Upload node to this component. The Gemini OCR node will read each invoice, extract the invoice number, date, and total amount, and output the plain text. You can then connect that output to an AI Assistant node that calculates taxes and summarizes the total, and finally route the results to a spreadsheet to automatically update your accounting records.

Tips and Best Practices

Use clear, high-resolution scans or photos for the best accuracy.
Adjust the Max Tokens field based on your typical document length to avoid cutting off long text or wasting processing power.
Test with a few sample pages before running large batches to ensure the AI reads the layout exactly as you expect.
Keep your Google Gemini API key updated and rotate it periodically for account security.

Security Considerations

Always ensure that the Google Gemini API key you configure is kept private and is only used within your authorized Nappai workflows. Avoid uploading highly sensitive or regulated documents (like tax returns or personal IDs) unless your organization’s data compliance policies explicitly allow AI-powered text extraction. The component does not store your documents after processing; it only holds them temporarily in memory to perform the reading operation.