PDF Extractor
The PDF Extractor is a tool designed to help you get information out of PDF documents. PDFs are great for storing documents, but they are often difficult to edit or analyze because the text is locked into an image-like format. This component acts as a bridge, reading the PDF and converting the hidden text, structured tables, and embedded images into a usable format that you can use in the rest of your automation workflows.
Think of it as a digital scanner that doesn’t just take a picture of the page, but actually reads the words and organizes the data inside it.
How it Works
Internally, this component uses a powerful library called PyMuPDF (also known as fitz) to analyze your file. It works by looking at each page of the PDF one by one.
- Reading Text: It scans the pages to find words and sentences, extracting them into plain text.
- Finding Tables: It looks for grid lines and structured layouts to identify tables, pulling that data out so you can use it in spreadsheets or databases.
- Extracting Images: If the PDF contains embedded images (like logos or charts), it pulls those images out as well.
Because it uses advanced technology for rendering and extraction, it can handle complex documents that might confuse simpler tools. The result is a structured collection of text, table data, and image files that you can then connect to other parts of your Nappai automation.
Connection & Credentials
This component does not require external API keys, passwords, or specific credential configurations. It works directly on the files you provide within your workflow.
Inputs
The following fields are available to configure this component. Note that the specific input fields (such as file path) are managed by the base system settings, but typically you will need to provide the source PDF file.
- File: [The PDF document you want to extract data from. This is usually the primary input required for the component to function.]
- Visible in: All operations
Outputs
When the component finishes processing, it produces a structured output containing the extracted elements. You can connect these outputs to other components in your workflow to further process or store the data.
- Text Content: The raw text found in the PDF.
- Tables: Structured data representing any tables found in the document.
- Images: The visual elements extracted from the PDF.
Output Data Example (JSON)
Here is an example of how the data might look when successfully extracted. This shows a simple structure containing text and one table row. json { “text”: “This is the main text content extracted from the PDF pages.”, “tables”: [ [ [“Product”, “Price”, “Stock”], [“Laptop”, “$999”, “5”], [“Mouse”, “$25”, “50”] ] ], “images”: [ { “id”: “img_001”, “type”: “image/png”, “note”: “Extracted image data” } ] }
Connectivity
This component is typically placed at the beginning of a workflow that involves document processing.
-
Connects To:
- Data Processors / Transformers: To clean up the extracted text or format the tables for database entry.
- Storage Components: To save the extracted text or tables into a database, spreadsheet, or cloud storage.
- AI Assistants: To send the extracted text to an AI model for summarization, translation, or analysis.
-
Logical Flow:
- File Upload/Source: Your workflow usually starts by providing the PDF file (via a file trigger or previous step).
- PDF Extractor: Breaks down the file into text and data.
- Next Steps: The output flows into components that decide what to do with that text (e.g., “Summarize this text” or “Save this table”).
Usage Example
Scenario: Automating Invoice Processing
- Input: You have a folder of new PDF invoices.
- Process:
- Use the PDF Extractor on one of these invoices.
- The component identifies the Tables containing item names, prices, and quantities.
- It also extracts the Text for the invoice number and date.
- Output Usage:
- Connect the Tables output to a “Create Row” component to add the items to your inventory database.
- Connect the Text output to an AI Assistant to calculate the total tax automatically.
Important Notes
- Development Status: This component is currently in the development phase. This means its features or interface might change in future updates. It is recommended to test it thoroughly in a non-critical workflow before relying on it for important business processes.
- File Requirements: The input file must be a valid PDF. Scanned images that are not searchable may not extract text well unless they are high-quality and clear.
Tips and Best Practices
- Check Output Quality: After running the extractor, review the extracted text to ensure it reads logically. Complex layouts sometimes result in mixed-up text order.
- Use Tables for Data: If your PDF contains structured data (like a receipt or a report), prioritize connecting the “Tables” output to database or spreadsheet components for accurate data entry.
- Handle Large Files: For very large PDFs, extraction might take longer. Consider processing documents in batches if you are automating high-volume workflows.
Security Considerations
- Local Processing: The extraction happens locally within the Nappai environment using the PyMuPDF library. Ensure you trust the PDF files you are uploading, as extracting text from malicious PDFs can sometimes pose risks if the underlying library has vulnerabilities.
- Data Privacy: Be mindful of the sensitive data (like personal info or financial records) contained in the PDFs you process, as the extracted text and tables will be available for downstream components to access.