Amazon S3

The Amazon S3 component lets you pull files or folders from an S3 bucket into your Nappai workflow. You can choose to read a single file, a whole folder, or a list of files that are referenced in other data objects. The component can also download the file contents or just return the file paths.

How it Works

When you run the component, it connects to Amazon S3 using the credentials you set up in Nappai.

If you give it a S3 Object (a single file) it downloads that file.
If you give it a S3 Prefix (a folder path) it walks through the folder, optionally filtering files with regular expressions.
If you provide a list of data objects that contain S3 keys or prefixes, the component reads each one and pulls the corresponding files.

The component can also keep track of which files it has already processed using a small SQLite database (the “Check Point” settings). This is useful when you run the component many times and only want to process new files.

Inputs

Data with file: A list of data objects that contain S3 keys or prefixes. The component will read each entry and fetch the referenced files.
S3 Object: Select a single file from your S3 bucket.
S3 Prefix: Select a folder (prefix) inside your S3 bucket.
Always Include Regex: A comma‑separated list of regular expressions. Files that match any of these patterns will always be included, even if they would otherwise be excluded.
S3 Session Token: Optional session token for temporary credentials.
Batch Size: Number of files to process in one batch when crawling a folder.
Check Point Database: Name of the SQLite database that stores which files have already been processed.
Check Point Database Directory: Folder where the checkpoint database file is stored.
Check Point Table: Table name inside the checkpoint database.
S3 Object Key: Key name inside a data object that holds the S3 object path to load.
S3 Prefix Key: Key name inside a data object that holds the S3 prefix (folder) to load.
Download File content: If checked, the component will download the file contents and return them as part of the output.
File Type: Choose “all” to load every file type, or “documents” to load only text and image files.
Ignore Files Regex: A comma‑separated list of regular expressions. Files that match any of these patterns will be skipped.
Max Concurrency: Maximum number of parallel operations when crawling the bucket.
Max Concurrency Download: Maximum number of parallel downloads when pulling files.
Max Depth: How deep the crawler will go into sub‑folders.
Crawl Only Folders: If checked, the crawler will only list folders and ignore files.

Credential
This component requires an AWS S3 bucket credential.

Go to the Credentials section of Nappai and create a new credential of type “AWS S3 bucket”.

Enter your AWS Access Key, AWS Secret Access Key, S3 Bucket name, and AWS Region.

In the component, select this credential from the Credential dropdown.
The credential fields (Access Key, Secret Key, Bucket, Region) are not shown in the input list because they are supplied through the credential.

Outputs

Data: A list of Data objects that contain the text content of each file and metadata such as file name and path.
Files: A list of Data objects that contain file paths (and optionally the file content if “Download File content” is enabled).

Usage Example

Create a credential: In the Credentials tab, add an “AWS S3 bucket” credential with your AWS keys, bucket name, and region.
Add the Amazon S3 component to your workflow.
Select the credential from the component’s Credential field.
Choose a S3 Prefix (e.g., reports/2024/) to load all files in that folder.
Set File Type to “documents” if you only want text and image files.
Enable “Download File content” if you need the actual file data in the output.
Connect the “Data” output to the next component that will process the text (e.g., a text summarizer).
Run the workflow. The component will list the files in the chosen folder, download them if requested, and pass the results downstream.

Text Extractor – Pulls text from PDFs or images.
File Processor – Performs operations on a list of file paths.
Data Enricher – Adds metadata to data objects.

Tips and Best Practices

Use Always Include Regex to guarantee that critical files are never missed, even if you set a broad ignore pattern.
Keep Batch Size and Max Concurrency moderate (e.g., 10–20) to avoid overwhelming your network or the S3 service.
Store the checkpoint database in a shared location if you run the workflow on multiple machines.
If you only need file names, leave “Download File content” unchecked to speed up the run.

Security Considerations

Store your AWS credentials in Nappai’s secure credential store; never hard‑code them in the workflow.
The component uses the standard AWS SDK, so it respects IAM policies attached to the credentials.
If you enable “Download File content”, the file data is held in memory only for the duration of the workflow run.
Use the checkpoint database to avoid re‑processing sensitive files multiple times.