Skip to main content

Batch Processing

Learn how to efficiently process large numbers of documents with Structurify.

Overview

Structurify processes documents in batches within a project. Upload all your documents to a project, then run a single extraction job to process them all.

Workflow

  1. Create a project with the appropriate template
  2. Upload all documents to the project
  3. Run one extraction job (processes all documents)
  4. Export all results at once

Example: Process 100 Invoices

import os
from pathlib import Path
from structurify import Structurify

client = Structurify(api_key=os.environ['STRUCTURIFY_API_KEY'])

# Create project
project = client.projects.create(
name="Q1 2026 Invoices",
template_id="tpl_invoice"
)

# Upload all invoices
invoice_dir = Path("./invoices")
for pdf in invoice_dir.glob("*.pdf"):
doc = client.documents.upload(
project_id=project["id"],
file_path=str(pdf)
)
print(f"Uploaded: {doc['name']}")

# Run extraction (one job for all documents)
job = client.extraction.run(project_id=project["id"])

# Wait for all to complete
completed = client.extraction.wait_for_completion(
job["id"],
timeout=600 # 10 minutes for large batches
)

print(f"Completed: {completed['completedTasks']}")
print(f"Failed: {completed['failedTasks']}")

# Export all results
export = client.exports.create(
project_id=project["id"],
format="csv"
)
csv_data = client.exports.download(export["export"]["id"])

Performance Tips

Parallel Uploads

Speed up uploads by running them concurrently:

import asyncio
from structurify import Structurify

async def upload_documents(project_id: str, files: list):
tasks = []
for file_path in files:
task = asyncio.create_task(
upload_single(project_id, file_path)
)
tasks.append(task)

return await asyncio.gather(*tasks)

Chunking Large Batches

For very large batches (1000+ documents), split into multiple projects:

def chunk_list(lst, chunk_size):
for i in range(0, len(lst), chunk_size):
yield lst[i:i + chunk_size]

files = list(Path("./documents").glob("*.pdf"))

for i, chunk in enumerate(chunk_list(files, 100)):
project = client.projects.create(
name=f"Batch {i+1}",
template_id="tpl_invoice"
)
# Upload and process chunk...

Credit Usage

  • 1 credit = 1 document extracted
  • Credits are consumed when extraction runs
  • Failed extractions don't consume credits

Check your credit balance before large batches:

# Check in dashboard: https://app.structurify.ai
# Or handle InsufficientCreditsError

Progress Monitoring

Monitor extraction progress:

import time

job = client.extraction.run(project_id=project["id"])

while True:
status = client.extraction.get(job["id"])
progress = status.get("progress", 0)
completed = status.get("completedTasks", 0)
total = status.get("totalTasks", 0)

print(f"Progress: {progress}% ({completed}/{total})")

if status["status"] in ["done", "error", "cancelled"]:
break

time.sleep(5)

Handling Failures

Some documents may fail extraction. Check the job results:

completed = client.extraction.wait_for_completion(job["id"])

if completed["failedTasks"] > 0:
print(f"Warning: {completed['failedTasks']} documents failed")
# Log failures for manual review