← Back to Blog
Developer

Batch Processing Multi-Page Hindi PDFs: A Developer Guide

BharatOCR Team5 min read

Batch Processing Multi-Page Hindi PDFs: A Developer Guide

If you've ever worked with Indian government documents, you know the pain. A single property registration file can run 40 pages. Court orders stretch to 80. Insurance claims pile up in stacks of PDFs, each one stuffed with Hindi text that needs to be digitized.

Processing these one page at a time is not practical. You need batch processing for Hindi PDFs — the ability to send an entire multi-page document and get structured text back for every page.

Why Batch Processing Hindi PDFs Matters

Most Indian administrative documents are not single-page affairs. Here's what real-world volumes look like:

  • Property registration deeds: 20-60 pages of mixed Hindi-English text
  • Court case files: 50-100+ pages, often scanned at varying quality
  • Insurance claim bundles: 10-30 pages combining forms, medical reports, and declarations
  • Bank loan applications: 15-40 pages of KYC documents, income proofs, and agreements

Sending each page as a separate API call creates unnecessary overhead — network latency, authentication on every request, and the hassle of stitching results back together. Batch processing solves all of this.

How BharatOCR Handles Multi-Page PDFs

BharatOCR accepts multi-page PDFs up to 50 pages in a single API call. The engine — built on PaddleOCR PP-OCRv5 — processes each page independently and returns results in page order, with confidence scores for every detected text block.

Here's a Python example to get you started:

import requests

API_KEY = "boc_your_api_key_here"
PDF_PATH = "property_deed_42pages.pdf"

with open(PDF_PATH, "rb") as f:
    response = requests.post(
        "https://api.bharatocr.com/api/v1/ocr",
        headers={"Authorization": f"Bearer {API_KEY}"},
        files={"file": (PDF_PATH, f, "application/pdf")},
        data={"language": "hi"}
    )

result = response.json()

for page in result["pages"]:
    print(f"Page {page['page_number']}:")
    print(f"  Confidence: {page['confidence']:.2f}")
    print(f"  Text blocks: {len(page['blocks'])}")
    for block in page["blocks"]:
        print(f"    [{block['confidence']:.2f}] {block['text']}")

The response includes a pages array, where each entry contains the page number, an overall confidence score, and individual text blocks with their bounding box coordinates and per-block confidence.

Try BharatOCR Free

95%+ accuracy on Hindi documents. First 3 pages free, no credit card.

Start Free

Understanding Confidence Scores

Every text block comes with a confidence score between 0 and 1. For printed Hindi text in decent scan quality, you can expect scores above 0.95. Here's a rough guide:

| Confidence Range | What It Means | Action | |---|---|---| | 0.95 - 1.00 | High quality, reliable text | Use directly | | 0.85 - 0.94 | Good quality, minor uncertainty | Acceptable for most workflows | | 0.70 - 0.84 | Moderate quality, possible errors | Flag for human review | | Below 0.70 | Poor scan or unusual font | Manual verification needed |

You can use these scores to build smart review workflows. For example, auto-approve pages above 0.90 and route lower-confidence pages to a human operator.

Handling Mixed Hindi-English Pages

Indian documents are rarely pure Hindi. You'll find English headers, dates in both scripts, section numbers in Latin digits, and legal terms in English. BharatOCR handles this natively.

The PP-OCRv5 model recognizes both Devanagari and Latin scripts in a single pass. You don't need to specify "mixed mode" or make separate calls. Each text block in the response is tagged with the detected script, so you can filter or process them differently if needed.

for block in page["blocks"]:
    if block.get("script") == "devanagari":
        process_hindi_text(block["text"])
    else:
        process_english_text(block["text"])

Batch Processing Hindi PDFs: Error Handling

Real-world PDFs are messy. Pages get corrupted during scanning, some pages are blank, and occasionally a page is just a photograph with no text. Your code needs to handle all of this gracefully.

BharatOCR returns per-page status codes. A corrupt page won't kill the entire request — you'll get results for the valid pages and error details for the problematic ones.

result = response.json()

successful_pages = []
failed_pages = []

for page in result["pages"]:
    if page.get("status") == "success":
        successful_pages.append(page)
    else:
        failed_pages.append({
            "page_number": page["page_number"],
            "error": page.get("error", "Unknown error")
        })

print(f"Processed: {len(successful_pages)}/{len(result['pages'])} pages")
if failed_pages:
    print(f"Failed pages: {[p['page_number'] for p in failed_pages]}")

For documents exceeding 50 pages, split the PDF into chunks on your end and make sequential calls. Libraries like PyMuPDF or pikepdf make this straightforward.

Performance Expectations

BharatOCR processes each page in under 2 seconds on average. For a 50-page PDF, expect total processing time of around 30-60 seconds depending on page complexity and scan quality.

A few tips to get the best results:

  • Scan at 300 DPI — lower resolutions hurt accuracy, especially for smaller Hindi text
  • Use grayscale or black-and-white scans when possible — color adds file size without improving text recognition
  • Straighten skewed scans before sending — while BharatOCR handles mild skew, heavily rotated pages drop in accuracy

Monitoring Your Usage

When you're processing high volumes, keep an eye on your consumption. The usage endpoint tells you exactly where you stand:

usage = requests.get(
    "https://api.bharatocr.com/api/v1/usage",
    headers={"Authorization": f"Bearer {API_KEY}"}
).json()

print(f"Pages used this month: {usage['pages_used']}")
print(f"Pages remaining: {usage['pages_remaining']}")

At Rs 5 per page, a 50-page property deed costs Rs 250 to digitize. Compare that to manual transcription, which could take a data entry operator an entire day.

How BharatOCR Helps

BharatOCR was built for exactly this use case — high-volume Hindi document processing. The batch API accepts PDFs up to 50 pages, returns per-page confidence scores so you can build smart review pipelines, and handles mixed Hindi-English text without any extra configuration.

The pricing is straightforward: 3 free pages to test, then Rs 5 per page or monthly plans from Rs 999 to Rs 9,999 depending on your volume. Every page is processed through PaddleOCR PP-OCRv5, which delivers 95%+ accuracy on printed Hindi text.

If you're building a document processing pipeline for Indian government files, legal documents, or financial records, batch processing through BharatOCR is the most practical path from scanned PDFs to structured, searchable text.

Start with the free tier, test with your actual documents, and scale from there.

Try BharatOCR Today

Extract text from Hindi documents with 95%+ accuracy. Start free.

Related Posts