How to Extract Text from Hindi Documents Using an OCR API

A Hindi OCR API lets you send a scanned document, photo, or PDF and receive the extracted Devanagari text back as structured JSON. No local model setup, no GPU provisioning, no training data headaches. You make an HTTP request, and you get text.

If you are building a fintech app that needs to read Hindi ID proofs, a legal tech product that processes court orders, or a government digitization pipeline, this guide will show you exactly how to extract Hindi text using an OCR API — with working code examples you can copy and run.

What You Need Before Starting

To follow along, you will need:

A BharatOCR API key (sign up at bharatocr.com — you get 3 free pages to test)
A Hindi document in JPEG, PNG, PDF, TIFF, or BMP format
Python 3.7+ or cURL installed on your machine

Your API key will look like boc_xxxxxxxxxxxx. Keep it in an environment variable, not hardcoded in your source files.

Sending Your First Hindi OCR API Request with cURL

The simplest way to test is with cURL. Here is a request that sends a base64-encoded image to the BharatOCR OCR endpoint:

# Encode your image to base64
BASE64_IMAGE=$(base64 -w 0 hindi_document.jpg)

# Send the OCR request
curl -X POST https://api.bharatocr.com/api/v1/ocr \
  -H "Authorization: Bearer $BHARATOCR_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"image\": \"$BASE64_IMAGE\",
    \"language\": \"hi\",
    \"output_format\": \"json\"
  }"

That is it. The API accepts the base64-encoded image, runs OCR, and returns the extracted text.

Try BharatOCR Free

95%+ accuracy on Hindi documents. First 3 pages free, no credit card.

Start Free

Understanding the Hindi OCR API Response

Here is what a typical JSON response looks like:

{
  "status": "success",
  "processing_time_ms": 1340,
  "pages": 1,
  "results": [
    {
      "page": 1,
      "blocks": [
        {
          "text": "भारत सरकार",
          "confidence": 0.97,
          "bbox": [120, 45, 380, 90],
          "language": "hi"
        },
        {
          "text": "Ministry of Home Affairs",
          "confidence": 0.99,
          "bbox": [100, 95, 420, 130],
          "language": "en"
        },
        {
          "text": "प्रमाण पत्र संख्या: 2024/HIN/00451",
          "confidence": 0.94,
          "bbox": [80, 150, 500, 185],
          "language": "mixed"
        }
      ]
    }
  ]
}

A few things to notice:

Confidence scores are returned per text block. Anything above 0.90 is highly reliable. Scores between 0.70-0.90 may need human review.
Bounding boxes (bbox) give you the pixel coordinates of each text block — useful if you need to highlight or annotate the original image.
Language detection happens automatically. The API identifies whether each block is Hindi, English, or mixed — no configuration needed.
Processing time is typically under 2 seconds per page.

Extracting Hindi Text with Python

For production use, here is a Python example using the requests library:

import base64
import os
import requests

API_KEY = os.environ["BHARATOCR_API_KEY"]
API_URL = "https://api.bharatocr.com/api/v1/ocr"

def extract_hindi_text(image_path: str) -> dict:
    """Extract text from a Hindi document image."""
    with open(image_path, "rb") as f:
        image_base64 = base64.b64encode(f.read()).decode("utf-8")

    response = requests.post(
        API_URL,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "image": image_base64,
            "language": "hi",
            "output_format": "json",
        },
        timeout=30,
    )
    response.raise_for_status()
    return response.json()


# Usage
result = extract_hindi_text("hindi_document.jpg")

for block in result["results"][0]["blocks"]:
    print(f"[{block['confidence']:.2f}] {block['text']}")

This gives you clean, structured output with confidence scores you can use to flag low-quality extractions for manual review.

Handling PDFs and Multi-Page Documents

The Hindi OCR API supports multi-page PDFs natively. You can send PDFs up to 50 pages in a single batch request. The response will contain a separate results array for each page.

def extract_from_pdf(pdf_path: str) -> dict:
    """Extract text from a multi-page Hindi PDF."""
    with open(pdf_path, "rb") as f:
        pdf_base64 = base64.b64encode(f.read()).decode("utf-8")

    response = requests.post(
        API_URL,
        headers={
            "Authorization": f"Bearer {API_KEY}",
            "Content-Type": "application/json",
        },
        json={
            "file": pdf_base64,
            "file_type": "pdf",
            "language": "hi",
            "output_format": "json",
        },
        timeout=120,
    )
    response.raise_for_status()
    return response.json()

For large PDFs, set a longer timeout. Processing time scales roughly linearly — expect about 1.5-2 seconds per page.

Working with Mixed Hindi-English Documents

Indian documents almost always contain both Hindi and English text. Property papers have Hindi descriptions with English case numbers. Bank statements mix Hindi headers with English transaction data. Government IDs have bilingual fields.

The BharatOCR API handles this automatically. You do not need to specify regions or zones. The engine detects language switches within the same line and returns each block with its detected language.

This is where most generic OCR tools break down. They try to read the entire document in one language and produce mangled output whenever the script changes. A purpose-built Hindi OCR API treats mixed-language text as a first-class scenario.

Extracting Tables from Hindi Documents

Many Hindi documents contain tabular data — land records with column headers in Hindi, government rate charts, bank passbook entries. BharatOCR uses PP-StructureV3 for table detection and extraction.

When the API detects a table in your document, it returns structured row-column data alongside the regular text blocks. This saves you from writing complex post-processing logic to reconstruct table structure from raw text.

Error Handling and Best Practices

A few tips for production integrations:

Check confidence scores. Set a threshold (we recommend 0.85) and flag blocks below it for human review.
Retry on timeouts. Network issues happen. Implement exponential backoff with 2-3 retries.
Validate input images. Blurry, rotated, or very low-resolution images will produce poor results regardless of the OCR engine. Minimum 150 DPI is recommended for scanned documents.
Store the raw response. Keep the full JSON response including bounding boxes and confidence scores. You may need them later for auditing or reprocessing.

How BharatOCR Helps

BharatOCR gives you a single API endpoint that handles Hindi, English, and mixed-language documents with 95%+ accuracy on printed text. Built on PaddleOCR PP-OCRv5, it processes pages in under 2 seconds and supports JPEG, PNG, PDF, TIFF, and BMP.

Start free with 3 pages, then pay Rs 5 per page on the pay-as-you-go tier. For higher volumes, monthly plans start at Rs 999/month. There is no infrastructure to manage, no models to train, and no GPU costs to worry about.

Send a POST to /api/v1/ocr and start extracting Hindi text in minutes, not months. Sign up at bharatocr.com and get your API key today.

How to Extract Text from Hindi Documents Using an OCR API

How to Extract Text from Hindi Documents Using an OCR API

What You Need Before Starting

Sending Your First Hindi OCR API Request with cURL

Understanding the Hindi OCR API Response

Extracting Hindi Text with Python

Handling PDFs and Multi-Page Documents

Working with Mixed Hindi-English Documents

Extracting Tables from Hindi Documents

Error Handling and Best Practices

How BharatOCR Helps

Try BharatOCR Today

Related Posts

PaddleOCR vs Tesseract for Hindi Text Recognition

Understanding OCR Accuracy for Devanagari: Ligatures, Matras, and Conjuncts

Hindi OCR vs English OCR: Key Differences and Challenges