Building a Hindi Document Scanner with Python and an OCR API
A Hindi document scanner Python script does not need to be complicated. If you can send an HTTP request, you can extract Devanagari text from any document. No ML libraries to install, no model weights to download, no GPU required.
We will build a complete working script that reads a Hindi document image, sends it to BharatOCR's OCR API, and prints the extracted text. The whole thing fits in under 20 lines of code.
What You Will Need
Before we start, make sure you have:
- Python 3.8+ installed
- The
requestslibrary (pip install requests) - A BharatOCR API key (sign up at bharatocr.com — you get 3 free pages)
- A Hindi document image (JPEG, PNG, PDF, TIFF, or BMP)
That is it. No TensorFlow, no PyTorch, no OpenCV. The OCR processing happens on BharatOCR's servers, so your local machine just needs to send the image and receive the results.
The Complete Hindi Document Scanner Script
Here is the full working script:
import base64
import requests
API_URL = "https://api.bharatocr.com/api/v1/ocr"
API_KEY = "boc_your_api_key_here"
# Read and encode the document image
with open("hindi_document.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
# Send to BharatOCR API
response = requests.post(
API_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
json={"image": image_data, "language": "hi"},
timeout=30,
)
response.raise_for_status()
# Extract and print the text
result = response.json()
for block in result["data"]["text_blocks"]:
print(f"[{block['confidence']:.2f}] {block['text']}")
That is 17 lines including blank lines. Let us break down what each part does.
Try BharatOCR Free
95%+ accuracy on Hindi documents. First 3 pages free, no credit card.
Step-by-Step Breakdown
Reading the Image File
with open("hindi_document.jpg", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
We read the image file in binary mode and encode it as a base64 string. Base64 encoding converts binary data into ASCII text, which is safe to include in a JSON request body. The .decode("utf-8") converts the bytes object to a regular string.
This works with any image format BharatOCR supports: JPEG, PNG, TIFF, BMP, or even PDF files.
Sending the API Request
response = requests.post(
API_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
json={"image": image_data, "language": "hi"},
timeout=30,
)
response.raise_for_status()
We send a POST request to /api/v1/ocr with the API key in the Authorization header. The request body contains the base64-encoded image and the language code ("hi" for Hindi).
The timeout=30 prevents the script from hanging indefinitely if there is a network issue. raise_for_status() throws an exception if the API returns an error (4xx or 5xx status code) instead of silently failing.
Parsing the Response
result = response.json()
for block in result["data"]["text_blocks"]:
print(f"[{block['confidence']:.2f}] {block['text']}")
The API returns JSON with the extracted text organized into blocks. Each block has the recognized text, a confidence score (0.0 to 1.0), and the bounding box coordinates on the image. We print each block with its confidence score.
Adding Error Handling
The basic script works, but a production Hindi document scanner Python script needs proper error handling:
import base64
import sys
import requests
API_URL = "https://api.bharatocr.com/api/v1/ocr"
API_KEY = "boc_your_api_key_here"
def extract_text(image_path: str) -> list[dict]:
"""Extract Hindi text from a document image."""
try:
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
except FileNotFoundError:
print(f"Error: File not found — {image_path}")
sys.exit(1)
try:
response = requests.post(
API_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
json={"image": image_data, "language": "hi"},
timeout=30,
)
response.raise_for_status()
except requests.exceptions.Timeout:
print("Error: API request timed out. Try again.")
sys.exit(1)
except requests.exceptions.HTTPError as e:
print(f"Error: API returned {e.response.status_code}")
if e.response.status_code == 401:
print("Check your API key.")
elif e.response.status_code == 429:
print("Rate limit exceeded. Wait and retry.")
sys.exit(1)
result = response.json()
return result["data"]["text_blocks"]
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python scanner.py <image_path>")
sys.exit(1)
blocks = extract_text(sys.argv[1])
full_text = "\n".join(b["text"] for b in blocks)
print(full_text)
Now you can run it from the command line:
python scanner.py hindi_document.jpg
This version handles missing files, timeouts, authentication errors, and rate limits gracefully.
Tips for Better OCR Accuracy
The quality of your input image directly affects OCR accuracy. Here are practical tips for your Hindi document scanner Python project.
Resolution Matters
Aim for 300 DPI when scanning documents. Phone cameras usually produce sufficient resolution, but if you are processing old or faded documents, higher resolution helps. Anything below 150 DPI will noticeably hurt accuracy.
Straighten the Image
Skewed documents reduce accuracy. If you are capturing documents with a phone camera, try to keep the camera perpendicular to the page. For programmatic deskewing, OpenCV's getRotationMatrix2D works well:
import cv2
import numpy as np
def deskew(image_path: str) -> np.ndarray:
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
coords = np.column_stack(np.where(img > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
else:
angle = -angle
h, w = img.shape
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
return cv2.warpAffine(img, matrix, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE)
Improve Contrast
Faded or low-contrast documents benefit from simple contrast enhancement before sending to OCR. A basic threshold or adaptive threshold in OpenCV can make a significant difference:
img = cv2.imread("faded_document.jpg", cv2.IMREAD_GRAYSCALE)
enhanced = cv2.adaptiveThreshold(
img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
cv2.imwrite("enhanced.jpg", enhanced)
Remove Noise
Stamps, stains, and background patterns confuse OCR engines. A light Gaussian blur followed by thresholding can clean up noisy documents.
Processing Multiple Documents
For batch processing, loop through a directory of images:
from pathlib import Path
input_dir = Path("documents")
for image_file in input_dir.glob("*.jpg"):
blocks = extract_text(str(image_file))
text = "\n".join(b["text"] for b in blocks)
output_file = image_file.with_suffix(".txt")
output_file.write_text(text, encoding="utf-8")
print(f"Processed: {image_file.name}")
This creates a .txt file alongside each image with the extracted Hindi text.
Checking Your API Usage
Keep track of how many pages you have processed with the usage endpoint:
usage = requests.get(
"https://api.bharatocr.com/api/v1/usage",
headers={"Authorization": f"Bearer {API_KEY}"},
).json()
print(f"Pages used: {usage['data']['pages_used']}")
How BharatOCR Helps
Building a Hindi document scanner in Python is straightforward with BharatOCR because the hard part — accurate Devanagari text recognition — is handled by our API.
BharatOCR uses PaddleOCR PP-OCRv5, optimized for Hindi and Devanagari script. You get 95%+ accuracy on printed Hindi text with sub-2-second processing per page. The API accepts JPEG, PNG, PDF, TIFF, and BMP, so you do not need to worry about format conversion.
For developers, the integration is minimal: one HTTP POST, one JSON response. No SDKs to install, no dependencies to manage, no model updates to track. Your code stays simple while the OCR engine handles the complexity.
Start with 3 free pages to test your integration. Pay-as-you-go is Rs 5 per page, or choose a monthly plan from Rs 999 to Rs 9,999 for production workloads. BharatOCR is built by Meridian Intelligence Pvt. Ltd.