How to Add Table Extraction to Your Indian Document Workflow

Table extraction from Indian documents is one of those problems that seems simple until you try it. A human glances at a government rate list and instantly understands the rows and columns. A machine sees pixels — and turning those pixels into structured data requires more than standard OCR.

Regular OCR gives you text. Table extraction gives you text with structure — rows, columns, headers, and cell values preserved in a format you can actually work with. If you deal with Indian government forms, financial statements, or official records, this distinction matters enormously.

Types of Tables in Indian Documents

Indian documents are full of tabular data. Here are the most common types you will encounter.

Government Rate Lists and Schedule of Rates

Public Works Departments (PWD), irrigation departments, and municipal corporations publish Schedule of Rates (SOR) documents listing approved rates for construction materials and labor. These are massive tables — sometimes hundreds of rows — with columns for item code, description (often in Hindi), unit, and rate.

These tables drive infrastructure budgets across the country. Extracting them manually means days of data entry for a single document.

Bank Statements and Financial Records

Indian bank statements, whether from SBI, PNB, or any nationalized bank, present transactions in tabular format: date, description, debit, credit, and balance. Hindi-medium bank branches often issue statements with Hindi headers and transaction descriptions.

Land Revenue Records

Khatauni, khasra, and jamabandi records contain tabular data about land ownership, area, crop details, and revenue amounts. These are among the most requested documents in Indian government digitization projects. The tables follow state-specific formats and are almost always in Hindi or the regional language.

Tax Documents and Returns

GST invoices, TDS certificates (Form 16/16A), and property tax assessment orders all contain structured tables. The challenge is that each type has a different layout, and many mix Hindi and English within the same table.

Official Government Forms

Application forms, inspection reports, and assessment orders from state and central government bodies frequently present data in table format. Think of a food safety inspection report with columns for parameters tested, acceptable limits, and observed values.

Why Regular OCR Is Not Enough for Tables

Standard OCR processes a document page and returns all the text it finds, typically in reading order (left to right, top to bottom). This works fine for paragraphs and letters. It fails for tables.

Consider a simple 3-column table with headers "Item", "Quantity", and "Price". Regular OCR might return:

Item Quantity Price
Cement 50 bags 350
Steel 20 kg 55

That looks okay as raw text, but there is no structure. Your code cannot reliably determine that "350" belongs to "Cement" and "55" belongs to "Steel" — especially when the table has 200 rows and some cells span multiple lines.

Table extraction solves this by first detecting table boundaries, then identifying rows and columns, and finally extracting text from each cell with its position in the grid.

Try BharatOCR Free

95%+ accuracy on Hindi documents. First 3 pages free, no credit card.

Start Free

How PP-StructureV3 Handles Table Extraction

BharatOCR uses PP-StructureV3 from PaddlePaddle for table detection and extraction. The process works in stages.

Table detection identifies where tables exist on the page. A document might have text paragraphs above and below a table — the detector finds the table region and isolates it.

Cell detection maps out the grid structure within the table. It identifies row boundaries, column boundaries, and merged cells. This is the hard part — Indian documents often have inconsistent line styles, partial borders, or borderless tables where alignment is the only structural cue.

Text recognition runs OCR on each detected cell. Because the cell boundaries are already known, the text within each cell is attributed to the correct row and column.

The output is a structured representation of the table that preserves the grid layout.

Using the Table Extraction API

The API endpoint for table extraction is:

POST /api/v1/ocr/table

Here is a Python example:

import base64
import requests

API_URL = "https://api.bharatocr.com/api/v1/ocr/table"
API_KEY = "boc_your_api_key_here"

with open("government_rate_list.pdf", "rb") as f:
    image_data = base64.b64encode(f.read()).decode("utf-8")

response = requests.post(
    API_URL,
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={"image": image_data, "language": "hi"},
    timeout=60,
)
response.raise_for_status()

result = response.json()
for table in result["data"]["tables"]:
    print(f"Table with {len(table['rows'])} rows, {len(table['columns'])} columns")
    for row in table["rows"]:
        cells = [cell["text"] for cell in row["cells"]]
        print(" | ".join(cells))

The response JSON contains an array of detected tables, each with its rows and cells. Every cell includes the extracted text, confidence score, and bounding box coordinates.

Integration Patterns

Once you have structured table data from the API, here are common ways to use it.

Export to CSV

The most straightforward integration — convert the table JSON to a CSV file:

import csv

for i, table in enumerate(result["data"]["tables"]):
    with open(f"table_{i}.csv", "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        for row in table["rows"]:
            writer.writerow([cell["text"] for cell in row["cells"]])

This is useful for one-off conversions where someone needs the data in a spreadsheet.

Push to a Database

For automated pipelines, insert extracted table data directly into PostgreSQL or any database:

import psycopg2

conn = psycopg2.connect("dbname=documents user=app")
cur = conn.cursor()

for row in table["rows"]:
    cells = [cell["text"] for cell in row["cells"]]
    cur.execute(
        "INSERT INTO rate_list (item_code, description, unit, rate) "
        "VALUES (%s, %s, %s, %s)",
        (cells[0], cells[1], cells[2], cells[3]),
    )
conn.commit()

Convert to Pandas DataFrame

For data analysis workflows:

import pandas as pd

rows_data = []
for row in table["rows"]:
    rows_data.append([cell["text"] for cell in row["cells"]])

df = pd.DataFrame(rows_data[1:], columns=rows_data[0])
print(df.describe())

Generate JSON for APIs

If the extracted table feeds into another API or microservice:

import json

structured = []
headers = [cell["text"] for cell in table["rows"][0]["cells"]]
for row in table["rows"][1:]:
    entry = {}
    for j, cell in enumerate(row["cells"]):
        entry[headers[j]] = cell["text"]
    structured.append(entry)

print(json.dumps(structured, ensure_ascii=False, indent=2))

Tips for Better Table Extraction Results

Scan quality matters. Tables with thin or faint grid lines need higher resolution scans (300 DPI minimum). If the lines are barely visible to the human eye, the detector will struggle too.

Borderless tables are harder. If your documents use whitespace instead of lines to separate columns, ensure the spacing is consistent. Irregular spacing confuses column detection.

Merged cells need attention. Header rows that span multiple columns are common in Indian documents. The extraction handles standard merges, but unusual layouts may need post-processing.

Multi-page tables. If a table spans multiple pages, process each page separately and concatenate the results. Check for repeated headers on continuation pages and remove duplicates.

How BharatOCR Helps

BharatOCR's table extraction handles the specific challenges of Indian documents — Hindi text in cells, inconsistent formatting across government departments, and mixed-language tables.

PP-StructureV3 powers the table detection and cell extraction. Combined with PaddleOCR PP-OCRv5 for the text recognition within each cell, you get structured data with 95%+ text accuracy and proper row-column attribution.

Processing supports up to 50 pages per batch request across JPEG, PNG, PDF, TIFF, and BMP formats. The API endpoints are:

POST /api/v1/ocr/table for table extraction
POST /api/v1/ocr for regular text extraction
GET /api/v1/usage for monitoring your consumption

Start free with 3 pages, then Rs 5 per page or monthly plans from Rs 999 to Rs 9,999. BharatOCR is built by Meridian Intelligence Pvt. Ltd.

Structured data from Indian documents should not require manual data entry. Send us the document, and we will send you the table.

How to Add Table Extraction to Your Indian Document Workflow

How to Add Table Extraction to Your Indian Document Workflow

Types of Tables in Indian Documents

Government Rate Lists and Schedule of Rates

Bank Statements and Financial Records

Land Revenue Records

Tax Documents and Returns

Official Government Forms

Why Regular OCR Is Not Enough for Tables

How PP-StructureV3 Handles Table Extraction

Using the Table Extraction API

Integration Patterns

Export to CSV

Push to a Database

Convert to Pandas DataFrame

Generate JSON for APIs

Tips for Better Table Extraction Results

How BharatOCR Helps

Try BharatOCR Today

Related Posts

Integrating Hindi OCR into Your Existing Fintech Stack

Batch Processing Multi-Page Hindi PDFs: A Developer Guide

Building a Hindi Document Scanner with Python and an OCR API