How to Add Table Extraction to Your Indian Document Workflow
Table extraction from Indian documents is one of those problems that seems simple until you try it. A human glances at a government rate list and instantly understands the rows and columns. A machine sees pixels — and turning those pixels into structured data requires more than standard OCR.
Regular OCR gives you text. Table extraction gives you text with structure — rows, columns, headers, and cell values preserved in a format you can actually work with. If you deal with Indian government forms, financial statements, or official records, this distinction matters enormously.
Types of Tables in Indian Documents
Indian documents are full of tabular data. Here are the most common types you will encounter.
Government Rate Lists and Schedule of Rates
Public Works Departments (PWD), irrigation departments, and municipal corporations publish Schedule of Rates (SOR) documents listing approved rates for construction materials and labor. These are massive tables — sometimes hundreds of rows — with columns for item code, description (often in Hindi), unit, and rate.
These tables drive infrastructure budgets across the country. Extracting them manually means days of data entry for a single document.
Bank Statements and Financial Records
Indian bank statements, whether from SBI, PNB, or any nationalized bank, present transactions in tabular format: date, description, debit, credit, and balance. Hindi-medium bank branches often issue statements with Hindi headers and transaction descriptions.
Land Revenue Records
Khatauni, khasra, and jamabandi records contain tabular data about land ownership, area, crop details, and revenue amounts. These are among the most requested documents in Indian government digitization projects. The tables follow state-specific formats and are almost always in Hindi or the regional language.
Tax Documents and Returns
GST invoices, TDS certificates (Form 16/16A), and property tax assessment orders all contain structured tables. The challenge is that each type has a different layout, and many mix Hindi and English within the same table.
Official Government Forms
Application forms, inspection reports, and assessment orders from state and central government bodies frequently present data in table format. Think of a food safety inspection report with columns for parameters tested, acceptable limits, and observed values.
Why Regular OCR Is Not Enough for Tables
Standard OCR processes a document page and returns all the text it finds, typically in reading order (left to right, top to bottom). This works fine for paragraphs and letters. It fails for tables.
Consider a simple 3-column table with headers "Item", "Quantity", and "Price". Regular OCR might return:
Item Quantity Price
Cement 50 bags 350
Steel 20 kg 55
That looks okay as raw text, but there is no structure. Your code cannot reliably determine that "350" belongs to "Cement" and "55" belongs to "Steel" — especially when the table has 200 rows and some cells span multiple lines.
Table extraction solves this by first detecting table boundaries, then identifying rows and columns, and finally extracting text from each cell with its position in the grid.
Try BharatOCR Free
95%+ accuracy on Hindi documents. First 3 pages free, no credit card.
How PP-StructureV3 Handles Table Extraction
BharatOCR uses PP-StructureV3 from PaddlePaddle for table detection and extraction. The process works in stages.
Table detection identifies where tables exist on the page. A document might have text paragraphs above and below a table — the detector finds the table region and isolates it.
Cell detection maps out the grid structure within the table. It identifies row boundaries, column boundaries, and merged cells. This is the hard part — Indian documents often have inconsistent line styles, partial borders, or borderless tables where alignment is the only structural cue.
Text recognition runs OCR on each detected cell. Because the cell boundaries are already known, the text within each cell is attributed to the correct row and column.
The output is a structured representation of the table that preserves the grid layout.
Using the Table Extraction API
The API endpoint for table extraction is:
POST /api/v1/ocr/table
Here is a Python example:
import base64
import requests
API_URL = "https://api.bharatocr.com/api/v1/ocr/table"
API_KEY = "boc_your_api_key_here"
with open("government_rate_list.pdf", "rb") as f:
image_data = base64.b64encode(f.read()).decode("utf-8")
response = requests.post(
API_URL,
headers={"Authorization": f"Bearer {API_KEY}"},
json={"image": image_data, "language": "hi"},
timeout=60,
)
response.raise_for_status()
result = response.json()
for table in result["data"]["tables"]:
print(f"Table with {len(table['rows'])} rows, {len(table['columns'])} columns")
for row in table["rows"]:
cells = [cell["text"] for cell in row["cells"]]
print(" | ".join(cells))
The response JSON contains an array of detected tables, each with its rows and cells. Every cell includes the extracted text, confidence score, and bounding box coordinates.
Integration Patterns
Once you have structured table data from the API, here are common ways to use it.
Export to CSV
The most straightforward integration — convert the table JSON to a CSV file:
import csv
for i, table in enumerate(result["data"]["tables"]):
with open(f"table_{i}.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
for row in table["rows"]:
writer.writerow([cell["text"] for cell in row["cells"]])
This is useful for one-off conversions where someone needs the data in a spreadsheet.
Push to a Database
For automated pipelines, insert extracted table data directly into PostgreSQL or any database:
import psycopg2
conn = psycopg2.connect("dbname=documents user=app")
cur = conn.cursor()
for row in table["rows"]:
cells = [cell["text"] for cell in row["cells"]]
cur.execute(
"INSERT INTO rate_list (item_code, description, unit, rate) "
"VALUES (%s, %s, %s, %s)",
(cells[0], cells[1], cells[2], cells[3]),
)
conn.commit()
Convert to Pandas DataFrame
For data analysis workflows:
import pandas as pd
rows_data = []
for row in table["rows"]:
rows_data.append([cell["text"] for cell in row["cells"]])
df = pd.DataFrame(rows_data[1:], columns=rows_data[0])
print(df.describe())
Generate JSON for APIs
If the extracted table feeds into another API or microservice:
import json
structured = []
headers = [cell["text"] for cell in table["rows"][0]["cells"]]
for row in table["rows"][1:]:
entry = {}
for j, cell in enumerate(row["cells"]):
entry[headers[j]] = cell["text"]
structured.append(entry)
print(json.dumps(structured, ensure_ascii=False, indent=2))
Tips for Better Table Extraction Results
Scan quality matters. Tables with thin or faint grid lines need higher resolution scans (300 DPI minimum). If the lines are barely visible to the human eye, the detector will struggle too.
Borderless tables are harder. If your documents use whitespace instead of lines to separate columns, ensure the spacing is consistent. Irregular spacing confuses column detection.
Merged cells need attention. Header rows that span multiple columns are common in Indian documents. The extraction handles standard merges, but unusual layouts may need post-processing.
Multi-page tables. If a table spans multiple pages, process each page separately and concatenate the results. Check for repeated headers on continuation pages and remove duplicates.
How BharatOCR Helps
BharatOCR's table extraction handles the specific challenges of Indian documents — Hindi text in cells, inconsistent formatting across government departments, and mixed-language tables.
PP-StructureV3 powers the table detection and cell extraction. Combined with PaddleOCR PP-OCRv5 for the text recognition within each cell, you get structured data with 95%+ text accuracy and proper row-column attribution.
Processing supports up to 50 pages per batch request across JPEG, PNG, PDF, TIFF, and BMP formats. The API endpoints are:
POST /api/v1/ocr/tablefor table extractionPOST /api/v1/ocrfor regular text extractionGET /api/v1/usagefor monitoring your consumption
Start free with 3 pages, then Rs 5 per page or monthly plans from Rs 999 to Rs 9,999. BharatOCR is built by Meridian Intelligence Pvt. Ltd.
Structured data from Indian documents should not require manual data entry. Send us the document, and we will send you the table.