
Gemini-Powered OCR & Intelligent Mapping
An AI-powered platform to extract, validate, and map business data (e.g., product codes, quantities) from multi-modal inputs like text, images, and large PDFs.
Defining the core problem and identified pain points that necessitated this technical intervention.
Manual data entry for sales orders was a significant bottleneck. Staff had to decipher handwritten notes, photos, and multi-page scanned PDFs to find product codes and quantities. This data was then manually cross-referenced against complex mapping tables (both general and customer-specific) before ERP entry, a process that was slow, expensive, and resulted in frequent, costly errors.
The architectural and implementation strategy developed to resolve the challenge.
Built two distinct, stateful agents — one for general industry mapping and one for customer-specific mapping. The agents accept multi-modal (text/image) inputs, use Gemini's structured output to extract order items (e.g., 'Code-123 X10'), map them to internal reference codes, and generate a final, import-ready CSV after a human-in-the-loop confirmation.
Created an asynchronous API endpoint that accepts large PDF uploads. A background task worker splits the PDF into images (using PyMuPDF), runs the same Gemini extraction logic on every page, and aggregates all results into a single, downloadable Excel file.
Built the entire system as a secure FastAPI application with full user/admin roles, JWT/OTP authentication, and dedicated APIs for managing the mapping databases.
My specific roles, responsibilities, and the technical value I added to the project lifecycle.
Architected the complete FastAPI backend, including the secure JWT/OTP authentication and admin role-based access control.
Designed and built the dual LangGraph conversational agent framework (general vs. customer-specific) for stateful, multi-modal data extraction.
Developed the high-throughput, asynchronous batch processing pipeline for PDFs, using PyMuPDF for image splitting and background tasks for reliable processing.
Engineered the core multi-modal input system, enabling the Gemini-powered tools to extract structured data from text, images, and scanned PDF pages.
Created the full suite of CRUD APIs for managing the complex mapping databases, including features for bulk CSV upload and export.
Implemented the session logic for the conversational agents to support a human-in-the-loop confirmation workflow before final mapping.
Gemini-powered OCR achieved 98% field-level accuracy across handwritten notes, images, and scanned PDFs.
Batch-processed a 20-page sales order PDF in under 3 minutes — down from a 2-hour manual effort.
Automated data entry saved the client team over 1,500 hours of manual work in the first year alone.
Dual agent framework covered both general industry mapping and customer-specific code tables from one platform.
Human-in-the-loop confirmation step eliminated costly ERP data entry mistakes caught at source.
The async batch pipeline scales to PDFs of any size without blocking the API or timing out.