Loading project details...

Loading projects...

Multi-Modal AI System for Data Extraction & Mapping

Gemini-Powered OCR & Intelligent Mapping

An AI-powered platform to extract, validate, and map business data (e.g., product codes, quantities) from multi-modal inputs like text, images, and large PDFs.

PythonFastAPIPydanticLangGraphLangChainMongoDBPyMuPDFJWTOAuth2BackgroundTasks

Timeline

2 Weeks

Intensity

Individual

Access

Proprietary

Revision

Mar 2026

The Challenge

Defining the core problem and identified pain points that necessitated this technical intervention.

Manual data entry for sales orders was a significant bottleneck. Staff had to decipher handwritten notes, photos, and multi-page scanned PDFs to find product codes and quantities. This data was then manually cross-referenced against complex mapping tables (both general and customer-specific) before ERP entry, a process that was slow, expensive, and resulted in frequent, costly errors.

Technical Solution

The architectural and implementation strategy developed to resolve the challenge.

Conversational AI Agents (LangGraph)

Built two distinct, stateful agents — one for general industry mapping and one for customer-specific mapping. The agents accept multi-modal (text/image) inputs, use Gemini's structured output to extract order items (e.g., 'Code-123 X10'), map them to internal reference codes, and generate a final, import-ready CSV after a human-in-the-loop confirmation.

Batch OCR Pipeline

Created an asynchronous API endpoint that accepts large PDF uploads. A background task worker splits the PDF into images (using PyMuPDF), runs the same Gemini extraction logic on every page, and aggregates all results into a single, downloadable Excel file.

Secure Platform Layer

Built the entire system as a secure FastAPI application with full user/admin roles, JWT/OTP authentication, and dedicated APIs for managing the mapping databases.

Key Contributions

My specific roles, responsibilities, and the technical value I added to the project lifecycle.

Backend Architecture

Architected the complete FastAPI backend, including the secure JWT/OTP authentication and admin role-based access control.

Dual Agent Framework

Designed and built the dual LangGraph conversational agent framework (general vs. customer-specific) for stateful, multi-modal data extraction.

Batch Processing Pipeline

Developed the high-throughput, asynchronous batch processing pipeline for PDFs, using PyMuPDF for image splitting and background tasks for reliable processing.

Multi-Modal Input System

Engineered the core multi-modal input system, enabling the Gemini-powered tools to extract structured data from text, images, and scanned PDF pages.

Mapping Database APIs

Created the full suite of CRUD APIs for managing the complex mapping databases, including features for bulk CSV upload and export.

HITL Session Logic

Implemented the session logic for the conversational agents to support a human-in-the-loop confirmation workflow before final mapping.

Performance Metrics

Results & Impact

Extraction Accuracy

98%

Gemini-powered OCR achieved 98% field-level accuracy across handwritten notes, images, and scanned PDFs.

Per-PDF Processing

3 min

Batch-processed a 20-page sales order PDF in under 3 minutes — down from a 2-hour manual effort.

Man-Hours Saved

1,500+

Automated data entry saved the client team over 1,500 hours of manual work in the first year alone.

Agent Modes

Dual agent framework covered both general industry mapping and customer-specific code tables from one platform.

ERP Entry Errors

Human-in-the-loop confirmation step eliminated costly ERP data entry mistakes caught at source.

PDF Page Scale

∞

The async batch pipeline scales to PDFs of any size without blocking the API or timing out.