Agentic AI for Grocery Receipt Analytics

Overview

GroceryBot automates grocery receipt processing, extracting structured data from photos for personal expense tracking and analytics. Manual data entry is slow and error-prone. Standard parsing solution don’t handle semi-structured receipts with varying formats very well.

Early Prototype & Validation

I built a quick prototype using n8n (visual workflow automation) and a minimal web UI.

Goal: Validate that OCR + basic extraction could work for real receipts and gauge user value.

Result: Proved technical feasibility, uncovered critical UX and accuracy gaps, and built conviction for further investment.

The early prototype worked well most of the time but lacked a structured way to evaluate prompt and model changes and would often produce parsing errors.

Golden Set Creation & Model Evaluation

As part of moving from a low-code platform (n8n) to python I created a golden set of real receipts and hand-labeled outputs for model and prompt evaluation and validation and went through the process of iterating prompts and system messages - building on the earlier experimentation with n8n - until a marked improvement in quality of output was achieved.

F1 values for Item Fields

Improvement in F1 with each version of Parserator Agent (via Prompts, System Message, and model improvements)

By focusing on the Primary Metrics - Error Rate (%) and Overall Accuracy (helpfulness) the initial parser agent transformed from a prototype with inconsistent and error-prone extraction to a robust, nearly error-free, and highly normalized data output.

Item field extraction F1 score improved from ~0.80 to ~0.96.
Top-level field extraction (store name, date, amount) now consistently perfect.
Normalization and unit handling is now highly accurate, with most errors now isolated to rare or ambiguous tag assignments.
JSON and parsing errors have been virtually eliminated.

Agentic System Design

Evolved to a modular, multi-agent system:

Orchestrator coordinates agent flow.
Parserator: Extracts initial data.
Metador: Matches line items and leverages retrieval-augmented generation (RAG) via vector DB (ChromaDB) - grows over time reducing need to manualy input or correct data for repeat purchases.
Namer/Tagger: Handle ambiguous or unmatched items.
Guardrails: Schema validation and logging at every stage.
Memory: Persistent (DB), ephemeral (Redis), and vector (ChromaDB).

Evaluation & Iteration

Every system update was benchmarked against the golden set and minor changes are evaluated via agent dashboard to spot potential regression and visualize improvements over time.

Logged metrics included extraction accuracy, error rates, latency and token usage - enabling safe, data-driven iterations.

Overall Helpfulness Metrics - The Northstar drives prioritization:

Performance Tracking - Latency and Token usage tracking by agent support continuous optimization and cost control. (Recently Introduced)

User Experience

Designed a clean mobile UI for easy photo capture, rapid review, and human-in-the-loop correction—essential for responsible AI.
Receipt Upload Process:

Receipt Verification Process:

Users gain actionable insights from a rich analytics dashboard powered by agentic data extraction.

Reflection

This project was a self-guided masterclass in identifying, validating, and systematically solving a problem uniquely suited to agentic AI. My approach was shaped by several core lessons from agentic product management:

1. Identify Where Agentic AI Adds Unique Value

Rather than defaulting to classic automation, I focused on a pain point—extracting structured data from semi-structured receipts with highly variable layouts—where LLMs and multi-agent systems can truly outperform traditional rule-based solutions. This would be similar to invoice or other business document parsing and identification.

2. Start Lean, Validate Early, and Measure Objectively

Inspired by agentic PM best practices, I started with a rapid prototype (n8n + web UI) to prove feasibility and create a working concept without utilizing engineering resources. Afterwards, I created a “golden set” of real receipts and hand-labeled outputs to move beyond guesswork and ensure that every model, prompt, and system change was measured objectively. This allowed rapid iteration, confidence in improvements, and a reliable baseline for scaling complexity.

3. Decompose the Solution into Agents and Guardrails

Rather than building a monolith, I architected the system as a set of collaborating agents (Parserator, Metador, Namer, Tagger) - each with a specific goal and instructions, orchestrated for division of labor and extensibility. At each hand-off, schema validation and system message rules (“guardrails”) ensured reliability—a core lesson from agentic frameworks. This agentic structure made it possible to adapt, improve, and debug each part independently.

4. Prioritize Metrics, Responsible AI, and UX from the Start

Guided by frameworks like “North Star metrics,” “HHH” (Helpful, Honest, Harmless), and guardrail evaluation, I built the stack to capture real metrics—accuracy, error rates, latency, cost—early and often. This data-driven rigor enabled systematic improvement and transparent, user-centric iteration. Responsible AI was not an afterthought: Human-in-the-loop review, error handling, and clear user feedback channels were prioritized throughout—even though the stakes are low, user trust and data correctness remain essential

5. Roadmap and Scope with Focus

Following the agentic PM advice to “validate riskiest assumptions first,” I narrowed initial scope to core extraction and verification flows. As confidence and capability grew, I incrementally added advanced features (RAG, memory, analytics dashboard), always measuring value added and complexity introduced.

6. Bridge Classic and Agentic PM

This project fused classic PM discipline—clear goals, user empathy, iterative design—with agentic AI PM’s focus on modular agents, systematic evaluation, and robust guardrails. The result was not just a “smarter” feature but a robust, extensible, and differentiated product.

GroceryBot is proof that applying agentic AI PM—layered with classic product craft—can turn even a “messy data” problem into a robust, user-centered solution.

Tools and Technologies

Productivity: Trello (custom automations)

UI Design: Adobe Fresco + Storybook

Front End: React (Dockerized), Storybook (Component Dev & Docs)

Back End: Python, Flask, LangChain (Dockerized)

AI/LLM: OpenAI API, Retrieval-Augmented Generation (RAG), ChromaDB

Testing/Quality: Jest & pytest (Unit & Integration testing)

Dashboards/Analytics: Metabase (Dockerized)

Workflow Automation (Concept): n8n

Video Demo

The receipt processing still takes a while - but the more items are added to the vector database for each store the faster the process becomes.