Cooklist
About Cooklist
Cooklist is the AI grocery‑intelligence platform powering meal planning and shopping for millions of consumers across our consumer app and white‑label enterprise suite. Our mission is to combine the intelligence of a personal shopper, chef, and nutritionist to help people save time, eat better, and enjoy happier lives.
We’re profitable, process billions of dollars in transactions at the nation’s largest retailers, and our mobile experiences reach millions of people. We’re backed by Techstars, Mercury Fund, and industry leaders including the former CTO of Kroger and the Chief Product Officer of Amazon Fresh.
Role Overview
We’re hiring a Senior AI Engineer (LLMs & Agents) to design and operationalize the intelligence behind Cooklist’s AI grocery shopping assistant that is embedded in top US retailers and the Cooklist app. Your primary job: own the evals, reliability, and workflow architecture that make an agentic system trustworthy at scale.
You’ll design prompts, tool‑calling strategies, retrieval pipelines, and safety guardrails; build the eval harness and real‑time monitoring systems; and drive model/latency tradeoffs that turn demos into production‑grade performance. This is a founding‑level role on a tiny team where your decisions directly impact millions of shoppers and where accuracy around allergens and nutrition is critical.
We are an AI‑leveraged org: our question is always “How do we use AI to build AI?” You’ll set patterns, tests, and guardrails that allow AI assistants to safely contribute to the codebase, compounding our output.
Responsibilities
Own LLM reliability end‑to‑end: architect prompts, tools, and reasoning workflows that meet strict accuracy, safety, and latency requirements.
Design robust evals: build offline/online eval suites for structured output, factuality, grounding, allergen sensitivity, and user‑goal attainment; define gold sets, synthetic data pipelines, and automatic failure taxonomies.
Productionize agent workflows: retrieval‑augmented generation, tool calling over GraphQL/WebSockets, function/tool schemas, and strict JSON output contracts.
Model strategy: evaluate and deploy model mixes (reasoning vs. fast paths), caching strategies, and guardrails to balance quality, latency, and cost.
Monitoring & observability: ship real‑time conversation analytics, drift detection, canary/shadow testing, incident taxonomies, and auto‑triage for misbehavior.
Safety & compliance: encode domain policies for allergens, dietary restrictions, and nutrition; implement red‑team tests, constraints, and fallbacks.
Tight product loop: partner with mobile/backend to ship agent features, collect outcome‑level telemetry, and iterate quickly (“build first, refine fast”).
Scale the system: create libraries, prompts, schemas, and tests that let AI coding assistants contribute safely; document playbooks and upgrade paths.
Qualifications
You’ve shipped LLM systems to production with real user impact. Ideally agentic loops, tool calling, and structured outputs at scale.
You’re fluent in Python and have built eval harnesses, automated datasets, and dashboards for LLM quality.
You’ve implemented RAG (indexing, chunking, embeddings, reranking) and understand failure modes (hallucination, grounding, duplication, drift).
You can design and enforce strict schemas, guarantee parseability, and create deterministic fallbacks.
You’re comfortable making model tradeoffs (reasoning models vs. smaller/cheaper paths; latency budgets; cost controls) and can prove the impact.
You care about safety (allergens, dietary needs, policy adherence) and can translate product risk into tests, gates, and roll‑out controls.
You move with founder energy: high ownership, high bar for polish, gritty, and calm under production pressure.
Our Stack
What We Offer