• about
  • strategy
  • portfolio
  • About
  • Strategy
  • Team
  • Portfolio
  • Community
    • Software Day
    • CXO Summit
  • Talent
  • Content
    • Podcasts
    • 2025 GTM Report
    • Media Assets
  • Contact
2021 Mercury Fund. All Rights Reserved.
Website by Darien Group

We invest in exceptional founders

Join the best startups across America and work with a Mercury portfolio company.
Search 
jobs
Explore 
companies
Join talent network
Talent
My job alerts

Senior AI Engineer (LLMs & Agents)

Cooklist

Cooklist

Software Engineering, Data Science
Austin, TX, USA
Posted on Aug 21, 2025
Apply now

About Cooklist

Cooklist is the AI grocery‑intelligence platform powering meal planning and shopping for millions of consumers across our consumer app and white‑label enterprise suite. Our mission is to combine the intelligence of a personal shopper, chef, and nutritionist to help people save time, eat better, and enjoy happier lives.

We’re profitable, process billions of dollars in transactions at the nation’s largest retailers, and our mobile experiences reach millions of people. We’re backed by Techstars, Mercury Fund, and industry leaders including the former CTO of Kroger and the Chief Product Officer of Amazon Fresh.

Role Overview

We’re hiring a Senior AI Engineer (LLMs & Agents) to design and operationalize the intelligence behind Cooklist’s AI grocery shopping assistant that is embedded in top US retailers and the Cooklist app. Your primary job: own the evals, reliability, and workflow architecture that make an agentic system trustworthy at scale.

You’ll design prompts, tool‑calling strategies, retrieval pipelines, and safety guardrails; build the eval harness and real‑time monitoring systems; and drive model/latency tradeoffs that turn demos into production‑grade performance. This is a founding‑level role on a tiny team where your decisions directly impact millions of shoppers and where accuracy around allergens and nutrition is critical.

We are an AI‑leveraged org: our question is always “How do we use AI to build AI?” You’ll set patterns, tests, and guardrails that allow AI assistants to safely contribute to the codebase, compounding our output.

Responsibilities

  • Own LLM reliability end‑to‑end: architect prompts, tools, and reasoning workflows that meet strict accuracy, safety, and latency requirements.

  • Design robust evals: build offline/online eval suites for structured output, factuality, grounding, allergen sensitivity, and user‑goal attainment; define gold sets, synthetic data pipelines, and automatic failure taxonomies.

  • Productionize agent workflows: retrieval‑augmented generation, tool calling over GraphQL/WebSockets, function/tool schemas, and strict JSON output contracts.

  • Model strategy: evaluate and deploy model mixes (reasoning vs. fast paths), caching strategies, and guardrails to balance quality, latency, and cost.

  • Monitoring & observability: ship real‑time conversation analytics, drift detection, canary/shadow testing, incident taxonomies, and auto‑triage for misbehavior.

  • Safety & compliance: encode domain policies for allergens, dietary restrictions, and nutrition; implement red‑team tests, constraints, and fallbacks.

  • Tight product loop: partner with mobile/backend to ship agent features, collect outcome‑level telemetry, and iterate quickly (“build first, refine fast”).

  • Scale the system: create libraries, prompts, schemas, and tests that let AI coding assistants contribute safely; document playbooks and upgrade paths.

Qualifications

  • You’ve shipped LLM systems to production with real user impact. Ideally agentic loops, tool calling, and structured outputs at scale.

  • You’re fluent in Python and have built eval harnesses, automated datasets, and dashboards for LLM quality.

  • You’ve implemented RAG (indexing, chunking, embeddings, reranking) and understand failure modes (hallucination, grounding, duplication, drift).

  • You can design and enforce strict schemas, guarantee parseability, and create deterministic fallbacks.

  • You’re comfortable making model tradeoffs (reasoning models vs. smaller/cheaper paths; latency budgets; cost controls) and can prove the impact.

  • You care about safety (allergens, dietary needs, policy adherence) and can translate product risk into tests, gates, and roll‑out controls.

  • You move with founder energy: high ownership, high bar for polish, gritty, and calm under production pressure.

Our Stack

  • Language: Python/Django backend; Javascript/React Native frontend
  • APIs/Data: GraphQL; real‑time streaming over WebSockets
  • Mobile: React Native (close collaboration with the mobile team)
  • LLM engineering: internally built prompt/tool libraries, RAG pipelines & eval system

What We Offer

  • Competitive compensation + meaningful equity
  • Austin, TX based with WFH flexibility
  • Work directly with founders and an elite, tight‑knit team
  • Ship experiences that materially improve the lives of millions
  • A high‑intensity, high‑ownership environment designed for builders
Apply now
See more open positions at Cooklist
Privacy policyCookie policy
  • team
  • community
  • talent
  • content
Website by Darien Group
Subscribe to our Newsletter
2021 Mercury Fund. All Rights Reserved.