Listing Description Generator: AI Content That Sellers Actually Keep
Marketplace Application
Empty or thin listing descriptions are one of the biggest conversion killers on any marketplace. Sellers skip writing them because it takes time. This system generates complete, market-appropriate descriptions automatically from structured listing data — and crucially, it measures whether sellers actually keep the output or rewrite it, so you can continuously improve quality. Applicable to any marketplace where listing quality drives buyer engagement: automotive, property, e-commerce, rental, or B2B.
Project Summary
Domain: Online Marketplace / Content Automation Role: ML Engineer Scope: 7 locales, production API + evaluation dashboard
The Problem
Creating compelling listing descriptions is time-consuming and a major friction point for sellers on marketplace platforms. Many listings go live with minimal or no descriptions, reducing buyer engagement and overall listing quality. The platform needed a way to automatically generate high-quality, structured descriptions that sellers would actually keep and use.
The goal: Build an AI system that generates listing descriptions so good that sellers adopt them with minimal editing - and build the evaluation infrastructure to prove it.
System Architecture
(GraphQL API)
& Field Filter
(YAML Config)
(Trace Logging)
invoke_model()
Validated Schema
Description
(gen events)
& Matcher
(semantic score)
(Claude Sonnet)
(0–100)
Dashboard
(Container Registry)
(DNS / Publish)
(Access Control)
My Approach
Stakeholder Workshops & Locale-Aware Prompt Engineering
Before writing a single prompt, I ran workshops with internal stakeholders — marketplace operations, seller success, and locale market managers — to understand what "good" looked like in each market. These sessions surfaced requirements that no spec document would have captured: preferred vocabulary, tone expectations, and the vocabulary sellers in each country actually use when writing descriptions themselves.
This translated directly into prompt engineering decisions:
- Locale-specific lexicon: Each locale config encodes preferred phrasing and restricted words, so the model writes the way local sellers and buyers speak — not generic AI-English
- Accent and dialect calibration: For markets with regional language variants, the prompt steers toward the accepted standard rather than a literal translation
- YAML-based hierarchical configuration: System prompt defines role and output schema; locale configs layer on tone, field availability, and vocabulary rules; field configs exclude attributes not tracked in that market (e.g., accident history absent in some locales) to prevent hallucination
This architecture makes onboarding a new locale a configuration change, not a code change.
AWS Bedrock with Claude Haiku 4.5 — Cost-Optimised at Scale
The generation API calls Claude Haiku 4.5 via AWS Bedrock with two key optimisations:
- Adaptive retries (5 attempts): the API retries with adjusted parameters on malformed or low-quality outputs before failing, improving reliability without manual intervention
- System prompt caching: the system prompt — the largest and most stable part of each request — is cached at the Bedrock layer, reducing the cost of cached tokens by 90% at production call volumes
Evaluation System with DeepEval — Three Scoring Methods
Generating descriptions is straightforward. Proving they work in production required a separate engineering effort: an evaluation pipeline built on DeepEval, pulling real logs from Datadog and scoring AI output against what sellers actually published.
The evaluation logs are extracted from Datadog (generation events + final seller descriptions), matched by listing ID, then scored three ways:
Method 1 — Word overlap (fast, structural): Computes overlapping words, word overlap %, and AI words retained %. Simple but fast — gives an immediate signal on how much of the AI text survived seller editing.
Method 2 — Cosine similarity (semantic): Catches cases where sellers paraphrase rather than copy verbatim — high meaning adoption that word overlap would miss.
Method 3 — LLM as judge (DeepEval):
Claude Sonnet evaluates each AI/seller description pair and returns a structured score: adoption_score (0–100), adoption_category, and reasoning. Categories: adopted | partially_adopted | replaced | non_relevant. This catches nuanced rewrites that neither lexical method detects.
Interactive Evaluation Dashboard — Production Feedback Loop
Built a Streamlit dashboard as a feedback channel from the production environment back to the development team:
- Surfaces failing cases — descriptions where sellers replaced or discarded the AI output — so prompt issues can be investigated with real examples
- Per-locale and per-user-type (professional dealers vs. private sellers) breakdowns expose where the prompt underperforms
- Automated recommendation engine analyses patterns (what sellers keep, remove, add) and proposes targeted prompt changes
- Monitoring successful and failing cases alike is what makes it possible to measure the success metrics of an AI product — not just output quality in isolation, but actual adoption in the field
Production System
Generation API (FastAPI):
- Accepts vehicle ID, fetches attributes from the listing service via GraphQL
- Applies locale-specific field filtering and YAML-driven prompt assembly
- Calls Claude Haiku 4.5 via AWS Bedrock with adaptive retry logic (5 attempts) and system prompt caching
- Returns description + token usage metrics
Evaluation Pipeline (DeepEval + Streamlit + Datadog):
- Fetches generation events and final seller descriptions from Datadog
- Scores pairs using word overlap, cosine similarity, and LLM-as-judge (DeepEval)
- Surfaces failing cases in the dashboard for prompt iteration
- Generates actionable recommendations for prompt tuning per locale
Tech Stack
Python FastAPI Streamlit Claude Haiku 4.5 Claude Sonnet AWS Bedrock DeepEval Datadog Plotly Pydantic YAML Config Docker
Key Takeaways
- Stakeholder workshops before prompts: understanding locale-specific vocabulary and tone expectations shaped every prompt decision — no spec document would have surfaced this
- Prompt engineering for locale accents and lexicon is a distinct discipline from general prompt engineering — what reads as natural in one market sounds foreign in another
- System prompt caching turned AWS Bedrock costs from a scale concern into a non-issue — 90% reduction on cached token costs
- Evaluation is harder than generation: building the proof that AI content gets adopted required more engineering than generating the content itself
- Three scoring methods (word overlap, cosine similarity, LLM-as-judge) catch different adoption patterns — no single method is sufficient
- A production→dev feedback loop via the evaluation dashboard is what turns a one-time launch into a continuously improving product