Listing Description Generator: AI Content That Sellers Actually Keep

Marketplace Application

Empty or thin listing descriptions are one of the biggest conversion killers on any marketplace. Sellers skip writing them because it takes time. This system generates complete, market-appropriate descriptions automatically from structured listing data — and crucially, it measures whether sellers actually keep the output or rewrite it, so you can continuously improve quality. Applicable to any marketplace where listing quality drives buyer engagement: automotive, property, e-commerce, rental, or B2B.

Project Summary

Domain: Online Marketplace / Content Automation Role: ML Engineer Scope: 7 locales, production API + evaluation dashboard

7 Locales Supported

90% Token Cost Reduction (Caching)

2 Evaluation Layers

3 Scoring Methods

The Problem

Creating compelling listing descriptions is time-consuming and a major friction point for sellers on marketplace platforms. Many listings go live with minimal or no descriptions, reducing buyer engagement and overall listing quality. The platform needed a way to automatically generate high-quality, structured descriptions that sellers would actually keep and use.

The goal: Build an AI system that generates listing descriptions so good that sellers adopt them with minimal editing - and build the evaluation infrastructure to prove it.

System Architecture

⚙ Generation Pipeline

Listing Service
(GraphQL API)

→

Data Loader
& Field Filter

→

Prompt Builder
(YAML Config)

↓

Langfuse
(Trace Logging)

→

AWS Bedrock
invoke_model()

→

Pydantic AI
Validated Schema

→

Generated
Description

↩ AI output logged to Datadog → feeds Evaluation Pipeline

📊 Evaluation Pipeline

Datadog Logs
(gen events)

→

Log Fetcher
& Matcher

→

Cosine Similarity
(semantic score)

→

LLM Judge
(Claude Sonnet)

→

Adoption Score
(0–100)

→

Streamlit
Dashboard

🏗 Infrastructure

Docker Image

→

AWS ECR
(Container Registry)

→

Route 53
(DNS / Publish)

→

SSO
(Access Control)

My Approach

Stakeholder Workshops & Locale-Aware Prompt Engineering

Before writing a single prompt, I ran workshops with internal stakeholders — marketplace operations, seller success, and locale market managers — to understand what "good" looked like in each market. These sessions surfaced requirements that no spec document would have captured: preferred vocabulary, tone expectations, and the vocabulary sellers in each country actually use when writing descriptions themselves.

This translated directly into prompt engineering decisions:

Locale-specific lexicon: Each locale config encodes preferred phrasing and restricted words, so the model writes the way local sellers and buyers speak — not generic AI-English
Accent and dialect calibration: For markets with regional language variants, the prompt steers toward the accepted standard rather than a literal translation
YAML-based hierarchical configuration: System prompt defines role and output schema; locale configs layer on tone, field availability, and vocabulary rules; field configs exclude attributes not tracked in that market (e.g., accident history absent in some locales) to prevent hallucination

This architecture makes onboarding a new locale a configuration change, not a code change.

AWS Bedrock with Claude Haiku 4.5 — Cost-Optimised at Scale

The generation API calls Claude Haiku 4.5 via AWS Bedrock with two key optimisations:

Adaptive retries (5 attempts): the API retries with adjusted parameters on malformed or low-quality outputs before failing, improving reliability without manual intervention
System prompt caching: the system prompt — the largest and most stable part of each request — is cached at the Bedrock layer, reducing the cost of cached tokens by 90% at production call volumes

Evaluation System with DeepEval — Three Scoring Methods

Generating descriptions is straightforward. Proving they work in production required a separate engineering effort: an evaluation pipeline built on DeepEval, pulling real logs from Datadog and scoring AI output against what sellers actually published.

The evaluation logs are extracted from Datadog (generation events + final seller descriptions), matched by listing ID, then scored three ways:

Method 1 — Word overlap (fast, structural): Computes overlapping words, word overlap %, and AI words retained %. Simple but fast — gives an immediate signal on how much of the AI text survived seller editing.

Method 2 — Cosine similarity (semantic): Catches cases where sellers paraphrase rather than copy verbatim — high meaning adoption that word overlap would miss.

Method 3 — LLM as judge (DeepEval): Claude Sonnet evaluates each AI/seller description pair and returns a structured score: adoption_score (0–100), adoption_category, and reasoning. Categories: adopted | partially_adopted | replaced | non_relevant. This catches nuanced rewrites that neither lexical method detects.

Interactive Evaluation Dashboard — Production Feedback Loop

Built a Streamlit dashboard as a feedback channel from the production environment back to the development team:

Surfaces failing cases — descriptions where sellers replaced or discarded the AI output — so prompt issues can be investigated with real examples
Per-locale and per-user-type (professional dealers vs. private sellers) breakdowns expose where the prompt underperforms
Automated recommendation engine analyses patterns (what sellers keep, remove, add) and proposes targeted prompt changes
Monitoring successful and failing cases alike is what makes it possible to measure the success metrics of an AI product — not just output quality in isolation, but actual adoption in the field

Production System

Generation API (FastAPI):

Accepts vehicle ID, fetches attributes from the listing service via GraphQL
Applies locale-specific field filtering and YAML-driven prompt assembly
Calls Claude Haiku 4.5 via AWS Bedrock with adaptive retry logic (5 attempts) and system prompt caching
Returns description + token usage metrics

Evaluation Pipeline (DeepEval + Streamlit + Datadog):

Fetches generation events and final seller descriptions from Datadog
Scores pairs using word overlap, cosine similarity, and LLM-as-judge (DeepEval)
Surfaces failing cases in the dashboard for prompt iteration
Generates actionable recommendations for prompt tuning per locale

Tech Stack

Python FastAPI Streamlit Claude Haiku 4.5 Claude Sonnet AWS Bedrock DeepEval Datadog Plotly Pydantic YAML Config Docker

Key Takeaways

Stakeholder workshops before prompts: understanding locale-specific vocabulary and tone expectations shaped every prompt decision — no spec document would have surfaced this
Prompt engineering for locale accents and lexicon is a distinct discipline from general prompt engineering — what reads as natural in one market sounds foreign in another
System prompt caching turned AWS Bedrock costs from a scale concern into a non-issue — 90% reduction on cached token costs
Evaluation is harder than generation: building the proof that AI content gets adopted required more engineering than generating the content itself
Three scoring methods (word overlap, cosine similarity, LLM-as-judge) catch different adoption patterns — no single method is sufficient
A production→dev feedback loop via the evaluation dashboard is what turns a one-time launch into a continuously improving product