Pricing Intelligence: Detecting Misleading Listing Prices Across 5 Markets
Marketplace Application
Online marketplaces routinely receive listings where the advertised price is misleading โ tied to financing deals, leasing conditions, or VAT exclusions that most buyers don't qualify for. This system automatically detects those listings at scale, protecting buyer trust and flagging them for review before they go live. Applicable to any marketplace where price transparency matters: automotive, property, rental, B2B, or e-commerce.
Project Summary
Domain: Online Marketplace / Pricing Transparency Role: ML Engineer (sole data scientist) Scope: 5 markets, 6 languages, production API + batch pipeline
Key Result: In-house model is 20x faster and >2x cheaper than Claude while maintaining comparable detection performance
The Problem
In online marketplaces, some listings advertise prices that come with conditions - financing requirements, leasing terms, special buyer restrictions, or VAT exclusions. These conditional prices are misleading for standard buyers who expect transparent pricing. Manually reviewing thousands of listings across multiple markets and languages was unsustainable.
The goal: Build an automated system that flags conditional pricing across 5 international markets with high accuracy and low latency.
Model Architecture
Description
Extracts sentences with
pricing-relevant keywords
Encoder ยท 12 Layers
86M Parameters Multilingual
Used for: DE
Encoder ยท 24 Layers
304M Parameters
Used for: IT, AT, CA, BE
Conditional
Non-Conditional
Financing ยท Incentives
Leasing ยท SpecialBuyers ยท Others
My Approach
Dual-Head Transformer Encoder Fine-Tuning
Rather than fine-tuning two separate models, I designed a single dual-head classifier that learns both tasks simultaneously in one forward pass:
- Binary head: Conditional vs. Non-Conditional โ detects whether the advertised price is achievable by a standard buyer
- Multi-label head: 7 condition categories โ Financing, Leasing, Incentives, Special Buyers, VAT Excluded, Other, OK
Joint fine-tuning improved accuracy on both tasks: the binary signal sharpens category boundaries, and the category signal anchors the binary decision. One model, two outputs, better performance than either standalone.
Language-Aware Model Selection
Rather than forcing a single model across all markets, I selected architecturally appropriate models:
| Model | Markets | Rationale |
|---|---|---|
| DeBERTa V3 Large | IT, AT, CA, BE | Superior performance on English-centric and Romance language text |
| mDeBERTa V3 Base | DE | Better multilingual representations for German compound words and syntax |
Feature Engineering
- Country-specific keyword extraction: 70โ71 priority stems per market to extract pricing-relevant sentences before feeding into the model
Hybrid Inference: Transformers + LLM Fallback
For predictions where the model confidence fell below 0.8, the system routes to Claude via AWS Bedrock for validation.
Inference
Check
(sub-second)
(+2โ5s)
This hybrid approach optimises for speed on high-confidence predictions while using LLM reasoning only for genuinely ambiguous edge cases.
Automated Evaluation Pipeline
I built a stratified validation system using Claude as ground truth:
- Sample 250 listings per country (stratified by predicted class)
- Generate Claude labels with structured reasoning
- Evaluate transformer predictions against Claude ground truth
- Track per-country precision/recall over time
Production System
Deployment Pipeline
(classifier service)
(containerised)
(build & push)
(container registry)
(pulled on deploy)
(stack launch)
Group
(EC2 GPU)
(load balancer)
(DNS endpoint)
Real-Time API (FastAPI):
- Accepts listing description, price, and country code
- Returns classification with confidence score in <2s
- Falls back to Claude for uncertain predictions
- Includes business rule overrides (e.g., known leasing providers auto-flagged)
Monitoring: Full Datadog integration tracking prediction distributions, Claude API costs, conditional listing rates per market, and model latency.
Fine-Tuning Under Constraints
Fine-tuning a 304M parameter transformer encoder on a single 24 GB GPU required engineering around every memory bottleneck. Real constraints from the fine-tuning run:
| Constraint | Problem | Solution |
|---|---|---|
| 24 GB VRAM | Model + activations exceeded memory | Gradient checkpointing โ recompute activations on backward pass instead of storing them |
| Max batch size 8 | Too noisy for stable convergence | Gradient accumulation over 4 steps โ effective batch of 64 without extra memory |
| Long input sequences | Full 512-token inputs too slow | Input truncation to 384 tokens โ cut training time significantly with negligible accuracy loss |
| FP32 precision | Doubled memory for weights/activations | FP16 mixed precision training throughout |
| DeBERTa-v3-large size | 304M params barely fit | Combined all four techniques together to make fine-tuning feasible on a single GPU |
The disentangled attention mechanism in DeBERTa encodes each token using separate content and position embeddings โ this is what gives it strong contextual understanding with a relatively small parameter count compared to its performance.
Tech Stack
Python PyTorch HuggingFace Transformers DeBERTa V3 mDeBERTa V3 FastAPI AWS Bedrock Claude AWS Athena AWS S3 AWS EC2 (GPU) Papermill Datadog Docker
Why In-House Over Claude?
The core business case: fine-tuning a transformer model instead of routing everything through Claude delivered dramatic cost and speed gains at no meaningful accuracy cost.
Detection Performance
| Method | Detection Rate | False Positive Rate |
|---|---|---|
| Claude | 99% | 1% |
| Transformer Model | 90% โ 95% | 4% โ 7% |
A small accuracy trade-off in exchange for 20x faster inference and a cost curve that stays flat as volume scales.
Inference Speed
| Method | Listings per Day | Inference Time |
|---|---|---|
| Claude | 60K | 2 โ 3 hours |
| Transformer Model | 60K | ~10 minutes |
Cost at Scale
| Listings per Day | GPU Cost p.a. (Transformer) | Claude Cost p.a. |
|---|---|---|
| 20K | $14K | $10K |
| 40K | $14K | $23K |
| 60K | $14K | $42K |
| 80K | $14K | $56K |
| 100K | $14K | $70K |
The transformer's GPU cost is fixed at $14K/year regardless of volume. Claude's cost scales linearly โ at 100K listings/day the in-house model is 5x cheaper.
Key Takeaways
- Right-sizing models matters: DeBERTa vs mDeBERTa selection per market improved accuracy without unnecessary compute
- Confidence-based routing between fast local models and LLMs is a production pattern I now use everywhere
- Country-specific feature engineering (keyword stems, text extraction) outperformed language-agnostic approaches
- Automated evaluation pipelines with LLM-as-judge provide scalable quality assurance across markets