Pricing Intelligence: Detecting Misleading Listing Prices Across 5 Markets

Marketplace Application

Online marketplaces routinely receive listings where the advertised price is misleading — tied to financing deals, leasing conditions, or VAT exclusions that most buyers don't qualify for. This system automatically detects those listings at scale, protecting buyer trust and flagging them for review before they go live. Applicable to any marketplace where price transparency matters: automotive, property, rental, B2B, or e-commerce.

Project Summary

Domain: Online Marketplace / Pricing Transparency Role: ML Engineer (sole data scientist) Scope: 5 markets, 6 languages, production API + batch pipeline

20x Faster than Claude

>2x Cheaper at Scale

90–95% Detection Rate

5 Markets, 6 Languages

Key Result: In-house model is 20x faster and >2x cheaper than Claude while maintaining comparable detection performance

The Problem

In online marketplaces, some listings advertise prices that come with conditions - financing requirements, leasing terms, special buyer restrictions, or VAT exclusions. These conditional prices are misleading for standard buyers who expect transparent pricing. Manually reviewing thousands of listings across multiple markets and languages was unsustainable.

The goal: Build an automated system that flags conditional pricing across 5 international markets with high accuracy and low latency.

Model Architecture

🧠 Encoders learn contextual word meaning with a tiny memory footprint — enabling quick iteration and faster fine-tuning on a single 24 GB GPU

Listing
Description

→

✂ Trimmer
Extracts sentences with
pricing-relevant keywords

→

🤖 mDeBERTa-v3-base
Encoder · 12 Layers
86M Parameters Multilingual
Used for: DE

🤖 DeBERTa-v3-large
Encoder · 24 Layers
304M Parameters
Used for: IT, AT, CA, BE

→

🔲 Binary Classifier
Conditional
Non-Conditional

⊞ Multi Classifier
Financing · Incentives
Leasing · SpecialBuyers · Others

💡 Why Two Heads? Fine-tuning both tasks simultaneously forces the encoder to learn richer shared representations — the binary signal sharpens category boundaries, and the category signal anchors binary decisions. Single-task models for each head performed worse individually than this joint approach.

My Approach

Dual-Head Transformer Encoder Fine-Tuning

Rather than fine-tuning two separate models, I designed a single dual-head classifier that learns both tasks simultaneously in one forward pass:

Binary head: Conditional vs. Non-Conditional — detects whether the advertised price is achievable by a standard buyer
Multi-label head: 7 condition categories — Financing, Leasing, Incentives, Special Buyers, VAT Excluded, Other, OK

Joint fine-tuning improved accuracy on both tasks: the binary signal sharpens category boundaries, and the category signal anchors the binary decision. One model, two outputs, better performance than either standalone.

Language-Aware Model Selection

Rather than forcing a single model across all markets, I selected architecturally appropriate models:

Model	Markets	Rationale
DeBERTa V3 Large	IT, AT, CA, BE	Superior performance on English-centric and Romance language text
mDeBERTa V3 Base	DE	Better multilingual representations for German compound words and syntax

Feature Engineering

Country-specific keyword extraction: 70–71 priority stems per market to extract pricing-relevant sentences before feeding into the model

Hybrid Inference: Transformers + LLM Fallback

For predictions where the model confidence fell below 0.8, the system routes to Claude via AWS Bedrock for validation.

Listing

→

Transformer
Inference

→

Confidence
Check

→

≥ 0.8

Return Prediction
(sub-second)

< 0.8

Claude Validation
(+2–5s)

→

Return Prediction

This hybrid approach optimises for speed on high-confidence predictions while using LLM reasoning only for genuinely ambiguous edge cases.

Automated Evaluation Pipeline

I built a stratified validation system using Claude as ground truth:

Sample 250 listings per country (stratified by predicted class)
Generate Claude labels with structured reasoning
Evaluate transformer predictions against Claude ground truth
Track per-country precision/recall over time

Production System

Deployment Pipeline

🔁 CI/CD Pipeline

FastAPI App
(classifier service)

→

Docker Image
(containerised)

→

GitHub Actions
(build & push)

→

AWS ECR
(container registry)

☁ AWS Infrastructure (CloudFormation)

ECR Image
(pulled on deploy)

→

CloudFormation
(stack launch)

→

Auto Scaling
Group
(EC2 GPU)

→

Target Group
(load balancer)

→

Route 53
(DNS endpoint)

Real-Time API (FastAPI):

Accepts listing description, price, and country code
Returns classification with confidence score in <2s
Falls back to Claude for uncertain predictions
Includes business rule overrides (e.g., known leasing providers auto-flagged)

Monitoring: Full Datadog integration tracking prediction distributions, Claude API costs, conditional listing rates per market, and model latency.

Fine-Tuning Under Constraints

Fine-tuning a 304M parameter transformer encoder on a single 24 GB GPU required engineering around every memory bottleneck. Real constraints from the fine-tuning run:

Constraint	Problem	Solution
24 GB VRAM	Model + activations exceeded memory	Gradient checkpointing — recompute activations on backward pass instead of storing them
Max batch size 8	Too noisy for stable convergence	Gradient accumulation over 4 steps → effective batch of 64 without extra memory
Long input sequences	Full 512-token inputs too slow	Input truncation to 384 tokens — cut training time significantly with negligible accuracy loss
FP32 precision	Doubled memory for weights/activations	FP16 mixed precision training throughout
DeBERTa-v3-large size	304M params barely fit	Combined all four techniques together to make fine-tuning feasible on a single GPU

The disentangled attention mechanism in DeBERTa encodes each token using separate content and position embeddings — this is what gives it strong contextual understanding with a relatively small parameter count compared to its performance.

Tech Stack

Python PyTorch HuggingFace Transformers DeBERTa V3 mDeBERTa V3 FastAPI AWS Bedrock Claude AWS Athena AWS S3 AWS EC2 (GPU) Papermill Datadog Docker

Why In-House Over Claude?

The core business case: fine-tuning a transformer model instead of routing everything through Claude delivered dramatic cost and speed gains at no meaningful accuracy cost.

Detection Performance

Method	Detection Rate	False Positive Rate
Claude	99%	1%
Transformer Model	90% – 95%	4% – 7%

A small accuracy trade-off in exchange for 20x faster inference and a cost curve that stays flat as volume scales.

Inference Speed

Method	Listings per Day	Inference Time
Claude	60K	2 – 3 hours
Transformer Model	60K	~10 minutes

Cost at Scale

Listings per Day	GPU Cost p.a. (Transformer)	Claude Cost p.a.
20K	$14K	$10K
40K	$14K	$23K
60K	$14K	$42K
80K	$14K	$56K
100K	$14K	$70K

The transformer's GPU cost is fixed at $14K/year regardless of volume. Claude's cost scales linearly — at 100K listings/day the in-house model is 5x cheaper.

Key Takeaways

Right-sizing models matters: DeBERTa vs mDeBERTa selection per market improved accuracy without unnecessary compute
Confidence-based routing between fast local models and LLMs is a production pattern I now use everywhere
Country-specific feature engineering (keyword stems, text extraction) outperformed language-agnostic approaches
Automated evaluation pipelines with LLM-as-judge provide scalable quality assurance across markets