Designing
AI Quality Systems

Part 1:
Floor / Ceiling / Style for
AI-Generated Commerce Video

December 9, 2025 · Kenneth Hung · 20 min read

I've spent 15+ years productizing emerging technology, including 6.5 years at Meta where I shipped industry-first AI/ML experiences in AR and Avatars that reached 1.3B+ users across Messenger, Instagram, and Facebook. The work was always the same shape: take a novel technical capability, productize it through experiences and tools, then grow the ecosystem that scales it.

This case study applies that practice to generative AI, using TikTok Shop as the test case. Earlier capabilities didn't force the question generative AI does:

How do you specify acceptable behavior for multi-modal AI-generated content at scale? And how do you build the systems creators, ML pipelines, and global markets can all rely on and trust?

The substance of this piece isn't TikTok Shop. It's a framework called Floor / Ceiling / Style for translating human judgment into machine-learnable signals.

1. The Problem:
Why Generative AI Needs a Quality Discipline

AI Quality problem

Generative AI has solved a generation problem. It hasn't solved a quality problem.

In social media e-commerce video, that quality problem is a trust problem. Plausible video, text, and images can now be generated at near-zero marginal cost. What hasn't kept pace is trust. Users sense AI when they see it. They scroll past, distrust, or actively reject it. And when buyers don't trust the video, they don't buy.

The model is the easy part. Specifying behavior is the hard part.

Without a quality infrastructure underneath the generation, three things happen at scale:

  1. Safety failures (Floor). Some output is technically broken or policy-violating. At scale, even a 0.1% failure rate becomes daily PR and legal exposure.

  2. Quality drift (Ceiling). Without a benchmark for excellence, average quality regresses toward the mean of training data that's increasingly AI-generated itself. The feedback loop degrades.

  3. Context mismatch (Style). Generic output doesn't serve specific contexts. An excellent beauty video and an excellent enterprise demo share almost no surface features.

This is what responsible AI looks like in practice: not a separate ethics review at the end, but a quality system woven into how content gets generated, evaluated, and shipped.

Four reasons TikTok Shop is the right test case for this framework.

The Universal Challenge

The challenge isn't generating commerce video. That's solved across every major platform.

The challenge is generating commerce video that converts, in markets where attention costs money and ecosystem trust is the long-term asset. Without a quality framework, AI-generated commerce content at scale becomes "AI slop" on any platform that ships it. High views, low conversion, ecosystem damage.

This piece is about the infrastructure that prevents that. Built once, applicable everywhere.

2. The Framework:
Floor / Ceiling / Style

The function that codifies AI behavior goes by different names across the industry: Model Behavior, Model Policy, Responsible AI, Trust & Safety, Content Integrity. The naming differs. The work is converging. I've written separately about AI Behavior Design as a discipline.

For TikTok Shop specifically, the mission is narrower:

Build AIGC for commerce conversion, not entertainment virality.
Learn from high-GMV videos, not high-view videos.

Those rarely overlap, and the difference between them is what the framework has to encode.

The Foundation

Three-Layer Quality Framework

Floor must pass · Ceiling to optimize · Style matches context
↑↓ Vertical: Quality
Measurable, better/worse
📈 Ceiling
Numeric Benchmark · 0—100
Top performer benchmark
3-second hook retention
Completion rate
CTR / CVR
Structure score
Product timing
CTA timing
🤖 AI should LEARN these patterns
Pass(Floor) × Score(Ceiling) = Quality
🚫 Floor
Binary 0/1 · Pass / Fail
Basic quality threshold
Visual quality
Audio quality
Compliance
Product info
Captions
Early exit patterns
⚠️ AI must NEVER go below
←→ Horizontal: Style
Contextual, different not better
Format
Content Structures
📦
Unboxing
📚
Tutorial
🔄
Before/After
🎧
ASMR
📖
Narrative
Fast-cut
Lifestyle
🎭
Skit
🎤
Vox Pop
Aesthetic
Visual Styles
💧
Clean Girl
🌈
Dopamine
🪩
Y2K
🤍
Quiet Lux
🌃
Cyberpunk
🍃
Japandi
🌾
Cottagecore
🏛️
Old Money
⚙️
Industrial
Format × Aesthetic × Category = Optimal Style
Category → Style Match
Premium Skincare 📚 + 🤍
Electronics 📦 + 🌃
Home Decor ✨ + 🍃
Food/Snacks 🎧 + 🌾
Fashion ⚡ + 🪩
Color Cosmetics 📚 + 🌈
Fitness Gear 🔄 + ⚙️
Luxury Goods ✨ + 🏛️
Pet Supplies ✨ + 🌾
Collectibles 📦 + 🌈
How Three Layers Drive Conversion Metrics
🚫 Floor → Basic Retention
Bad quality = instant swipe
Protects 3s retention
📈 Ceiling → Conversion Lift
Top patterns = higher CTR/CVR
Maximize watch → click → buy
🎨 Style → Match Expectation
Right format + aesthetic = resonance
Boosts completion & trust

Why three layers, not one

Most quality systems start as a single pass/fail threshold. This works briefly, then breaks. Single thresholds can't distinguish "policy violation" from "underperforming output" from "off-brand for this category." Each requires a different intervention.

Three layers separate three distinct questions:

  • Floor: Is this output safe to ship? (binary, hard constraint)

  • Ceiling: Does this output match our top-performing examples? (numeric, optimization target)

  • Style: Does this output match its context? (categorical, conditioning signal)

Ceiling signals must come from data, not assumptions
A repeatable five-step process for discovering Ceiling patterns per category
01
Extract Top 10%
high-CVR videos
02
Compare Bottom 10%
low-CVR videos
03
Extract patterns
that differ
04
Find statistical
correlations
05
A/B validate
signals
The work is forensic: extract top performers, contrast against the bottom, identify differentiating patterns, validate via A/B testing. The output is a numeric quality score that ML pipelines can optimize against.
Example Data reveals "Product in use within 3s" correlates r = 0.68 (strong) with Beauty CTR

3. Annotation in Practice:
Turning Judgment into Training Signal

The framework defines what to measure. Annotation is how that measurement becomes data the system can learn from.

This is the most leveraged part of the system, and the part most often underbuilt.

Annotation in Practice

Turning judgment into training signal

Each video gets labeled across three layers, each with a different data type and annotation method. Floor uses auto-detection. Ceiling derives from real performance. Style is human classification.

🚫
Floor
Pass / Fail
Auto-detected · all six must pass

The non-negotiable baseline. Most signals are detectable automatically using CV, ASR, NLP, and OCR. Human annotation enters only at boundary cases.

floor: {
  visual: 1, audio: 1,
  compliance: 1, product_info: 1,
  captions: 1, early_exit: 1,
  pass: true
}
Resolution ≥ 720p, stable framing
Resolution, stability, brightness
CV detection
SNR ≥ 12dB, dialogue clear
Signal-to-noise, clarity score
ASR analysis
No banned words, no regulated claims
Banned words, regulated claims
NLP filter
Price + features verified vs catalog
Price and feature accuracy
OCR + validation
Legible, no occlusion of product
Readability, occlusion detection
OCR + analysis
Drop-off ≠ bottom-10% pattern
Compare to bottom 10% drop-off
Pattern match
📈
Ceiling
0 — 100
Performance-derived · 0–100 score

Derived from actual top-performing content per category. Labels come from real platform data, not human judgment. Reward models learn what high-GMV content does differently.

ceiling: {
  hook_3s: 0.92, completion: 0.68,
  cvr: 0.042, structure: 0.87,
  product_seconds: 2, cta_position: 0.85,
  score: 94
}
3s hook ≥ 80% retention
Retention curve at the critical drop-off
Platform data
Completion > category baseline
Play-completion vs category benchmark
Platform data
CTR + CVR > category baseline
Click-through and conversion rates
Platform data
Cosine similarity ≥ 0.85 vs templates
Compared to high-CVR pattern templates
ML model
First product appearance ≤ 5s
First product appearance timestamp
CV detection
CTA in last 30% of video
Call-to-action placement detection
CV + NLP
🎨
Style
Tags
Tags only · no quality judgment

Style is conditional, not absolute. A Y2K aesthetic isn't better than Quiet Luxury — it's better for a category. The framework tags style instead of scoring it.

style: {
  format: "tutorial",
  aesthetic: "clean girl",
  category: "beauty"
}
format Video classifier
Tags one of 9 format types — Tutorial, Unboxing, ASMR…
~92% accuracy
aesthetic CLIP embedding match
Cosine similarity against 10 reference videos per aesthetic
prototype-based
category SKU metadata
Vertical comes from the linked product, not the video itself
catalog truth
κ ≥ 0.70 Cohen's kappa · annotator agreement
Two annotators tag each video independently. Below the threshold, a third reviewer breaks the tie. Filters out ambiguous edge cases.
Key insight The model can only learn what the annotation pipeline reliably labels.

Three layers, three annotation methods

Each layer needs a different annotation method. Floor uses multi-modal detection. Ceiling joins data. Style requires human judgment.

Sample Annotations

Three learning signals from one schema

Top performers teach the model what to reproduce. Floor failures teach what to refuse. Low performers teach where the gap is between average and excellent.

TOP 5% High CVR

30s Beauty Lipstick Tutorial

Floor: All pass
Visual
Audio
Compl.
Product
Captions
Exit
Ceiling 94/100
3s Hook92% (≥ 65%)
Completion68% (≥ 45%)
CVR4.2% (≥ 1.8%)
Product2s (≤ 5s)
Style: Tags
📚Tutorial + 💧Clean + 💄Beauty
FLOOR FAIL Rejected

45s Fashion Showcase

Floor: 2 failed
Visual
Audio
Compl.
Product
Captions
Exit
Compliance: Exaggerated claims
Captions: Blocks product
Ceiling Not evaluated
Floor failed — Ceiling evaluation skipped
Style: Tags
Fast-cut + 🪩Y2K + 👗Fashion
!LOW PERF Needs work

60s Electronics Unboxing

Floor: All pass
Visual
Audio
Compl.
Product
Captions
Exit
Ceiling 38/100
3s Hook45% (≥ 65%)
Completion22% (≥ 45%)
CVR0.3% (≥ 1.8%)
Product12s (≤ 5s)
Style: Tags
📦Unboxing + 🌃Cyber + 📱Tech
Annotation Pipeline

From raw video to a single annotated record

Five stages turn raw video into the records shown above. Reproducible at scale, auditable per stage. Output: one combined record per video.

📥
01
Collect
Top 5% +
Bottom 10%
🤖
02
Auto-Detect
Floor signals
via classifiers
📊
03
Label Perf
Ceiling from
platform data
🏷️
04
Classify
Format × Aesthetic ×
Category
05
QA Check
Cohen's
Kappa > 0.7
Output: one record per video
{
  "video_id": "sv_12345",
  "floor": {
    "visual": 1, "audio": 1, "compliance": 1,
    "product_info": 1, "captions": 1, "early_exit": 1,
    "pass": true
  },
  "ceiling": {
    "hook_3s": 0.92, "completion": 0.68,
    "cvr": 0.042, "product_appearance_seconds": 2,
    "score": 94
  },
  "style": {
    "format": "tutorial", "aesthetic": "clean", "category": "beauty"
  }
}
The framework is the ML architecture. Floor becomes a binary classifier. Ceiling becomes a reward model. Style becomes categorical features for downstream ranking. Three layers, three training primitives.

Why this transfers beyond TikTok Shop

The same annotation discipline applies to any generative AI product, single-modal or multi-modal:

  • For an LLM assistant: Floor is policy compliance. Ceiling is helpfulness on top-rated responses. Style is tone match.

  • For multi-modal models: The same three layers apply per modality and across modalities.

The framework transfers because the underlying problem (translating judgment into signal) transfers.

4. Trust, Bias,
& Responsible AI

The framework above defines what good looks like and how to measure it. Now the harder question:

What happens when the system gets it wrong, or gets it right for the wrong people?

Quality systems are usually framed as a conversion problem. That's the surface answer. The deeper reason quality systems matter is trust — and trust gets broken in two distinct ways:

  1. Externally, when AI-generated content erodes consumer confidence in the platform.

  2. Internally, when the system produces high-quality output for one demographic and lower-quality output for everyone else, without anyone noticing.

Both failure modes share a root cause: codifying behavior without specifying who that behavior serves.

This section covers both.

The trust contrast happening right now

  • Instagram. Meta has been pushing AI-generated content into feeds aggressively. AI characters with profiles. AI-generated comments. AI imagery in recommendations. The user backlash is visible and loud. The "dead internet" discourse moved from niche to mainstream. The technical capability is there. The trust isn't.

  • TikTok. The brand is built on the opposite. Authenticity. "Real People, Real Reviews." Creators talking to camera. The platform's value proposition collapses if AI-generated content erodes that perception.

The cost of getting it wrong:

  • A 0.1% Floor failure rate is a daily lawsuit.

  • A 5% drop in perceived authenticity is an existential threat.

Quality is the surface goal.
Trust is what's actually at stake.

The Three-Layer Quality framework treats trust as something the system actively protects, not something it hopes to maintain. Three places trust shows up in the framework:

Floor as a trust boundary.

Floor is binary because trust is binary.

One viral AI video making medical claims it can't back up doesn't lose a sale. It loses platform credibility for a category.

These are the guardrails the system can't cross.

Ceiling as a trust signal.

Completion rate and CVR aren't just performance metrics. They're trust proxies.

People stay because they trust the content.

People buy because they trust the seller.

Style as a trust match.

Generic AI output is recognizable.

Distinctive AI output that matches its category and audience doesn't feel out of place.

Style prevents the uncanny-valley sensation of "this content doesn't belong here."

A structural parallel

Different labs encode AI behavior using structurally similar mechanisms. Anthropic's Constitutional AI. OpenAI's Model Spec and RLHF. Google DeepMind's safety rules and preference learning. Each one names the same architecture: hard constraints, learned preferences, human judgment translated into machine-learnable signals.

The domains differ. The structural shape is the same: specifying acceptable AI behavior in a way the system can learn and the audience can trust.

Where the system quietly breaks: bias

If the previous part of this section was about why trust matters externally, this part is about how trust gets quietly broken internally.

Responsible AI isn't a separate discipline from quality. It's the part of quality that asks: who is this system working for, and who is it leaving out?

A quality framework that doesn't account for bias produces high-quality biased output. Bias isn't binary. It's not "have it or don't." It's "more or less." The goal isn't claiming neutrality, which is unachievable. The goal is building mechanisms that continuously detect and reduce bias as the system runs.

Codifying behavior without asking "behavior for whom" is how that happens.

A quality framework without this layer ships biased output more efficiently. With it, the framework becomes responsible AI infrastructure.

Continue to Part 2

The framework above is general. Most quality systems break the moment they meet specifics: different verticals demand different signals, different markets demand different rules.

Part 2 covers how the framework scales — eight verticals with their own Floor / Ceiling / Style configurations, three markets launching the same product under different regulatory regimes, and the 90-day execution plan for building this from zero.