Designing
AI Quality Systems

Part 1: Floor / Ceiling / Style for
AI-Generated Commerce Video

December 9, 2025 · Kenneth Hung · 20 min read

I've spent 15+ years productizing emerging technology, including 6.5 years at Meta where I shipped industry-first AI/ML experiences in AR and Avatars that reached 1.3B+ users across Messenger, Instagram, and Facebook. The work was always the same shape: take a novel technical capability, productize it through experiences and tools, then grow the ecosystem that scales it.

This case study applies that practice to generative AI, using TikTok Shop as the test case. Earlier capabilities didn't force the question generative AI does:

How do you specify acceptable behavior for multi-modal AI-generated content at scale? And how do you build the systems creators, ML pipelines, and global markets can all rely on and trust?

The substance of this piece isn't TikTok Shop. It's a framework called Floor / Ceiling / Style for translating human judgment into machine-learnable signals.

This case study runs across four parts.

  1. Part 1 (this piece) introduces the framework.

  2. Part 2 scales it across verticals and markets.

  3. Part 3 (coming soon!) applies it to cross-regional adaptation.

  4. Part 4 (coming soon!) applies it to making AI content feel human.

1. The Problem:
Why Generative AI Needs a Quality Discipline

AI Quality problem

Generative AI has solved a generation problem. It hasn't solved a quality problem.

The technology to generate plausible video, text, and images at near-zero marginal cost now exists. What doesn't exist, at most companies, is a shared specification for acceptable AI behavior. And the infrastructure to enforce that specification consistently.

The model is the easy part. Specifying behavior is the hard part.

Without that infrastructure, three things happen at scale:

  • Floor failures. Some output is technically broken or policy-violating. At scale, even a 0.1% failure rate becomes daily PR and legal exposure.

  • Ceiling drift. Without a benchmark for excellence, average quality regresses toward the mean of training data that's increasingly AI-generated itself. The feedback loop degrades.

  • Style mismatch. Generic output doesn't serve specific contexts. An excellent beauty video and an excellent enterprise demo share almost no surface features.

This is what responsible AI looks like in practice: not a separate ethics review at the end, but a quality system woven into how content gets generated, evaluated, and shipped.

Why TikTok Shop

TikTok Shop is the sharpest version of the AI quality problem available right now, but it's not the only one. TikTok Shop has proven the shoppable-video commerce model at Western scale, and the entire industry is now chasing it. Amazon launched its AI Video Generator for sellers in June 2025, making AI-generated product videos free for any Amazon merchant. Instagram Shop and YouTube Shop follow the same shoppable-video playbook. Social commerce as a category is projected to grow from $2.6T globally in 2026 to $8.5T by 2030. Generative AI is how every one of these platforms plans to fill the content gap.

TikTok Shop is the leading edge of this shift. In 2025, TikTok Shop hit $64.3B in global GMV (+94% year-over-year), built a $15.1B US market across 803,500 active stores and 15.4M creators, and grew to $13.1B in Indonesia, its second-largest market globally. As of early 2024, TikTok Shop held 68.1% of US social shopping GMV (more than Instagram Checkout and Facebook Shop combined).

The structural significance is bigger than the numbers. TikTok Shop has inverted the e-commerce funnel. Products no longer wait to be searched for. They find consumers through content the consumers were already watching. That model only works if the content is good across categories, markets, and regulatory regimes.

Generative AI is the only realistic path to producing that volume of category-specific, market-adapted commerce video. And commerce video is multi-modal by nature: visual, audio, text, and temporal pacing all working together. That makes TikTok Shop the largest live test of whether multi-modal AI-generated content can hold up at consumer scale, and what works here will define how Amazon, Meta, and YouTube approach the same problem next.

The Universal Challenge

The challenge isn't generating commerce video. That's solved across every major platform.

The challenge is generating commerce video that converts, in markets where attention costs money and ecosystem trust is the long-term asset. Without a quality framework, AI-generated commerce content at scale becomes "AI slop" on any platform that ships it. High views, low conversion, ecosystem damage.

This piece is about the infrastructure that prevents that. Built once, applicable everywhere.

2. The Framework:
Floor / Ceiling / Style

The function that codifies AI behavior goes by different names across the industry: Model Behavior, Model Policy, Responsible AI, Trust & Safety, Content Integrity. The naming differs. The work is converging. I've written separately about AI Behavior Design as a discipline.

For TikTok Shop specifically, the mission is narrower: build AIGC for commerce conversion, not entertainment virality. Learn from high-GMV videos, not high-view videos.

Those rarely overlap, and the difference between them is what the framework has to encode.

The Foundation

Three-Layer Quality Framework

Floor must pass · Ceiling to optimize · Style matches context
↑↓ Vertical: Quality
Measurable, better/worse
📈 Ceiling
Numeric Benchmark · 0—100
Top performer benchmark
3-second hook retention
Completion rate
CTR / CVR
Structure score
Product timing
CTA timing
🤖 AI should LEARN these patterns
Pass(Floor) × Score(Ceiling) = Quality
🚫 Floor
Binary 0/1 · Pass / Fail
Basic quality threshold
Visual quality
Audio quality
Compliance
Product info
Captions
Early exit patterns
⚠️ AI must NEVER go below
←→ Horizontal: Style
Contextual, different not better
Format
Content Structures
📦
Unboxing
📚
Tutorial
🔄
Before/After
🎧
ASMR
📖
Narrative
Fast-cut
Lifestyle
🎭
Skit
🎤
Vox Pop
Aesthetic
Visual Styles
💧
Clean Girl
🌈
Dopamine
🪩
Y2K
🤍
Quiet Lux
🌃
Cyberpunk
🍃
Japandi
🌾
Cottagecore
🏛️
Old Money
⚙️
Industrial
Format × Aesthetic × Category = Optimal Style
Category → Style Match
Premium Skincare 📚 + 🤍
Electronics 📦 + 🌃
Home Decor ✨ + 🍃
Food/Snacks 🎧 + 🌾
Fashion ⚡ + 🪩
Color Cosmetics 📚 + 🌈
Fitness Gear 🔄 + ⚙️
Luxury Goods ✨ + 🏛️
Pet Supplies ✨ + 🌾
Collectibles 📦 + 🌈
How Three Layers Drive Conversion Metrics
🚫 Floor → Basic Retention
Bad quality = instant swipe
Protects 3s retention
📈 Ceiling → Conversion Lift
Top patterns = higher CTR/CVR
Maximize watch → click → buy
🎨 Style → Match Expectation
Right format + aesthetic = resonance
Boosts completion & trust

Why three layers, not one

Most quality systems start as a single pass/fail threshold. This works briefly, then breaks. Single thresholds can't distinguish "policy violation" from "underperforming output" from "off-brand for this category." Each requires a different intervention.

Three layers separate three distinct questions:

  • Floor: Is this output safe to ship? (binary, hard constraint)

  • Ceiling: Does this output match our top-performing examples? (numeric, optimization target)

  • Style: Does this output match its context? (categorical, conditioning signal)

  • The Floor catches catastrophic failures. Technical defects, policy violations, structural breakage. Floor is binary by design. There is no "70% compliant." Either output ships or it doesn't.

    Implementation is classifier gating, post-generation processing, and rule-based filters. For TikTok Shop, Floor signals span every modality of the video:

    • Visual. Resolution, stability, brightness

    • Audio. Clarity, signal-to-noise ratio

    • Text. Banned-word and regulated-claim detection, OCR for product info accuracy

    • Temporal. Caption readability, absence of early-exit patterns

    If output fails Floor on any modality, downstream evaluation doesn't run. It doesn't matter how high it would have scored elsewhere.

  • The Ceiling is what excellent output looks like, derived from actual top-performing content in each category. This is where the system learns excellence, not just compliance.

    Implementation is numeric scoring (0-100), via reward models trained on top-decile examples.

    For TikTok Shop, Ceiling signals come from high-GMV examples (not high-view), and include:

    • 3-second hook retention ≥ 80%

    • First product appearance ≤ 5 seconds

    • Completion rate above category average

    • CTR/CVR above category average

    • Pain → solution narrative structure

    • Optimized CTA placement and timing

    Ceiling signals must come from data, not assumptions.

    The work is forensic: extract top performers, contrast against the bottom, identify differentiating patterns, validate via A/B testing. The output is a numeric quality score that ML pipelines can optimize against.

  • Style isn't about quality. It's about match. A quietly elegant beauty tutorial and a high-energy fast-cut sneaker drop can both be excellent, and both fail if you swap their styles.

    Implementation is categorical labels via style embeddings or conditional generation. For TikTok Shop, Style breaks into two axes:

    1. Format (how content is structured): Unboxing, Tutorial, Before/After, ASMR, Narrative, Fast-cut, Lifestyle, Skit, Vox Pop.

    2. Aesthetic (how it looks): Clean Girl, Dopamine, Y2K, Quiet Luxury, Cyberpunk, Japandi, Cottagecore, Old Money, Industrial.

    The product of Format × Aesthetic × Category produces an "optimal style" recommendation. Style is highly localized, with the lowest reuse rate across markets and the highest refresh rate over time.

  • Every generated piece passes through the pipeline in order:

    1. Floor check. Pass or reject.

    2. Ceiling score. 0-100, optimization signal.

    3. Style match. Categorical fit to context.

    Floor is what the system must do. Ceiling is what the system aims for. Style is how the system adapts. The layered structure is what makes the framework scale.

Ceiling signals must come from data, not assumptions
A repeatable five-step process for discovering Ceiling patterns per category
01
Extract Top 10%
high-CVR videos
02
Compare Bottom 10%
low-CVR videos
03
Extract patterns
that differ
04
Find statistical
correlations
05
A/B validate
signals
The work is forensic: extract top performers, contrast against the bottom, identify differentiating patterns, validate via A/B testing. The output is a numeric quality score that ML pipelines can optimize against.
Example Data reveals "Product in use within 3s" correlates r = 0.68 (strong) with Beauty CTR

3. Annotation in Practice:
Turning Judgment into Training Signal

The framework defines what to measure. Annotation is how that measurement becomes data the system can learn from.

This is the most leveraged part of the system, and the part most often underbuilt.

Annotation in Practice

Turning judgment into training signal

Each video gets labeled across three layers, each with a different data type and annotation method. Floor uses auto-detection. Ceiling derives from real performance. Style is human classification.

🚫
Floor
Pass / Fail
Auto-detected · all six must pass

The non-negotiable baseline. Most signals are detectable automatically using CV, ASR, NLP, and OCR. Human annotation enters only at boundary cases.

floor: {
  visual: 1, audio: 1,
  compliance: 1, product_info: 1,
  captions: 1, early_exit: 1,
  pass: true
}
Resolution ≥ 720p, stable framing
Resolution, stability, brightness
CV detection
SNR ≥ 12dB, dialogue clear
Signal-to-noise, clarity score
ASR analysis
No banned words, no regulated claims
Banned words, regulated claims
NLP filter
Price + features verified vs catalog
Price and feature accuracy
OCR + validation
Legible, no occlusion of product
Readability, occlusion detection
OCR + analysis
Drop-off ≠ bottom-10% pattern
Compare to bottom 10% drop-off
Pattern match
📈
Ceiling
0 — 100
Performance-derived · 0–100 score

Derived from actual top-performing content per category. Labels come from real platform data, not human judgment. Reward models learn what high-GMV content does differently.

ceiling: {
  hook_3s: 0.92, completion: 0.68,
  cvr: 0.042, structure: 0.87,
  product_seconds: 2, cta_position: 0.85,
  score: 94
}
3s hook ≥ 80% retention
Retention curve at the critical drop-off
Platform data
Completion > category baseline
Play-completion vs category benchmark
Platform data
CTR + CVR > category baseline
Click-through and conversion rates
Platform data
Cosine similarity ≥ 0.85 vs templates
Compared to high-CVR pattern templates
ML model
First product appearance ≤ 5s
First product appearance timestamp
CV detection
CTA in last 30% of video
Call-to-action placement detection
CV + NLP
🎨
Style
Tags
Tags only · no quality judgment

Style is conditional, not absolute. A Y2K aesthetic isn't better than Quiet Luxury — it's better for a category. The framework tags style instead of scoring it.

style: {
  format: "tutorial",
  aesthetic: "clean girl",
  category: "beauty"
}
format Video classifier
Tags one of 9 format types — Tutorial, Unboxing, ASMR…
~92% accuracy
aesthetic CLIP embedding match
Cosine similarity against 10 reference videos per aesthetic
prototype-based
category SKU metadata
Vertical comes from the linked product, not the video itself
catalog truth
κ ≥ 0.70 Cohen's kappa · annotator agreement
Two annotators tag each video independently. Below the threshold, a third reviewer breaks the tie. Filters out ambiguous edge cases.
Key insight The model can only learn what the annotation pipeline reliably labels.

Three layers, three annotation methods

Each layer needs a different annotation method. Floor uses multi-modal detection. Ceiling joins data. Style requires human judgment.

  • Computer vision for visual quality, ASR for audio, NLP for compliance, OCR for product info. Human annotation enters only at boundary cases.

  • Labels come from real platform data. Actual retention. Actual CVR. The work is building the pipeline that joins generated content to its performance metrics.

  • Annotators don't decide if a video is good. They classify it as Tutorial-format, Clean-aesthetic, Beauty-category. Inter-rater agreement (Cohen's kappa) above 0.7 is the quality bar.

Sample Annotations

Three learning signals from one schema

Top performers teach the model what to reproduce. Floor failures teach what to refuse. Low performers teach where the gap is between average and excellent.

TOP 5% High CVR

30s Beauty Lipstick Tutorial

Floor: All pass
Visual
Audio
Compl.
Product
Captions
Exit
Ceiling 94/100
3s Hook92% (≥ 65%)
Completion68% (≥ 45%)
CVR4.2% (≥ 1.8%)
Product2s (≤ 5s)
Style: Tags
📚Tutorial + 💧Clean + 💄Beauty
FLOOR FAIL Rejected

45s Fashion Showcase

Floor: 2 failed
Visual
Audio
Compl.
Product
Captions
Exit
Compliance: Exaggerated claims
Captions: Blocks product
Ceiling Not evaluated
Floor failed — Ceiling evaluation skipped
Style: Tags
Fast-cut + 🪩Y2K + 👗Fashion
!LOW PERF Needs work

60s Electronics Unboxing

Floor: All pass
Visual
Audio
Compl.
Product
Captions
Exit
Ceiling 38/100
3s Hook45% (≥ 65%)
Completion22% (≥ 45%)
CVR0.3% (≥ 1.8%)
Product12s (≤ 5s)
Style: Tags
📦Unboxing + 🌃Cyber + 📱Tech
Annotation Pipeline

From raw video to a single annotated record

Five stages turn raw video into the records shown above. Reproducible at scale, auditable per stage. Output: one combined record per video.

📥
01
Collect
Top 5% +
Bottom 10%
🤖
02
Auto-Detect
Floor signals
via classifiers
📊
03
Label Perf
Ceiling from
platform data
🏷️
04
Classify
Format × Aesthetic ×
Category
05
QA Check
Cohen's
Kappa > 0.7
Output: one record per video
{
  "video_id": "sv_12345",
  "floor": {
    "visual": 1, "audio": 1, "compliance": 1,
    "product_info": 1, "captions": 1, "early_exit": 1,
    "pass": true
  },
  "ceiling": {
    "hook_3s": 0.92, "completion": 0.68,
    "cvr": 0.042, "product_appearance_seconds": 2,
    "score": 94
  },
  "style": {
    "format": "tutorial", "aesthetic": "clean", "category": "beauty"
  }
}
The framework is the ML architecture. Floor becomes a binary classifier. Ceiling becomes a reward model. Style becomes categorical features for downstream ranking. Three layers, three training primitives.

Why this transfers beyond TikTok Shop

The same annotation discipline applies to any generative AI product, single-modal or multi-modal:

  • For an LLM assistant: Floor is policy compliance. Ceiling is helpfulness on top-rated responses. Style is tone match.

  • For multi-modal models: The same three layers apply per modality and across modalities.

The framework transfers because the underlying problem (translating judgment into signal) transfers.

4. Trust, Bias,
& Responsible AI

The framework above defines what good looks like and how to measure it. Now the harder question:

  • What happens when the system gets it wrong, or gets it right for the wrong people?

Quality systems are usually framed as a conversion problem. That's the surface answer. The deeper reason quality systems matter is trust — and trust gets broken in two distinct ways:

  1. Externally, when AI-generated content erodes consumer confidence in the platform.

  2. Internally, when the system produces high-quality output for one demographic and lower-quality output for everyone else, without anyone noticing.

Both failure modes share a root cause: codifying behavior without specifying who that behavior serves.

This section covers both.

The trust contrast happening right now

  • Instagram. Meta has been pushing AI-generated content into feeds aggressively. AI characters with profiles. AI-generated comments. AI imagery in recommendations. The user backlash is visible and loud. The "dead internet" discourse moved from niche to mainstream. The technical capability is there. The trust isn't.

  • TikTok. The brand is built on the opposite. Authenticity. "Real People, Real Reviews." Creators talking to camera. The platform's value proposition collapses if AI-generated content erodes that perception.

The cost of getting it wrong:

  • A 0.1% Floor failure rate is a daily lawsuit.

  • A 5% drop in perceived authenticity is an existential threat.

Quality is the surface goal.
Trust is what's actually at stake.

The Three-Layer Quality framework treats trust as something the system actively protects, not something it hopes to maintain. Three places trust shows up in the framework:

Floor as a trust boundary.

Floor is binary because trust is binary.

One viral AI video making medical claims it can't back up doesn't lose a sale. It loses platform credibility for a category.

These are the guardrails the system can't cross.

Ceiling as a trust signal.

Completion rate and CVR aren't just performance metrics. They're trust proxies.

People stay because they trust the content.

People buy because they trust the seller.

Style as a trust match.

Generic AI output is recognizable.

Distinctive AI output that matches its category and audience doesn't feel out of place.

Style prevents the uncanny-valley sensation of "this content doesn't belong here."

A structural parallel

Different labs encode AI behavior using structurally similar mechanisms. Anthropic's Constitutional AI. OpenAI's Model Spec and RLHF. Google DeepMind's safety rules and preference learning. Each one names the same architecture: hard constraints, learned preferences, human judgment translated into machine-learnable signals.

The domains differ. The structural shape is the same: specifying acceptable AI behavior in a way the system can learn and the audience can trust.

Where the system quietly breaks: bias

If the previous part of this section was about why trust matters externally, this part is about how trust gets quietly broken internally.

Responsible AI isn't a separate discipline from quality. It's the part of quality that asks: who is this system working for, and who is it leaving out?

A quality framework that doesn't account for bias produces high-quality biased output. Bias isn't binary. It's not "have it or don't." It's "more or less." The goal isn't claiming neutrality, which is unachievable. The goal is building mechanisms that continuously detect and reduce bias as the system runs.

Codifying behavior without asking "behavior for whom" is how that happens.

A quality framework without this layer ships biased output more efficiently. With it, the framework becomes responsible AI infrastructure.

    • Training data bias. The Top 10% of high-converting content reflects what the platform has historically rewarded. If past high performers were mostly young, light-skinned, conventionally attractive models, Ceiling signals encode that pattern. The system doesn't learn what makes a video convert. It learns what convertible videos have looked like so far.

    • Prompt-implicit assumptions. "A beauty influencer" defaults to young and female. "A professional look" implies a specific skin tone, age, and dress. These assumptions aren't explicit, but they shape generation.

    • Style label bias. Style categories themselves can encode bias. Does "high-end" map to one specific aesthetic? Does "professional" implicitly exclude certain body types?

    • Data layer. Audit demographic distribution per category. Supplement deliberately where coverage is thin.

    • Prompt layer. Review templates for implicit assumptions. "A skincare user" instead of "a young woman with clear skin."

    • Detection layer. Build classifiers that flag patterns at the distribution level, not just individual outputs.

    • Output layer. Audit regularly. Distribution analysis, not random spot checks.

    • Feedback layer. User reporting. Internal red teams. Find problems before users do.

    • A diverse quality team. Different perspectives find different blind spots.

    • External audits. Internal teams develop collective blind spots.

    • Bias bounty mechanisms. Crowdsourced detection beats centralized review.

Continue to Part 2

The framework above is general. Most quality systems break the moment they meet specifics: different verticals demand different signals, different markets demand different rules.

Part 2 covers how the framework scales — eight verticals with their own Floor / Ceiling / Style configurations, three markets launching the same product under different regulatory regimes, and the 90-day execution plan for building this from zero.