Designing
AI Quality Systems
Part 2: Scaling the Framework Across Verticals, Markets, and Execution
December 18, 2025 · Kenneth Hung · 20 min read
This is Part 2 of a four-part case study.
Part 1 introduced the Floor / Ceiling / Style framework: a way to translate human judgment about AI-generated content into machine-learnable signals.
Part 2 (this piece) picks up where Part 1 left off: how do you scale this framework across verticals, markets, and operational reality?
Part 3 (coming soon!) applies the framework to cross-regional adaptation: one product, three markets, one quality system.
Part 4 (coming soon!) applies the same framework to a different use case: making AI-generated content feel like authentic human creation.
The framework is general. The reality of deployment is not. A health supplement video in Indonesia has nothing in common with a sneaker drop in California except their three-layer structure. This part covers what stays universal, what localizes, and how to build the system from zero.
1. Category Specificity:
Why One Model Can't Serve All Verticals
Category-Specific Signals
The three-layer structure is universal. The signal values inside it are not. Every category has a different core challenge — and each needs its own signal definitions and training data. One generic model can't serve them all.
The structure is universal.
The signals are not.
A single model trained on all categories' data will average toward the largest category and underserve the others. Per-category Ceiling signals keep performance balanced across the vertical mix.
-
The Floor catches catastrophic failures. Technical defects, policy violations, structural breakage. Floor is binary by design. There is no "70% compliant." Either output ships or it doesn't.
Implementation is classifier gating, post-generation processing, and rule-based filters. For TikTok Shop, Floor signals span every modality of the video:
Visual. Resolution, stability, brightness
Audio. Clarity, signal-to-noise ratio
Text. Banned-word and regulated-claim detection, OCR for product info accuracy
Temporal. Caption readability, absence of early-exit patterns
If output fails Floor on any modality, downstream evaluation doesn't run. It doesn't matter how high it would have scored elsewhere.
-
The Ceiling is what excellent output looks like, derived from actual top-performing content in each category. This is where the system learns excellence, not just compliance.
Implementation is numeric scoring (0-100), via reward models trained on top-decile examples.
For TikTok Shop, Ceiling signals come from high-GMV examples (not high-view), and include:
3-second hook retention ≥ 80%
First product appearance ≤ 5 seconds
Completion rate above category average
CTR/CVR above category average
Pain → solution narrative structure
Optimized CTA placement and timing
Ceiling signals must come from data, not assumptions.
The work is forensic: extract top performers, contrast against the bottom, identify differentiating patterns, validate via A/B testing. The output is a numeric quality score that ML pipelines can optimize against.
-
Style isn't about quality. It's about match. A quietly elegant beauty tutorial and a high-energy fast-cut sneaker drop can both be excellent, and both fail if you swap their styles.
Implementation is categorical labels via style embeddings or conditional generation. For TikTok Shop, Style breaks into two axes:
Format (how content is structured): Unboxing, Tutorial, Before/After, ASMR, Narrative, Fast-cut, Lifestyle, Skit, Vox Pop.
Aesthetic (how it looks): Clean Girl, Dopamine, Y2K, Quiet Luxury, Cyberpunk, Japandi, Cottagecore, Old Money, Industrial.
The product of Format × Aesthetic × Category produces an "optimal style" recommendation. Style is highly localized, with the lowest reuse rate across markets and the highest refresh rate over time.
-
Every generated piece passes through the pipeline in order:
Floor check. Pass or reject.
Ceiling score. 0-100, optimization signal.
Style match. Categorical fit to context.
Floor is what the system must do. Ceiling is what the system aims for. Style is how the system adapts. The layered structure is what makes the framework scale.
-
New verticals can be added by defining their three layers without rebuilding the system. The cold-start problem becomes:
Identify the closest adjacent category
Transfer its structural signals
Redefine Style
Validate with a small dataset
Iterate
Beauty & Personal Care Pet Supplies Fashion & Apparel Food & Beverages Toys & Collectibles Tech & Electronics Health Products Home & Living
How the three-layer framework operationalizes for one vertical — from signal definitions to AIGC production roles to measurable outcomes. Switch between verticals below to see how the framework adapts.
2. Scaling Globally:
What Stays Universal, What Localizes
The framework is designed for global scale by separating what transfers from what doesn't.
Global Framework Architecture
Some layers are globally unified, others are market-customized. Hover the legend below to see which layers transfer.
worked example:
launching Immunity Gummies in three markets
Three Markets, One Framework
Watch the framework operate on a real product. Immunity gummies launching in the US, Brazil, and Indonesia. The structure is global. Every signal value is local.
- Dietary supplements need no pre-market approval
- Can say "supports immune health" (structure/function claim)
- Cannot say "boosts immunity" beyond support, or "prevents cold"
- Banned: "cure," "treat," "prevent disease"
- Ingredients in Supplement Facts format
- DSHEA disclaimer required: "Not evaluated by FDA. Not intended to diagnose, treat, cure, or prevent any disease."
⚠ Failure scenario
⚠ Failure scenario
⚠ Failure scenario
- Food supplements require prior ANVISA notification (not registration)
- Can say "fonte de vitamina C" (vitamin C source)
- Cannot say "previne gripes" (prevents colds)
- Cannot say "fortalece imunidade" without clinical evidence
- Strict upper nutrient limits per population group (IN 28/2018, Annex IV)
- 100% Portuguese labels + mandatory warning "Este produto não é um medicamento"
⚠ Failure scenario
⚠ Failure scenario
⚠ Failure scenario
- BPOM ML certification is the basic entry requirement
- Halal certification mandatory (87% Muslim market). BPJPH issues, LPPOM MUI inspects
- Gelatin source is the core question:
- ✗ Pork gelatin → automatic rejection (haram)
- ✓ Bovine (Halal-slaughtered) / fish / plant-based
- Halal logo (BPJPH purple, or legacy MUI green through Oct 2026) must be visible
⚠ Failure scenario
⚠ Failure scenario
⚠ Failure scenario
Three-layer framework for health products: Floor is non-negotiable, Ceiling adapts to trust culture, Style is highly localized.
10 Failure modes to design for upfront
Failure modes from the world
The framework can break in predictable ways when scaling globally. The first set of failures comes from the diversity of markets, cultures, and commerce contexts the framework must adapt to.
-
A global team building "global" templates is actually building US-centric templates (English-first copy, dollar pricing, nuclear family imagery, gestures and color symbolism assumed universal). The defaults bake in cultural assumptions that fail elsewhere — a "thumbs up" reads positive in the US, offensive in parts of the Middle East; red signals luck in China, danger in Western contexts.
Mitigation: cultural gates with local sign-off authority before any template ships globally; visual symbol audit per market.
-
Regulations vary by country (GDPR in EU, FTC in US, Halal certification in Indonesia/Malaysia, royal/political content rules in Thailand and Vietnam) and update faster than centralized teams can track.
Mitigation: rules-as-code per country, conservative defaults, local compliance team sign-off.
-
"Spanish" is not one language: Brazilian Portuguese, Mexican Spanish, Argentine Spanish differ in idiom, register, and connotation. Same for Indonesian, Thai, Vietnamese in APAC. Machine translation produces compliant-sounding but culturally wrong output.
Mitigation: native-speaker quality control per language variant, not per language family.
-
Direct translation loses cultural meaning — "fresh" in US beauty means cool/trendy; the Mandarin literal 新鮮 means "not stale." "Y2K aesthetic" has no Mandarin equivalent and needs to remap to 復古 (retro) or 平成 (Heisei era) depending on market. Style references decay fast: what was "clean girl" in 2024 is "boring" in 2026.
Mitigation: local style curation per market (not translation), quarterly refresh cycles, trend-decay monitoring tied to engagement metrics.
-
Generated CTAs assume payment methods, logistics timelines, and bandwidth conditions that vary by market (Pix in Brazil, OXXO in Mexico, Touch'n Go in Malaysia; same-day delivery in Jakarta, multi-day in rural markets; high-bitrate video in mature markets, low-bitrate fallbacks elsewhere).
Mitigation: CTA generation tied to market metadata; video transcoding pipelines for bandwidth-constrained regions.
Failure modes from the system
The second set comes from how the framework operates internally — its data, its scale, its people, and how the optimization itself can go wrong.
-
New markets have no top-performer data — Ceiling signals trained on US Beauty videos don't transfer to Indonesia (different skin tones, different aspirational references, different humor). Cold-start period is 3-6 months of noisy signal while a market accumulates 1000+ labeled examples per vertical.
Mitigation: seed with adjacent-market signals (Mexico for LATAM, Indonesia for SEA), tag early data as "low-confidence," calibrate weights as native data accumulates.
-
10 markets × 8 categories × 3 layers = 240 configurations, each potentially overriding the others. Changing a single global Beauty Ceiling signal cascades to 10 market-specific rubrics, each requiring re-validation. Without tooling, every refinement triggers weeks of re-labeling.
Mitigation: hierarchical inheritance (global → regional → country), change-impact propagation alerts, eval-set versioning so old eval data doesn't pollute new rubrics.
-
Local expertise doesn't exist in centralized hubs. A US-based Beauty team won't know Indonesian Halal cosmetic requirements or that K-beauty references work differently in Japan than in Korea. Hiring locally is slow (12+ months to onboard a Beauty expert in a new market).
Mitigation: regional hubs in 3-4 anchor markets (e.g., Singapore for SEA, São Paulo for LATAM, Berlin for EU), shared knowledge bases with versioned cultural context, two-way feedback loops where regional teams contribute back to the global framework.
-
Optimizing toward a Ceiling rubric narrows the model's output distribution toward a small set of high-scoring patterns (same hook structures, same pacing, same aesthetic). Users habituate quickly; what scored well in Q1 is "scroll fatigue" in Q3.
Mitigation: diversity rewards in the rubric (penalize over-similarity to recent outputs), multi-objective optimization (Ceiling AND novelty), rotation of "unconventional" examples in the training/eval set to prevent distribution collapse.
-
Merchants gaming the rubric (e.g., inserting compliant-sounding phrases that mask non-compliant claims), prompt injection in product descriptions, or Style templates being exploited to generate borderline content.
Mitigation: red-team probes as part of the eval dataset, anomaly detection on rubric pass-rates per merchant, periodic adversarial rubric reviews.
How the Framework Generates a Market-Ready Video
The Floor / Ceiling / Style framework operating end-to-end as a generation pipeline. Indonesia walked step by step as the most complex of the three markets.
- Brief: launch immunity gummies in Indonesia
- Resolves to: BPOM ML + Halal + Indonesian
- Channel: TikTok Shop
- Locale stack: Global + APAC + ID
- BPOM ML cert?
- Halal cert (BPJPH)?
- Gelatin source?
- Pork gelatin → reject
- 15–30 second fast pace
- Product + Halal within 3s
- KOL recommend hook
- Urgency + social proof CTA
- Green / white palette
- Halal logo upper right
- Hijab version
- Pop Indo music
- Forbidden medical claims removed?
- Halal logo visible 3+ seconds?
- Models religiously appropriate?
- → Pass: ship
- ↻ Single fail: regenerate
- ⟲ Pattern fail: framework refinement
3. The Practice of AI Behavior Design
The AI Behavior Designer's role
The Floor / Ceiling / Style framework doesn't write itself. Someone has to decide that "no medical claims" is a hard binary, not a 0–100 score, and that the line shifts when the category does.
That decision is behavior design. It isn't policy work (which writes the rules upstream) and it isn't engineering (which trains the models). It's the judgment work that translates platform standards into AIGC pipeline behavior. Across all the diagrams above, that work breaks into four responsibilities:
-
The Floor is the hard binary, the line the AI must never cross.
Health has 5 human-label markers in its Floor because medical claim detection can't be reliably automated.
Toys has 1 because most toy-safety signals are visually verifiable.
Those asymmetries are the design output: deciding where automation is trustworthy, and where human judgment is required.
-
The Ceiling layer codifies what top 5% content actually looks like in this category right now. Not generally good content. Category-specific patterns.
Beauty Ceiling has Before/After.
Tech has Hidden Features.
Food has ASMR Audio.
These aren't intuitions. They're patterns observed in high-performing content, validated against platform data, and updated as platform norms shift.
-
Style refuses easy answers.
Y2K aesthetic isn't better than Quiet Luxury. Y2K is right for fashion targeting students. Quiet Luxury is right for health targeting professionals.
The match rules between Format, Aesthetic, and Category are explicit design decisions.
When Avatar generates a "Tech blogger / Gadget pro" persona, that persona was matched to Tech audience because of a rule someone wrote.
-
Quality assurance isn't downstream of the pipeline. It's built into the same structure that generates content.
Every output gets evaluated against Floor (pass/fail), scored against Ceiling (0–100), and tagged for Style.
When CVR drops in a category, the framework tells you whether it's a Floor failure, a Ceiling regression, or a Style mismatch.
The same diagram that defines what to generate also defines how to diagnose what went wrong.
Where AI Behavior Design Operates
The framework runs as code. The AI Behavior Designer designs how it should behave. Here is what the role owns, decides, and updates across the pipeline.
AI Behavior Design is a discipline in formation. Across companies it goes by many names — Model Behavior, Model Policy, Responsible AI, Trust & Safety, Content Integrity. The work converges on the same problem: ensuring AI-generated content quality at scale, where the surface of design isn't UI but the system's behavior itself. At each framework layer, the AI Behavior Designer translates domain expertise into artifacts that ship as code. Across all layers, the role aggregates findings into framework refinements.
Delivers
Delivers
Delivers
Delivers
Delivers
The system generates videos. The framework defines what "good" means. The AI Behavior Designer keeps that definition accurate as markets evolve, content trends shift, and regulations change.
These rubrics live at the evaluation layer, not the training layer. The model itself is trained upstream via RLHF, DPO, RLAIF, Constitutional AI, fine-tuning, and reward modeling — those are the lab-level levers. Floor / Ceiling / Style rubrics define "good" for the system that runs in production, and failure patterns reveal what to update at the framework layer.
See a sample rubric criterion: BC-04 Result-First Hook (Beauty Vertical, Ceiling Layer)
Illustrative excerpt showing the depth and format of production rubric authoring. Real rubrics span 8-12 criteria per vertical-market combination, with parallel artifacts for calibration data, annotator guidelines, and eval pipeline specs. Part 3 and 4 will apply criteria like this one to AI-generated videos.
The first 90 days
The first 90 days for an AI Behavior Designer standing up the function: audit, MVP, validation. The scope isn't the whole AIGC pipeline. It's the eval/rubric layer that turns Floor / Ceiling / Style into a working signal system, starting with one pilot vertical.
The 90-day deliverable: a replicable framework for one category, with data showing it works.
-
Audit existing systems
Map current failure modes from production data and merchant escalations
Interview engineering, operations, and merchant teams
Pick one high-GMV vertical as pilot (Beauty is defensible: most data, most variation)
-
Define Floor / Ceiling / Style signals for the pilot
Deliver annotation guidelines and labeled samples
Calibrate rubrics with annotators (target inter-rater κ > 0.7)
Align with engineering on technical implementation
Establish a quality evaluation baseline
-
Run the full pipeline end-to-end
A/B test framework outputs against current production
Document failure modes and refinement hypotheses for the next iteration
Build the playbook for vertical two
Org structure
How the AI Behavior Design role lives within an organization: who it works with, how the work gets measured, and how the function scales beyond MVP. The role isn't a solo function. It's the connective tissue between policy, engineering, data science, and operations.
-
A quality framework can't be built by a single function:
Product Design + UX. Quality definition, UX, labeling specs.
Product Management. Strategy, prioritization, roadmap.
Engineering. Tool development, model integration, pipeline.
Data Science. Signal validation, analytics, A/B infrastructure.
Operations. Creator relations, content ops, edge case escalation.
Trust & Safety. Compliance, content policy, regional regulations.
Product Marketing. Go-to-market, enablement, launch comms.
The orchestration role AI Behavior Designer owns is the connective tissue. Specifying behavior in a way engineering can implement, operations can enforce, and merchants can adopt.
-
Three categories:
Content Quality. Signal compliance, Floor pass rate, Ceiling distribution.
Creator Efficiency. Production time, adoption, satisfaction.
Content Performance. 3-second retention, completion, CVR vs. human baseline, GMV.
System economics. Inference cost per generated video, rejection rate × inference cost (cost of waste), Floor / Ceiling threshold sensitivity to cost (where the bar is set determines how much compute the system burns). Quality decisions are also cost decisions.
The metrics ladder: signal compliance → content quality → creator adoption → content volume → business outcomes.
-
Once proven on one vertical, scaling follows the global architecture:
A central team owns framework definition, core annotation methodology, and quality systems infrastructure.
Regional teams own market-specific Floor rules, Ceiling calibration, and Style curation.
Category specialists own per-vertical signal definition.
This mirrors how the labs structure equivalent functions.
Continue to Part 3 & 4
The framework above is theory until applied. Parts 3 and 4 each apply Floor / Ceiling / Style to a different use case.
Part 3: Cross-Regional Adaptation (coming soon!) One product, three markets, one quality system. The framework in scaling mode.
Part 4: Authenticity Through Imperfection (coming soon!) Making AI-generated content feel human. The framework in trust mode.
Read in any order. Both stand alone.
Together they demonstrate two things at once:
The framework working
The AI fluency that defining quality systems for generative AI now requires
Appendix
- Floor: product visible in 2s
- Ceiling: hook must attract
- Style: beauty uses aspirational tone
- Good/bad criteria
- Rating rubric (1-5 scale)
- Edge case handling
- Reusable patterns across verticals
- Recommend rubric reweighting
- Flag preference data gaps
- Output distribution shifts
- Behavioral regressions to investigate
- Next-iteration hypotheses
B: "Can't cover those dark circles?" (pain-point)
CVR: A 3.2% vs B 1.8%
Low tail (0-1) = 11.1% of outputs. Right shift vs. prior run.
| ID | Criterion | Mean | Δ prior |
|---|---|---|---|
| BC-09 | Claim verifiability | 2.41 | −0.34 |
| BC-07 | Face visibility | 3.12 | −0.18 |
| BC-11 | Pattern saturation | 3.28 | ±0.02 |
| BC-04 | Result-first hook | 3.89 | +0.22 |
| BC-02 | Brand aesthetic | 4.14 | +0.07 |