Spec-driven QC (not filename-driven)

At the studio I'm contracting with, Epic Games, there's a creative-services QC review that used to take a person hours. A delivery would land in a Google Drive folder: twenty to fifty image files for a typical one, four hundred or more for the big ones. Each file had to be checked against the right spec: dimensions, file type, file size, ratings icon, 1px border, localization handling. The QC artist did all of it by eye.

An earlier automation I'd built solved part of the problem. It worked by parsing filenames: the system would inspect each file's name against a config of known conventions to figure out which spec it was supposed to match. That worked when filenames followed the convention. It broke any time a producer formatted a filename slightly differently. Maintenance was constant.

Then the picture changed. The team's project-management Airtable base started exposing the specs as data: each Asset Spec record explicitly described the expected dimensions, file type, max size, ratings requirement, border requirement, naming convention. All linked to the originating Task. The spec wasn't something to infer from a filename anymore. It was already there.

This system uses the specs directly.

What got built

A Lambda that receives an Approval record ID, reads the specs linked through it, traverses the Google Drive folder for the delivery, matches each file to its target spec via a deterministic three-step pipeline, runs metadata + AI vision checks, and writes structured results back to two places: a QC base for the review team and a summary on the originating Approval record.

165 unit tests; live-validated against four scenarios (happy-path, multilingual, no-specs, wrong-type) before any team-facing rollout.

The load-bearing decision: spec-driven, not filename-driven

The earlier filename parser was carrying weight that didn't belong to it. It was encoding business rules in a string parser. The platform name, the expected dimensions, the file type, all inferred from substrings of the filename, all subject to drift the moment a producer decided to format a filename slightly differently.

This system's matching pipeline is trivial because it uses the actual data:

Filename prefix/suffix. Compare the filename against the spec's Filename Prefix Nomenclature and Filename Suffix fields. Most explicit signal; resolves immediately when conventions are followed.
Detected dimensions. Read pixel dimensions from the file's metadata (no full decode) and compare against the spec's Dimensions field.
File type. Use the file extension as a tiebreaker against the spec's expected format when two specs share dimensions.

The match is essentially a database lookup. The QC rules live where they should: on the spec record in Airtable.

The bigger pattern this surfaces: when business logic looks fragile in code, check whether the data layer can hold it instead. The filename parser was complex because the spec data wasn't accessible from where the matching happened. Once the data was accessible, the parser shrank to "compare to spec field." Applies broadly beyond QC.

The two-phase pipeline: deterministic plus AI vision

The QC checks split into two categories with two different shapes of code.

Deterministic checks (code, sub-second per file):

File type matches expected?
File size under cap?
Dimensions correct?

These are well-proven things. Code handles them fast and reliably.

AI vision checks (Claude Opus, several seconds per file):

Ratings icon present and placed correctly?
1px border on the image (where required)?
Localization text behaving as expected (when the spec restricts it)?

These are perceptual. A human spots them by eye. Code can't do them well alone.

The decision: use deterministic code for what it does well, use AI vision for what it doesn't. Each check runs only when the spec calls for it. If a spec doesn't require a ratings icon, the system doesn't run the ratings check. The spec is the gate.

Drawing the line between these two categories is its own engineering judgment, and it's a relatively new kind of decision. Easy mistake in one direction: ask a model to verify the file type. Easy mistake in the other: ask code to judge whether the rating icon is artistically placed. Both fail in their own ways. The right answer is to use each tool for the kind of work it actually does, and to be honest when the answer is "this one needs the model even though it costs more."

The S3 transit layer

Each file needs to be available to two consumers: Claude Vision needs a publicly-fetchable URL, and Airtable's attachment ingest also needs a publicly-fetchable URL. Google Drive doesn't expose one; its links are auth-protected.

S3 solves both at once. Each file goes:

Drive → S3 (transit, short-lived) → presigned URL → Claude Vision
                                                  → Airtable attachment ingest

After Airtable ingests the file into its own CDN, S3 isn't holding anything load-bearing; the bucket gets cleaned at the end of each run. S3 is transit, not durable storage. The QC team interacts with the files via Airtable; S3 is never in their workflow.

What I'd take to the next system

Three transferable principles.

Spec-driven beats filename-driven, when the data is available. When business logic in code looks fragile, check whether the data layer can hold it instead. The filename parser shrank to "compare to spec field" the moment the spec field was reachable. Applies broadly beyond QC.

Use deterministic code for what it does well; use AI vision for what it doesn't. Two-phase pipelines are clean when the line between the two phases is drawn cleanly. The skill is knowing which side each check belongs on, and being honest when the answer is "this needs the model, even though it costs more."

AI vision is augmentation, not replacement. The model finds the issues; the human reviews and decides. In this system, the approval status is never touched by code. A human reads the QC summary and sets the next status themselves. The separation matters for trust, accuracy, and accountability.

Where it stands

Code-complete, tested, deployed to AWS Lambda. The trigger integration that wires it into the team's workflow is the last piece.

A delivery that used to take a QC artist hours runs through the pipeline in minutes. The artist still makes the call. But now they're reading findings instead of finding them.