Engineering for a production Airtable system (not building one)

At the studio I'm contracting with, Epic Games, one of the Airtable automations I maintained started timing out in production. The script was loading all 45,000 Task Assignment records on every trigger to find the 1-5 records relevant to a given Task. It worked fine in development. It failed silently at scale.

That kind of rewrite, same logic, different shape, designed for production scale, is most of what the work is, day to day.

The system

Two large internal teams at the studio handle their creative-production work through an Airtable-based project-management system. Around 200,000 records across the core tables (Tasks, Task Assignments, Timesheets). Hundreds of users. 24/7 concurrent load. Airtable's automation runtime is 30 seconds, and a script that stalls fails silently.

My engineering work on this system is keeping it running. Writing new automations when a workflow needs one. Fixing the automations that are starting to stall at scale. Building admin tooling for the operations that aren't Airtable-native (contractor offboarding, bulk timesheet cleanups). Running one-off resolution projects when data drifts.

It's not "build an app, ship it, move on." There's no version 1.0. There's a continuously running production system, and the work is keeping it correct and performant under load.

The shift this work represents

For most of my career, the engineering work has been building applications. A new product. A feature. A refactor. Discrete shipped artifacts. This work at Epic is a different shape, and it took me a while to recognize that as its own thing worth naming.

This is the engineering mode of most mature production systems. Big org, big system, mostly working, always one quiet incident away from a problem. The work isn't "what's the new feature." It's recognizing the mode and operating well in it. Write code at production scale. Use the system's own structure to find what you need. Treat hard runtime constraints as design constraints, not quirks.

Most engineering writing celebrates app building. The maintenance and support shape gets undersold. Both modes are real. The difference matters.

What I actually build

Four categories of work:

Production automations. Triggered by record changes (new Task Owner, end date pushed out, approval status update) or scheduled (weekly batches for things like timesheet generation). They run inside Airtable's 30-second runtime budget, every time. Every script has a DRY_RUN flag at the top by default: Airtable's built-in "Test automation" feature runs against live data, not a sandbox, so a code-level dry-run is the only safe way to verify behavior before going live.
CLI backfill tools. Standalone scripts for catch-up operations: finding tasks with missing dependencies, generating records that should exist but don't. Run from the command line with explicit --live flags so nothing happens accidentally.
Shared scripting extensions. Admin tools that run inside the Airtable interface. Designed to work in either base via runtime detection, so the same script ships to both. Example: a contractor offboarding cleanup that previews future timesheets and task assignments to delete before the admin clicks Delete.
One-off resolution projects. Self-contained cleanup efforts when data drifts. The most recent one fixed about 35,000 broken timesheet links after an upstream system migration. Analysis scripts, resolution tooling, cached data, and run logs all preserved in the repo as a pattern reference for future work.

The performance constraints are real

The system at the time of writing:

~54,000 Task records
~45,000 Task Assignments
~80,000 Timesheet records
Hundreds of users, 24/7 concurrent load
30-second automation runtime. No longer-running tier.

The hard rules every script follows:

Never load a full table when a targeted query will do. recordIds and filterByFormula only.
Use the system's own structure to find related records. A Task already has a linked field pointing at its Task Assignments. Read that link to get 1-5 records. Don't scan the 45,000-row Task Assignment table looking for matches.
Set-based lookups, not nested array scans. O(1) versus O(n²) makes the difference between completing in 2 seconds and timing out at 30.
Every script designed for 100,000 records and 500 concurrent triggers. The test before any change ships.

These aren't guidelines. They're load-bearing constraints. A script that ignores them will fail silently in production, and "silently" is the part that matters. The runtime kills the script mid-execution, the user sees nothing, and the work the script was supposed to do quietly stops happening.

What I'd take to the next system

Three transferable principles from this work.

Design for production scale, on every script. Not just the big ones. Every change gets the "would this hold up at 100K records and 500 concurrent triggers" treatment before it ships. A script that works in dev and dies in prod is a bug in the design, not an unforeseen scale problem.

Use the system's own structure. Walk linked fields. Don't scan sibling tables for what's already pointed to from the record you're holding. True for Airtable, true for SQL databases with foreign keys, true for any data model with explicit relationships. The system already knows how its pieces connect. The script should use that.

Hard runtime constraints are design constraints. Airtable's 30-second automation timeout dictates most of the architecture in this work. Same shape as Lambda's 15-minute limit, or any platform with a wall clock. The constraint is a design input, not a quirk to work around.

Where it stands

All production automations are running. Both bases. Most have been rewritten at least once to meet scale constraints, and the current versions hold. The CLI backfills are complete, with their main use now catching up future drift. The big one-off resolution project is done.

New automations and fixes get added as workflows evolve. That's the shape of the work. There's no version 1.0 to ship.

Building an app is one engineering mode. Keeping one running at scale is another. Both are real, and recognizing which mode you're in changes the right next move.