VeePeenini, Part 9: The Plan, and All the Ways It Was Wrong | Vitor Pontual

On paper the plan was one sentence: copy the database from the old box to the new one, point the front door at the new box, done. Everything that mattered lived in the gap between that sentence and a version I would actually trust with my friends’ data. Closing that gap is most of what this entry is.

The boring problems that were actually the dangerous ones

The tunnel wasn’t only the app’s. The Cloudflare Tunnel in front of the game also carried a couple of my other personal services. “Just move the tunnel to the new box” would have quietly knocked those over too. So the call was to give the app its own dedicated tunnel, and at switch time repoint only the app’s address to it, leaving everything else untouched.

The new box is a different computer. Because the second machine is ARM and the home server is x86, you can’t copy the built app across; the image has to be rebuilt for the new chip. Easy once you know it, a confusing failure if you don’t.

A missing key, caught by being boring. When we diffed the new box’s configuration against the old one, the keys that sign push notifications were absent. Flip without noticing and notifications would have died silently, and “silently” is the worst word that can appear in that sentence. Nobody would have reported it; it would just slowly stop working. Diffing two config files line by line is dull work. Dull work is what catches this.

The scoring jobs, not just the app. The app is not only a website. A set of background jobs do the real game work. One freezes the betting odds the instant a match kicks off, because the scoring math depends on those frozen odds. Another grades the match when it ends and hands out the points and the loot packs. Those jobs had to move too, and be running on the new box before the next kickoff, or points would come out wrong with no error anywhere to warn me. Moving “the app” was the easy eighty percent. These jobs were the dangerous twenty.

A writer that bypassed the app. One nightly job writes standings history straight into the database instead of going through the app. That punched a hole in my first instinct, which was “just stop the app and writes stop.” Not quite. That job had to be stopped by hand as well, or it could scribble a new row into the old database after I’d already taken my copy.

How do I prove nothing was lost?

Not “believe,” prove. Row counts are a trap here: two tables can have the same number of rows while the contents are quietly different. So we verified with content checksums, an md5 fingerprint of the actual data in the tables that hold anything a player cares about, the points, the stickers, the trades.

-- a single fingerprint of every prediction's points and graded state
md5(string_agg(id || ':' || points_awarded || ':' || is_graded, ',' ORDER BY id))

Compute that on both boxes. If the fingerprints match, the data is identical down to the byte, not just the count. That one idea is what later let me tell my friends “nothing was lost” and actually mean it.

The hardest call was when, not how

The technical steps were nearly settled. The genuinely hard decision was timing.

My first idea was to do it at halftime. The live card-drop windows I built back in Part 5 mean nothing is claimable during the break, so halftime looked like a natural dead zone. It wasn’t. Two things killed it. At halftime the match isn’t graded yet, so the new box would have had to settle its first real match in production with everyone watching. And halftime is exactly when people pick their phones back up, since nothing is happening on the pitch. The quiet was an illusion.

So we landed on the gap between games, after a match had completely finished and its points were already in. Then the new box inherits a clean, settled, verified state, and the only thing ahead of it is the next game.

Turning “please don’t” into “can’t”

The decision that made the whole thing safe rather than merely likely: take the front door down. Instead of asking everyone to please not touch anything for a few minutes and hoping they comply, I had us stop the app entirely during the swap. It’s ugly, because the app is briefly unreachable. But it converts “please don’t” into “can’t,” for every person and every kind of action at once. That short blackout is the only window in which both databases are provably identical, which is the entire reason it exists. The downtime is not a flaw in the plan. It is the plan.

Then we rehearsed it for real: dump the database to a file, restore it onto the new box, and time the whole thing. It came back in about three seconds, and the restore didn’t even require stopping the new box’s app first. Knowing that before the live attempt is the whole difference between confidence and hope.

The plan was finally one I trusted. Running it is Part 10.