24 Hours in the Trenches of Speedrun Data
Over the last day SrTDb went from five planned games to an autonomous pipeline crawling a 307-game library. Here's what actually held up β and what didn't.
The pipeline that worked
The winning pattern was a division of labor. A local LLM (qwen3:8b on an RTX 3060) reads every caption and casts a wide net for trick mentions; then the verifier signs off only on candidates backed by two genuinely independent sources. The local model is cheap and tireless but noisy β 91β97% of what it extracts is junk: bare level names, "cat jam", commentary fragments. The magic isn't the extraction; it's the filter. Requiring β₯2 distinct video sources before a careful verifier even looks collapses thousands of drafts into a handful of real tricks. We verified 40 cited tricks across the top five games this way.
Rate limits are the real boss
YouTube doesn't care how polite your per-request delay is. The 429 we hit wasn't about spacing β it was cumulative per-IP load: two fresh-game harvests in one hour, each enumerating thousands of search candidates plus audio downloads. The lesson: budget the window, not the request. Harvest one game at a time, space them hours apart, back off exponentially. Patience is a feature, not a workaround.
Identity is harder than it looks
Is a video actually a speedrun of this game? Sounds trivial; it's a minefield. "Super Star Wars: The Empire Strikes Back" (SNES) shares its subtitle with a completely different NES game β the only text clue is the word "Super." Sonic 2 on Genesis isn't Sonic 2 on Master System, and the string "Sonic the Hedgehog 2" happily matches "Sonic the Hedgehog 2006." It took a franchise-prefix and subtitle match, platform guards, and the operator's sharpest insight: a video already tagged for one game is almost certainly a false positive for another. Cross-referencing flags across games caught contamination nothing else did.
Vision was a confident liar
The obvious fix for ambiguous titles β "just look at the gameplay" β failed. The local 7B vision model called an 8-bit Master System frame "Genesis, Super Mario World" at 0.95 confidence. Overconfident and wrong is worse than honestly uncertain. Real disambiguation will come from OCR of marathon overlays β which literally print the game's name on screen β not from asking a model to recognize a game from pixels.
The unglamorous truth
The biggest blind spot isn't any algorithm β it's that 93% of a popular game's candidate videos have no title metadata at all, and most games were never harvested at the source. So the highest-leverage move turned out to be the least exciting one: collect a thorough baseline per game first (the registered-run list, platforms, run lengths) and verify everything against it.
If there's a single meta-lesson, it's that the hard part of automation is rarely the model β it's knowing what to trust. Trust two independent witnesses over one confident one. Trust a registered run over a matching title. Trust the rate limiter over your impatience. And when a system is uncertain, the honest move is to leave it flagged for a human, not to guess.
Twenty-four hours in: 307 games classified, ~4,800 flags, a durable harvester grinding away, and a clear map of where the hard problems actually live. Onward.