17 Videos Later — The Optimization Maze of Local AI Video
Part 2 of 3 — Previously: Can Two 3090s Keep Up? / Next: Hitting the Ceiling
If the output is bad, you must have the wrong settings. Tweak the sampler, bump the step count, adjust CFG, and eventually the model will produce what you want. Right?
Sometimes that’s true. Sometimes the biggest improvement has nothing to do with settings at all.
After getting LTX-Video 2 producing 4-second clips on the dual 3090s, I had a working pipeline but mediocre output. What followed was a 3-hour optimization marathon — 17 test videos, two critical bugs, one spectacular failure, and a breakthrough that came from the last place I expected.
The Two-Stage Hypothesis
The plan sounded great on paper: generate at half resolution, spatially upscale 2x, then refine with the distilled LoRA in a second pass. Lower initial resolution means fewer pixels to compute, faster generation, and less VRAM pressure. The upscaler adds detail back. The refinement pass cleans up artifacts.
First two-stage test — “a cat playing piano” at 512x320, upscaled to 1024x640:
| Test | Enhancement | CFG | Gen Time | Result |
|---|---|---|---|---|
| 16 | Off | 1.0 | 54s | First two-stage attempt |
| 17 | On (Qwen3:14b) | 1.0 | 45s | Same seed, enhanced prompt |
The output was… fine. Not better, not worse. The upscaler added resolution but not detail — smoother blurriness at a higher pixel count. I switched to a consistent test subject — “plastic action figure toy dancing on a desk” — and started systematic testing.
Two Bugs in Five Minutes
Test 20 introduced a major settings overhaul: CFG bumped to 4.0, STG and FetaEnhance nodes added, resolution adjusted to 960x544 with a proper 480x272 base stage. The res_2s sampler replaced the basic euler.
Test 21 immediately broke. The prompt enhancement returned empty.
Bug #1: Ollama had updated its API. The /no_think flag I’d been appending to prompts to suppress Qwen3’s chain-of-thought output no longer worked. The fix: a "think": false parameter in the API request body. A one-line change that took 20 minutes to diagnose because the error was silent — Ollama just returned nothing instead of throwing an error.
Bug #2: ComfyUI was holding all 48GB of VRAM even when idle. The text encoder (Gemma 3 12B FP8) on GPU 1 and the video model (LTX-2 19B FP8) on GPU 0 stayed loaded indefinitely. When Ollama tried to load Qwen3:14b for prompt enhancement, there was nowhere to put it.
The fix for the VRAM problem:
# Force ComfyUI to release all GPU memory
curl -X POST http://localhost:8188/free
ComfyUI’s /free endpoint dumps everything from VRAM. After enhancement, Ollama unloads its model, and ComfyUI reloads on the next generation request. It added ~15 seconds of overhead per video for model loading, but without it, every generation either OOMed or Ollama couldn’t load.
This is the kind of problem that doesn’t exist in a tutorial. Nobody’s YouTube demo mentions VRAM conflicts between inference engines. Running locally, you’re the one deciding which model gets which GPU, and when.
The CFG 4.0 Disaster
Test 22 was the first clean baseline: proper VRAM management, working prompt enhancement, all the new nodes in place. Generation time: 2:03 at 960x544. A real two-stage pipeline finally running end to end.
Test 23 was where I made the mistake. Stage 1 was using CFG 4.0 for the initial generation — reasonable for the full model. I thought: “Stage 2 refinement should also use higher guidance to add detail.” So I set Stage 2 CFG to 4.0 as well, with 8 refinement steps instead of the default 3.
Generation time jumped to 3:15. The output was chaos.
Stage 2 at CFG 4.0 doesn’t refine — it regenerates from scratch. The entire motion coherence from Stage 1 was destroyed. The model treated the upscaled video as noise to be overwritten rather than structure to be enhanced. It was like asking someone to touch up a painting and having them paint over it entirely.
Test 24 confirmed it: reverting Stage 2 to CFG 1.0 brought everything back. Generation time dropped to 1:57, and the output respected the Stage 1 structure.
| Test | Stage 2 CFG | Gen Time | Result |
|---|---|---|---|
| 22 | 1.0 (default) | 2:03 | Clean baseline |
| 23 | 4.0 | 3:15 | Destroyed — regenerated instead of refining |
| 24 | 1.0 (reverted) | 1:57 | Confirmed: Stage 2 must be gentle |
Key rule for anyone building two-stage video pipelines: refinement CFG must stay at 1.0. Higher values don’t “add detail” — they tell the model to ignore what’s already there.
The Breakthrough Nobody Expected
Tests 25-28 simplified the approach. I dropped the upscaler entirely and went back to single-stage generation at 768x512 with the full model at 30 steps, CFG 4.0, and switched to the dpmpp_2m sampler.
Test 25 with Qwen3 prompt enhancement: decent. But Test 26 changed everything.
Instead of letting Qwen3 enhance the prompt, I wrote one manually:
“A small plastic action figure toy standing on a wooden desk, slowly moving its arms up and down. The toy is colorful with visible joints. Warm desk lamp lighting, shallow depth of field. Static tripod shot, 4K quality.”
The result: “Actually looks like an action figure.”
Twenty-six test videos. Two sessions. Two-stage pipelines, VRAM fixes, sampler changes, resolution experiments — and the single biggest improvement came from writing a better prompt by hand.
The problem with AI-enhanced prompts wasn’t the AI. It was that Qwen3 produced beautiful but vague prose — “soft golden-hour light filtering through lace curtains” — when what the video model needed was concrete, grounded instructions. Specific object. Specific action. Specific camera. Specific lighting. No poetry.
Test 27 confirmed it: a woman walking down a city sidewalk, manually prompted with concrete details, produced something that actually looked like a person walking — not a shapeshifting figure.
The Official Recipe
With prompting solved, I went back to settings — this time using the official Lightricks workflow instead of guessing.
| Component | What I Had | Official Lightricks |
|---|---|---|
| Stage 1 sampler | dpmpp_2m | res_2s |
| LoRA strength | 1.0 | 0.6 |
| Stage 2 scheduler | Simple | ManualSigmas (hand-tuned noise schedule for refinement) |
| GPU optimization | Basic split | SequenceChunkedBlock (spills to GPU 1 when GPU 0 runs out of memory) |
| Resolution | 768x512 | 768x512 → 1536x1024 (2x spatial upscale) |
Test 31 implemented everything: both stages using the res_2s sampler, LoRA strength at 0.6 instead of 1.0, a hand-tuned noise schedule for Stage 2 refinement, and a chunking strategy that uses the second GPU as overflow memory. Output resolution jumped to 1536x1024 — four times the pixel count I started with.
Test 32 pushed CFG to 4.5 for tighter prompt adherence. A knight in silver armor on a dark wooden table, studio lighting, product photography framing. Best output of the session — and this time, the settings actually mattered because the prompt was already doing its job.
What 17 Videos Taught Me
I started this session convinced it was a settings problem. Two-stage pipelines, sampler swaps, CFG tuning, resolution experiments — the kind of knob-turning that feels like progress.
Some of it was real. Fixing the VRAM conflicts was necessary. Learning that Stage 2 CFG must stay at 1.0 saved me from destroying every future generation. The official Lightricks workflow pushed quality up at the end.
But the single biggest jump — the one that turned shapeshifting blobs into recognizable subjects — was writing a prompt by hand instead of letting an AI do it. Twenty-six tests to learn that the model doesn’t need poetry. It needs instructions.
The output was sharper, the pipeline was solid, the settings were dialed. But the motion was still wooden — subjects moved like they were underwater, surfaces shimmered like they couldn’t decide what material they were made of. YouTube creators were posting smoother results with the same model. Someone had figured something out. Time to find out what.