AI Video Generation at Home — Can Two 3090s Keep Up?
Part 1 of 3 — Next: The Optimization Maze and Hitting the Ceiling
OpenAI’s Sora generates cinematic video from text. Runway does it in seconds. Google’s Veo 2 produces footage that’s hard to tell from a camera. These are enterprise models running on warehouse-scale GPU clusters — thousands of H100s at $30,000 each, racks of them linked together in facilities that cost more than most buildings. When you see a 10-second clip of a woman walking through Tokyo and think “that looks real,” you’re looking at the output of hundreds of millions of dollars of infrastructure.
That makes AI video look like a solved problem. It isn’t. Not for the rest of us.
I wanted to find out what happens when you try to close that gap with two gaming GPUs and a Linux box in your office.
This is the story of getting LTX-Video 2 running on a dual RTX 3090 server. The VRAM arithmetic that almost killed the project before it started, the multi-GPU problems nobody warned about, and the moment the first video came out looking like a fever dream instead of a cat.
The Setup
Hardware:
- AI server: Dual NVIDIA RTX 3090 (24GB VRAM each, 48GB total), Linux
- Daily driver: NVIDIA RTX 4080 (16GB VRAM)
- Both machines on the same network, shared storage via NFS
Software:
- ComfyUI — node-based inference engine
- LTX-Video 2 — 19B parameter open-source video model by Lightricks
- Ollama with Qwen3:14b — local LLM for prompt enhancement
LTX-Video 2 was the target because it’s the fastest open-source video model in early 2026. The official benchmarks claimed 2-3 second generation times on an H100. Consumer hardware would be slower, but it seemed realistic to get clips in under a minute.
The VRAM Crisis
The first surprise: LTX-Video 2 doesn’t use CLIP or T5 for text encoding like older video models. It requires Gemma 3 12B — Google’s 12-billion parameter language model — as the text encoder.
Here’s the VRAM math that stopped me cold:
| Component | VRAM (BF16) |
|---|---|
| LTX-2 19B model (FP8 quantized) | ~27 GB |
| Gemma 3 12B text encoder (BF16) | ~23 GB |
| VAE + working memory | ~4-6 GB |
| Total | ~54 GB |
My daily driver has 16GB. Not even close. Even the dual 3090 server at 48GB was technically short.
The fix came from quantization. An FP8 version of Gemma 3 12B — gemma_3_12B_it_fp8_scaled.safetensors — cuts the encoder from 23GB to 13GB with negligible quality loss for text encoding. That brought the total to ~44GB, just fitting in 48GB with room to breathe.
Key learning for anyone attempting this: FP8 quantization on the text encoder is essentially free. The encoder’s job is converting text to embeddings — it doesn’t need full precision for that. Save your VRAM budget for the generative model.
Multi-GPU: The Part Nobody Talks About
ComfyUI has a generic multi-GPU plugin. I tried it. It loaded the model, then immediately threw shape mismatch errors trying to split the LTX-2 architecture across devices.
The problem: LTX-Video 2’s transformer architecture has internal dependencies that generic model-splitting doesn’t handle. The community fix came from a purpose-built extension: ComfyUI-LTX2-MultiGPU by DreamFast, designed specifically for this model’s architecture.
The final GPU layout:
| GPU | Components | VRAM Used |
|---|---|---|
| cuda:0 | LTX-2 19B checkpoint + VAE | ~30 GB |
| cuda:1 | Gemma 3 12B FP8 encoder | ~13 GB |
This is the part the YouTube tutorials gloss over. They show results on single H100s or A100s with 80GB of VRAM. If you’re running consumer GPUs, multi-GPU support isn’t a convenience — it’s mandatory, and it requires model-specific tooling.
Automating the Pipeline
Clicking through ComfyUI’s node editor for every generation wasn’t going to scale. I wrote a CLI tool — vidgen "a cat playing piano" — that handles prompt enhancement (Ollama Qwen3:14b expands short prompts into detailed scene descriptions), VRAM choreography, generation via ComfyUI’s API, and output.
The VRAM choreography matters more than it sounds. ComfyUI loads models into VRAM and holds them indefinitely. Ollama does the same. With 48GB shared between both, I had to force ComfyUI to unload (/free endpoint) before Ollama could load, then reverse the process for generation. Without this dance, every run either OOMed or Ollama couldn’t load its model.
The First 15 Videos Were Terrifying
Generation time: 36-54 seconds per clip at 768x512, 97 frames (4 seconds at 24fps). Incredibly fast for local hardware.
The videos themselves: morphing geometry, liquid textures, subjects that shifted between human and animal mid-frame. The kind of output that makes you check the model card to see if you downloaded the right file.
I spent an hour blaming the FP8 encoder, the multi-GPU split, VRAM pressure. None of that was the problem.
The Real Problem: Wrong Settings for the Wrong Model
LTX-Video 2 ships in two variants:
- Base model (
ltx-2-19b-dev-fp8.safetensors) — needs 20-30 steps, CFG 3.0-4.0 - Distilled LoRA (
ltx-2-19b-distilled-lora-384.safetensors) — a student model trained to produce good results in 8 steps at CFG 1.0
I was running the distilled LoRA with base model settings: 20 steps, CFG 4.0. The model was trained to converge in 8 steps — forcing it through 20 was like overcooking it. Every additional step degraded the output.
The fix:
| Setting | Wrong (what I used) | Correct (for distilled) |
|---|---|---|
| Steps | 20 | 8 |
| CFG | 4.0 | 1.0 |
| Sampler | euler | euler |
| LoRA strength | 0.6 | 0.6 |
With corrected settings, the same prompts went from nightmare fuel to recognizable — still rough, but intentionally so. An action figure moving against a backdrop instead of a fever dream.
Session 1 Final Config
Model: ltx-2-19b-dev-fp8.safetensors (cuda:0)
Encoder: gemma_3_12B_it_fp8_scaled.safetensors (cuda:1)
LoRA: ltx-2-19b-distilled-lora-384.safetensors @ 0.6
Steps: 8
CFG: 1.0
Sampler: euler
Res: 768x512, 97 frames (4s @ 24fps)
Enhance: Ollama Qwen3:14b
Gen time: 36-54 seconds per clip
15 test videos generated. Quality went from unwatchable to “recognizable but stiff.” Gen time under a minute.
What 3.5 Hours Gets You
Sora generates 1080p video with smooth motion, consistent subjects, and natural lighting. It took me 3.5 hours just to get coherent 4-second clips at 768x512 — and “coherent” is doing heavy lifting in that sentence.
But everything works. The multi-GPU split, the VRAM math, the prompt pipeline, the CLI tool — all of it running, all of it producing output in under a minute. The foundation is solid. The output isn’t.
The next session would be about finding out whether better settings could fix that.