Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation

Both teaser effects are out-of-distribution — unseen by the hypernetwork during training.

TL;DR — Prompt2Effect is a weight-driven hypernetwork that conditioned on the frozen base-model weights, synthesizes effect-specific LoRA weights in a single forward pass. It matches or beats conventional LoRA fine-tuning in quality and effect alignment while cutting per-effect cost from 56 GPU-hours of training to 3.3 seconds of inference.

Abstract

Personalizing Image-to-Video (I2V) diffusion models with specific visual effects is increasingly demanded for high-end video generation. Current practice requires training a separate Low-Rank Adaptation (LoRA) module for each effect, incurring substantial data curation and iterative optimization costs that hinder interactive control. We present Prompt2Effect, a weight-driven hypernetwork that amortizes per-effect training by directly synthesizing effect-specific LoRA weights in a single forward pass. Unlike prior hypernetworks that regress adapter weights purely from semantics, Prompt2Effect is explicitly conditioned on the frozen base model weights, grounding weight prediction in the structural geometry of each layer. Furthermore, instead of predicting raw LoRA matrices, we introduce an SVD-canonicalized parameterization that resolves factorization ambiguity and stabilizes large-scale weight synthesis. Together, these design principles enable accurate and scalable LoRA prediction for high-dimensional I2V diffusion models. Extensive experiments demonstrate that Prompt2Effect achieves on-par or superior video quality and effect alignment compared to conventional LoRA fine-tuning, while reducing the computational cost from 56 GPU training hours to 3.3 seconds of hypernetwork inference. When used as initialization for subsequent fine-tuning, our predicted weights further improve final performance and accelerate optimization by approximately 10x.

Method

Trained once, the hypernetwork predicts all effect-specific LoRA weights for a frozen I2V backbone in a single forward pass — zero-shot, with optional fast fine-tuning from the prediction (about 10× faster than training from scratch).

Prompt2Effect pipeline overview — **(1)** The frozen base weights W₀ are tokenized into row/column tokens and compressed by learnable queries. **(2)** A Transformer backbone fuses them with the effect-prompt semantics. **(3)** A head outputs SVD-canonicalized LoRA factors A, B for all layers in one pass. **(4)** Optional fast adaptation fine-tunes from this initialization.

Weight-Driven Analysis

Unlike prior hypernetworks that regress weights purely from semantic embeddings, Prompt2Effect is weight-driven: it conditions on the frozen base weights W₀, slicing each layer into row and column tokens (inspired by CUR decomposition) so that prediction is grounded in the layer’s native structure. It further predicts an SVD-canonicalized parameterization of the update (B^★ = U S^1/2, A^★ = S^1/2V^⊤), which removes the factorization ambiguity of ΔW = BA and stabilizes large-scale weight synthesis.

Spectral compressibility gap between base weights and the LoRA update — Compressibility gap motivating full-rank weight tokenization, aggregated over all LoRA–layer pairs across our library of 70 dynamic effects. For each layer we take the top-k singular subspaces of the base weight W₀ and measure the cumulative energy of the LoRA update ΔW projected into them. W₀ concentrates most of its spectral energy in the top singular components, whereas ΔW spreads its energy across a much broader set of singular directions. Aggressively compressing W₀ would discard directions that are informative for predicting ΔW, so Prompt2Effect tokenizes the base weights at full rank.

Video Results

Ablations

Hypernetwork design. A modular per-block design with separate small predictors (“Ensemble”) underperforms our unified 1.3B-parameter hypernetwork, which models all layer weights jointly — indicating that global coordination across layers matters for consistent effect synthesis.

Weight-driven input. Replacing the weight-driven input with abstract noise embeddings (“Noise”) causes a steep drop in VLM metrics and aesthetic quality: mapping text directly to functional weights is hard without structural priors. Exposing the hypernetwork to the principal subspaces of the base weights is crucial, and full-rank tokenization works best.

SVD-canonicalized prediction. Predicting SVD-canonicalized targets (A^★, B^★) rather than raw LoRA matrices accelerates convergence and improves stability: the NMSE of the reconstructed ΔW decreases smoothly, whereas the non-canonicalized variant oscillates and converges to a substantially higher final error.

Compression ratio. Among learned-query counts for weight-token compression, n = 256 gives the best compute–fidelity trade-off, outperforming n = 128 and marginally improving over n = 512.

SVD canonicalization training-dynamics ablation — **SVD canonicalization & weight-driven input.** Training dynamics (NMSE of the reconstructed ΔW vs. epoch). Predicting SVD-canonicalized targets converges fastest and lowest; removing canonicalization or the weight-driven input slows convergence and raises the final error.

Weight-tokenization rank ablation — **Base-weight tokenization rank.** Full-rank weight tokenization reaches the lowest NMSE(ΔW); half-rank is slightly worse; removing weight conditioning (“Without Weight”) plateaus far higher — confirming that full-rank tokenization is needed to predict ΔW.

BibTeX

@inproceedings{yang2026prompt2effect,
  title     = {Prompt2Effect: Training-Free Image-to-Video Model Specialization via LoRA Generation},
  author    = {Yang, Xiaomeng and Li, Yanyu and Qian, Gordon Guocheng and Skorokhodov, Ivan
               and Ivanov, Viacheslav and Vinella, Avalon and Zhang, Xuan and Wang, Yanzhi
               and Tulyakov, Sergey and Kag, Anil},
  booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
  year      = {2026}
}