Royalty of the Stable-fast.md
Stable-Fast Guide for LightDiffusion-Next¶
Welcome to the Stable-Fast guide for LightDiffusion-Next. This document will help you understand how to use Stable-Fast to accelerate your image generation process.
Table of Contents¶
- Introduction
- Prerequisites
- Installation
- How to Enable Stable-Fast
- Technical Overview
- Performance Benefits
- Compatibility Notes
- Tips and Tricks
- Troubleshooting
Introduction¶
What is this?¶
stable-fast
is an ultra lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
stable-fast
provides super fast inference optimization by utilizing some key techniques and features:
Installation¶
NOTE: stable-fast
is currently only tested on Linux
and WSL2 in Windows
on Nvidia GPUs
.
Running the run.sh
or the pipeline.sh
scripts should work on most systems. If you encounter any issues, please refer to the command below:
pip install --index-url "https://download.pytorch.org/whl/cu121" \
torch==2.2.2 torchvision "xformers>=0.0.22" "triton>=2.1.0" \
stable_fast-1.0.5+torch222cu121-cp310-cp310-manylinux2014_x86_64.whl
How to Enable Stable-Fast¶
In the GUI¶
- Launch LightDiffusion-Next: Start the application using
run.bat
(Windows) orrun.sh
(Linux). - Enable Stable-Fast: Check the “Stable-Fast” checkbox in the LightDiffusion-Next interface.
- Configure Settings: Set your desired parameters for image generation.
- Generate Images: Click the “Generate” button to create images with accelerated processing.
In the CLI¶
To enable Stable-Fast in the command-line interface, use the following command:
./pipeline.bat "your prompt here" width height number_of_images batch_size --stable-fast
For example:
./pipeline.sh "A beautiful sunset over the ocean" 1024 768 1 1 --stable-fast
Technical Overview¶
- CUDNN Convolution Fusion:
stable-fast
implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations ofConv + Bias + Add + Act
computation patterns. - Low Precision & Fused GEMM:
stable-fast
implements a series of fused GEMM operators that compute withfp16
precision, which is fast than PyTorch’s defaults (read & write withfp16
while compute withfp32
). - Fused Linear GEGLU:
stable-fast
is able to fuseGEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)
into one CUDA kernel. - NHWC & Fused GroupNorm:
stable-fast
implements a highly optimized fused NHWCGroupNorm + Silu
operator with OpenAI’sTriton
, which eliminates the need of memory format permutation operators. - Fully Traced Model:
stable-fast
improves thetorch.jit.trace
interface to make it more proper for tracing complex models. Nearly every part ofStableDiffusionPipeline/StableVideoDiffusionPipeline
can be traced and converted to TorchScript. It is more stable thantorch.compile
and has a significantly lower CPU overhead thantorch.compile
and supports ControlNet and LoRA. - CUDA Graph:
stable-fast
can capture theUNet
,VAE
andTextEncoder
into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape. - Fused Multihead Attention:
stable-fast
just uses xformers and makes it compatible with TorchScript.
My next goal is to keep stable-fast
as one of the fastest inference optimization frameworks for diffusers
and also
provide both speedup and VRAM reduction for transformers
.
In fact, I already use stable-fast
to optimize LLMs and achieve a significant speedup.
But I still need to do some work to make it more stable and easy to use and provide a stable user interface.
Differences With Other Acceleration Libraries¶
- Fast:
stable-fast
is specialy optimized for HuggingFace Diffusers. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster thantorch.compile
,TensorRT
andAITemplate
in compilation time. - Minimal:
stable-fast
works as a plugin framework forPyTorch
. It utilizes existingPyTorch
functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions. - Maximum Compatibility:
stable-fast
is compatible with all kinds ofHuggingFace Diffusers
andPyTorch
versions. It is also compatible withControlNet
andLoRA
. And it even supports the latestStableVideoDiffusionPipeline
out of the box!
Performance Benefits¶
Using Stable-Fast can provide:
- Up to 70% speedup in generation time
- Reduced memory usage
- Lower CPU overhead compared to
torch.compile
- Improved batch processing efficiency
Compatibility Notes¶
Stable-Fast is compatible with:
- Most SD1.x and SD2.x models
- ControlNet extensions
- LoRA adapters
- Various sampling methods
It may have limited compatibility with some specialized or custom models.
Tips and Tricks¶
- First Run Warmup: The first generation with Stable-Fast may take slightly longer due to compilation overhead. Subsequent generations will be faster.
- Resolution Impact: The performance gains from Stable-Fast are more pronounced at higher resolutions.
- Combine with HiresFix: Stable-Fast works particularly well with HiresFix, allowing for higher resolution outputs with less performance penalty.
- Memory Management: While Stable-Fast reduces memory usage, it’s still recommended to close other applications when generating at high resolutions.
- Testing: Run comparison tests with and without Stable-Fast to find the optimal configuration for your specific hardware.
Troubleshooting¶
If you encounter issues with Stable-Fast, try these solutions:
- Driver Updates: Ensure your NVIDIA drivers are up to date.
- CUDA Toolkit: Verify your CUDA toolkit installation.
- Model Compatibility: Some models may not be fully compatible with all Stable-Fast optimizations. Try a different model.
For additional help, please refer to the GitHub repository or raise an issue on the LightDiffusion-Next issues page.
Wish you good generations! ```