Royalty of the Stable-fast.md

Stable-Fast Guide for LightDiffusion-Next¶

Welcome to the Stable-Fast guide for LightDiffusion-Next. This document will help you understand how to use Stable-Fast to accelerate your image generation process.

Table of Contents¶

Introduction
Prerequisites
Installation
How to Enable Stable-Fast
- In the GUI
- In the CLI
Technical Overview
Performance Benefits
Compatibility Notes
Tips and Tricks
Troubleshooting

Introduction¶

What is this?¶

stable-fast is an ultra lightweight inference optimization framework for HuggingFace Diffusers on NVIDIA GPUs. stable-fast provides super fast inference optimization by utilizing some key techniques and features:

Installation¶

NOTE: stable-fast is currently only tested on Linux and WSL2 in Windows on Nvidia GPUs.

Running the run.sh or the pipeline.sh scripts should work on most systems. If you encounter any issues, please refer to the command below:

pip install --index-url "https://download.pytorch.org/whl/cu121" \
        torch==2.2.2 torchvision "xformers>=0.0.22" "triton>=2.1.0" \
        stable_fast-1.0.5+torch222cu121-cp310-cp310-manylinux2014_x86_64.whl

How to Enable Stable-Fast¶

In the GUI¶

Launch LightDiffusion-Next: Start the application using run.bat (Windows) or run.sh (Linux).
Enable Stable-Fast: Check the “Stable-Fast” checkbox in the LightDiffusion-Next interface.
Configure Settings: Set your desired parameters for image generation.
Generate Images: Click the “Generate” button to create images with accelerated processing.

In the CLI¶

To enable Stable-Fast in the command-line interface, use the following command:

./pipeline.bat "your prompt here" width height number_of_images batch_size --stable-fast

For example:

./pipeline.sh "A beautiful sunset over the ocean" 1024 768 1 1 --stable-fast

Technical Overview¶

CUDNN Convolution Fusion: stable-fast implements a series of fully-functional and fully-compatible CUDNN convolution fusion operators for all kinds of combinations of Conv + Bias + Add + Act computation patterns.
Low Precision & Fused GEMM: stable-fast implements a series of fused GEMM operators that compute with fp16 precision, which is fast than PyTorch’s defaults (read & write with fp16 while compute with fp32).
Fused Linear GEGLU: stable-fast is able to fuse GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c) into one CUDA kernel.
NHWC & Fused GroupNorm: stable-fast implements a highly optimized fused NHWC GroupNorm + Silu operator with OpenAI’s Triton, which eliminates the need of memory format permutation operators.
Fully Traced Model: stable-fast improves the torch.jit.trace interface to make it more proper for tracing complex models. Nearly every part of StableDiffusionPipeline/StableVideoDiffusionPipeline can be traced and converted to TorchScript. It is more stable than torch.compile and has a significantly lower CPU overhead than torch.compile and supports ControlNet and LoRA.
CUDA Graph: stable-fast can capture the UNet, VAE and TextEncoder into CUDA Graph format, which can reduce the CPU overhead when the batch size is small. This implemention also supports dynamic shape.
Fused Multihead Attention: stable-fast just uses xformers and makes it compatible with TorchScript.

My next goal is to keep stable-fast as one of the fastest inference optimization frameworks for diffusers and also provide both speedup and VRAM reduction for transformers. In fact, I already use stable-fast to optimize LLMs and achieve a significant speedup. But I still need to do some work to make it more stable and easy to use and provide a stable user interface.

Differences With Other Acceleration Libraries¶

Fast: stable-fast is specialy optimized for HuggingFace Diffusers. It achieves a high performance across many libraries. And it provides a very fast compilation speed within only a few seconds. It is significantly faster than torch.compile, TensorRT and AITemplate in compilation time.
Minimal: stable-fast works as a plugin framework for PyTorch. It utilizes existing PyTorch functionality and infrastructures and is compatible with other acceleration techniques, as well as popular fine-tuning techniques and deployment solutions.
Maximum Compatibility: stable-fast is compatible with all kinds of HuggingFace Diffusers and PyTorch versions. It is also compatible with ControlNet and LoRA. And it even supports the latest StableVideoDiffusionPipeline out of the box!

Performance Benefits¶

Using Stable-Fast can provide:

Up to 70% speedup in generation time
Reduced memory usage
Lower CPU overhead compared to torch.compile
Improved batch processing efficiency

Compatibility Notes¶

Stable-Fast is compatible with:

Most SD1.x and SD2.x models
ControlNet extensions
LoRA adapters
Various sampling methods

It may have limited compatibility with some specialized or custom models.

Tips and Tricks¶

First Run Warmup: The first generation with Stable-Fast may take slightly longer due to compilation overhead. Subsequent generations will be faster.
Resolution Impact: The performance gains from Stable-Fast are more pronounced at higher resolutions.
Combine with HiresFix: Stable-Fast works particularly well with HiresFix, allowing for higher resolution outputs with less performance penalty.
Memory Management: While Stable-Fast reduces memory usage, it’s still recommended to close other applications when generating at high resolutions.
Testing: Run comparison tests with and without Stable-Fast to find the optimal configuration for your specific hardware.

Troubleshooting¶

If you encounter issues with Stable-Fast, try these solutions:

Driver Updates: Ensure your NVIDIA drivers are up to date.
CUDA Toolkit: Verify your CUDA toolkit installation.
Model Compatibility: Some models may not be fully compatible with all Stable-Fast optimizations. Try a different model.

For additional help, please refer to the GitHub repository or raise an issue on the LightDiffusion-Next issues page.

Wish you good generations! ```