diff --git a/asset/content/incremental.jpg b/asset/content/incremental.jpg deleted file mode 100644 index c6a395d..0000000 Binary files a/asset/content/incremental.jpg and /dev/null differ diff --git a/asset/content/model-incremental.jpg b/asset/content/model-incremental.jpg new file mode 100644 index 0000000..88bfe96 Binary files /dev/null and b/asset/content/model-incremental.jpg differ diff --git a/asset/content/model.jpg b/asset/content/model.jpg deleted file mode 100644 index fd96387..0000000 Binary files a/asset/content/model.jpg and /dev/null differ diff --git a/asset/content/performance.jpg b/asset/content/performance.jpg new file mode 100644 index 0000000..b387fb5 Binary files /dev/null and b/asset/content/performance.jpg differ diff --git a/index.html b/index.html index d5eec48..95611f1 100644 --- a/index.html +++ b/index.html @@ -30,13 +30,13 @@ margin: 0.2em 0; } .hero h2 { - font-size: 3em; + font-size: 2.8em; margin: 0.2em 0; font-weight: normal; line-height: 1.4; /* Adjust this value to make the spacing larger */ } .hero p { - font-size: 1.5em; + font-size: 1.4em; margin-bottom: 1em; } .button { @@ -220,11 +220,11 @@ .inserted-image { max-width: 80%; /* Set the maximum width for the image */ height: auto; /* Ensure the height adjusts automatically to maintain aspect ratio */ - margin: 20px 0; /* Add space above and below the image */ + margin: 30px; /* Add space above and below the image */ + margin-top: 10px; display: block; /* Make sure the image is treated as a block-level element */ margin-left: auto; /* Center the image horizontally */ margin-right: auto; - margin-top: -10px; border-radius: 10px; box-shadow: 2px 2px 12px 4px #00000012; } @@ -459,11 +459,11 @@
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 x 4096 resolution. +
We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: Deep compression autoencoder: un-like traditional AEs, which compress images only 8x, - we trained an AE that can compress images 32x, effectively reducing the number of latent tokens. + we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. @@ -471,8 +471,8 @@
• Deep Compression Autoencoder:
We introduce a new Autoencoder (AE) that aggressively increases the scaling factor to 32.
- Compared with AE-F8, our AE-F32 outputs 16x fewer latent tokens, which is crucial for efficient training
+ Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens, which is crucial for efficient training
and generating ultra-high-resolution images, such as 4K resolution.
• Efficient Linear DiT:
We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N2) to O(N)
- Mix-FFN, with 3x3 depth-wise convolution in MLP, enhances the local information of tokens.
- Linear attention achieves comparable results to vanilla, improving 4K generation by 1.7x in latency.
+ Mix-FFN, with 3×3 depth-wise convolution in MLP, enhances the local information of tokens.
+ Linear attention achieves comparable results to vanilla, improving 4K generation by 1.7× in latency.
Mix-FFN also removes the need for positional encoding (NoPE) without quality loss, marking the first DiT without positional embedding.
• Decoder-only Small LLM as Text Encoder:
We use Gemma, a decoder-only LLM, as the text encoder to enhance understanding and reasoning in prompts.
@@ -506,16 +506,22 @@
Our mission is to develop efficient, lightweight, and accelerated AI technologies that address practical challenges and deliver fast, open-source solutions...
+We compare Sana with the most advanced text-to-image diffusion models in Table 7. For 512 × 512 resolution, + Sana-0.6 demonstrates a throughput that is 5× faster than PixArt-Σ, which has a similar model size, + and significantly outperforms it in FID, Clip Score, GenEval, and DPG-Bench. For 1024 × 1024 resolution, + Sana is considerably stronger than most models with <3B parameters and excels in inference latency. + Our models achieve competitive performance even when compared to the most advanced large model FLUX-dev. + For instance, while the accuracy on DPG-Bench is equivalent and slightly lower on GenEval, + Sana-0.6B’s throughput is 39× faster, and Sana-1.6B is 23× faster.
@misc{xie2024sana,
- title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
- author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
- year={2024},
- eprint={0000.0000},
- archivePrefix={arXiv},
- primaryClass={cs.CV}
+ title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
+ author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
+ year={2024},
+ eprint={0000.0000},
+ archivePrefix={arXiv},
+ primaryClass={cs.CV}
}