diff --git a/asset/content/incremental.jpg b/asset/content/incremental.jpg deleted file mode 100644 index c6a395d..0000000 Binary files a/asset/content/incremental.jpg and /dev/null differ diff --git a/asset/content/model-incremental.jpg b/asset/content/model-incremental.jpg new file mode 100644 index 0000000..88bfe96 Binary files /dev/null and b/asset/content/model-incremental.jpg differ diff --git a/asset/content/model.jpg b/asset/content/model.jpg deleted file mode 100644 index fd96387..0000000 Binary files a/asset/content/model.jpg and /dev/null differ diff --git a/asset/content/performance.jpg b/asset/content/performance.jpg new file mode 100644 index 0000000..b387fb5 Binary files /dev/null and b/asset/content/performance.jpg differ diff --git a/index.html b/index.html index d5eec48..95611f1 100644 --- a/index.html +++ b/index.html @@ -30,13 +30,13 @@ margin: 0.2em 0; } .hero h2 { - font-size: 3em; + font-size: 2.8em; margin: 0.2em 0; font-weight: normal; line-height: 1.4; /* Adjust this value to make the spacing larger */ } .hero p { - font-size: 1.5em; + font-size: 1.4em; margin-bottom: 1em; } .button { @@ -220,11 +220,11 @@ .inserted-image { max-width: 80%; /* Set the maximum width for the image */ height: auto; /* Ensure the height adjusts automatically to maintain aspect ratio */ - margin: 20px 0; /* Add space above and below the image */ + margin: 30px; /* Add space above and below the image */ + margin-top: 10px; display: block; /* Make sure the image is treated as a block-level element */ margin-left: auto; /* Center the image horizontally */ margin-right: auto; - margin-top: -10px; border-radius: 10px; box-shadow: 2px 2px 12px 4px #00000012; } @@ -459,11 +459,11 @@

Efficient High-Resolution Image Synthesis

About Sana

-

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 x 4096 resolution. +

We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU. Core designs include: Deep compression autoencoder: un-like traditional AEs, which compress images only 8x, - we trained an AE that can compress images 32x, effectively reducing the number of latent tokens. + we trained an AE that can compress images 32×, effectively reducing the number of latent tokens. Linear DiT: we replace all vanilla attention in DiT with linear attention, which is more efficient at high resolutions without sacrificing quality. Decoder-only text encoder: we replaced T5 with modern decoder-only small LLM as the text encoder and designed complex human instruction with in-context learning to enhance the image-text alignment. @@ -471,8 +471,8 @@

About Sana

with efficient caption labeling and selection to accelerate convergence.
As a result, Sana-0.6B is very competitive with modern giant diffusion model (e.g. Flux-12B), being 20 times smaller and 100+ times faster in measured throughput. - Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 x 1024 resolution image. - Sana enables content creation at low cost. Code and model will be publicly released.

+ Moreover, Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024 × 1024 resolution image. + Sana enables content creation at low cost.

@@ -487,12 +487,12 @@

Several core design details for Efficiency

    •   Deep Compression Autoencoder: We introduce a new Autoencoder (AE) that aggressively increases the scaling factor to 32. - Compared with AE-F8, our AE-F32 outputs 16x fewer latent tokens, which is crucial for efficient training + Compared with AE-F8, our AE-F32 outputs 16× fewer latent tokens, which is crucial for efficient training and generating ultra-high-resolution images, such as 4K resolution.
    •   Efficient Linear DiT: We introduce a new linear DiT, replacing vanilla quadratic attention and reducing complexity from O(N2) to O(N) - Mix-FFN, with 3x3 depth-wise convolution in MLP, enhances the local information of tokens. - Linear attention achieves comparable results to vanilla, improving 4K generation by 1.7x in latency. + Mix-FFN, with 3×3 depth-wise convolution in MLP, enhances the local information of tokens. + Linear attention achieves comparable results to vanilla, improving 4K generation by 1.7× in latency. Mix-FFN also removes the need for positional encoding (NoPE) without quality loss, marking the first DiT without positional embedding.
    •   Decoder-only Small LLM as Text Encoder: We use Gemma, a decoder-only LLM, as the text encoder to enhance understanding and reasoning in prompts. @@ -506,16 +506,22 @@

Several core design details for Efficiency

- details of difference parts for efficiency improvement + pipeline for Sana
-

Our Mission

-

Our mission is to develop efficient, lightweight, and accelerated AI technologies that address practical challenges and deliver fast, open-source solutions...

+

Performance

+

We compare Sana with the most advanced text-to-image diffusion models in Table 7. For 512 × 512 resolution, + Sana-0.6 demonstrates a throughput that is 5× faster than PixArt-Σ, which has a similar model size, + and significantly outperforms it in FID, Clip Score, GenEval, and DPG-Bench. For 1024 × 1024 resolution, + Sana is considerably stronger than most models with <3B parameters and excels in inference latency. + Our models achieve competitive performance even when compared to the most advanced large model FLUX-dev. + For instance, while the accuracy on DPG-Bench is equivalent and slightly lower on GenEval, + Sana-0.6B’s throughput is 39× faster, and Sana-1.6B is 23× faster.

- pipeline for Sana + Sana performance
@@ -540,12 +546,12 @@

Sana-0.6B is deployable on a customer-grade 4090 GPU

BibTeX

@misc{xie2024sana,
-        title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
-        author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
-        year={2024},
-        eprint={0000.0000},
-        archivePrefix={arXiv},
-        primaryClass={cs.CV}
+      title={Sana: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer},
+      author={Enze Xie and Junsong Chen and Junyu Chen and Han Cai and Yujun Lin and Zhekai Zhang and Muyang Li and Ligeng Zhu and Yao Lu and Song Han},
+      year={2024},
+      eprint={0000.0000},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
     }