HuggingFace Diffusers 0.12 : Get Started : Stable Diffusion ガイド 🎨 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 02/08/2023 (v0.12.1)

* 本ページは、HuggingFace Diffusers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Get Started : The Stable Diffusion Guide 🎨

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Diffusers 0.12 : Getting Started : Stable Diffusion ガイド 🎨

イントロ

Stable Diffusion は潜在的拡散モデルで、LMU Munich の機械ビジョン & 学習グループ, a.k.a CompVis の研究者により開発されました。
モデルチェックポイントは EleutherAI と LAION のサポートで Stability AI, CompVis, と Runway のコラボにより 2022年8月末に公開されました。詳細は、公式ブログ投稿を確認できます。

その公開リリース以来、コミュニティは stable diffusion チェックポイントを 高速に、よりメモリ効率的に、そして より高性能 にするために協力して驚くべきジョブを遂行してきました。

🧨 Diffusers はメモリ、計算量、そして品質すべての改良を含む stable diffusion を実行するために単純な API を提供します。

このノートブックは改良点を一つずつ紹介しますので、貴方は推論のために StableDiffusionPipeline を最善に活用できます。

プロンプト・エンジニアリング 🎨

*Stable Diffusion* を推論で実行するとき、通常は特定のタイプやスタイルの画像を生成してからそれを改良することを望みます。前に生成された画像の改良は、生成に満足するまで異なるプロンプトと潜在的に異なるシードで繰り返し推論を実行することを意味します。そこでまず、与えられた時間の総量でできる限り多くの画像を生成するために stable diffusion をできる限り高速化することが最も重要です。

計算効率性 (速度) と メモリ効率性 (GPU RAM) の両方を改良することでこれは成されます。

最初に計算効率性を調べることから始めましょう。

このノートブックを通じて、runwayml/stable-diffusion-v1-5 にフォーカスします :

model_id = "runwayml/stable-diffusion-v1-5"

Let’s load the pipeline.

スピード最適化

from diffusers import StableDiffusionPipeline                                                                                                                                                                                                 
                                                                                                                                                                                                                                              
pipe = StableDiffusionPipeline.from_pretrained(model_id)

私たちは古い戦士長の美しい写真を生成することを目的とし、そのような写真を生成する最善のプロンプトを見つけることを後で試します。For now, let’s keep the prompt simple:

prompt = "portrait photo of a old warrior chief"

まず、GPU で推論を実行していることを確実にするべきです、そしてちょうど PyTorch モジュールでそうするように、パイプラインを GPU に移します。

pipe = pipe.to("cuda")

画像を生成するために、[~StableDiffusionPipeline.__call__] メソッドを使用するべきです。

すべての呼び出しである程度同じ画像を再生成できることを保証するため、generator を使用しましょう。詳細はここの再現性のドキュメントをご覧ください。

generator = torch.Generator("cuda").manual_seed(0)

Now, let’s take a spin on it.

image = pipe(prompt, generator=generator).images[0]                                                                                                                                                                                           
image

Cool, this now took roughly 30 seconds on a T4 GPU (you might see faster inference if your allocated GPU is better than a T4).

上記で行ったデフォルト実行は完全な float32 精度を使用し、デフォルトの推論ステップ数 (50) を実行しました。最も簡単なスピードアップは float16 (or 半) 精度に切り替えて単純に少ない推論ステップを実行することに由来します。Let’s load the model now in float16 instead.

import torch                                                                                                                                                                                                                                  

pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)                                                                                                                                                           
pipe = pipe.to("cuda")

And we can again call the pipeline to generate an image.

generator = torch.Generator("cuda").manual_seed(0)                                                                                                                                                                                            

image = pipe(prompt, generator=generator).images[0]                                                                                                                                                                                           
image

Cool, this is almost three times as fast for arguably the same image quality.

パイプラインを float16 で常に実行することを強く勧めます、ここまでそれゆえに品質の劣化を見たことは殆どありませんので。

次に、50 推論ステップを使用する必要があるのか、あるいはもっと少なく使用できるのか見てみましょう。推論ステップ数は使用するノイズ除去スケジューラに関係します。より効率的なスケジューラの選択はステップ数を減らすのに役立つ可能性があります。

stable diffusion パイプラインと互換性があるすべてのスケジューラを見てみましょう。

pipe.scheduler.compatibles

    [diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler,                                                                                                                                                       
     diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler,                                                                                                                                                                       
     diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler,                                                                                                                                                                     
     diffusers.schedulers.scheduling_pndm.PNDMScheduler,                                                                                                                                                                                      
     diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler,                                                                                                                                                                   
     diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler,                                                                                                                                                
     diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler,                                                                                                                                                         
     diffusers.schedulers.scheduling_ddpm.DDPMScheduler,                                                                                                                                                                                      
     diffusers.schedulers.scheduling_ddim.DDIMScheduler]

Cool, that’s a lot of schedulers.

🧨 Diffusers は Stable Diffusion とともに使用できる多くの新規のスケジューラ/サンプラーを常に追加しています。詳細は、ここで公式ドキュメントを見ることを勧めます。

Alright, 現在 Stable Diffusion は通常はおよそ 50 推論ステップを必要とする PNDMScheduler を使用しています。けれども、DPMSolverMultistepScheduler or DPMSolverSinglestepScheduler のような別のスケジューラは 20 から 25 推論ステップだけで上手くやれるようです。Let’s try them out.

from_config 関数を利用して新しいスケジューラを設定できます。

from diffusers import DPMSolverMultistepScheduler                                                                                                                                                                                             
                                                                                                                                                                                                                                              
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

Now, let’s try to reduce the number of inference steps to just 20.

generator = torch.Generator("cuda").manual_seed(0)                                                                                                                                                                                            
                                                                                                                                                                                                                                              
image = pipe(prompt, generator=generator, num_inference_steps=20).images[0]                                                                                                                                                                   
image

画像は少し違って見えますが、間違いなく依然として等しく高品質です。We now cut inference time to just 4 seconds though 😍.

メモリ最適化

生成において使用されるメモリが少なければ間接的により高速になります、毎秒生成できる画像の数を最大化しようとすることが多いからです。通常は、推論実行毎により多くの画像が生成できれば、毎秒より多くの画像も生成できます。

一度にどのくらいの数の画像が生成できるかを見る最も簡単な方法は、単純に試して “Out-of-memory (OOM)” エラーをいつ得るかを確認することです。

単純にプロンプトと generator のリストを渡すことでバッチ化推論を実行できます。Let’s define a quick function that generates a batch for us.

def get_inputs(batch_size=1):                                                                                                                                                                                                                 
  generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)]                                                                                                                                                             
  prompts = batch_size * [prompt]                                                                                                                                                                                                             
  num_inference_steps = 20                                                                                                                                                                                                                    

  return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps}

この関数はプロンプトのリストと generator のリストを返しますので、好きな結果を生成した generator を再利用できます。

画像のバッチを簡単に表示できるメソッドもまた必要です。

from PIL import Image                                                                                                                                                                                                                         

def image_grid(imgs, rows=2, cols=2):                                                                                                                                                                                                         
    w, h = imgs[0].size                                                                                                                                                                                                                       
    grid = Image.new('RGB', size=(cols*w, rows*h))                                                                                                                                                                                            
                                                                                                                                                                                                                                              
    for i, img in enumerate(imgs):                                                                                                                                                                                                            
        grid.paste(img, box=(i%cols*w, i//cols*h))                                                                                                                                                                                            
    return grid

Cool, let’s see how much memory we can use starting with batch_size=4.

images = pipe(**get_inputs(batch_size=4)).images                                                                                                                                                                                              
image_grid(images)

4 の batch_size を超えるとこのノートブックではエラー出力されます (T4 GPU でそれを実行していると仮定しています)。また、前の 4s/画像と比較して毎秒僅かにより多くの画像を生成している (3.75s/image) だけであることも分かります。

けれども、コミュニティはメモリ制約を更に改良する幾つかの素敵なトリックを見つけました。stable diffusion がリリースされた後、コミュニティは数日の間に改良を見つけてそれらを惜しみなく GitHub で共有しました – open-source at its finest! オリジナルなアイデアはこの GitHub スレッドに由来したと私は確信しています。

メモリの殆どは圧倒的に交差アテンション層により占められています。この演算をバッチで実行する代わりに、メモリのかなりの量をセーブするためにそれを逐次実行できます。

ここで文書化されているように enable_attention_slicing を呼び出すことで簡単に有効にできます。

pipe.enable_attention_slicing()

Great, now that attention slicing is enabled, let’s try to double the batch size again, going for batch_size=8.

images = pipe(**get_inputs(batch_size=8)).images                                                                                                                                                                                              
image_grid(images, rows=2, cols=4)

Nice, it works. スピードの上昇は再びそれほど大きくはありません (但し別の GPU では遥かにそれは大きいかもしれません)。

We’re at roughly 3.5 seconds per image 🔥 which is probably the fastest we can be with a simple T4 without sacrificing quality.

Next, let’s look into how to improve the quality!

品質の改良

画像生成パイプラインが非常に高速になりましたので、最大の画像品質を得ることを試しましょう。

まず最初に、画像品質は極めて主観的なものなので、ここで一般的な主張をすることは難しいです。

品質の改善のために取れる最も明白なステップはより良いチェックポイントを使用することです。Stable Diffusion の公開以来、多くの改良版がリリースされてきました、これらはここで要約されます :

Official Release – 22 Aug 2022: Stable-Diffusion 1.4
20 October 2022: Stable-Diffusion 1.5
24 Nov 2022: Stable-Diffusion 2.0
7 Dec 2022: Stable-Diffusion 2.1

より新しいバージョンが同じパラメータでより良い画像品質となることを必ずしも意味しません。あるプロンプトに対して 2.0 は 1.5 よりも僅かに悪いと言われましたが、妥当なプロンプトエンジニアリングが与えられれば 2.0 と 2.1 はより良いようです。

総合的には、モデルを試してオンラインのアドバイスを読むことを強く勧めます (例えば最高の可能な品質を得るためには negative プロンプトの使用は 2.0 と 2.1 に対して非常に重要であることが示されています)。例えばこの素敵なブログ投稿をご覧ください。

更に、コミュニティは多くの上のバージョンを特定のスタイルで再調整し始めて、それらの一部は極めて高い品質を持ち、多くの牽引力を獲得しています。ダウンロードによりソートされた diffusers のチェックポイントのすべてを見て様々なチェックポイントを試すことを勧めます。

以下については、単純化のために私たちは v1.5 で続けます。

次に、パイプラインの単一のコンポーネントを最適化を試すこともできます、例えば、潜在的デコーダの切り替えです。Stable Diffusion パイプライン全体がどのように動作するかの詳細は、このブログ記事を見てください。

stabilityai の最も新しいオートデコーダをロードしましょう。

from diffusers import AutoencoderKL                                                                                                                                                                                                           

vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda")

Now we can set it to the vae of the pipeline to use it.

pipe.vae = vae

Let’s run the same prompt as before to compare quality.

images = pipe(**get_inputs(batch_size=8)).images                                                                                                                                                                                              
image_grid(images, rows=2, cols=4)

違いは非常に微小のようですが、新しい生成は間違いなく少し鮮明です。

Cool, finally, let’s look a bit into prompt engineering.

私たちの目標は古い戦士長の写真を生成することでした。少し多くの色を写真に持ちこんで見た目をより印象的にしてみましょう。

元々はプロンプトは ”portrait photo of an old warrior chief“ でした。

プロンプトを改良するには、ディテールの追加に加えて、高品質な写真をセーブするためにオンラインで使用できる可能性のある手がかり (cute) を追加することが助けになることが多いです。

基本的には、プロンプト・エンジニアリングを行うとき、以下を考える必要があります :

私が望む写真や類似の写真はインターネット上でどのように保存されていた可能性が高いか？
モデルを私が望むスタイルに誘導するために私が与えることができる追加のディテールは何か？

Cool, let’s add more details.

prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes"

and let’s also add some cues that usually help to generate higher quality images.

prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta"                                                                                                                                         
prompt

Cool, let’s now try this prompt.

images = pipe(**get_inputs(batch_size=8)).images                                                                                                                                                                                              
image_grid(images, rows=2, cols=4)

Pretty impressive! そこでは幾つかの非常に高品質な画像生成を得ました。2 番目の画像は私の個人的なお気に入りですので、このシードを再利用して “old” の代わりに “oldest warrior”, “old”, “”, と “young” を使用してプロンプトを僅かに調整できるかを見ます。

prompts = [                                                                                                                                                                                                                                   
    "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",                                                                                                                                                                                                                                                                   
    "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",                                                                                                                                                                                                                                                                        
    "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",                                                                                                                                                                                                                                                                            
    "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3  --beta --upbeta",                                                                                                                                                                                                                                                                      
]                                                                                                                                                                                                                                             

generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))]  # 1 because we want the 2nd image                                                                                                                          

images = pipe(prompt=prompts, generator=generator, num_inference_steps=25).images                                                                                                                                                             
image_grid(images)

The first picture looks nice! The eye movement slightly changed and looks nice. これで Stable Diffusion の使用方法の 101 ガイドを終えました 🤗.

最適化や他のガイドの詳細は、以下を見ることを勧めます :

Blog post about Stable Diffusion : In-detail blog post explaining Stable Diffusion.
FlashAttention : XFormers flash attention can optimize your model even further with more speed and memory improvements.
Dreambooth – Quickly customize the model by fine-tuning it.
General info on Stable Diffusion – Info on other tasks that are powered by Stable Diffusion.

以上

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28