HuggingFace Diffusers 0.12 : API : パイプライン – 概要 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 03/11/2023 (v0.14.0)

* 本ページは、HuggingFace Diffusers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

API : Pipelines – Overview

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Diffusers 0.12 : API : パイプライン – 概要

パイプラインは最先端の拡散モデルを推論で実行する単純な方法を提供します。殆どの拡散システムは複数の独立的に訓練済みのモデルと高度にアダプテーション可能なスケジューラ・コンポーネントから構成されます – それらのすべては end-to-end な拡散システムが機能するために必要です。

例として、Stable Diffusion は 3 つの独立した訓練済みモデルを持ちます :

オートエンコーダ
条件付き Unet
CLIP テキストエンコーダ
スケジューラコンポーネント, scheduler
CLIPFeatureExtractor,
そして safety チェッカー。これらのコンポーネントのすべてが stable diffusion を推論で実行するために必要です、けれどもそれらは互いに独立に訓練されて作成されます。

その目的で、私たちは統一された API のもとですべてオープンソース化された、最先端拡散システムを提供するために努力しています。より具体的には、以下のようなパイプラインを提供する努力をしています :

公式に公開された重みをロードし、対応する論文に従ってオリジナル実装と同じ出力を 1対1 で生成することができます (e.g. LDMTextToImagePipeline, これは High-Resolution Image Synthesis with Latent Diffusion Models の公式にリリースされた重みを使用します),
推論でモデルを実行する簡単なユーザインターフェイスを持ちます (パイプライン API セクション参照),
自明で、公式論文とともに読むことができるコードにより理解しやすい (パイプライン summary 参照),
コミュニティにより容易に貢献できます (Contribution セクション参照)。

パイプラインは訓練機能を提供しない (そしてするべきでない) ことに注意してください。公式の訓練サンプルを探している場合には、examples を見てください。

🧨 Diffusers 要約

以下のテーブルは総ての公式にサポートされているパイプライン、対応する論文、そして (利用可能なら) それらを直接試すための colab ノートブックをまとめてあります。

alt_diffusion – 画像-to-画像変換テキスト誘導生成
AltDiffusion (2022/11)
audio_diffusion – 条件なし音声生成 – Colab
Audio Diffusio
controlnet – 画像-to-画像テキスト誘導生成
ControlNet with Stable Diffusion (2023/02)
cycle_diffusion – 画像-to-画像変換テキスト誘導生成
Cycle Diffusion (2020/10)
dance_diffusion – 条件なし音声生成
Dance Diffusion
ddpm – 条件なし画像生成
Denoising Diffusion Probabilistic Models (2020/06 ; ノイズ除去拡散確率モデル)
ddim – 条件なし画像生成 – Colab
Denoising Diffusion Implicit Models (2020/10 ; ノイズ除去拡散暗黙モデル)
latent_diffusion – テキスト-to-画像生成
High-Resolution Image Synthesis with Latent Diffusion Models (2021/12 ; 潜在拡散モデルによる高解像度画像合成)
latent_diffusion – 超解像度画像-to-画像変換
High-Resolution Image Synthesis with Latent Diffusion Models (2021/12 ; 潜在拡散モデルによる高解像度画像合成)
latent_diffusion_uncond – 条件なし画像生成
High-Resolution Image Synthesis with Latent Diffusion Models
paint_by_example – 画像誘導画像インペインティング
Paint by Example: Exemplar-based Image Editing with Diffusion Models (2022/11)
pndm – 条件なし画像生成
Pseudo Numerical Methods for Diffusion Models on Manifolds (2022/02)
score_sde_ve – 条件なし画像生成
Score-Based Generative Modeling through Stochastic Differential Equations (2020/09 ; 確率微分方程式によるスコアベース生成モデリング)
score_sde_vp (訳注: リンク切れ) – 条件なし画像生成
Score-Based Generative Modeling through Stochastic Differential Equations (2020/09 ; 確率微分方程式によるスコアベース生成モデリング)
semantic_stable_diffusion – テキスト-to-画像生成
SEGA: Instructing Diffusion using Semantic Dimensions (2023/01)
stable_diffusion_text2img – テキスト-to-画像生成 – Colab
Stable Diffusion (2022/08)
stable_diffusion_img2img – 画像-to-画像テキスト誘導生成 – Colab
Stable Diffusion (2022/08)
stable_diffusion_inpaint – テキスト誘導画像インペインティング – Colab
Stable Diffusion (2022/08)
stable_diffusion_panorama – テキスト誘導パノラマビュー
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation (2023/02)
stable_diffusion_pix2pix – テキストベースの画像編集
InstructPix2Pix: Learning to Follow Image Editing Instructions (InstructPix2Pix: 画像編集指示に従うことを学習する) (2022/11)
stable_diffusion_pix2pix_zero – テキストベースの画像編集
Zero-shot Image-to-Image Translation (ゼロショット画像-to-画像変換) (2023/02)
stable_diffusion_attend_and_excite – テキスト-to-画像生成
Attend and Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
(Attend and Excite: テキスト-to-画像拡散モデルへのアテンションベースのセマンティック・ガイダンス) (2023/01)
stable_diffusion_self_attention_guidance – テキスト-to-画像生成
Self-Attention Guidance (自己アテンション・ガイダンス) (2022/10)
stable_diffusion_image_variation – 画像-to-画像生成
Stable Diffusion Image Variations (Stable Diffusion 画像バリエーション) (2022)
stable_diffusion_latent_upscale – テキスト誘導超解像度画像-to-画像変換
Stable Diffusion Latent Upscaler (SD 潜在的アップスケーラ)
stable_diffusion_2 – テキスト-to-画像生成
Stable Diffusion 2
stable_diffusion_2 – テキスト誘導画像インペインティング
Stable Diffusion 2
stable_diffusion_2 – 深度-to-画像テキスト誘導生成
Stable Diffusion 2
stable_diffusion_2 – テキスト誘導超解像度画像-to-画像変換
Stable Diffusion 2
stable_diffusion_safe – テキスト誘導生成 – Colab
Safe Stable Diffusion
stable_unclip – テキスト-to-画像生成
Stable unCLIP
stable_unclip – 画像-to-画像テキスト誘導生成
Stable unCLIP
stochastic_karras_ve – 条件なし画像生成
Elucidating the Design Space of Diffusion-Based Generative Models (2022/06)
unclip – テキスト-to-画像生成
Hierarchical Text-Conditional Image Generation with CLIP Latents (2022/04)
versatile_diffusion – テキスト-to-画像生成
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model (2022/11)
versatile_diffusion – 画像バリエーション生成
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model (2022/11)
versatile_diffusion – Dual 画像 & テキスト誘導生成
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model (2022/11)
vq_diffusion – テキスト-to-画像生成
Vector Quantized Diffusion Model for Text-to-Image Synthesis (2021/11)

Note: Pipelines are simple examples of how to play around with the diffusion systems as described in the corresponding papers.

けれども、それらの殆どは異なるスケジューラ・コンポーネントや異なるモデル・コンポーネントさえも使用するように適応させることができます。幾つかのパイプラインの例が以下の Examples で示されます。

パイプライン API

拡散モデルは複数の独立に訓練されたモデルや前からの他の既存のコンポーネントから構成されることが多いです。

各モデルは様々なタスクで独立に訓練され、スケジューラは簡単に切り替えて別のものと置き換えることができます。しかし、推論の間、私たちはすべてのコンポーネントを容易にロードしてそれらを推論で使用できることを望みます – 例えばあるコンポーネントが Transformers のような別のライブラリに由来する CLIP のテキストエンコーダであったとしても。そのため、すべてのパイプラインは以下の機能を提供しています :

from_pretrained メソッドは Hugging Face ハブ・レポジトリ id, e.g. runwayml/stable-diffusion-v1-5 か、ローカルディレクトリへのぱパス, e.g. ”./stable-diffusion” を受け取ります。どのモデルとコンポーネントがロードされるべきかを正しく取得するため、パイプラインにロードされるべきすべてのコンポーネントを定義する model_index.json ファイル, e.g. runwayml/stable-diffusion-v1-5/model_index.json を提供する必要があります。より具体的には、各モデル/コンポーネントについて形式 <name>: [“<library>”, “<class name>”] を定義する必要があります。<name> は &class name> のロードされるインスタンスに与えられる属性名で、これは “<library>” の名前のライブラリやパイプラインフォルダで見つかります。
save_pretrained はローカルパス, e.g. ./stable-diffusion を受け取り、その下にパイプラインのすべてのモデル/コンポーネントがセーブされます。各コンポーネント/モデルについてローカルパス内にフォルダが作成されます、これは与えられた属性名, e.g. ./stable_diffusion/unet に従って命名されます。更に、model_index.json ファイルがローカルパス, e.g. ./stable_diffusion/model_index.json のルートに作成され、その結果完全なパイプラインがローカルパスから再度インスタンス化できます。
to, これは文字列か torch.device を受け取り、タイプ torch.nn.Module であるすべてのモデルを渡されたデバイスに移動します。動作は PyTorch の to メソッドに完全に類似しています。
__call__ メソッド, 推論でパイプラインを使用します。__call__ はパイプラインの推論ロジックを定義して、理想的にはそのすべての局面を包含するべきです、前処理からテンソルを様々なモデルとスケジューラに forward し、そして後処理します。__call__ メソッドの API はパイプライン毎に非常に様々であり得ます。例えば、StableDiffusionPipeline のようなテキスト-to-画像パイプラインは画像を生成するために他のものの中でテキストプロンプトを受け取るはずです。他方、DDPMPipeline のような純粋な画像生成パイプラインはどのような入力を供給することもなしに実行できます。各パイプラインに対してどのような入力が適応できるかより良く理解するためには、それぞれのパイプラインを直接見るべきです。

Note : すべてのパイプラインは、__call__ メソッドを torch.no_grad デコレータで修飾することで PyTorch の autograd を無効にしています、パイプラインは訓練のために使用されるべきではないからです。forward パスの間に勾配をストアしたい場合には、貴方自身のパイプラインを書くことを勧めます、コミュニティ・サンプルもご覧ください。

Contribution

(訳注: 原文参照)

Examples

Stable Diffusion によるテキスト-to-画像生成

# make sure you're logged in with `huggingface-cli login`
from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler

pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")

Stable Diffusion による画像-to-画像テキスト誘導生成

StableDiffusionImg2ImgPipeline は新しい画像の生成を条件付けるためにテキストプロンプトと初期画像を渡すことができます。

import requests
from PIL import Image
from io import BytesIO

from diffusers import StableDiffusionImg2ImgPipeline

# load the pipeline
device = "cuda"
pipe = StableDiffusionImg2ImgPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to(
    device
)

# let's download an initial image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))

prompt = "A fantasy landscape, trending on artstation"

images = pipe(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images

images[0].save("fantasy_landscape.png")

You can also run this example on colab

シードと潜在的変数を再利用してプロンプトを調整する

結果を再生成するために貴方自身の潜在的変数を生成したり、好きな特定の結果上でプロンプトを調整することができます。このノートブックはそれを行う方法をステップ毎に示します。それを Google Colab で実行することもできます。

Stable Diffusion を使用したインペインティング

StableDiffusionInpaintPipeline はマスクとテキストプロンプトを提供することで画像の特定の部分を編集することを可能にします。

import PIL
import requests
import torch
from io import BytesIO

from diffusers import StableDiffusionInpaintPipeline


def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")


img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0]