HuggingFace Diffusers 0.16 : API : パイプライン : IF (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/12/2023 (v0.16.0)

* 本ページは、HuggingFace Diffusers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

API : Pipelines – IF

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Diffusers 0.16 : API : パイプライン : IF

概要

DeepFloyd IF は新しい最先端のオープンソースのテキスト-to-画像変換モデルで、高度なフォトレアリズムと言語理解を備えています。モデルは、凍結されたテキストエンコーダと 3 つのカスケード (接続) されたピクセル拡散モジュールから構成されるモジュールです :

ステージ 1 : テキストプロンプトに基づいて 64×64 px 画像を生成するベースモデル。
ステージ 2 : 64×64 px => 256×256 px 超解像度モデル、そして
ステージ 3 : 256×256 px => 1024×1024 px 超解像度モデル。ステージ 1 とステージ 2 は埋め込みを抽出するために T5 transformer に基づく凍結されたテキストエンコーダを利用し、これは交差アテンションとアテンション・プーリングで強化された UNet アーキテクチャに供給されます。ステージ 3 は Stability の x4 アップスケーリング・モデルです。結果は、COCO データセットで 6.66 のゼロショット FID スコアを達成し、現在の最先端モデルの性能を超える高度に効率的なモデルになります。私たちのワークはカスケード接続された拡散モデルの最初のステージにおけるより大きな UNet アーキテクチャの潜在的能力を明確に示し、テキスト-to-画像合成に対する有望な将来性を描きます。

使用方法

IF を使用する前に、使用条件を承認する必要があります。そのためには :

Hugging Face アカウントを持ちログインしていることを確実にしてください。
DeepFloyd/IF-I-XL-v1.0 と DeepFloyd/IF-II-L-v1.0 のモデルカードでライセンスを承認します。
ローカルでの Log-in を確実にします。huggingface_hub をインストールします :
```
pip install huggingface_hub --upgrade
```
Python シェルで login 関数を実行します :
```
from huggingface_hub import login

login()
```
そして貴方の Hugging Face Hub アクセストークンを入力します。

次に diffusers と依存関係をインストールします :

pip install diffusers accelerate transformers safetensors

利用可能なチェックポイント

ステージ 1
ステージ 2
- DeepFloyd/IF-II-L-v1.0
- DeepFloyd/IF-II-M-v1.0
ステージ 3
- stabilityai/stable-diffusion-x4-upscaler

デモ

Google Colab

テキスト-to-画像生成

デフォルトでは diffusers はモデル cpu オフローディングを利用して 14 GB ほどの少ない VRAM で IF パイプライン全体を実行します。

from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch

# stage 1
stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()

# stage 2
stage_2 = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()

# stage 3
safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
generator = torch.manual_seed(1)

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

# stage 1
image = stage_1(
    prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
pt_to_pil(image)[0].save("./if_stage_I.png")

# stage 2
image = stage_2(
    image=image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")

# stage 3
image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images
image[0].save("./if_stage_III.png")

stage 1

stage 2

stage 3

テキスト誘導-画像-to-画像生成

同じ IF モデル重みがテキスト誘導画像-to-画像変換や画像バリエーションのために利用できます。この場合、IFInpaintingPipeline と IFInpaintingSuperResolutionPipeline パイプラインを使用して重みを確実にロードするだけです。

Note : ここで說明されるように、~DiffusionPipeline.components() 関数を利用して、テキスト-to-画像変換パイプラインの重みを画像-to-画像変換パイプラインに二度ロードすることなく直接移動することもできます。

from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline
from diffusers.utils import pt_to_pil

import torch

from PIL import Image
import requests
from io import BytesIO

# download image
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image.resize((768, 512))

# stage 1
stage_1 = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()

# stage 2
stage_2 = IFImg2ImgSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()

# stage 3
safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()

prompt = "A fantasy landscape in style minecraft"
generator = torch.manual_seed(1)

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

# stage 1
image = stage_1(
    image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_I.png")

# stage 2
image = stage_2(
    image=image,
    original_image=original_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")

# stage 3
image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
image[0].save("./if_stage_III.png")

テキスト誘導インペインティング生成

from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch

from PIL import Image
import requests
from io import BytesIO

# download image
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png"
response = requests.get(url)
original_image = Image.open(BytesIO(response.content)).convert("RGB")
original_image = original_image

# download mask
url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png"
response = requests.get(url)
mask_image = Image.open(BytesIO(response.content))
mask_image = mask_image

# stage 1
stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()

# stage 2
stage_2 = IFInpaintingSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()

# stage 3
safety_modules = {
    "feature_extractor": stage_1.feature_extractor,
    "safety_checker": stage_1.safety_checker,
    "watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()

prompt = "blue sunglasses"
generator = torch.manual_seed(1)

# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)

# stage 1
image = stage_1(
    image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_I.png")

# stage 2
image = stage_2(
    image=image,
    original_image=original_image,
    mask_image=mask_image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    generator=generator,
    output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")

# stage 3
image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images
image[0].save("./if_stage_III.png")

異なるパイプライン間の変換

from_pretrained でロードされることに加えて、パイプラインは互いに直接ロードすることもできます。

from diffusers import IFPipeline, IFSuperResolutionPipeline

pipe_1 = IFPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0")
pipe_2 = IFSuperResolutionPipeline.from_pretrained("DeepFloyd/IF-II-L-v1.0")


from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline

pipe_1 = IFImg2ImgPipeline(**pipe_1.components)
pipe_2 = IFImg2ImgSuperResolutionPipeline(**pipe_2.components)


from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline

pipe_1 = IFInpaintingPipeline(**pipe_1.components)
pipe_2 = IFInpaintingSuperResolutionPipeline(**pipe_2.components)

スピードに対する最適化

IF を高速に実行する最も単純な最適化はすべてのモデルコンポーネントを GPU に移動することです。

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

より短い時間ステップ数について拡散過程を実行することもできます。

これは num_inference_steps 引数を使用して成されるか :

pipe("<prompt>", num_inference_steps=30)

あるいは timesteps 引数を使用します :

from diffusers.pipelines.deepfloyd_if import fast27_timesteps

pipe("", timesteps=fast27_timesteps)

画像バリエーションやインペインティングを行なうとき、strength 引数を使用して timesteps の数を減らすこともできます。strength 引数は入力画像に追加するノイズの総量で、ノイズ除去過程でどれだけのステップを実行するかも決定します。より小さい数は画像の変化を小さくしますがより高速に実行されます。

pipe = IFImg2ImgPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

image = pipe(image=image, prompt="<prompt>", strength=0.3).images

torch.compile を利用することもできます。IF で torch.compile を徹底的にテストはしていませんので、期待される結果を与えないかもしれないことに注意してください。

import torch

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.to("cuda")

pipe.text_encoder = torch.compile(pipe.text_encoder)
pipe.unet = torch.compile(pipe.unet)

メモリに対する最適化

GPU メモリに対して最適化するとき、標準的な diffusers の cpu オフローディング API を利用できます。

モデルベースの CPU ローディングか :

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

あるいはより積極的な層ベースの CPU オフローディングです。

pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-IF-v1.0", variant="fp16", torch_dtype=torch.float16)
pipe.enable_sequential_cpu_offload()

更に、T5 は 8 bit 精度でロードできます :

from transformers import T5EncoderModel

text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-IF-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
)

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-IF-v1.0",
    text_encoder=text_encoder,  # pass the previously instantiated 8bit text encoder
    unet=None,
    device_map="auto",
)

prompt_embeds, negative_embeds = pipe.encode_prompt("")

すべてのモデルコンポーネントを CPU に一度にロードできない、google colab free tier のような CPU RAM が制約されたマシンについては、対応するモデルコンポーネントが必要なときにテキストエンコーダや unet を持つパイプラインだけを手動でロードすることができます。

from diffusers import IFPipeline, IFSuperResolutionPipeline
import torch
import gc
from transformers import T5EncoderModel
from diffusers.utils import pt_to_pil

text_encoder = T5EncoderModel.from_pretrained(
    "DeepFloyd/IF-I-IF-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit"
)

# text to image

pipe = DiffusionPipeline.from_pretrained(
    "DeepFloyd/IF-I-IF-v1.0",
    text_encoder=text_encoder,  # pass the previously instantiated 8bit text encoder
    unet=None,
    device_map="auto",
)

prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
prompt_embeds, negative_embeds = pipe.encode_prompt(prompt)

# Remove the pipeline so we can re-load the pipeline with the unet
del text_encoder
del pipe
gc.collect()
torch.cuda.empty_cache()

pipe = IFPipeline.from_pretrained(
    "DeepFloyd/IF-I-IF-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
)

generator = torch.Generator().manual_seed(0)
image = pipe(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    output_type="pt",
    generator=generator,
).images

pt_to_pil(image)[0].save("./if_stage_I.png")

# Remove the pipeline so we can load the super-resolution pipeline
del pipe
gc.collect()
torch.cuda.empty_cache()

# First super resolution

pipe = IFSuperResolutionPipeline.from_pretrained(
    "DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16, device_map="auto"
)

generator = torch.Generator().manual_seed(0)
image = pipe(
    image=image,
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    output_type="pt",
    generator=generator,
).images

pt_to_pil(image)[0].save("./if_stage_II.png")

以上

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31