HuggingFace Diffusers 0.12 : 使用方法 : 推論のためのパイプライン (3) (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 02/15/2023 (v0.12.1)

* 本ページは、HuggingFace Diffusers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Using Diffusers : Custom Pipelines

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

HuggingFace Diffusers 0.12 : 使用方法 : 推論のためのパイプライン (3)

カスタム・パイプライン

For more information about community pipelines, please have a look at this issue.

コミュニティ サンプルは推論と訓練サンプルの両方から構成され、これらはコミュニティにより追加されました。すべてのコミュニティ・サンプルの概要を得るために以下の表を見てください。試すことができる、コピー&ペースト ready コードサンプルを得るには コードサンプル をクリックしてください。If a community doesn’t work as expected, please open an issue and ping the author on it.

CLIP 誘導 Stable Diffusion (サンプル) – CLIP 誘導 Stable Diffusion (コードサンプル) – Suraj Patil (作者) – Colab
Stable Diffusion によるテキスト-to-画像生成のために CLIP 誘導を行なう (説明)
One Step U-Net (Dummy) – One Step U-Net – Patrick von Platen
コミュニティ・パイプラインの使用方法のサンプル展示 (https://github.com/huggingface/diffusers/issues/841 参照)
Stable Diffusion 補間 – Stable Diffusion 補間 – Nate Raw
異なるプロンプト/シード間の Stable Diffusion の潜在的空間を補間します。
Stable Diffusion Mega – Stable Diffusion Mega – Patrick von Platen
Text2Image, Image2Image とインペインティングのすべての機能を持つ一つの Stable Diffusion パイプライン
長プロンプト重み付け Stable Diffusion – 長プロンプト重み付け Stable Diffusion – SkyTNT
トークン長制限のない一つの Stable Diffusion パイプライン、そしてプロンプトにおける重み付けの解析のサポート
音声-to-画像 – 音声-to-画像 – Mikail Duzenli
文字起こしに自動音声認識を、画像生成に Stable Diffusion を使用します。

カスタム・パイプラインをロードするためには、diffusers/examples/community のファイルの一つとしての、custom_pipeline 引数を DiffusionPipeline に渡す必要があるだけです。貴方自身のパイプラインを使用するには自由に PR を送ってください、私たちはそれらを素早くマージします。

pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder"
)

サンプル使用方法 : CLIP 誘導 Stable Diffusion

CLIP 誘導 stable diffusion は、追加の CLIP モデルを使用してすべてのノイズ除去ステップで stable diffusion をガイドすることでよりリアルな画像を生成するのに役立つことができます。

以下のコードはおよそ 12GB の GPU RAM を必要とします。

from diffusers import DiffusionPipeline
from transformers import CLIPFeatureExtractor, CLIPModel
import torch


feature_extractor = CLIPFeatureExtractor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K")
clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16)


guided_pipeline = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="clip_guided_stable_diffusion",
    clip_model=clip_model,
    feature_extractor=feature_extractor,
    torch_dtype=torch.float16,
)
guided_pipeline.enable_attention_slicing()
guided_pipeline = guided_pipeline.to("cuda")

prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece"

generator = torch.Generator(device="cuda").manual_seed(0)
images = []
for i in range(4):
    image = guided_pipeline(
        prompt,
        num_inference_steps=50,
        guidance_scale=7.5,
        clip_guidance_scale=100,
        num_cutouts=4,
        use_cutouts=False,
        generator=generator,
    ).images[0]
    images.append(image)

# save images locally
for i, img in enumerate(images):
    img.save(f"./clip_guided_sd/image_{i}.png")

画像リストは PIL 画像のリストを含みます、これはローカルにセーブしたり google colab で直接表示できます。生成された画像は stable diffusion をネイティブに使用するよりも高い品質である傾向があります。E.g. 上記のスクリプトは以下の画像を生成します :

ワンステップ Unet

ダミー “one-step-unet” は以下のように実行できます :

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="one_step_unet")
pipe()

Note : This community pipeline is not useful as a feature, but rather just serves as an example of how community pipelines can be added (https://github.com/huggingface/diffusers/issues/841 参照).

Stable Diffusion 補間

以下のコードは少なくとも 8GB VRAM の GPU で実行できて約 5 分かかるはずです。

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    torch_dtype=torch.float16,
    safety_checker=None,  # Very important for videos...lots of false positives while interpolating
    custom_pipeline="interpolate_stable_diffusion",
).to("cuda")
pipe.enable_attention_slicing()

frame_filepaths = pipe.walk(
    prompts=["a dog", "a cat", "a horse"],
    seeds=[42, 1337, 1234],
    num_interpolation_steps=16,
    output_dir="./dreams",
    batch_size=4,
    height=512,
    width=512,
    guidance_scale=8.5,
    num_inference_steps=50,
)

Please have a look at https://github.com/nateraw/stable-diffusion-videos for more in-detail information on how to create videos using stable diffusion as well as more feature-complete functionality.

Stable Diffusion Mega

Stable Diffusion Mega パイプラインは stable diffusion パイプラインの主要なユースケースの利用を単一クラスで可能にします。

#!/usr/bin/env python3
from diffusers import DiffusionPipeline
import PIL
import requests
from io import BytesIO
import torch


def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")


pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="stable_diffusion_mega",
    torch_dtype=torch.float16,
)
pipe.to("cuda")
pipe.enable_attention_slicing()


### Text-to-Image

images = pipe.text2img("An astronaut riding a horse").images

### Image-to-Image

init_image = download_image(
    "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
)

prompt = "A fantasy landscape, trending on artstation"

images = pipe.img2img(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images

### Inpainting

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

prompt = "a cat sitting on a bench"
images = pipe.inpaint(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.75).images

As shown above this one pipeline can run all both “text-to-image”, “image-to-image”, and “inpainting” in one pipeline.

Long Prompt Weighting Stable Diffusion

このパイプラインは 77 トークン長の制限なしにプロンプトの入力を可能にします。そして ”()” を使用して単語の重み付けを増やしたり、”[]” を使用して単語の重み付けを減らしたりすることができます。このパイプラインは stable diffusion パイプラインの主要なユースケースの利用を単一クラスで可能にします。

pytorch

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "hakurei/waifu-diffusion", custom_pipeline="lpw_stable_diffusion", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "best_quality (1girl:1.3) bow bride brown_hair closed_mouth frilled_bow frilled_hair_tubes frills (full_body:1.3) fox_ear hair_bow hair_tubes happy hood japanese_clothes kimono long_sleeves red_bow smile solo tabi uchikake white_kimono wide_sleeves cherry_blossoms"
neg_prompt = "lowres, bad_anatomy, error_body, error_hair, error_arm, error_hands, bad_hands, error_fingers, bad_fingers, missing_fingers, error_legs, bad_legs, multiple_legs, missing_legs, error_lighting, error_shadow, error_reflection, text, error, extra_digit, fewer_digits, cropped, worst_quality, low_quality, normal_quality, jpeg_artifacts, signature, watermark, username, blurry"

pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0]

onnxruntime

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="lpw_stable_diffusion_onnx",
    revision="onnx",
    provider="CUDAExecutionProvider",
)

prompt = "a photo of an astronaut riding a horse on mars, best quality"
neg_prompt = "lowres, bad anatomy, error body, error hair, error arm, error hands, bad hands, error fingers, bad fingers, missing fingers, error legs, bad legs, multiple legs, missing legs, error lighting, error shadow, error reflection, text, error, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry"

pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0]

if you see Token indices sequence length is longer than the specified maximum sequence length for this model ( *** > 77 ) . Running this sequence through the model will result in indexing errors. Do not worry, it is normal.

音声-to-画像変換

以下のコードは、事前訓練済み OpenAI whisper-small と Stable Diffusion を使用して音声サンプルから画像を生成できます。

import torch

import matplotlib.pyplot as plt
from datasets import load_dataset
from diffusers import DiffusionPipeline
from transformers import (
    WhisperForConditionalGeneration,
    WhisperProcessor,
)


device = "cuda" if torch.cuda.is_available() else "cpu"

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")

audio_sample = ds[3]

text = audio_sample["text"].lower()
speech_data = audio_sample["audio"]["array"]

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device)
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

diffuser_pipeline = DiffusionPipeline.from_pretrained(
    "CompVis/stable-diffusion-v1-4",
    custom_pipeline="speech_to_image_diffusion",
    speech_model=model,
    speech_processor=processor,
    
    torch_dtype=torch.float16,
)

diffuser_pipeline.enable_attention_slicing()
diffuser_pipeline = diffuser_pipeline.to(device)

output = diffuser_pipeline(speech_data)
plt.imshow(output.images[0])

This example produces the following image:

以上

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28