HuggingFace ããã° : Stable Diffusion with 𧚠Diffusers (翻蚳/解説)
翻蚳 : (æ ª)ã¯ã©ã¹ãã£ãã ã»ãŒã«ã¹ã€ã³ãã©ã¡ãŒã·ã§ã³
äœææ¥æ : 11/13/2022
* æ¬ããŒãžã¯ãHuggingFace Blog ã®ä»¥äžã®ããã¥ã¡ã³ãã翻蚳ããäžã§é©å®ãè£è¶³èª¬æãããã®ã§ãïŒ
- Stable Diffusion with 𧚠Diffusers (Authors : Suraj Patil, Pedro Cuenca, Nathan Lambert, Patrick von Platen : 08/22/2022)
* ãµã³ãã«ã³ãŒãã®åäœç¢ºèªã¯ããŠãããŸãããå¿
èŠãªå Žåã«ã¯é©å®ãè¿œå æ¹å€ããŠããŸãã
* ãèªç±ã«ãªã³ã¯ã匵ã£ãŠé ããŠããŸããŸããããsales-info@classcat.com ãŸã§ãäžå ±ããã ãããšå¬ããã§ãã
- 人工ç¥èœç 究éçºæ¯æŽ
- 人工ç¥èœç ä¿®ãµãŒãã¹(çµå¶è å±€åããªã³ãµã€ãç ä¿®)
- ãã¯ãã«ã«ã³ã³ãµã«ãã£ã³ã°ãµãŒãã¹
- å®èšŒå®éš(ãããã¿ã€ãæ§ç¯)
- ã¢ããªã±ãŒã·ã§ã³ãžã®å®è£
- 人工ç¥èœç ä¿®ãµãŒãã¹
- PoC(æŠå¿µå®èšŒ)ã倱æãããªãããã®æ¯æŽ
- ãäœãŸãã®å°åã«é¢ä¿ãªã Web ãã©ãŠã¶ãããåå é ããŸããäºåç»é² ãå¿ èŠã§ãã®ã§ã泚æãã ããã
â ãååã : æ¬ä»¶ã«é¢ãããåãåããå ã¯äžèšãŸã§ãé¡ãããããŸãã
- æ ªåŒäŒç€Ÿã¯ã©ã¹ãã£ãã ã»ãŒã«ã¹ã»ããŒã±ãã£ã³ã°æ¬éš ã»ãŒã«ã¹ã»ã€ã³ãã©ã¡ãŒã·ã§ã³
- sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP
HuggingFace ããã° : Stable Diffusion with 𧚠Diffusers
Stable Diffusion 㯠CompVis, Stability AI ãš LAION ã®ç 究è ãšæè¡è ã«ããäœæãããããã¹ã-to-ç»åã®æœåšæ¡æ£ã¢ãã«ã§ãããã㯠LAION-5B ããŒã¿ããŒã¹ã®ãµãã»ããã® 512×512 ç»åã§èšç·ŽãããŠããŸããLAION-5B ã¯çŸåšååšããæ倧ã®ãããªãŒã§ã¢ã¯ã»ã¹å¯èœãªãã«ãã¢ãŒãã«ãªããŒã¿ã»ããã§ãã
ãã®èšäºã§ã¯ã𧚠Diffusers ã©ã€ãã©ãª 㧠Stable Diffusion ã䜿çšããæ¹æ³ã瀺ããã¢ãã«ãã©ã®ããã«åäœãããã說æããŠæåŸã« diffusers ãç»åçæãã€ãã©ã€ã³ãã©ã®ããã«ã«ã¹ã¿ãã€ãºãå¯èœã«ããããå°ãæãäžããããšæããŸãã
Stable Diffusion ã®å®è¡
ã©ã€ã»ã³ã¹
ã¢ãã«ã䜿çšããåã«ãéã¿ãããŠã³ããŒãããŠäœ¿çšããããã«ã¢ãã«ã©ã€ã»ã³ã¹ãæ¿èªããå¿ èŠããããŸãã
ã©ã€ã»ã³ã¹ã¯ããã®ãããªãã¯ãã«ãªæ©æ¢°åŠç¿ã·ã¹ãã ã®æœåšçãªåŒå®³ã軜æžããããã«èšèšãããŠããŸããç§ãã¡ã¯ ãŠãŒã¶ãã©ã€ã»ã³ã¹å šäœã泚ææ·±ãèªãããšãèŠæ±ããŸããããã«èŠçŽãæäŸããŸã :
- éæ³ãããã¯æ害ãªåºåãã³ã³ãã³ããæå³çã«çæãŸãã¯å
±æããããã«ã¢ãã«ã䜿çšã§ããŸããã
- ç§ãã¡ã¯è²Žæ¹ãçæããåºåã®æš©å©ã䞻匵ããŸããã貎æ¹ã¯ããããèªç±ã«äœ¿çšã§ããŠããããŠã©ã€ã»ã³ã¹ã®æ¡ä»¶ã«åããã¹ãã§ã¯ãªããããã®äœ¿çšã«ã€ããŠè²¬ä»»ãè² ããŸãããããŠ
- éã¿ãåé åžããŠã¢ãã«ãåçš and/or ãµãŒãã¹ãšããŠå©çšããŠãè¯ãã§ãããããè¡ãªãå Žåãã©ã€ã»ã³ã¹ã®ãã®ãšåã䜿çšå¶éãå«ããŠããã㊠CreativeML OpenRAIL-M ã®ã³ããŒããã¹ãŠã®ãŠãŒã¶ãšå ±æããªããã°ãªããªãããšã«çæããŠãã ããã
䜿çšæ¹æ³
æåã«ã以äžã®ã³ãŒãã¹ãããããå®è¡ããããã« diffusers==0.4.0 ãã€ã³ã¹ããŒã«ããå¿ èŠããããŸã :
pip install diffusers==0.4.0 transformers scipy ftfy
ãã®èšäºã§ã¯ã¢ãã«ããŒãžã§ã³ v1-4 ã䜿çšããŸãã®ã§ããã®ã«ãŒã ã«ã¢ã¯ã»ã¹ããã©ã€ã»ã³ã¹ãèªãã§åæããã®ã§ããã°ãã§ãã¯ããã¯ã¹ããã§ãã¯ããå¿ èŠããããŸãã貎æ¹ã¯ ð€ Hugging Face ããã®ç»é²ãŠãŒã¶ã§ããå¿ èŠãããããããŠã³ãŒããåäœãããã«ã¯ã¢ã¯ã»ã¹ããŒã¯ã³ã䜿çšããå¿ èŠããããŸããã¢ã¯ã»ã¹ããŒã¯ã³ã®è©³çŽ°ã¯ãããã¥ã¡ã³ãã®ãã®ã»ã¯ã·ã§ã³ ãåç §ããŠãã ãããã¢ã¯ã»ã¹ãèŠæ±ããã®ã§ããã°ã貎æ¹ã®ãŠãŒã¶ããŒã¯ã³ã次ã®ããã«ç¢ºå®ã«æž¡ããŠãã ãã :
YOUR_TOKEN="/your/huggingface/hub/token"
ãã®ã¯ã³ã¿ã€ã ã®ã»ããã¢ããã®åŸãStable Diffusion æšè«ãé²ããããšãã§ããŸãã
Stable Diffusion ã¢ãã«ã¯ StableDiffusionPipeline ãã€ãã©ã€ã³ã䜿çšããŠæ°è¡ã ãã§æšè«ãå®è¡ã§ããŸãããã®ãã€ãã©ã€ã³ã¯ãåçŽãª from_pretrained é¢æ°åŒã³åºãã§ãããã¹ãããç»åãçæããããã«å¿ èŠãªãã®ãã¹ãŠãã»ããã¢ããããŸãã
from diffusers import StableDiffusionPipeline
# get your token at https://huggingface.co/settings/tokens
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", use_auth_token=YOUR_TOKEN)
If a GPU is available, let’s move it to one!
pipe.to("cuda")
Note : GPU ã¡ã¢ãªã«ããå¶éãã 10GB æªæºã®å©çšå¯èœãª GPU RAM ããæããªãå ŽåãStableDiffusionPipeline ã (äžã§è¡ãªããããããª) ããã©ã«ãã® float32 粟床ã§ã¯ãªã float16 粟床ã§ããŒãããããšã確å®ã«ããŠãã ãããfp16 ãã©ã³ãããéã¿ãããŒãããdiffusers ã«éã¿ã float16 粟床ã§ããããšãæ³å®ããŠããããšãæ瀺ããããšã§ãããè¡ãªãããšãã§ããŸã :
import torch
from diffusers import StableDiffusionPipeline
# get your token at https://huggingface.co/settings/tokens
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", revision="fp16", torch_dtype=torch.float16, use_auth_token=YOUR_TOKEN)
To run the pipeline, simply define the prompt and call pipe.
prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt)["sample"][0]
# you can save the image with
# image.save(f"astronaut_rides_horse.png")
The result would look as follows :
åã®ã³ãŒãã¯ãããå®è¡ãããã³ã«ç°ãªãç»åãäžããŸãã
ããæç¹ã§é»ãç»åãåŸãå Žåãããã¯ã¢ãã«å §éšã«çµã¿èŸŒãŸããã³ã³ãã³ããã£ã«ã¿ãŒã NSFW ã®çµæãæ€åºããå¯èœæ§ããããŸããããã¯åœãŠã¯ãŸããªãã¯ããšç¢ºä¿¡ããã®ã§ããã°ãããã³ããã調æŽãããç°ãªãã·ãŒãã®äœ¿çšãè©ŠããŠã¿ãŠãã ãããå®éãã¢ãã«äºæž¬ã¯ç¹å®ã®çµæã«å¯Ÿã㊠NSFW ãæ€åºããããã®æ å ±ãå«ã¿ãŸããLet’s see what they look like:
result = pipe(prompt)
print(result)
{ 'sample': [<PIL.Image.Image image mode=RGB size=512x512>], 'nsfw_content_detected': [False] }
決å®è«çåºåãæãå Žåãã©ã³ãã ã·ãŒããè (ãŸ) ã㊠generator ããã€ãã©ã€ã³ã«æž¡ãããšãã§ããŸããåãã·ãŒã㧠generator ã䜿çšããã°ãã€ã§ãåãç»ååºåãåŸãããŸãã
import torch
generator = torch.Generator("cuda").manual_seed(1024)
image = pipe(prompt, guidance_scale=7.5, generator=generator).images[0]
# you can save the image with
# image.save(f"astronaut_rides_horse.png")
The result would look as follows
num_inference_steps åŒæ°ã䜿çšããŠæšè«ã¹ãããæ°ãå€æŽã§ããŸãã
äžè¬ã«ãçµæã¯ããå€ãã®ã¹ãããã䜿çšããã°ããè¯ããªããŸãããããå€ãã®ã¹ãããã¯çæã«ããæéãããããŸããStable Diffusion ã¯æ¯èŒçå°ãªãã¹ãããæ°ã§éåžžã«äžæãåäœããŸãã®ã§ãããã©ã«ãã® 50 ã®æšè«ã¹ãããæ°ã䜿çšããããšãå§ããŸããããéãçµæãæãã®ã§ããã°ããå°ããæ°ã䜿çšã§ããŸããããé«è³ªãªçµæãæœåšçã«æ±ãããªãã°ã倧ããªæ°ã䜿çšã§ããŸãã
ããå°ãªããã€ãºé€å»ã¹ãããã§ãã€ãã©ã€ã³ã®å®è¡ãè©ŠããŸãããã
import torch
generator = torch.Generator("cuda").manual_seed(1024)
image = pipe(prompt, guidance_scale=7.5, num_inference_steps=15, generator=generator).images[0]
# you can save the image with
# image.save(f"astronaut_rides_horse.png")
æ§é ãåãã§ããå®å®æãšéŠ¬ã®äžè¬çãªåœ¢ã«åé¡ãããããšã«æ³šæããŠãã ãããããã¯ã15 ã ãã®ãã€ãºé€å»ã¹ãããã¯çæçµæã®è³ªãæ¬è³ªçã«å£åãããŠããããšã瀺ããŠããŸããåè¿°ã®ããã«ã50 ã®ãã€ãºé€å»ã¹ãããã¯é«å質ãªç»åãçæããã®ã«éåžžã¯ååã§ãã
num_inference_steps ã«å ããŠããã¹ãŠã®ãããŸã§ã®äŸã§ guidance_scale ãšåŒã°ãããããäžã€ã®é¢æ°åŒæ°ã䜿çšããŠããŸãããguidance_scale ã¯ãå šäœçãªãµã³ãã«å質ã«å ããŠçæãã¬ã€ããããæ¡ä»¶ä¿¡å· (ãã®å Žåã¯ããã¹ã) ãžã®å¿ å®åºŠ (= adherence) ãå¢ãæ¹æ³ã§ããããã¯ãŸã åé¡åšããªãŒã»ã¬ã€ãã³ã¹ ãšããŠãç¥ãããŠããŸããããã¯ç°¡åã«èšãã° (ç»åå質ãšå€æ§æ§ãç ç²ã«ããŠ) çæãããã³ããã«æ¬è³ªçã«ããè¯ãäžèŽããããã«åŒ·å¶ãããã®ã§ããStable Diffusion ã«å¯ŸããŠã¯ 7 ãš 8.5 ã®éã®å€ãéåžžã¯è¯ãéžæã§ããããã©ã«ãã§ã¯ãã€ãã©ã€ã³ã¯ 7.5 ã® guidance_scale ã䜿çšããŸãã
éåžžã«å€§ããå€ã䜿çšããã°ç»åã¯è¯ãèŠãããããããŸããããå€æ§æ§ã¯å°ãããªããŸãããã®ãã©ã¡ãŒã¿ã®æè¡çãªè©³çŽ°ã«ã€ããŠã¯èšäºã® ãã®ã»ã¯ã·ã§ã³ ã§åŠç¿ã§ããŸãã
次ã«ãåãããã³ããã®å¹Ÿã€ãã®ç»åãäžåºŠã«çæã§ããæ¹æ³ãèŠãŸãããããŸããããããã°ãªããã§çŽ æµã«å¯èŠåããã®ã«åœ¹ç«ã€ image_grid é¢æ°ãäœæããŸãã
from PIL import Image
def image_grid(imgs, rows, cols):
assert len(imgs) == rows*cols
w, h = imgs[0].size
grid = Image.new('RGB', size=(cols*w, rows*h))
grid_w, grid_h = grid.size
for i, img in enumerate(imgs):
grid.paste(img, box=(i%cols*w, i//cols*h))
return grid
åãããã³ãããè€æ°åç¹°ãè¿ãããªã¹ããåçŽã«äœ¿çšããŠãåãããã³ããã«å¯Ÿããè€æ°ã®ç»åãçæã§ããŸããåã«äœ¿çšããæååã®ä»£ããã«ãªã¹ãããã€ãã©ã€ã³ã«éããŸãã
num_images = 3
prompt = ["a photograph of an astronaut riding a horse"] * num_images
images = pipe(prompt).images
grid = image_grid(images, rows=1, cols=3)
# you can save the grid with
# grid.save(f"astronaut_rides_horse.png")
ããã©ã«ãã§ã¯ãstable diffusion 㯠512 x 512 ãã¯ã»ã«ã®ç»åãçæããŸããããŒãã¬ãŒããã©ã³ãã¹ã±ãŒãã®æ¯çã§ç©åœ¢ã®ç»åãäœæããããã« height ãš width åŒæ°ã䜿çšããŠããã©ã«ãããªãŒããŒã©ã€ãããããšã¯éåžžã«ç°¡åã§ãã
ç»åãµã€ãºãéžæãããšãã以äžã®ã¢ããã€ã¹ãäžããŸã :
- height ãš width ã¯äž¡æ¹ãšã 8 ã®åæ°ã§ããããšã確å®ã«ããŠãã ããã
- 512 以äžã«ãããšäœã質ã®ç»åãšããçµæã«ãªããããããŸããã
- äž¡æ¹ã®æ¹å㧠512 ãè¶
ãããšç»åé åãç¹°ãè¿ããŸã (倧åçãªäžè²«æ§ã¯å€±ãããŸã)ã
- éæ£æ¹åœ¢ç»åãäœæãããã¹ããªæ¹æ³ã¯ã1 ã€ã®æ¬¡å 㧠512ããããŠä»æ¹ã§ãããã倧ããå€ã䜿çšããããšã§ãã
Let’s run an example:
prompt = "a photograph of an astronaut riding a horse"
image = pipe(prompt, height=512, width=768).images[0]
# you can save the image with
# image.save(f"astronaut_rides_horse.png")
Stable Diffusion ã¯ã©ã®ããã«åäœãããïŒ
stable diffusion ãçæã§ããé«å質ãªç»åãèŠãŠããŸããã®ã§ãã¢ãã«ãã©ã®ããã«æ©èœããããããè¯ãå°ãç解ããŠã¿ãŸãããã
Stable Diffusion 㯠High-Resolution Image Synthesis with Latent Diffusion Models (æœåšæ¡æ£ã¢ãã«ã«ããé«è§£å床ç»ååæ) ã§ææ¡ããããæœåšæ¡æ£ (Latent Diffusion) ãšåŒã°ããç¹å®ã®ã¿ã€ãã®æ¡æ£ã¢ãã«ã«åºã¥ããŠããŸãã
äžè¬çã«èšãã°ãæ¡æ£ã¢ãã«ã¯ç»åã®ãããªé¢å¿ã®ãããµã³ãã«ã«å°éããããã«ã©ã³ãã ãªã¬ãŠã¹ãã€ãºã段éçã«é€å»ããããã«èšç·Žãããæ©æ¢°åŠç¿ã·ã¹ãã ã§ããããããã©ã®ããã«åäœãããã®è©³çŽ°ãªæŠèŠã«ã€ããŠã¯ããã® colab ã確èªããŠãã ããã
æ¡æ£ã¢ãã«ã¯ç»åããŒã¿ã®çæã«ã€ããŠæå 端ã®çµæãç²åŸããããšã瀺ããŠããŸããããããæ¡æ£ã¢ãã«ã®äžã€ã®æ¬ ç¹ã¯ãéã®ãã€ãºé€å»éçšã¯ãã®ç¹°ãè¿ããããé次çãªæ§è³ªã«ããé ãããšã§ããæŽã«ããããã®ã¢ãã«ã¯ãã¯ã»ã«ç©ºéã§äœçšããããã«å€ãã®ã¡ã¢ãªãæ¶è²»ããŸããããã¯é«è§£å床ç»åãçæãããšã巚倧ã«ãªããŸãããã®ããããããã®ã¢ãã«ãèšç·Žããããšãšæšè«ã®ããã«äœ¿çšããããšã¯å°é£ã§ãã
æœåšæ¡æ£ã¯ãæ¡æ£éçšã (å®éã®ãã¯ã»ã«ç©ºéã䜿çšãã代ããã«) äœæ¬¡å ã®æœåšç©ºéã«å¯ŸããŠé©çšããããšã«ãããã¡ã¢ãªãšèšç®ã®è€éãã軜æžããããšãã§ããŸãããããæšæºçãªæ¡æ£ã¢ãã«ãšæœåšæ¡æ£ã¢ãã«ã®äž»ãªéãã§ã : æœåšæ¡æ£ã§ã¯ãã¢ãã«ã¯ç»åã®æœåšç㪠(å§çž®ããã) è¡šçŸãçæããããã«èšç·ŽãããŸãã
æœåšæ¡æ£ã«ã¯ 3 ã€ã®äž»èŠãªã³ã³ããŒãã³ãããããŸãã
- ãªãŒããšã³ã³ãŒã (VAE)
- U-Net
- ããã¹ããšã³ã³ãŒã, e.g. CLIP ã®ããã¹ããšã³ã³ãŒã
1. ãªãŒããšã³ã³ãŒã (VAE)
VAE ã¢ãã«ã¯ 2 ã€ã®ããŒãããšã³ã³ãŒããšãã³ãŒããæã¡ãŸãããšã³ã³ãŒãã¯ç»åãäœæ¬¡å æœåšè¡šçŸã«å€æããããã«äœ¿çšããããã㯠U-Net ã¢ãã«ãžã®å ¥åãšããŠæ©èœããŸããéã«ããã³ãŒãã¯æœåšè¡šçŸãç»åã«å€æãæ»ããŸãã
æœåšæ¡æ£ã®èšç·Žã®éããšã³ã³ãŒã㯠(åã¹ãããã§åŸã ã«å€ããªããã€ãºãé©çšãã) é æ¹åã®æ¡æ£éçšã«ã€ããŠç»åã®æœåšè¡šçŸ (latents) ãåŸãããã«äœ¿çšãããŸããæšè«ã®éãéã®æ¡æ£éçšã«ããçæãããããã€ãºé€å»ããã latents 㯠VAE ãã³ãŒãã䜿çšããŠç»åã«å€æãæ»ãããŸãã(ãã®åŸ) èŠãããã«ãæšè«ã®é㯠VAE ãã³ãŒãã ããå¿ èŠãšããŸãã
2. U-Net
U-Net ã¯ãšã³ã³ãŒãéšãšãã³ãŒãéšãæã¡ãäž¡æ¹ãšã ResNet ãããã¯ããæ§æãããŸãããšã³ã³ãŒãã¯ç»åè¡šçŸãäœæ¬¡å ãªç»åè¡šçŸã«å§çž®ããŠããã³ãŒãã¯äœè§£å床ãªç»åè¡šçŸã (ããããã¯ãã€ãºãå°ãªã) å ã®ãããé«è§£å床ãªç»åè¡šçŸã«ãã³ãŒããæ»ããŸããããå ·äœçã«ã¯ãU-Net åºåã¯ãã€ãºæ®å·®ãäºæž¬ããŸããããã¯äºæž¬ããããã€ãºé€å»ãããç»åè¡šçŸãèšç®ããããã«äœ¿çšã§ããŸãã
ããŠã³ãµã³ããªã³ã°ã®éã« U-Net ãéèŠãªæ å ±ã倱ãããšãé²ãããã«ããšã³ã³ãŒãã®ããŠã³ãµã³ããªã³ã° ResNet ãšãã³ãŒãã®ã¢ãããµã³ããªã³ã° ResNet éã«ã·ã§ãŒãã«ããæ¥ç¶ãéåžžã¯è¿œå ãããŸããæŽã«ãstable diffusion U-Net 㯠cross-attention å±€ãéããŠãã®åºåãããã¹ãåã蟌ã¿äžã§æ¡ä»¶ä»ããããšãã§ããŸããcross-attention å±€ã¯é垞㯠ResNet ãããã¯éã« U-Net ã®ãšã³ã³ãŒãéšãšãã³ãŒãéšã®äž¡æ¹ã«è¿œå ãããŸãã
3. ããã¹ããšã³ã³ãŒã
ããã¹ããšã³ã³ãŒãã¯äŸãã° “An astronout riding a horse” ã®ãããªå ¥åããã³ããã U-Net ã«ããç解å¯èœãªåã蟌ã¿ç©ºéå ã«å€æããããšãæ ããŸããããã¯éåžžã¯ãå ¥åããŒã¯ã³ã®ã·ãŒã¯ãšã³ã¹ãæœåšçãªããã¹ãåã蟌ã¿ã®ã·ãŒã¯ãšã³ã¹ã«ããããããåçŽãª transformer ããŒã¹ã®ãšã³ã³ãŒãã§ãã
Imagen ã«ã€ã³ã¹ãã€ã¢ãããŠãStable Diffusion ã¯èšç·Žã®éã«ã¯ããã¹ããšã³ã³ãŒããèšç·Žããªãã§ãåçŽã« CLIP ã®æ¢ã«èšç·Žæžã¿ã®ããã¹ããšã³ã³ãŒã, CLIPTextModel ãå©çšããŸãã
æœåšæ¡æ£ã¯äœæ
é«éã§å¹ççãïŒ
æœåšæ¡æ£ã¯äœæ¬¡å 空éã§äœçšããã®ã§ããã¯ã»ã«ç©ºéã®æ¡æ£ã¢ãã«ãšæ¯ã¹ãŠããã¯ã¡ã¢ãªãšèšç®èŠä»¶ãå€§å¹ ã«åæžããŸããäŸãã°ãStable Diffusion ã§äœ¿çšããããªãŒããšã³ã³ãŒã㯠8 ã®çž®å°å å (= reduction factor) ãæã¡ãŸããããã¯ãshape (3, 512, 512) ã®ç»åãæœåšç©ºéã§ã¯ (3, 64, 64) ã«ãªãããšãæå³ãã8 à 8 = 64 åå°ãªãã¡ã¢ãªãå¿ èŠãšããã ãã§ãã
ããã 512 à 512 ç»åãçŽ æ©ãçæã§ããçç±ã§ãã16GB Colab GPU ã§ãããïŒ
æšè«æã® Stable Diffusion
ãã¹ãŠãäžã€ã«ãŸãšããŠãè«ççãªãããŒãå³ç€ºããŠã¢ãã«ãæšè«ã§ã©ã®ããã«åäœããã詳ããèŠãŸãããã
stable diffusion ã¢ãã«ã¯å ¥åãšããŠæœåšçã·ãŒããšããã¹ãããã³ãããåãåããŸãããããŠæœåšçã·ãŒãã¯ãµã€ãº $64 \times 64$ ã®ã©ã³ãã ãªæœåšçç»åè¡šçŸãçæããããã«äœ¿çšãããäžæ¹ã§ããã¹ãããã³ãã㯠CLIP ã®ããã¹ããšã³ã³ãŒãã䜿çšããŠãµã€ãº $77 \times 768$ ã®ããã¹ãåã蟌ã¿ã«å€æãããŸãã
次㫠U-Net ã¯ãããã¹ãåã蟌ã¿ã«æ¡ä»¶ä»ããããªãããã©ã³ãã ãªæœåšç»åè¡šçŸãå埩çã«ãã€ãºé€å»ããŸãããã€ãºå·®åã§ãããU-Net ã®åºåã¯ã¹ã±ãžã¥ãŒã©ã»ã¢ã«ãŽãªãºã ãéããŠãã€ãºé€å»ãããæœåšçç»åè¡šçŸãèšç®ããããã«äœ¿çšãããŸããå€ãã®æ§ã ãªã¹ã±ãžã¥ãŒã©ã»ã¢ã«ãŽãªãºã ããã®èšç®ã®ããã«äœ¿çšã§ããŸãããããããã«è¯ãç¹ãšæªãç¹ããããŸããStable Diffusion ã«ã€ããŠã¯ã以äžã®äžã€ã®äœ¿çšãå§ããŸã :
- PNDM ã¹ã±ãžã¥ãŒã© (used by default)
- DDIM ã¹ã±ãžã¥ãŒã©
- K-LMS ã¹ã±ãžã¥ãŒã©
ã¹ã±ãžã¥ãŒã©ã»ã¢ã«ãŽãªãºã ãã©ã®ããã«æ©èœãããã®çè«ã¯ãã®ããŒãããã¯ã®ç¯å²å€ã§ãããæçã«èšãã°ããããã¯åã®ãã€ãºè¡šçŸãšäºæž¬ããããã€ãºå·®åããäºæž¬ããããã€ãºé€å»ãããç»åè¡šçŸãèšç®ããããšãèŠããŠããã¹ãã§ãã詳现ã¯ãElucidating the Design Space of Diffusion-Based Generative Models ã調ã¹ãããšãå§ããŸãã
ãã€ãºé€å»ããã»ã¹ã¯ããè¯ãæœåšçç»åè¡šçŸã段éçã«ååŸããããã«ããã 50 åç¹°ãè¿ãããŸããå®äºããã°ãæœåšçç»åè¡šçŸã¯å€åãªãŒããšã³ã³ãŒãã®ãã³ãŒãéšã«ãããã³ãŒããããŸãã
Latent and Stable Diffusion ãžã®ãã®ç°¡æœãªã€ã³ãããã¯ã·ã§ã³ã®åŸã¯ãð€ Hugging Face Diffusers ã©ã€ãã©ãªãã©ã®ããã«é«åºŠã«å©çšããããèŠãŸãããïŒ
ç¬èªã®æšè«ãã€ãã©ã€ã³ãæžã
æåŸã«ãdiffusers ã§ã«ã¹ã¿ã diffusers ãã€ãã©ã€ã³ãäœæã§ããæ¹æ³ã瀺ããŸããã«ã¹ã¿ã æšè«ãã€ãã©ã€ã³ãæžãããšã¯ diffusers ã©ã€ãã©ãªã®é«åºŠãªäœ¿çšæ¹æ³ã§ããã㯠VAE ãäžè¿°ã®ã¹ã±ãžã¥ãŒã©ã®ãããªç¹å®ã®ã³ã³ããŒãã³ãã«åãæ¿ããããã«éåžžã«æçšãªå ŽåããããŸãã
äŸãã°ãStable Diffusion ãç°ãªãã¹ã±ãžã¥ãŒã©ãã€ãŸã (ãã® PR ã§è¿œå ããã) Katherine Crowson ã® K-LMS ã¹ã±ãžã¥ãŒã©ã§äœ¿çšããæ¹æ³ãå®æŒããŸãã
äºåèšç·Žæžã¿ã¢ãã« ã¯å®å šãª diffusion ãã€ãã©ã€ã³ãã»ããã¢ããããããã«å¿ èŠãªç·ãŠã®ã³ã³ããŒãã³ããå«ã¿ãŸãããããã¯ä»¥äžã®ãã©ã«ãã«ã¹ãã¢ãããŠããŸã :
- text_encoder : Stable Diffusion 㯠CLIP ã䜿çšããŸã ããä»ã®æ¡æ£ã¢ãã«ã¯ BERT ã®ãããªä»ã®ãšã³ã³ãŒãã䜿çšãããããããŸããã
- tokenizer : text_encoder ã¢ãã«ã«ãã䜿çšããããã®ãšäžèŽããŠããªããã°ãªããŸããã
- scheduler : èšç·Žã®éã«ç»åã«ãã€ãºãåŸã ã«è¿œå ããããã«äœ¿çšãããã¹ã±ãžã¥ãŒãªã³ã°ã»ã¢ã«ãŽãªãºã ã§ãã
- unet : å ¥åã®æœåšçè¡šçŸãçæããããã«äœ¿çšãããã¢ãã«ã
- vae : æœåšçè¡šçŸãå®ç»åã«ãã³ãŒãããããã«äœ¿çšãããªãŒããšã³ã³ãŒãã»ã¢ãžã¥ãŒã«ã
from_pretrained ãžã® subfolder åŒæ°ã䜿çšããããããã»ãŒããããŠãããã©ã«ããåç §ããŠã³ã³ããŒãã³ããããŒãã§ããŸãã
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
# 1. Load the autoencoder model which will be used to decode the latents into image space.
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_auth_token=YOUR_TOKEN)
# 2. Load the tokenizer and text encoder to tokenize and encode the text.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
# 3. The UNet model for generating the latents.
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", use_auth_token=YOUR_TOKEN)
ãããŠäºåå®çŸ©ãããã¹ã±ãžã¥ãŒã©ãããŒããã代ããã«ã幟ã€ãã® fitting ãã©ã¡ãŒã¿ãšãšãã« K-LMS ã¹ã±ãžã¥ãŒã© ãããŒãããŸãã
from diffusers import LMSDiscreteScheduler
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
Next, let’s move the models to GPU.
torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)
次ã«ç»åãçæããããã«äœ¿çšãããã©ã¡ãŒã¿ãå®çŸ©ããŸãã
guidance_scale 㯠Imagen è«æ ã®åŒ (2) ã®ã¬ã€ãã³ã¹éã¿ w ãžã®é¡æšãšããŠå®çŸ©ãããŠããããšã«æ³šæããŠãã ãããguidance_scale == 1 ã¯åé¡åšããªãŒãªã¬ã€ãã³ã¹ãè¡ããªãããšã«å¯Ÿå¿ããŸããããã§ã¯ãããåã«è¡ãããããã« 7.5 ã«èšå®ããŸãã
åã®ãµã³ãã«ãšã¯éããããæçã«ãªã£ãç»åãåŸãããã« num_inference_steps ã 100 ã«èšå®ããŸãã
prompt = ["a photograph of an astronaut riding a horse"]
height = 512 # default height of Stable Diffusion
width = 512 # default width of Stable Diffusion
num_inference_steps = 100 # Number of denoising steps
guidance_scale = 7.5 # Scale for classifier-free guidance
generator = torch.manual_seed(0) # Seed generator to create the inital latent noise
batch_size = len(prompt)
æåã«ãæž¡ãããããã³ããã«å¯Ÿãã text_embeddings ãååŸããŸãããããã®åã蟌ã¿ã¯ UNet ã¢ãã«ãæ¡ä»¶ä»ããããã«äœ¿çšãããŠãç»åçæãå ¥åããã³ããã«äŒŒãŠããã¯ãã®äœãã«åããŠã¬ã€ãããŸãã
text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
åé¡åšããªãŒãªã¬ã€ãã³ã¹ã«å¯Ÿããæ¡ä»¶ãªãã®ããã¹ãåã蟌ã¿ãååŸããŸããããã¯åã«ããã£ã³ã°ããŒã¯ã³ (空ããã¹ã) ã«å¯Ÿããåã蟌ã¿ã§ãããããã¯æ¡ä»¶ä»ã text_embeddings (batch_size ãš seq_length) ãšåã shape ãæã€å¿ èŠããããŸãã
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
[""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]
åé¡åšããªãŒãªã¬ã€ãã³ã¹ã«å¯ŸããŠã2 ã€ã®ãã©ã¯ãŒããã¹ãè¡ãªãå¿ èŠããããŸã : äžã€ã¯æ¡ä»¶ä»ãå ¥å (text_embeddings) ã«ãããããäžã€ã¯æ¡ä»¶ãªãåã蟌㿠(uncond_embeddings) ã«ãããŸããå®éã«ã¯ã2 ã€ã®ãã©ã¯ãŒããã¹ãå®è¡ããããšãé¿ããããã«äž¡è ãåäžã®ãããã«é£çµããããšãã§ããŸãã
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
次ã«ãåæã©ã³ãã ãã€ãºãçæããŸãã
latents = torch.randn(
(batch_size, unet.in_channels, height // 8, width // 8),
generator=generator,
)
latents = latents.to(torch_device)
ãã®æ®µéã§æœåšå€æ°ã調ã¹ãã°ããããã® shape ãçæãããç»åãããé¥ãã«å°ãã torch.Size([1, 4, 64, 64]) ã§ããããšãåãããŸããã¢ãã«ã¯ãã®æœåšè¡šçŸ (çŽç²ãªãã€ãº) ãåŸã§ 512 à 512 ç»åã«å€æããŸãã
次ã«ãã¹ã±ãžã¥ãŒã©ãéžæããã num_inference_steps ã§åæåããŸããããã¯ãã€ãºé€å»éçšã®éã«äœ¿çšããã sigmas ãšæ£ç¢ºãªæéã¹ãããã®å€ãèšç®ããŸãã
scheduler.set_timesteps(num_inference_steps)
K-LMS ã¹ã±ãžã¥ãŒã©ã¯ latents ããã® sigma å€ã§ä¹ç®ããå¿ èŠãããŸãããããããã§è¡ããŸãããã
latents = latents * scheduler.init_noise_sigma
We are ready to write the denoising loop.
from tqdm.auto import tqdm
scheduler.set_timesteps(num_inference_steps)
for t in tqdm(scheduler.timesteps):
# expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
latent_model_input = torch.cat([latents] * 2)
latent_model_input = scheduler.scale_model_input(latent_model_input)
# predict the noise residual
with torch.no_grad():
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# perform guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# compute the previous noisy sample x_t -> x_t-1
latents = scheduler.step(noise_pred, t, latents).prev_sample
ãããŠçæããã latents ãç»åã«ãã³ãŒããæ»ãããã« vae ã䜿çšããŸãã
# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
image = vae.decode(latents).sample
ãããŠæåŸã«ãç»åã PIL ã«å€æããŸããããããã衚瀺ãŸãã¯ã»ãŒãã§ããããã«ã
image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
pil_images[0]
以äž