CLIP (対照的言語-画像事前訓練) : 概要 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
更新日時 : 12/09/2022
作成日時 : 09/09/2022 (No releases published)

* 本ページは、CLIP の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

README.md

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

CLIP (対照的言語-画像事前訓練) : 概要

CLIP (対照的言語-画像事前訓練) は様々な (画像, テキスト) ペアで訓練されたニューラルネットワークです。それは GPT-2 と 3 のゼロショット機能と同様に、タスクに対して直接最適化されることなく、画像が与えられたとき、最も関連性の高いテキストのスニペットを予測するように自然言語で指示できます。CLIP はオリジナルの 1.28M のラベル付けされたサンプルのどれも使用することなく、ImageNet「ゼロショット」でオリジナルの ResNet50 の性能に一致し、コンピュータビジョンにおける幾つかの主要なチャレンジを越えることを私たちは見出しました。

アプローチ

使用方法

最初に、PyTorch 1.7.1 (or later) と torchvision, そして少しの追加の依存関係、それからこのレポジトリを Python パッケージとしてインストールします。CUDA GPU マシンでは、以下で上手くいきます :

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

上記の cudatoolkit=11.0 を貴方のマシンの適切な CUDA バージョンか (GPU がないマシンでインストールするとき) cpuonly で置き換えます。

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

image.shape, text.shape

(torch.Size([1, 3, 224, 224]), torch.Size([3, 77]))

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    print(image_features.shape, text_features.shape)

    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

torch.Size([1, 512]) torch.Size([3, 512])
Label probs: [[0.9927   0.004253 0.002968]]

API

CLIP モジュール clip は以下のメソッドを提供しています :

clip.available_models()

利用可能な CLIP モデルの名前を返します。

clip.available_models()

['RN50',
 'RN101',
 'RN50x4',
 'RN50x16',
 'RN50x64',
 'ViT-B/32',
 'ViT-B/16',
 'ViT-L/14',
 'ViT-L/14@336px']

clip.load(name, device=…, jit=False)

clip.available_models() により返されるモデル名で指定された、モデルとモデルにより必要とされる TorchVision transform を返します。それは必要に応じてモデルをダウンロードします。name 引数はローカルのチェックポイントへのパスでも可能です。

モデルを実行する device をオプションで指定できます、デフォルトではもし CUDA デバイスがあれば最初の CUDA デバイスを使用し、なければ CPU を使用します。jit が False のとき、モデルの非-JIT 版がロードされます。

clip.tokenize(text: Union[str, List[str]], context_length=77)

与えられたテキスト入力のトークン化されたシークエンスを含む LongTensor を返します。これはモデルへの入力として使用できます。

clip.load() により返されたモデルは以下のメソッドをサポートします :

model.encode_image(image: Tensor)

画像のバッチが与えられたとき、CLIP モデルのビジョン部によりエンコードされた画像特徴を返します。

model.encode_text(text: Tensor)

テキストトークンのバッチが与えられたとき、CLIP モデルの言語部によりエンコードされたテキスト特徴を返します。

model(image: Tensor, text: Tensor)

画像のバッチとテキストトークンのバッチが与えられたとき、各画像とテキスト入力に対応するロジットスコアを含む、2 つのテンソルを返します。値は対応する画像とテキスト特徴間のコサイン類似度を 100 倍したものです。

より多くの例

ゼロショット予測

下のコードは、論文の Appendix B で示されたように、CLIP を使用してゼロショット予測を実行します。このサンプルは CIFAR-100 データセットからの画像を取り、データセットからの 100 のテキストラベルの中で最尤のラベルを予測します。

import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

preprocess

Compose(
    Resize(size=224, interpolation=bicubic, max_size=None, antialias=None)
    CenterCrop(size=(224, 224))
    
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

image

class_id

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

image_features.shape, text_features.shape

(torch.Size([1, 512]), torch.Size([100, 512]))

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

values, indices

(tensor([0.6519, 0.1244, 0.0392, 0.0191, 0.0171], device='cuda:0',
        dtype=torch.float16), tensor([78, 93, 83, 44, 27], device='cuda:0'))

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

出力は以下のようなものです (正確な数は計算デバイスに依存して僅かに異なるかもしれません) :

Top predictions:

           snake: 65.19%
          turtle: 12.44%
    sweet_pepper: 3.92%
          lizard: 1.91%
       crocodile: 1.71%

このサンプルは、与えられた入力のエンコードされた特徴を返す、encode_image() と encode_text() メソッドを使用していることに注意してください。

線形プローブ評価

下のサンプルは scikit-learn を使用して画像特徴量でロジスティックス回帰を実行しています。

import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)


def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1)
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(np.float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

Note that the C value should be determined via a hyperparameter sweep using a validation split.

以上

2022年9月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30