Sentence Transformers 2.2 : ノートブック : 画像検索 – 重複した画像 (重複画像の除去) (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 12/02/2022 (v2.2.2)

* 本ページは、UKPLab/sentence-transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Image Duplicates & Near Duplicates

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Web: www.classcat.com ; ClassCatJP

Sentence Transformers 2.2 : ノートブック : 画像検索 – 重複した画像 (重複画像の除去)

このサンプルは SentenceTransformer が画像の重複やそれに近いものを見つけるためにどのように使用できるかを示します。

モデルとしては OpenAI CLIP モデルを使用します、これは画像と画像の alt テキストの大規模なセットで訓練されました。

写真のソースとしては、Unsplash Dataset Lite を使用します、これは約 25k 画像を含みます。Unsplash 画像についてはライセンスをご覧ください。

すべての画像をベクトル空間にエンコードしてからこのベクトル空間で高密度な領域、つまり画像がかなり類似している領域を見つけます。

from sentence_transformers import SentenceTransformer, util
from PIL import Image
import glob
import torch
import pickle
import zipfile
from IPython.display import display
from IPython.display import Image as IPImage
import os
from tqdm.autonotebook import tqdm

#First, we load the CLIP model
model = SentenceTransformer('clip-ViT-B-32')

# Next, we get about 25k images from Unsplash 
img_folder = 'photos/'
if not os.path.exists(img_folder) or len(os.listdir(img_folder)) == 0:
    os.makedirs(img_folder, exist_ok=True)
    
    photo_filename = 'unsplash-25k-photos.zip'
    if not os.path.exists(photo_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+photo_filename, photo_filename)
        
    #Extract all images
    with zipfile.ZipFile(photo_filename, 'r') as zf:
        for member in tqdm(zf.infolist(), desc='Extracting'):
            zf.extract(member, img_folder)

# Now, we need to compute the embeddings
# To speed things up, we destribute pre-computed embeddings
# Otherwise you can also encode the images yourself.
# To encode an image, you can use the following code:
# from PIL import Image
# img_emb = model.encode(Image.open(filepath))

use_precomputed_embeddings = True

if use_precomputed_embeddings: 
    emb_filename = 'unsplash-25k-photos-embeddings.pkl'
    if not os.path.exists(emb_filename):   #Download dataset if does not exist
        util.http_get('http://sbert.net/datasets/'+emb_filename, emb_filename)
        
    with open(emb_filename, 'rb') as fIn:
        img_names, img_emb = pickle.load(fIn)  
    print("Images:", len(img_names))
else:
    img_names = list(glob.glob('photos/*.jpg'))
    print("Images:", len(img_names))
    img_emb = model.encode([Image.open(filepath) for filepath in img_names], batch_size=128, convert_to_tensor=True, show_progress_bar=True)

Images: 24996

# Now we run the clustering algorithm
# With the threshold parameter, we define at which threshold we identify
# two images as similar. Set the threshold lower, and you will get larger clusters which have 
# less similar images in it (e.g. black cat images vs. cat images vs. animal images).
# With min_community_size, we define that we only want to have clusters of a certain minimal size

duplicates = util.paraphrase_mining_embeddings(img_emb)

# duplicates contains a list with triplets (score, image_id1, image_id2) and is scorted in decreasing order

重複

次のセルでは、top 10 の最も類似した画像を出力します。これらは同一の画像です、つまり同じ写真が Unsplash に二度アップロードされました。

for score, idx1, idx2 in duplicates[0:10]:
    print("\nScore: {:.3f}".format(score))
    print(img_names[idx1])
    display(IPImage(os.path.join(img_folder, img_names[idx1]), width=200))
    print( img_names[idx2])
    display(IPImage(os.path.join(img_folder, img_names[idx2]), width=200))

Score: 1.000
0UtMDLOk0Vg.jpg


10OY7Od4YeQ.jpg

Score: 1.000
4f4e3hRnwKs.jpg


f3hDGOHptrM.jpg

Score: 1.000
Aq3NQwdqOU8.jpg


JO_6maFFeoQ.jpg

重複に近いもの

重複画像をスキップして重複に近いものを見つけることもできます。これを実現するために、ある閾値より低いコサイン類以度を持つ画像ペアだけを見ます。この例では、0.99 より低いコサイン類以度を持つ画像を見ます。

threshold = 0.99
near_duplicates = [entry for entry in duplicates if entry[0] < threshold]

for score, idx1, idx2 in near_duplicates[0:10]:
    print("\nScore: {:.3f}".format(score))
    print(img_names[idx1])
    display(IPImage(os.path.join(img_folder, img_names[idx1]), width=200))
    print(img_names[idx2])
    display(IPImage(os.path.join(img_folder, img_names[idx2]), width=200))

Score: 0.989
ht07CBODJVY.jpg


TVpMXc3Urzg.jpg

Score: 0.989
mOLet_-xn2M.jpg


-zEDq4sRxRE.jpg

Score: 0.988
IXO5G1jR4h4.jpg


EheKbIZ8oAw.jpg

以上

月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31