PyTorch 2.0 チュートリアル : テキスト : TorchText ライブラリでテキスト分類 (翻訳/解説)

翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/14/2023 (2.0.0)

* 本ページは、PyTorch 2.0 Tutorials の以下のページを翻訳した上で適宜、補足説明したものです：

Text : Text Classification with TorchText Library

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

クラスキャット人工知能研究開発支援サービス

◆ クラスキャットは人工知能・テレワークに関する各種サービスを提供しています。お気軽にご相談ください :

人工知能研究開発支援
1. 人工知能研修サービス(経営者層向けオンサイト研修)
2. テクニカルコンサルティングサービス
3. 実証実験(プロトタイプ構築)
4. アプリケーションへの実装
人工知能研修サービス
PoC(概念実証)を失敗させないための支援

◆ 人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。スケジュール。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション
sales-info@classcat.com ; Website: www.classcat.com ; ClassCatJP

PyTorch 2.0 チュートリアル : テキスト : TorchText ライブラリでテキスト分類

このチュートリアルでは、テキスト分類分析のためのデータセットを構築するためにどのように torchtext ライブラリを利用するかを示します。ユーザは以下を行なうための柔軟性を持ちます :

iterator としての raw データにアクセスする
raw テキスト文字列を (モデルを訓練するために使用できる) torch.Tensor に変換するためにデータ処理パイプラインを構築する
torch.utils.data.DataLoader でデータをシャッフルして iterate する

raw データセット iterator にアクセスする

torchtext ライブラリは幾つかの raw データセット iterator を提供します、これは raw テキスト文字列を yield します。例えば、AG_NEWS データセット iterator はラベルとテキストのタプルとして raw データを yield します。

torchtext データセットにアクセスするために、https://github.com/pytorch/data の手順に従って torchdata をインストールしてください。

import torch
from torchtext.datasets import AG_NEWS
train_iter = iter(AG_NEWS(split='train'))

next(train_iter)
>>> (3, "Fears for T N pension after talks Unions representing workers at Turner
Newall say they are 'disappointed' after talks with stricken parent firm Federal
Mogul.")

next(train_iter)
>>> (4, "The Race is On: Second Private Team Sets Launch Date for Human
Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\\team of
rocketeers competing for the  #36;10 million Ansari X Prize, a contest
for\\privately funded suborbital space flight, has officially announced
the first\\launch date for its manned rocket.")

next(train_iter)
>>> (4, 'Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded
by a chemistry researcher at the University of Louisville won a grant to develop
a method of producing better peptides, which are short chains of amino acids, the
building blocks of proteins.')

データ処理パイプラインを準備する

語彙、単語ベクトル、tokenizer を含む、torchtext ライブラリの非常に基本的なコンポーネントを再検討しました。それらは raw テキスト文字列のための基本的なデータ処理ビルディングブロックです。

ここに tonenizer と語彙を伴う典型的な NLP データ処理のためのサンプルがあります。最初のステップは raw 訓練データセットで語彙を構築します。ここでは組み込みのファクトリ関数 build_vocab_from_iterator を使用します、これはリストを yield するイテレータかトークンのイテレータを受け取ります。ユーザは語彙に追加される任意の特殊シンボルを渡すこともできます。

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

tokenizer = get_tokenizer('basic_english')
train_iter = AG_NEWS(split='train')

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)

vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=[""])
vocab.set_default_index(vocab["<unk>"])

語彙ブロックはトークンのリストを整数に変換します。

vocab(['here', 'is', 'an', 'example'])
>>> [475, 21, 30, 5297]

tokenizer と語彙でテキスト処理パイプラインを準備します。テキストとラベル・パイプラインはデータセット iterator からの raw データ文字列を処理するために使用されます。

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

テキストパイプラインは語彙で定義された検索テーブルに基づいてテキスト文字列を整数のリストに変換します。ラベルパイプラインはラベルを整数に変換します。例えば、

text_pipeline('here is the an example')
>>> [475, 21, 2, 30, 5297]
label_pipeline('10')
>>> 9

データバッチと iterator を生成する

torch.utils.data.DataLoader は PyTorch ユーザのために推奨されます (チュートリアルはこちらです)。それはマップ-style データセットとともに動作します、これは getitem() と len() プロトコルを実装し、そしてインデックス/キーからデータサンプルへのマップを表します。それはまた False の shuffle 引数を持つ iterable なデータセットとともに動作もします。

モデルに送る前に、collate_fn 関数は DataLoader から生成されたサンプルのバッチ上で動作します。collate_fn の入力は DataLoader の batch サイズを持つデータのバッチで、そして collate_fn は前に宣言されたデータ処理パイプラインに従ってそれらを処理します。ここで注意してください、collate_fn はトップレベルの def として宣言されることを確実にしてください。これは関数が各ワーカーで利用可能であることを保証します。

この例では、元のデータバッチ入力のテキストエントリはリストにパックされてそして nn.EmbeddingBag の入力のための単一 tensor として結合されます。オフセットはテキスト tensor の個々のシークエンスの開始インデックスを表すデリミタの tensor です。ラベルは個々のテキストエントリのラベルを保存する tensor です。

from torch.utils.data import DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list, text_list, offsets = [], [], [0]
    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))
         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         offsets.append(processed_text.size(0))
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)
    return label_list.to(device), text_list.to(device), offsets.to(device)

train_iter = AG_NEWS(split='train')
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)

モデルを定義する

モデルは nn.EmbeddingBag 層と分類目的のための線形層から成ります。“mean” のデフォルト・モードを持つ nn.EmbeddingBag は埋め込みの「バッグ」の平均値を計算します。ここでのテキストエントリは異なる長さを持ちます。ここでは nn.EmbeddingBag モジュールはパディングを必要としません、何故ならばテキスト長はオフセットにセーブされているからです。

更に、nn.EmbeddingBag は埋め込みに渡り平均を on the fly に累積しますので、nn.EmbeddingBag は tensor のシークエンスを処理するパフォーマンスとメモリ効率を強化できます。

from torch import nn

class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange)
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

インスタンスを初期化する

AG_NEWS データセットは 4 つのラベルを持ち従ってクラス数は 4 です。

1 : World
2 : Sports
3 : Business
4 : Sci/Tec

64 の埋め込み次元を持つモデルを構築します。語彙サイズは語彙インスタンスの長さに等しいです。クラス数はラベルの数に等しいです。

train_iter = AG_NEWS(split='train')
num_class = len(set([label for (label, text) in train_iter]))
vocab_size = len(vocab)
emsize = 64
model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

モデルを訓練して結果を評価する関数を定義する

import time

def train(dataloader):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print('| epoch {:3d} | {:5d}/{:5d} batches '
                  '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count))
            total_acc, total_count = 0, 0
            start_time = time.time()

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text, offsets) in enumerate(dataloader):
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc/total_count

データセットを分割してモデルを実行する

元の AG_NEWS は検証データセットを持ちませんので、訓練データセットを 0.95 (訓練) と 0.05 (検証) の分割比率で訓練/検証セットに分割します。ここでは PyTorch コアライブラリの torch.utils.data.dataset.random_split 関数を利用します。

CrossEntropyLoss criterion は nn.LogSoftmax() と nn.NLLLoss() を単一クラスに結合しています。それは C クラスで分類問題を訓練するときに有用です。SGD は optimizer として確率的勾配降下法を実装しています。初期学習率は 5.0 に設定されます。エポックを通して学習率を調整するためにここでは StepLR が使用されます。

from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
# Hyperparameters
EPOCHS = 10 # epoch
LR = 5  # learning rate
BATCH_SIZE = 64 # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter, test_iter = AG_NEWS()
train_dataset = to_map_style_dataset(train_iter)
test_dataset = to_map_style_dataset(test_iter)
num_train = int(len(train_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(train_dataset, [num_train, len(train_dataset) - num_train])

train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
                              shuffle=True, collate_fn=collate_batch)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE,
                             shuffle=True, collate_fn=collate_batch)

for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()
    train(train_dataloader)
    accu_val = evaluate(valid_dataloader)
    if total_accu is not None and total_accu > accu_val:
      scheduler.step()
    else:
       total_accu = accu_val
    print('-' * 59)
    print('| end of epoch {:3d} | time: {:5.2f}s | '
          'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
    print('-' * 59)

| epoch   1 |   500/ 1782 batches | accuracy    0.686
| epoch   1 |  1000/ 1782 batches | accuracy    0.855
| epoch   1 |  1500/ 1782 batches | accuracy    0.877
-----------------------------------------------------------
| end of epoch   1 | time: 12.04s | valid accuracy    0.879
-----------------------------------------------------------
| epoch   2 |   500/ 1782 batches | accuracy    0.897
| epoch   2 |  1000/ 1782 batches | accuracy    0.899
| epoch   2 |  1500/ 1782 batches | accuracy    0.901
-----------------------------------------------------------
| end of epoch   2 | time:  9.90s | valid accuracy    0.887
-----------------------------------------------------------
| epoch   3 |   500/ 1782 batches | accuracy    0.916
| epoch   3 |  1000/ 1782 batches | accuracy    0.913
| epoch   3 |  1500/ 1782 batches | accuracy    0.913
-----------------------------------------------------------
| end of epoch   3 | time: 10.02s | valid accuracy    0.904
-----------------------------------------------------------
| epoch   4 |   500/ 1782 batches | accuracy    0.925
| epoch   4 |  1000/ 1782 batches | accuracy    0.925
| epoch   4 |  1500/ 1782 batches | accuracy    0.921
-----------------------------------------------------------
| end of epoch   4 | time:  9.98s | valid accuracy    0.912
-----------------------------------------------------------
| epoch   5 |   500/ 1782 batches | accuracy    0.930
| epoch   5 |  1000/ 1782 batches | accuracy    0.933
| epoch   5 |  1500/ 1782 batches | accuracy    0.928
-----------------------------------------------------------
| end of epoch   5 | time:  9.94s | valid accuracy    0.905
-----------------------------------------------------------
| epoch   6 |   500/ 1782 batches | accuracy    0.942
| epoch   6 |  1000/ 1782 batches | accuracy    0.943
| epoch   6 |  1500/ 1782 batches | accuracy    0.943
-----------------------------------------------------------
| end of epoch   6 | time:  9.94s | valid accuracy    0.917
-----------------------------------------------------------
| epoch   7 |   500/ 1782 batches | accuracy    0.944
| epoch   7 |  1000/ 1782 batches | accuracy    0.945
| epoch   7 |  1500/ 1782 batches | accuracy    0.944
-----------------------------------------------------------
| end of epoch   7 | time:  9.99s | valid accuracy    0.915
-----------------------------------------------------------
| epoch   8 |   500/ 1782 batches | accuracy    0.944
| epoch   8 |  1000/ 1782 batches | accuracy    0.947
| epoch   8 |  1500/ 1782 batches | accuracy    0.946
-----------------------------------------------------------
| end of epoch   8 | time: 10.21s | valid accuracy    0.915
-----------------------------------------------------------
| epoch   9 |   500/ 1782 batches | accuracy    0.946
| epoch   9 |  1000/ 1782 batches | accuracy    0.947
| epoch   9 |  1500/ 1782 batches | accuracy    0.945
-----------------------------------------------------------
| end of epoch   9 | time: 10.22s | valid accuracy    0.916
-----------------------------------------------------------
| epoch  10 |   500/ 1782 batches | accuracy    0.946
| epoch  10 |  1000/ 1782 batches | accuracy    0.946
| epoch  10 |  1500/ 1782 batches | accuracy    0.948
-----------------------------------------------------------
| end of epoch  10 | time: 10.26s | valid accuracy    0.916
-----------------------------------------------------------

テストデータセットでモデルを評価する

テストデータセットの結果をチェックします…

print('Checking the results of test dataset.')
accu_test = evaluate(test_dataloader)
print('test accuracy {:8.3f}'.format(accu_test))

Checking the results of test dataset.
test accuracy    0.909

ランダムなニュース上のテスト

ここまでの最善のモデルを使用してゴルフのニュースをテストします。

ag_news_label = {1: "World",
                 2: "Sports",
                 3: "Business",
                 4: "Sci/Tec"}

def predict(text, text_pipeline):
    with torch.no_grad():
        text = torch.tensor(text_pipeline(text))
        output = model(text, torch.tensor([0]))
        return output.argmax(1).item() + 1

ex_text_str = "MEMPHIS, Tenn. – Four days ago, Jon Rahm was \
    enduring the season’s worst weather conditions on Sunday at The \
    Open on his way to a closing 75 at Royal Portrush, which \
    considering the wind and the rain was a respectable showing. \
    Thursday’s first round at the WGC-FedEx St. Jude Invitational \
    was another story. With temperatures in the mid-80s and hardly any \
    wind, the Spaniard was 13 strokes better in a flawless round. \
    Thanks to his best putting performance on the PGA Tour, Rahm \
    finished with an 8-under 62 for a three-stroke lead, which \
    was even more impressive considering he’d never played the \
    front nine at TPC Southwind."

model = model.to("cpu")

print("This is a %s news" %ag_news_label[predict(ex_text_str, text_pipeline)])

This is a Sports news

以上

2023年5月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31