Flair 0.6 Tutorial 9: 貴方自身の Flair 埋め込みを訓練する (翻訳/解説)
翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 10/02/2020 (0.6.1)

* 本ページは、Flair ドキュメントの以下のページを翻訳した上で適宜、補足説明したものです：

Tutorial 9: Training your own Flair Embeddings

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマにウェビナー (WEB セミナー) を定期的に開催しています。スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
Windows PC のブラウザからご参加が可能です。スマートデバイスもご利用可能です。

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/

Facebook: https://www.facebook.com/ClassCatJP/

Tutorial 9: 貴方自身の Flair 埋め込みを訓練する

Flair Embeddings は Flair の秘密のソースで、様々な NLP タスクに渡る最先端の精度を獲得することを可能にします。このチュートリアルは貴方自身の Flair 埋め込みをどのように訓練するかを示します、それは Flair を新しい言語やドメインに適用することを望む場合に役立つかもしれません。

テキスト・コーパスを準備する

言語モデルはプレーンテキストで訓練されます。文字 LM の場合は、文字シークエンスで次の文字を予測するためにそれらを訓練します。貴方自身のモデルを訓練するためには、最初に巨大コーパスを適切に確認する必要があります。私達の実験では、およそ 10 億単語を持つコーパスを利用しました。

貴方のコーパスを訓練、検証とテスト部分に分割する必要があります。私達のトレーナー・クラスはテストと検証データとともに ‘test.txt’ と ‘valid.txt’ がある、コーパスのためのフォルダがあることを仮定しています。重要なこととして、分割の訓練データを含む ‘train’ と呼ばれるフォルダもまたあることです。例えば、10 億単語コーパスが 100 パーツに分割されます。総てのデータがメモリに収まらないのであれば分割は必要です、その場合トレーナーは総ての分割を通してランダムに反復します。

従って、フォルダ構造はこのように見えなければなりません :

corpus/
corpus/train/
corpus/train/train_split_1
corpus/train/train_split_2
corpus/train/...
corpus/train/train_split_X
corpus/test.txt
corpus/valid.txt

殆どの場合、ドキュメントやセンテンスのための明示的な separator なしに、構造化されていない形式でコーパスを提供することが推奨されます。LM がドキュメントの境界を見分けることを容易にすることを望むのであれば、”[SEP]” のような separator トークンを導入できます。

言語モデルを訓練する

このフォルダ構造をひとたび持てば、モデルの学習を開始するために単純に LanguageModelTrainer をポイントさせます。

from flair.data import Dictionary
from flair.models import LanguageModel
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus

# are you training a forward or backward LM?
is_forward_lm = True

# load the default character dictionary
dictionary: Dictionary = Dictionary.load('chars')

# get your corpus, process forward and at the character level
corpus = TextCorpus('/path/to/your/corpus',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# instantiate your language model, set hidden size and number of layers
language_model = LanguageModel(dictionary,
                               is_forward_lm,
                               hidden_size=128,
                               nlayers=1)

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=10,
              mini_batch_size=10,
              max_epochs=10)

このスクリプトのパラメータは非常に小さいです。私達は 1024 か 2048 の隠れサイズ、250 のシークエンス長、そして 100 のミニバッチサイズで良い結果を得られました。貴方のリソースに依拠して、巨大なモデルの訓練を試せますが、モデルを訓練するためには非常にパワフルな GPU と多くの時間が必要であることを知っておいてください (私達は >1 週間訓練します)。

LM を埋め込みとして使用する

ひとたび LM を訓練すれば、それを埋め込みとして使用することは容易です。モデルを単に FlairEmbeddings クラスにロードして任意の他の埋め込みのように Flair で利用するだけです :

sentence = Sentence('I love Berlin')

# init embeddings from your trained LM
char_lm_embeddings = FlairEmbeddings('resources/taggers/language_model/best-lm.pt')

# embed sentence
char_lm_embeddings.embed(sentence)

Done!

非-Latin アルファベット

アラビア語や日本語のような非-Latin アルファベットを使用する言語のための埋め込みを訓練する場合、最初に貴方自身の文字辞書を作成する必要があります。以下のコードスニペットでこれを行なうことができます :

# make an empty character dictionary
from flair.data import Dictionary
char_dictionary: Dictionary = Dictionary()

# counter object
import collections
counter = collections.Counter()

processed = 0

import glob
files = glob.glob('/path/to/your/corpus/files/*.*')

print(files)
for file in files:
    print(file)

    with open(file, 'r', encoding='utf-8') as f:
        tokens = 0
        for line in f:

            processed += 1            
            chars = list(line)
            tokens += len(chars)

            # Add chars to the dictionary
            counter.update(chars)

            # comment this line in to speed things up (if the corpus is too large)
            # if tokens > 50000000: break

    # break

total_count = 0
for letter, count in counter.most_common():
    total_count += count

print(total_count)
print(processed)

sum = 0
idx = 0
for letter, count in counter.most_common():
    sum += count
    percentile = (sum / total_count)

    # comment this line in to use only top X percentile of chars, otherwise filter later
    # if percentile < 0.00001: break

    char_dictionary.add_item(letter)
    idx += 1
    print('%d\t%s\t%7d\t%7d\t%f' % (idx, letter, count, sum, percentile))

print(char_dictionary.item2idx)

import pickle
with open('/path/to/your_char_mappings', 'wb') as f:
    mappings = {
        'idx2item': char_dictionary.idx2item,
        'item2idx': char_dictionary.item2idx
    }
    pickle.dump(mappings, f)

そして言語モデルを訓練するために貴方のコードでこの辞書をデフォルトのものの代わりに利用できます :

import pickle
dictionary = Dictionary.load_from_file('/path/to/your_char_mappings')

パラメータ

LanguageModelTrainer の学習パラメータの幾つかで遊ぶかもしれません。例えば、殆どのコーパスのために 20 の初期学習率、そして 4 の anneling 因子が非常に良いことを一般に見出します。学習率スケジューラの 'patience' 値を修正することも望むかもしれません。現在それを 25 として持ちます、これは訓練損失が 25 分割の間改良しない場合、それは学習率を減じることを意味します。

既存の LM を再調整する

時にスクラッチから訓練する代わりに既存の言語モデルを再調整することは意味があります。例えば、英語のための一般的な LM を持ち特定のドメインのために再調整したい場合です。

LanguageModel を再調整するため、新しいものをインスタンス化する代わりに既存の LanguageModel をロードする必要があるだけです。訓練コードの残りは上と同じであり続けます :

from flair.data import Dictionary
from flair.embeddings import FlairEmbeddings
from flair.trainers.language_model_trainer import LanguageModelTrainer, TextCorpus


# instantiate an existing LM, such as one from the FlairEmbeddings
language_model = FlairEmbeddings('news-forward').lm

# are you fine-tuning a forward or backward LM?
is_forward_lm = language_model.is_forward_lm

# get the dictionary from the existing language model
dictionary: Dictionary = language_model.dictionary

# get your corpus, process forward and at the character level
corpus = TextCorpus('path/to/your/corpus',
                    dictionary,
                    is_forward_lm,
                    character_level=True)

# use the model trainer to fine-tune this model on your corpus
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=100,
              mini_batch_size=100,
              learning_rate=20,
              patience=10,
              checkpoint=True)

再調整するとき、前と同じ文字辞書を使用して方向性 (forward/backward) をコピーしなければならないことに注意してください。

貴方の LM を提供することを考える

If you train a good LM for a language or domain we don't yet have in Flair, consider contacting us!
We would be happy to integrate more LMs into the library so that other people can use them!

以上

月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31