HuggingFace Transformers 4.5 : 利用方法 : タスクの要点 (翻訳/解説)
翻訳 : (株)クラスキャットセールスインフォメーション
作成日時 : 05/06/2021 (4.5.1)

* 本ページは、HuggingFace Transformers の以下のドキュメントを翻訳した上で適宜、補足説明したものです：

Using Transformers : Summary of the tasks

* サンプルコードの動作確認はしておりますが、必要な場合には適宜、追加改変しています。
* ご自由にリンクを張って頂いてかまいませんが、sales-info@classcat.com までご一報いただけると嬉しいです。

★ 無料 Web セミナー開催中 ★ クラスキャット主催人工知能 & ビジネス Web セミナー

人工知能とビジネスをテーマに WEB セミナーを定期的に開催しています。
スケジュールは弊社公式 Web サイトでご確認頂けます。

お住まいの地域に関係なく Web ブラウザからご参加頂けます。事前登録 が必要ですのでご注意ください。
ウェビナー運用には弊社製品「ClassCat® Webinar」を利用しています。

クラスキャットは人工知能・テレワークに関する各種サービスを提供しております :

人工知能研究開発支援	人工知能研修サービス	テレワーク & オンライン授業を支援
PoC(概念実証)を失敗させないための支援 (本支援はセミナーに参加しアンケートに回答した方を対象としています。)

◆ お問合せ : 本件に関するお問い合わせ先は下記までお願いいたします。

株式会社クラスキャット セールス・マーケティング本部セールス・インフォメーション

E-Mail：sales-info@classcat.com ; WebSite: https://www.classcat.com/ ; Facebook

HuggingFace Transformers : 利用方法 : タスクの要点

訳注 : 本ドキュメントの TensorFlow バージョンはこちらをご覧ください。

このページはライブラリを利用するとき最も頻度の高いユースケースを示します。利用可能なモデルはユースケースで多くの異なる configuration と素晴らしい多用途性を可能にします。最も単純なものはここで提示され、質問応答、シークエンス分類、固有表現認識などのようなタスクのための使用方法が紹介されます。

これらのサンプルは自動モデル (= auto-models) を活用します、これらは与えられたチェックポイントに従ってモデルをインスタンス化するクラスで、正しいモデル・アーキテクチャを自動的に選択します。より多くの情報のためには AutoModel ドキュメントを確認してください。より特定的にそしてそれを貴方の特定のユースケースに適応させるために自由にコードを変更してください。

モデルがタスク上で上手く動作するため、それはタスクに対応するチェックポイントからロードされなければなりません。これらのチェックポイントは通常は巨大なデータのコーパス上で事前訓練されて特定のタスク上で再調整されます。これは以下を意味しています :

総てのモデルが総てのタスク上で再調整されてはいません。特定のタスク上でモデルを再調整することを望む場合には、examples ディレクトリの run_$TASK.py スクリプトの一つを利用できます。
再調整モデルは特定のデータセット上で再調整されました。このデータセットは貴方のユースケースとドメインと重なるかもしれないしそうでないかもしれません。前に述べたように、貴方のモデルを再調整するために examples スクリプトを利用しても良いですし、あるいは貴方自身の訓練スクリプトを作成しても良いです。

タスク上で推論を行なうため、幾つかのメカニズムがライブラリにより利用可能になっています :

Pipeline : 非常に利用しやすい抽象で、これらは 2 行ほどの少ないコードを必要とするだけです。
直接モデル利用 : 抽象度は低いですが、tokenizer (PyTorch/TensorFlow) への直接アクセスと full 推論機能を通してより多くの柔軟性とパワーがあります。

両者のアプローチがここで紹介されます。

Note: ここで提示される総てのタスクは特定のタスク上で再調整された事前訓練 (された) チェックポイントを利用しています。特定のタスク上で再調整されていないチェックポイントのロードはそのタスクのために使用された追加のヘッドではなく base transformer 層だけをロードして、ヘッドの重みをランダムに初期化します。

これはランダム出力を生成します。

シークエンス分類

シークエンス分類は与えられたクラス数に従ってシークエンスを分類するタスクです。シークエンス分類のサンプルは GLUE データセットで、これはそのタスクに完全に基づいています。GLUE 分類タスク上でモデルを再調整したいのであれば、run_glue.py, run_tf_glue.py, run_tf_text_classification.py または run_xnli.py スクリプトを利用して良いです。

ここにセンチメント分析を行なうために pipeline を使用するサンプルがあります : シークエンスがポジティブかネガティブかを識別します。それは sst2 上で再調整したモデルを利用します、これは GLUE タスクです。

これは次のように、スコアとともにラベル (“POSITIVE” or “NEGATIVE”) を返します :

from transformers import pipeline
nlp = pipeline("sentiment-analysis")
result = nlp("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")
result = nlp("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

label: NEGATIVE, with score: 0.9991
label: POSITIVE, with score: 0.9999

2 つのシークエンスが互いの言い換え (= paraphrase) であるかを決定するモデルを使用してシークエンス分類を行なうサンプルがここにあります。そのプロセスは以下です :

チェックポイント名から tokenizer とモデルをインスタンス化します。モデルは BERT モデルとして識別されてそれをチェックポイントにストアされた重みでロードします。
正しいモデル固有の separator トークン型 id と attention マスクと共に 2 つのセンテンスからシークエンスを構築します (encode() と __call__() がこれを処理します)。
シークエンスをモデルに渡してその結果それは 2 つの利用可能なクラスの一つに分類されます : 0 (not a paraphrase) と 1 (is a paraphrase) 。
クラスに渡る確率を得るために結果の softmax を計算します。
結果をプリントします。

PyTorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased-finetuned-mrpc")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased-finetuned-mrpc")
classes = ["not paraphrase", "is paraphrase"]
sequence_0 = "The company HuggingFace is based in New York City"
sequence_1 = "Apples are especially bad for your health"
sequence_2 = "HuggingFace's headquarters are situated in Manhattan"
paraphrase = tokenizer(sequence_0, sequence_2, return_tensors="pt")
not_paraphrase = tokenizer(sequence_0, sequence_1, return_tensors="pt")
paraphrase_classification_logits = model(**paraphrase).logits
not_paraphrase_classification_logits = model(**not_paraphrase).logits
paraphrase_results = torch.softmax(paraphrase_classification_logits, dim=1).tolist()[0]
not_paraphrase_results = torch.softmax(not_paraphrase_classification_logits, dim=1).tolist()[0]
# Should be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(paraphrase_results[i] * 100))}%")
# Should not be paraphrase
for i in range(len(classes)):
    print(f"{classes[i]}: {int(round(not_paraphrase_results[i] * 100))}%")

not paraphrase: 10%
is paraphrase: 90%

not paraphrase: 94%
is paraphrase: 6%

Extractive 質問応答

Extractive (抽出可能な) 質問応答は質問が与えられたときテキストから答えを抽出するタスクです。質問応答データセットの例は SQuAD データセットで、これはこのタスクに完全に基づいています。モデルを SQuAD タスク上で再調整したいのであれば、run_qa.py と run_tf_squad.py スクリプトを利用して良いです。

ここに質問応答を行なう pipeline を使用するサンプルがあります : 質問が与えられたときテキストから答えを抽出します。それは SQuAD 上で再調整されたモデルを利用します。

from transformers import pipeline
nlp = pipeline("question-answering")
context = r"""
Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a
question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune
a model on a SQuAD task, you may leverage the examples/question-answering/run_squad.py script.
"""

これはテキストから抽出された答え、信頼度スコアを “start” と “end” 値と一緒に返します、これらはテキストの抽出された答えの位置です。

result = nlp(question="What is extractive question answering?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")
result = nlp(question="What is a good example of a question answering dataset?", context=context)
print(f"Answer: '{result['answer']}', score: {round(result['score'], 4)}, start: {result['start']}, end: {result['end']}")

Answer: 'the task of extracting an answer from a text given a question.', score: 0.6226, start: 34, end: 96
Answer: 'SQuAD dataset,', score: 0.5053, start: 147, end: 161

ここにモデルと tokenizer を使用した質問応答のサンプルがあります。プロセスは以下のようなものです :

チェックポイント名から tokenizer とモデルをインスタンス化します。モデルは BERT モデルとして識別されそしてそれをチェックポイントにストアされた重みでロードします。
テキストと幾つかの質問を定義します。
質問に渡り反復してそして正しいモデル固有の separator トークン型 id と attention マスクで、テキストと現在の質問からシークエンスを構築します。
このシークエンスをモデルに渡します。これは開始と終了位置の両者のために、シークエンス・トークン全体 (質問とテキスト) に渡るスコアの範囲を出力します。
トークンに渡る確率を得るために結果の softmax を計算します。
識別された開始と停止値からトークンを取得して、それらのトークンを文字列に変換します。
結果をプリントします。

PyTorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
    "How many pretrained models are available in 🤗 Transformers?",
    "What does 🤗 Transformers provide?",
    "🤗 Transformers provides interoperability between which frameworks?",
]
for question in questions:
    inputs = tokenizer(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]
    outputs = model(**inputs)
    answer_start_scores = outputs.start_logits
    answer_end_scores = outputs.end_logits
    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
    print(f"Question: {question}")
    print(f"Answer: {answer}")

Question: How many pretrained models are available in 🤗 Transformers?
Answer: over 32 +
Question: What does 🤗 Transformers provide?
Answer: general - purpose architectures
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: tensorflow 2 . 0 and pytorch

言語モデリング

言語モデリングはモデルをコーパスに適合させるタスク で、ドメイン固有である可能性があります。総てのポピュラーな transformer ベースのモデルは言語モデリングの変種で、e.g. masked 言語モデリングによる BERT、casual 言語モデリングによる GPT-2 を使用して訓練されます。

言語モデリングは事前訓練以外でも有用である可能性があります、例えばモデル分布をドメイン固有にシフトするためです :
非常に大規模なコーパスに渡り訓練された言語モデルを使用し、それを新しいニュース・データセットや科学論文 e.g. LysandreJik/arxiv-nlp に再調整します。

Masked 言語モデリング

masked 言語モデリングは masking トークンでシークエンスのトークンをマスクしてモデルに適切なトークンでそのマスクを満たすことを促すタスクです。これはモデルに右側のコンテキスト (マスクの右側のトークン) と左側のコンテキスト (マスクの左側のトークン) の両者に注意を払うことを可能にします。そのような訓練は、SQuAD のような (質問応答、Lewis, Lui, Goyal et al., part 4.2 参照) 双方向コンテキストを必要とするような下流タスクのための強力な基底を作成します。masked 言語モデリング・タスク上でモデルを再調整したい場合には、run_mlm.py スクリプトを活用して良いです。

ここにシークエンスからのマスクを置き換えるために pipeline を使用するサンプルがあります :

from transformers import pipeline
nlp = pipeline("fill-mask")

これはマスクが満たされたシークエンス、信頼度スコア、そして tokenizer 語彙のトークン id を出力します :

from pprint import pprint
pprint(nlp(f"HuggingFace is creating a {nlp.tokenizer.mask_token} that the community uses to solve NLP tasks."))

[{'score': 0.1792745739221573,
  'sequence': '<s>HuggingFace is creating a tool that the community uses to '
              'solve NLP tasks</s>',
  'token': 3944,
  'token_str': 'Ġtool'},
 {'score': 0.11349421739578247,
  'sequence': '<s>HuggingFace is creating a framework that the community uses '
              'to solve NLP tasks.</s>',
  'token': 7208,
  'token_str': 'Ġframework'},
 {'score': 0.05243554711341858,
  'sequence': '<s>HuggingFace is creating a library that the community uses to '
              'solve NLP tasks.</s>',
  'token': 5560,
  'token_str': 'Ġlibrary'},
 {'score': 0.03493533283472061,
  'sequence': '<s>HuggingFace is creating a database that the community uses '
              'to solve NLP tasks.</s>',
  'token': 8503,
  'token_str': 'Ġdatabase'},
 {'score': 0.02860250137746334,
  'sequence': '<s>HuggingFace is creating a prototype that the community uses '
              'to solve NLP tasks.</s>',
  'token': 17715,
  'token_str': 'Ġprototype'}]

ここにモデルと tokenizer を使用して masked 言語モデリングを行なうサンプルがあります。そのプロセスは以下です :

チェックポイント名から tokenizer とモデルをインスタンス化します。モデルは DistilBERT モデルとして識別されてチェックポイントにストアされている重みとともにそれをロードします。
単語の代わりに tokenizer.mask_token を配置して、マスクされたトークンを持つシークエンスを定義します。
そのシークエンスを ID のリストにエンコードしてそのリスト内のマスクされたトークンの位置を見つけます。
マスク・トークンのインデックスにおける予測を取得します : この tensor は語彙と同じサイズを持ち、値は各トークンに帰するスコアです。モデルはそれがそのコンテキストで可能性が高いと判断するトークンにより高いスコアを与えます。
PyTorch topk or TensorFlow top_k メソッドを使用してトップ 5 のトークンを取得します。
マスク・トークンをトークンで置き換えて、結果をプリントします。

PyTorch

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased")
model = AutoModelWithLMHead.from_pretrained("distilbert-base-cased")
sequence = f"Distilled models are smaller than the models they mimic. Using them instead of the large versions would help {tokenizer.mask_token} our carbon footprint."
input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
token_logits = model(input).logits
mask_token_logits = token_logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

これはモデルにより予測された top 5 トークンを伴う 5 シークエンスをプリントします :

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Distilled models are smaller than the models they mimic. Using them instead of the large versions would help reduce our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help increase our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help decrease our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help offset our carbon footprint.
Distilled models are smaller than the models they mimic. Using them instead of the large versions would help improve our carbon footprint.

Causal 言語モデリング

Causal (因果) 言語モデリングはトークンのシークエンスに続くトークンを予測するタスクです。この状況では、モデルは左側のコンテキスト (マスクの左側のトークン) にのみ注意を払います。そのような訓練は生成タスクのための特に興味深いです。causal 言語モデリング・タスク上でモデルを再調整したい場合、run_clm.py スクリプトを活用して良いです。

通常、次のトークンは (モデルが入力シークエンスから生成する) 最後の隠れ状態のロジットからサンプリングすることにより予測されます。

ここに tokenizer とモデルを使用して (トークンの入力シークエンスに続く) 次のトークンをサンプリングするために top_k_top_p_filtering() メソッドを利用するサンプルがあります。

PyTorch

from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and"
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])

これは元のシークエンスに続く (望ましくは) 首尾一貫した次のトークンを出力します、これは私達のケースでは単語 has です :

print(resulting_string)

Hugging Face is based in DUMBO, New York City, and has

次のセクションでは、この機能がユーザ定義の長さまで複数のトークンを生成するために generate() でどのように活用されるかを示します。

テキスト生成

テキスト生成 (a.k.a. open-ended テキスト生成) ではその目標は与えられたテキストからの継続であるテキストの首尾一貫した部分を作成することです。以下のサンプルは GPT-2 がテキストを生成するために pipeline でどのように使用されるかを示します。デフォルトでは総てのモデルはそれらに相当する configuration で設定されているように、pipeline で使用されるとき Top-K サンプリングを適用します (例えば gpt-2 config 参照)。

from transformers import pipeline
text_generator = pipeline("text-generation")
print(text_generator("As far as I am concerned, I will", max_length=50, do_sample=False))

[{'generated_text': 'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'}]

ここでは、コンテキスト “As far as I am concerned, I will” からモデルは 50 トークンの合計最大長を持つランダムテキストを生成します。引数 max_length について上で示されたように、PreTrainedModel.generate() のデフォルト引数は pipeline() で直接 override できます。

ここに XLNet とその tokenizer を使用するテキスト生成のサンプルがあります。

PyTorch

from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and"
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# get logits of last hidden state
next_token_logits = model(input_ids).logits[:, -1, :]
# filter
filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
# sample
probs = F.softmax(filtered_next_token_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated = torch.cat([input_ids, next_token], dim=-1)
resulting_string = tokenizer.decode(generated.tolist()[0])

print(generated)

Today the weather is really nice and I am planning on anning on taking a nice...... of a great time!<eop>...............

テキスト生成は現在 PyTorch で GPT-2, OpenAi-GPT, CTRL, XLNet, Transfo-XL と Reformer で、そして殆どのモデルについて TensorFlow でも可能です。上のサンプルで見られるように XLNet と Transfo-XL は上手く動作するためにはしばしばパッドされる必要があります。open-ended テキスト生成のためには GPT-2 は通常は良い選択です、何故ならばそれは causal 言語モデリング目的で数百万の web ページで訓練されたからです。

テキスト生成のための異なるデコーディング・ストラテジーをどのように適用するかのより多くの情報については、ここのテキスト生成ブログ投稿も参照してください。

固有表現認識

固有表現認識 (NER) は例えば人物、組織や位置としてトークンを識別するクラスに従ってトークンを分類するタスクです。固有表現認識データセットの例は CoNLL-2003 データセットで、これはそのタスクに完全に基づいています。NER タスク上でモデルを再調整したいのであれば、run_ner.py スクリプトを利用して良いです。

ここに固有表現認識を行なうために pipeline を使用するサンプルがあります、具体的には、トークンを 9 クラスの一つに属するものとして識別することを試みます :

O, 固有表現外 (= Outside of a named entity)
B-MIS, 別の雑多な (= miscellaneous) エンティティの直後の雑多なエンティティの開始
I-MIS, 種々雑多なエンティティ
B-PER, 別の人物名の直後の人物名の開始
I-PER, 人物名
B-ORG, 別の組織の直後の組織の開始
I-ORG, 組織
B-LOC, 別の場所の直後の場所の開始
I-LOC, 場所

それは dbmdz からの @stefan-it により再調整された、CoNLL-2003 上の再調整モデルを利用します。

from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very"
           "close to the Manhattan Bridge which is visible from the window."

これは上で定義された 9 クラスからのエンティティの一つとして識別された総ての単語のリストを出力します。ここに想定される結果があります :

print(nlp(sequence))

[
    {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]

シークエンス “Hugging Face” のトークンがどのように組織として識別され、そして “New York City”, “DUMBO” と “Manhattan Bridge” が場所として識別されたかに注意してください。

モデルと tokenizer を使用する、固有表現認識を行なうサンプルがここにあります。そのプロセスは以下です :

チェックポイント名から tokenizer とモデルをインスタンス化します。モデルは BERT モデルとして識別されてチェックポイントにストアされた重みでそれをロードします。
その上でモデルが訓練されたラベルリストを定義します。
“Hugging Face” を組織として “New York City” を場所とするような、既知のエンティティでシークエンスを定義します。
単語を予測にマップできるようにトークンに分解します。最初にシーケンスを完全にエンコードしてデコードすることにより小さいハックを利用します、その結果特殊トークンを含む文字列が残ります。
そのシークエンスを ID にエンコードします (特殊トークンが自動的に追加されます)。
入力をモデルに渡して最初の出力を得ることにより予測を取得します。これは各トークンのための 9 の可能なクラスに渡る分布という結果になります。各トークンのための最尤クラスを得るために argmax を取ります。
各トークンをその予測と一緒に zip してそれをプリントします。

PyTorch

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs).logits
predictions = torch.argmax(outputs, dim=2)

これは対応する予測にマップされた各トークンのリストを出力します。pipeline とは異なり、ここでは総てのトークンは予測を持ちます、何故ならばそのトークンで特定のエンティティが見つからなかったことを意味する “0” th クラスを除去しないからです。次の配列は出力であるはずです :

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy())])

[('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]

要約

要約はドキュメントや記事をより短いテキストに要約するタスクです。要約タスク上でモデルを再調整したい場合には、run_summarization.py スクリプトを活用して良いです。

要約データセットの例は CNN / Daily Mail データセットで、これは長いニュース記事から成りそして要約タスクのために作成されました。モデルを要約タスクで再調整したい場合には、このドキュメントで様々なアプローチが説明されています。

ここに要約を行なうためのパイプラインを使用する例があります。それは CNN / Daily Mail データセット上で再調整された Bart モデルを利用しています。

from transformers import pipeline
summarizer = pipeline("summarization")
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.
A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  Her next court appearance is scheduled for May 18.
"""

要約 pipeline は PretrainedModel.generate() メソッドに依拠していますので、下で示されるように pipeline の PretrainedModel.generate() のデフォルト引数を max_length と min_length のために直接 override することができます。これは次の要約を出力します :

print(summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False))

[{'summary_text': 'Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002. She is believed to still be married to four men.'}]

モデルと tokenizer を使用する要約を行なうサンプルがここにあります。そのプロセスは以下です :

チェックポイント名から tokenzier とモデルをインスタンス化します。要約は通常は Bart or T5 のようなエンコーダ-デコーダ・モデルを使用して成されます。
要約されるべき記事を定義します。
T5 固有の prefix “summarize: “ を追加します。
要約を生成するために PretrainedModel.generate() メソッドを使用します。

このサンプルでは Google の T5 モデルを利用しています。それは (CNN / Daily Mail を含む) マルチタスク混合データセット上でだけ事前訓練されていますが、それは非常に良い結果を生成します。

PyTorch

from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
# T5 uses a max_length of 512 so we cut the article to 512 tokens.
inputs = tokenizer.encode("summarize: " + ARTICLE, return_tensors="pt", max_length=512)
outputs = model.generate(inputs, max_length=150, min_length=40, length_penalty=2.0, num_beams=4, early_stopping=True)

翻訳

翻訳は一つの言語から別のものへテキストを変換するタスクです。翻訳タスク上でモデルを再調整したい場合には、run_translation.py を活用して良いです。

翻訳データセットの例は WMT 英独データセットです、これは入力データとして英語のセンテンスをそしてターゲットデータとして独語のセンテンスを持ちます。翻訳タスク上でモデルを再調整したい場合には、様々なアプローチがこのドキュメントで説明されます。

ここに翻訳を行なうための pipeline を使用するサンプルがあります。それは T5 モデルを利用しています、これは (WMT を含む) マルチタスク混合データセット上でのみ事前訓練されましたが、印象的な翻訳結果を生成します。

from transformers import pipeline
translator = pipeline("translation_en_to_de")
print(translator("Hugging Face is a technology company based in New York and Paris", max_length=40))

[{'translation_text': 'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'}]

翻訳 pipeline は PretrainedModel.generate() メソッドに依拠していますので、上で max_length のために示されたように pipeline で PretrainedModel.generate() のデフォルト引数を直接 override できます。

ここにモデルと tokenizer を使用して翻訳を行なうサンプルがあります。そのプロセスは以下です :

チェックポイント名から tokenizer とモデルをインスタンス化します。要約は通常は Bart or T5 のようなエンコーダ-デコーダ・モデルを使用して成されます。
翻訳されるべきセンテンスを定義します。
T5 固有のプレフィックス “translate English to German: “ を追加します。
翻訳を遂行するために PretrainedModel.generate() メソッドを使用します。

PyTorch

from transformers import AutoModelWithLMHead, AutoTokenizer
model = AutoModelWithLMHead.from_pretrained("t5-base")
tokenizer = AutoTokenizer.from_pretrained("t5-base")
inputs = tokenizer.encode("translate English to German: Hugging Face is a technology company based in New York and Paris", return_tensors="pt")
outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)

pipeline サンプルでのように、同じ翻訳を得ます :

print(tokenizer.decode(outputs[0]))

Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.

以上

2021年5月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31