100万語のテキストから100ミリ秒のフィルタリングで無効化された単語を検出

2022-02-21 03:55:02

<ブロッククオート

著者略歴 Pandasのデータ処理エキスパートである暁明は、無数のデータ実務家が抱えるデータ処理の課題を解決することに専念しています。

以前のグループメンバーが、Pandasを使って非活性化単語をフィルタリングするためのヒントを教えてくれました。

しかし、これは最も効率的な方法とは言えません。今日は、無効化された単語をフィルタリングする、より効率的な方法を紹介します。

記事の目次

フィルタリング無効化ワード事前準備

今回は197W字の小説をデータ例として挙げてみます。

データ読み込み

まず、この小説のデータを読みます。

with open(r"D:\hdfs\novels\天龙八部.txt", encoding="gb18030") as f:
    text = f.read()
print(len(text))

結果

この小説の総語数は1,272,000語であることがわかる。

ここでは、それを分割して、非活性化語を読み込む。

ジーバ・スプリッターは、特定の単語にロールを設定します

jieba splitterが主人公を正しくカットしないようにするために、だから今、私たちはjieba splitterテーブルにこの小説のキャラクター名を追加する必要があります。

まず、小説『ドラゴンウォリアー』のキャラクター名を読み込みます。

with open('D:/hdfs/novels/names.txt', encoding="utf-8") as f:
    for line in f:
        if line.startswith("天龙八部"):
            names = next(f).split()
            break

print(names[:20])

最初の20役は

['Blade White Phoenix', 'Ding Chunqiu', 'Lady Ma', 'Ma Wu De', 'Xiao Cui', 'Yu Guanghao', 'Ba Tian Shi', 'Daoist of Inequality', 'Deng Baichuan', 'Feng Bo Evil', 'Gan Bao Bao', 'Gong Ye Qian', 'Mu Wanqing', 'Bao Different', 'Tian Wolzi', 'Empress Dowager', 'Wang Yuyan', 'Boss Wu', 'Wu Ya Zi', 'Cloud Island Master']

文字を特定の語彙に設定する。

import jieba
for word in names:
    jieba.add_word(word)

いくつかの警告メッセージが表示されます。

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\Think\AppData\Local\Temp\jieba.cache
Loading model cost 0.759 seconds.
Prefix dict has been built successfully.

単語の分割を開始する

そして、原文は中国語に分割されます：。

%time cut_word = jieba.lcut(text)

Wall time: 6 s

# Load the deactivation words
with open("stoplist.txt", encoding="utf-8-sig") as f:
    stop_words = f.read().split()
stop_words.extend(['天龙八部', '\n', '\u3000', '目录', '一一声', '中', '只見'])
print(len(stop_words), stop_words[:10])

5748 ['say', 'person', 'meta', 'hellip', '&', ',', '?' , ',', '.' , '"']

%%time
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
print(len(all_words), all_words[:20])

中国語の単語分割は6秒かかります。

無効化された単語の読み込み

次に、非活性化ワードをロードします。

300656 ['release the name', 'green shirt', 'lei liu', 'dangerous peak', 'jade wall', 'moonflower', 'horse fast and fragrant,' 'high man far away', 'slight step', 'stripe born', 'who's family', 'son', 'who's family', 'no regrets', 'much love', 'tiger whistling', 'dragon roar', 'change nest', 'luan and phoenix', 'sword breath']
Wall time: 26.2 s

%%time
words = pd.Series(cut_word)
all_words = words[(words.str.len() > 1) & (~words.isin(stop_words))].tolist()
print(len(all_words), all_words[:20])

非活性化単語フィルタリングのためのn個のメソッドの性能比較

ダイレクトフィルタリング

300656 ['release the name', 'green shirt', 'lei liu', 'dangerous peak', 'jade wall', 'moonflower', 'horse fast and fragrant,' 'high man far away', 'slight step', 'stripe born', 'who's family', 'son', 'who's family', 'no regrets', 'much love', 'tiger whistling', 'dragon roar', 'change nest', 'luan and phoenix', 'sword qi']
Wall time: 465 ms

結果

%%time
stop_words = set(stop_words)
all_words = [word for word in cut_word if len(word) > 1 and word not in stop_words]
print(len(all_words), all_words[:20])

ダイレクトフィルタリングに26.2秒を要した

Pandasを用いた単語フィルタリングの無効化

300656 ['release the name', 'green shirt', 'lei liu', 'dangerous peak', 'jade wall', 'moonflower', 'horse fast and fragrant,' 'high man far away', 'slight step', 'stripe born', 'who's family', 'son', 'who's family', 'no regrets', 'much love', 'tiger whistling', 'dragon roar', 'change nest', 'luan and phoenix', 'sword qi']
Wall time: 104 ms

結果

%%time

all_words = [word for word in jieba.cut(text) if len(word) > 1 and word not in stop_words]
print(len(all_words), all_words[:20])

経過時間：0.46秒

セットコレクションによるフィルタリング

300656 ['release the name', 'green shirt', 'lei liu', 'dangerous peak', 'jade wall', 'moonflower', 'horse fast and fragrant,' 'high man far away', 'slight step', 'stripe born', 'who's family', 'son', 'who's family', 'no regrets', 'much love', 'tiger whistling', 'dragon roar', 'change nest', 'luan and phoenix', 'sword qi']
Wall time: 5.91 s

結果

%%time

text_sub = text
for stop_word in stop_words:
    text_sub = text_sub.replace(stop_word, " ")
all_words = [word for word in jieba.cut(text_sub) if len(word) > 1]
print(len(all_words), all_words[:20])

経過時間：0.1秒

最速でフィルタリングする方法

無効化された単語をセットコレクションフィルタリングでフィルタリングする方が高速ですが、最初の単語分割の処理にかかる時間は考慮しておらず、最大で6秒かかっていますが、この時間を短縮する方法はありますか。

セットコレクションを使った分割とフィルタリングで消費される全体の時間を見てみましょう。

174495 ['Heavenly Dragon', 'Release the Name', 'Green Shirt', 'Leiluo', 'Dangerous Peak', 'Jade Wall', 'Moonflower', 'Fragrant and Secluded Horse', 'Slight Steps', 'Stripe Born', 'Children', 'Family Courtyard', 'Regret', 'Tiger Whistling', 'Dragon Chanting', 'Change of Nest', 'Luan and Phoenix', 'Sword Qi', 'Blue Smoke', 'Water Pavilion']
Wall time: 5.76 s

結果

300656 ['release the name', 'green shirt', 'lei liu', 'dangerous peak', 'jade wall', 'moonflower', 'horse fast and fragrant,' 'high man far away', 'slight step', 'stripe born', 'who's family', 'son', 'who's family', 'no regrets', 'much love', 'tiger whistling', 'dragon roar', 'change nest', 'luan and phoenix', 'sword qi']
Wall time: 5.91 s

5.9s秒かかりました。

しかし、そもそも原文から無効化された単語を削除すれば、もっと速くなるのだろうか？

%%time

text_sub = text
for stop_word in stop_words:
    text_sub = text_sub.replace(stop_word, " ")
all_words = [word for word in jieba.cut(text_sub) if len(word) > 1]
print(len(all_words), all_words[:20])

結果

174495 ['Heavenly Dragon', 'Release the Name', 'Green Shirt', 'Leiluo', 'Dangerous Peak', 'Jade Wall', 'Moonflower', 'Fragrant and Secluded Horse', 'Slight Steps', 'Stripe Born', 'Children', 'Family Courtyard', 'Regret', 'Tiger Whistling', 'Dragon Chanting', 'Change of Nest', 'Luan and Phoenix', 'Sword Qi', 'Blue Smoke', 'Water Pavilion']
Wall time: 5.76 s

総合時間：5.7秒

わずかなスピードアップで、大きな差はなく、結果もかなり違うので、セットコレクションを使ってフィルタリングするのがよいでしょう。

概要

要約すると、中国語分割のために非活性化単語をフィルタリングする際に、セットコレクションを使用すると最高のパフォーマンスが得られるということである。

お付き合いいただきありがとうございました。次号では単語頻度統計への3つのアプローチと辞書とコレクションの原理 .

次号でお会いしましょう〜。