[解決済み] Gensim: TypeError: doc2bow expects an array of unicode tokens on the input, not the single string

2022-02-09 12:35:17

質問

Pythonのタスクを始めているのですが、gensimを使用しているときに問題に直面しました。ディスクからファイルをロードして処理しようとしています（ファイルを分割して小文字にします）。

私が持っているコードは以下の通りです。

dictionary_arr=[]
for file_path in glob.glob(os.path.join(path, '*.txt')):
    with open (file_path, "r") as myfile:
        text=myfile.read()
        for words in text.lower().split():
            dictionary_arr.append(words)
dictionary = corpora.Dictionary(dictionary_arr)

リスト(dictionary_arr)には、全ファイルの全単語のリストが含まれており、gensim corpora.Dictionaryを使ってそのリストを処理します。しかし、エラーになります。

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

何が問題なのか理解できないので、少し指導をお願いします。

解決するには？

dictionary.pyの中で、initialize関数は。

def __init__(self, documents=None):
    self.token2id = {} # token -> tokenId
    self.id2token = {} # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {} # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0 # number of documents processed
    self.num_pos = 0 # total number of corpus positions
    self.num_nnz = 0 # total number of non-zeroes in the BOW matrix

    if documents is not None:
        self.add_documents(documents)

関数 add_documents 文書のコレクションから辞書を構築する。各文書は、以下のリストである。トークンの

def add_documents(self, documents):

    for docno, document in enumerate(documents):
        if docno % 10000 == 0:
            logger.info("adding document #%i to %s" % (docno, self))
        _ = self.doc2bow(document, allow_update=True) # ignore the result, here we only care about updating token ids
    logger.info("built %s from %i documents (total %i corpus positions)" %
                 (self, self.num_docs, self.num_pos))

したがって、この方法でDictionaryを初期化した場合、ドキュメントを渡さなければならないが、単一のドキュメントを渡すことはできない。例えば

dic = corpora.Dictionary([a.split()])

はOKです。

[解決済み] Gensim: TypeError: doc2bow expects an array of unicode tokens on the input, not the single string

質問

解決するには？

関連

python string splicing.join()とsplitting.split()の説明

Python 人工知能人間学習描画機械学習モデル作成

python implement mysql add delete check change サンプルコード

Python百行で韓服サークルの画像クロールを実現する

[解決済み】TypeError: unhashable type: 'numpy.ndarray'.

[解決済み】お使いのCPUは、このTensorFlowバイナリが使用するようにコンパイルされていない命令をサポートしています。AVX AVX2

[解決済み] 'DataFrame' オブジェクトに 'sort' 属性がない

[解決済み】IndexError: invalid index to scalar variableを修正する方法

[解決済み】Flaskのテンプレートが見つからない【重複あり

[解決済み】 'numpy.float64' オブジェクトは反復可能ではない

最新

nginxです。[emerg] 0.0.0.0:80 への bind() に失敗しました (98: アドレスは既に使用中です)

htmlページでギリシャ文字を使うには

ピュアhtml+cssでの要素読み込み効果

純粋なhtml + cssで五輪を実現するサンプルコード

ナビゲーションバー・ドロップダウンメニューのHTML+CSSサンプルコード

タイピング効果を実現するピュアhtml+css

htmlの選択ボックスのプレースホルダー作成に関する質問

html css3 伸縮しない画像表示効果

トップナビゲーションバーメニュー作成用HTML+CSS

html+css 実装サイバーパンク風ボタン

おすすめ

ピローによる動的キャプチャ認識のためのPythonサンプルコード

Python 人工知能人間学習描画機械学習モデル作成

Python 可視化 big_screen ライブラリサンプル詳細

Pythonを使って簡単なzipファイルの解凍パスワードを手作業で解く

Pythonコードの可読性を向上させるツール「pycodestyle」の使い方を詳しく解説します

[解決済み】pygame.error: ビデオシステムが初期化されていない

[解決済み】socket.error: [Errno 48] アドレスはすでに使用中です。

[解決済み】終了コード -1073741515 (0xC0000135)でプロセス終了）

[解決済み】SyntaxError: デフォルト以外の引数がデフォルトの引数に続く

[解決済み】NameError: 名前 'self' が定義されていません。