1. ホーム
  2. パイソン

Pythonワードクラウドの実装

2022-02-26 04:04:43
<パス

Pythonのかなりクールな機能の1つは、ワードクラウドを簡単に実装することができることです。
このプロジェクトのオープンソースコードは、以下のgithubで公開されています。
https://github.com/amueller/word_cloud
ルーチンを実行する際には、wordcloudフォルダを削除する必要があることに注意してください。
ワードクラウドは、一部NLPベース、一部画像ベースになっています。
以下は、例として上記のgithub wordcloudのコードのスニペットです。


from os import path
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt

from wordcloud import WordCloud, STOPWORDS

d = path.dirname(__file__)

# Read the whole text.
text = open(path.join(d, 'alice.txt')).read()

# read the mask image
# taken from
# taken from http://www.stencilry.org/stencils/movies/alice%20in%20wonderland/255fk.jpg
alice_mask = np.array(Image.open(path.join(d, "alice_mask.png")))

stopwords = set(STOPWORDS)
stopwords.add("said")

wc = WordCloud(background_color="white", max_words=2000, mask=alice_mask,
               stopwords=stopwords)
# generate word cloud
wc.generate(text)

# store to file
wc.to_file(path.join(d, "alice.png"))

# show
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.figure()
plt.imshow(alice_mask, cmap=plt.cm.gray, interpolation='bilinear')
plt.axis("off")
plt.show()


オリジナル画像です。

結果

ウサギとアリスの写真
その中で
テキストでドキュメントを開く
alice_maskは描画を配列として読み込んでいます。
stopwords は、表示を停止する単語を設定します。
WordCloudは、ワードクラウドのプロパティを設定します
ワードクラウドを生成する
to_fileに画像を保存する
wordcloud.pyにアクセスすると、WordCloudクラスの関連するプロパティを見ることができます。

 """Word cloud object for generating and drawing.

    Parameters
    ----------
    font_path : string
        Font path to the font that will be used (OTF or TTF).
        Defaults to DroidSansMono path on a Linux machine.
        If you are on another OS or don't have this font, you need to adjust this path.

    width : int (default=400)
        Width of the canvas.

    height : int (default=200)
        Height : int (default=200) Height of the canvas.

    prefer_horizontal : float (default=0.90)
        The ratio of times to try horizontal fitting as opposed to vertical.
        If prefer_horizontal < 1, the algorithm will try rotating the word
        (There is currently no built-in way to get only vertical
        words.)

    mask : nd-array or None (default=None)
        If not None, gives a binary mask on where to draw words.
        None, width and height will be ignored and the shape of mask will be
        All white (#FF or #FFFFFF) entries will be considerd
        "masked out" while other entries will be free to draw on.
        changed in the most recent version!]

    scale : float (default=1)
        Scaling between computation and drawing,
        using scale instead of larger canvas size is significantly faster, but
        might lead to a coarser fit for the words.

    min_font_size : int (default=4)
        Will stop when there is no more room in this
        size.

    font_step : int (default=1)
        Step size for the font. font_step > 1 might speed up computation but
        give a worse fit.

    max_words : number (default=200)
        The maximum number of words.

    stopwords : set of strings or None
        If None, the build-in STOPWORDS
        list will be used.

    background_color : color value (default="black")
        Background color for the word cloud image.

    max_font_size : int or None (default=None)
        If None, height of the image is
        If None, height of the image is used.

    mode : string (default="RGB")
        Transparent background will be generated when mode is "RGBA" and
        background_color is None.

    relative_scaling : float (default=.5)
        Importance of relative word frequencies for font-size.
        relative_scaling=0, only word-ranks are considered.
        relative_scaling=1, a word that is twice as frequent will have twice
        If you want to consider the word frequencies and not only
        If you want to consider the word frequencies and not only their rank, relative_scaling around .5 often looks good.

        ... versionchanged: 2.0
            Default is now 0.5.

    color_func : callable, default=None
        Callable with parameters word, font_size, position, orientation,
        font_path, random_state that returns a PIL color for each word.
        Overwrites "colormap".
        See colormap for specifying a matplotlib colormap instead.

    regexp : string or None (optional)
        Regular expression to split the input text into tokens in process_text.
        If None is specified, ``r"\w[\w']+"`` is used.

    collocations : bool, default=True
        Whether to include collocations (bigrams) of two words.

        ... versionadded: 2.0

    colormap : string or matplotlib colormap, default="viridis"
        Matplotlib colormap to randomly draw colors from for each word.
        Ignored if "color_func" is specified.

        ... versionadded: 2.0

    normalize_plurals : bool, default=True
    

どこで
font_pathはフォントが使用されるパスを示す
width と height は、キャンバスの幅と高さを表します。
prefer_horizontal は、ワードクラウドのフォントの水平方向と垂直方向の量を調整します。
mask はワードクラウドの背景となる領域を作成するマスクです
scale:計算と描画の間の縮尺
min_font_size フォントサイズの最小値を指定します。
max_words はフォントの数を設定します。
stopwords は無効にする単語を設定します。
background_color は、ワードクラウドの背景色を設定します。
max_font_size はフォントの最大サイズを設定します。
modeはフォントの色を設定しますが、RGBAに設定すると背景が透明になります
relative_scaling は、フォントサイズに対する相対的な単語頻度の重要度を設定します。
regexpは正規表現を設定します。
collocations 2つの単語の共起語を含めるかどうか
generate関数でデバッグすると、関数を見ることができます。
words=process_text (text) は、テキストに含まれる単語の頻度を返すことができます。
generate_from_frequencies は、単語と単語の頻度に基づいてワードクラウドを作成します。
generate_from_frequencies関数の実装を順を追って説明すると、以下のようになります。

    def generate_from_frequencies(self, frequencies, max_font_size=None):
        """Create a word_cloud from words and frequencies.

        Parameters
        ----------
        frequencies : dict from string to float
            A contains words and associated frequency.

        max_font_size : int
            Use this font-size instead of self.max_font_size

        Returns
        -------
        self

        """
        # make sure frequencies are sorted and normalized
        frequencies = sorted(frequencies.items(), key=item1, reverse=True)
        if len(frequencies) <= 0:
            raise ValueError("We need at least 1 word to plot a word cloud, "
                             "got %d." % len(frequencies))
        frequencies = frequencies[:self.max_words]

        # largest entry will be 1
        max_frequency = float(frequencies[0][1])

        frequencies = [(word, freq / max_frequency)
                       for word, freq in frequencies]

        if self.random_state is not None:
            random_state = self.random_state
        else:
            random_state = Random()

        if self.mask is not None:
            mask = self.mask
            width = mask.shape[1]
            height = mask.shape[0]
            if mask.dtype.kind == 'f':
                warnings.warn("mask image should be unsigned byte between 0"
                              " and 255. Got a float array")
            if mask.ndim == 2:
                boolean_mask = mask == 255
            elif mask.ndim == 3:
                # if all channels are white, mask out
                boolean_mask = np.all(mask[:, :, :3] == 255, axis=-1)
            else:
                raise ValueError("Got mask of invalid shape: %s"
                                 % str(mask.shape))
        else:
            boolean_mask = None
            height, width = self.height, self.width
        occupancy = IntegralOccupancyMap(height, width, boolean_mask)

        # create image
        img_grey = Image.new("L", (width, height))
        draw = ImageDraw.Draw(img_grey)
        img_array = np.asarray(img_grey)
        font_sizes, positions, orientations, colors = [], [], [], []

        last_freq = 1.

        if max_font_size is None:
            # if not provided use default font_size
            max_font_size = self.max_font_size

        if max_font_size is None:
            # figure out a good font size by trying to draw with
            # just the first two words
            if len(frequencies) == 1:
                # we only have one word. we make it big!
                font_size = self.height
            else:
                self.generate_from_frequencies(dict(frequencies[:2]),
                                               max_font_size=self.height)
                # find font sizes
                sizes = [x[1] for x in self.layout_]
                font_size = int(2 * sizes[0] * sizes[1] / (sizes[0] + sizes[1]))
        else:
            font_size = max_font_size

        # we set self.words_ here because we called generate_from_frequencies
        # above... hurray for good design?
        self.words_ = dict(frequencies)

        # start drawing grey image
        for word, freq in frequencies:
            # select the font size
            rs = self.relative_scaling
            if rs ! = 0:
                font_size = int(round((rs * (freq / float(last_freq))
                                       + (1 - rs)) * font_size))
            if random_state.random() < self.prefer_horizontal:
                orientation = None
            else:
                orientation = Image.ROTATE_90
            tried_other_orientation = False
            while True:
                # try to find a position
                font = ImageFont.truetype(self.font_path, font_size)
                # transpose font optionally
                transposed_font = ImageFont.
                    font, orientation=orientation)
                # get size of resulting text
                box_size = draw.textsize(word, font=transposed_font)
                # find possible places using integral image:
                result = occupancy.sample_position(box_size[1] + self.margin,
                                                   box_size[0] + self.margin,
                                                   random_state)
                if result is not None or font_size < self.min_font_size:
                    # either we found a place or font-size went too small
                    break
                # if we didn't find a place, make font smaller
                # but first try to rotate!
                if not tried_other_orientation and self.prefer_horizontal < 1:
                    orientation = (Image.ROTATE_90 if orientation is None else
                                   Image.ROTATE_90)
                    tried_other_orientation = True
                else:
                    font_size -= self.font_step
                    orientation = None

            if font_size < self.min_font_size:
                # we were unable to draw any more
                break

            x

ワードクラウドは残念ながらopencvライブラリを使用していません。opencvライブラリを使用すれば、もっとクールなことができるはずです。