sklearn の CountVectorizer の挙動

sklearn の CountVectorizer の挙動についてメモしておきます

sklearn の version は 0.22.2.post1 です

sklearn.feature_extraction.text.CountVectorizer - scikit-learn 0.24.2 documentation

class sklearn.feature_extraction.text. CountVectorizer( *, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\\b\\w\\w+\\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype= ) Convert a collection of text documents to a matrix of token counts This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

scikit-learn.org

CountVectorizer を使うと、text document を単語の出現頻度のマトリックスに変換できます

>>> from sklearn.feature_extraction.text import CountVectorizer

>>> corpus = [
...	    'This is the first document.',
...	    'This document is the second document.',
...	    'And this is the third one.',
...	    'Is this the first document?',
... ]

>>> vectorizer = CountVectorizer()

>>> X = vectorizer.fit_transform(corpus)

>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]

公式のサンプルコード

これを使うと、Pandas に格納されている list 型のデータを one hot encoding なデータとして展開できます

data の準備

>>> df = pd.DataFrame([
...   [1, ["test1", "test2"]],
...   [2, ["test2", "test3"]],
...   [3, ["test4"]]
... ], columns=['user_id', 'multi_select_values'])

>>> df['multi_select_values']
0    [test1, test2]
1    [test2, test3]
2           [test4]
Name: multi_select_values, dtype: object

fit する

# analyzer は default で "word" を使うので .lower() を対象となるデータに使おうとするため、データが list 型だとエラーになる。そのため analyzer には list 型をそのまま返すような処理を書いている
>>> vectorizer = CountVectorizer(analyzer=lambda x: x)
>>> X = vectorizer.fit_transform(df['multi_select_values'])

# feature の名前を取得できる
>>> print(vectorizer.get_feature_names())
['test1', 'test2', 'test3', 'test4']

# toarray() すると numpy.ndarray で結果を取得できる
>>> print(X.toarray())
[[1 1 0 0]
 [0 1 1 0]
 [0 0 0 1]]

# dataframe にするには以下など (ただし選択肢が日本語の場合は、カラム名が日本語になってしまう)
>>> pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())