sklearn の CountVectorizer の挙動についてメモしておきます
sklearn の version は 0.22.2.post1 です
CountVectorizer を使うと、text document を単語の出現頻度のマトリックスに変換できます
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]これを使うと、Pandas に格納されている list 型のデータを one hot encoding なデータとして展開できます
data の準備
>>> df = pd.DataFrame([
... [1, ["test1", "test2"]],
... [2, ["test2", "test3"]],
... [3, ["test4"]]
... ], columns=['user_id', 'multi_select_values'])
>>> df['multi_select_values']
0 [test1, test2]
1 [test2, test3]
2 [test4]
Name: multi_select_values, dtype: objectfit する