DataFrameで列ごとに最頻値を計算する

40万列100行で各列について最頻値を求めたい(3値の多数決をしたい)

この場合df.mode().iloc[0]よりも早く出来る方法ありますかね…
— とくぷ (@tokutokupooh) October 16, 2020

というわけで測ってみました。
計算環境は

  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz

で、Pythonとパッケージのバージョンは

python: 3.8.5
numpy: 1.19.2
pandas: 1.1.2
scipy: 1.5.2

です。

結論

scipy.stats.mode()が早そう。
(2020-11-19追記) @nkayさんから教えていただきました。NumPyのnp.bincount()を使ったやり方がとても速い！
df.mode()はとても遅いので注意。

準備

import numpy as np
import pandas as pd
import statistics
from collections import Counter
import scipy
import scipy.stats as stats

import sys
print(f"python: {sys.version}")
print(f"numpy: {np.__version__}")
print(f"pandas: {pd.__version__}")
print(f"scipy: {scipy.__version__}")

今回使うデータフレーム

nrow = 100
ncol = 400_000
np.random.seed(0)
df = pd.DataFrame(np.random.choice([-1,0,1],(nrow,ncol)))

df.mode()

2分20秒。ここからスタートです。

%time df.mode().iloc[0]

CPU times: user 2min 22s, sys: 1.13 s, total: 2min 23s
Wall time: 2min 24s

0         0.0
1        -1.0
2        -1.0
3         1.0
4        -1.0
         ... 
399995   -1.0
399996    0.0
399997    0.0
399998    0.0
399999    1.0
Name: 0, Length: 400000, dtype: float64

df.mode() (categoricalデータに変換)

Pandasのコードを読む限りcategoricalだと早くなるとのこと。ただし、残念ながら今回のケースでは対して変わりませんでした。

dfcat = df.astype("category")
%time dfcat.mode().iloc[0]

CPU times: user 2min 28s, sys: 1.27 s, total: 2min 29s
Wall time: 2min 30s

0         0
1        -1
2        -1
3         1
4        -1
         ..
399995   -1
399996    0
399997    0
399998    0
399999    1
Name: 0, Length: 400000, dtype: category
Categories (3, int64): [-1, 0, 1]

scipy.stats.mode()

科学技術計算といえばSciPyですが…。10秒。上に比べるとかなり速いです。

%time stats.mode(df.values, axis=0).mode

CPU times: user 10.2 s, sys: 63.4 ms, total: 10.3 s
Wall time: 10.4 s
array([[ 0, -1, -1, ...,  0,  0,  1]])

statistics.mode()

Pythonの標準ライブラリのstatisticsです。17秒。

%time df.apply(statistics.mode, axis=0)

CPU times: user 17.1 s, sys: 128 ms, total: 17.2 s
Wall time: 17.4 s

0         0
1        -1
2        -1
3         1
4        -1
         ..
399995    0
399996    0
399997    0
399998    0
399999    1
Length: 400000, dtype: int64

NumPy

numpyにはmodeがないのでnumpy.unique()とnumpy.argmax()を組み合わせて自分で作ります。
22秒。

def numpymode(arr):
    unq = np.unique(arr, return_counts=True)
    return unq[0][unq[1].argmax()]

%time df.apply(numpymode, axis=0)

CPU times: user 22.9 s, sys: 160 ms, total: 23.1 s
Wall time: 23.3 s

0         0
1        -1
2        -1
3         1
4        -1
         ..
399995   -1
399996    0
399997    0
399998    0
399999    1
Length: 400000, dtype: int64

NumPy (np.bincount()を用いた場合)

（2020-11-19追記）
@nkayさんに教えていただいた方法です。圧倒的に早いです。約0.3秒。
今回のタスクだと400,000列に対して1列ずつ最頻値を計算してしまうのがネックであることがよくわかります。
全体を1つの配列として bincount() で数えあげ、それを reshape() で列ごとの最頻値としてあげるのが肝ですね。

def np_mode(df):
    arr = df.to_numpy()
    max, min, ncol = arr.max(), arr.min(), arr.shape[1]
    offset = np.arange(ncol) * (max-min+1) - min
    return np.bincount((arr+offset).ravel()).reshape(ncol, -1).argmax(1) + min

%time np_mode(df)

CPU times: user 224 ms, sys: 58.7 ms, total: 283 ms
Wall time: 282 ms
array([ 0, -1, -1, ...,  0,  0,  1])

collection.Counter

標準ライブラリcollectionからCounterを使用。14秒。

%time df.apply(lambda sr: Counter(sr).most_common()[0][0], axis=0)

CPU times: user 13.8 s, sys: 82.9 ms, total: 13.9 s
Wall time: 14 s

0         0
1        -1
2        -1
3         1
4        -1
         ..
399995    0
399996    0
399997    0
399998    0
399999    1
Length: 400000, dtype: int64

おまけ

Juliaでもやってみましょう。

StatsBase.mode()を使う場合

mode関数はJuliaの標準ライブラリには無いのでStatsBase.jlを使います。ちなみに下記コード中のsample関数もStatsBase.jlのものです。

using BenchmarkTools
using DataFrames
using StatsBase
nrow = 100
ncol = 400_000
df = DataFrame(sample([-1,0,1], (nrow, ncol)))
first(df, 5)  # データ形式の確認

5×400000 DataFrame. Omitted printing of 399991 columns
│ Row │ x1    │ x2    │ x3    │ x4    │ x5    │ x6    │ x7    │ x8    │ x9    │
│     │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤
│ 1   │ -1    │ 1     │ 1     │ 0     │ 1     │ 0     │ 0     │ -1    │ -1    │
│ 2   │ 1     │ -1    │ 0     │ 0     │ -1    │ 0     │ 0     │ 0     │ -1    │
│ 3   │ -1    │ 0     │ 1     │ -1    │ 1     │ 1     │ 0     │ 0     │ 0     │
│ 4   │ -1    │ -1    │ 0     │ 1     │ -1    │ 0     │ 1     │ -1    │ 1     │
│ 5   │ -1    │ 1     │ -1    │ 0     │ 1     │ -1    │ 0     │ 0     │ 1     │

@btime mapcols(mode, df);

914.653 ms (2399604 allocations: 333.43 MiB)

約1秒。さすがに早いですね。

NumPyの速いやり方をJuliaに移植

上の@nkayさんに教えていただいた関数をJuliaでもやってみます。
Juliaだとインデックスが1から始まる点、また2次元Arrayを1次元に落とす際に列方向で展開される点がNumPyとは異なるので注意です。
NumPyよりわずかに遅い結果に。おそらく convert(Matrix,df) でコピーが発生しているせいですね。

function jl_mode(df)
    arr=convert(Matrix,df)
    maxvalue, minvalue, ncol = maximum(arr), minimum(arr), size(arr,2)
    offset = collect(1:ncol) .* (maxvalue-minvalue+1) .- minvalue
    return argmax.(eachcol(reshape(counts(arr .+ offset'), :, ncol))) .+ (minvalue - 1)
end

@btime jl_mode(df);

342.880 ms (400026 allocations: 653.08 MiB)

Juliaでのより速い計算方法

愚直に書くのが速いかも…
値の範囲によってパフォーマンス変わりそうですが。 pic.twitter.com/cf6KbZIeML
— (｢・ω・)｢ｶﾞｵｰ (@bicycle1885) November 19, 2020

@bicycle1885さんに教えていただきました。Juliaだとfor文で愚直に書いた方が速いんですね…。
計算にかかった時間は0.1秒未満です。恐ろしい速さ。

function simplemode(x)
    minval, maxval = extrema(x)
    counts = zeros(Int, maxval - minval + 1)
    for i in eachindex(x)
        @inbounds counts[x[i]-minval+1] += 1
    end
    return argmax(counts) + minval - 1
end

@btime simplemode.(eachcol(df));

74.020 ms (799498 allocations: 51.87 MiB)