


我设置了一个简单的实验,以在使用KNeighborsClassifier运行sklearn GridSearchCV时检查多核CPU的重要性.我得到的结果令我感到惊讶,我想知道我是否误解了多核的好处,或者我做得不好.

I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.

2-8个工作之间的完成时间没有差异.怎么会 ?我已经注意到"CPU性能"选项卡上的差异.在第一个单元运行时,CPU使用率约为13%,而最后一个单元则逐渐增加到100%.我期望它能更快完成.也许不是线性地更快,也就是8个工作将比4个工作快2倍,但要快一点.

There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.



I am using jupyter-notebook, cell refers to jupyter-notebook cell.


I have loaded MNIST and used 0.05 test size for 3000 digits in a X_play.

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist['target']

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)


In the next cell I have setup KNN and a GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]


Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.

grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)


Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  2.0min finished
Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=3)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=4)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=5)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=6)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=7)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=8)]: Done  18 out of  18 | elapsed:  1.4min finished



Random Forest Classifier usage was much better. Test size was 0.5, 30000 images.

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]

Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished


Here are some reasons which might be a cause of this behaviour

  • 随着数量的增加线程数,初始化和释放每个线程会产生明显的开销.我在i7 7700HQ上运行了您的代码,每次增加n_job时,都会看到以下行为
    • n_job=1n_job=2的每个线程的时间(通过GridSearchCV对模型进行全面评估并对其进行完全测试并评估每个模型的时间)为2.9s(总时间约为2分钟)
    • n_job=3时,时间为3.4秒(总时间为1.4分钟)
    • n_job=4时,时间为3.8秒(总时间为58秒)
    • n_job=5时,时间为4.2秒(总时间为51秒)
    • n_job=6时,时间为4.2秒(总时间约为49秒)
    • n_job=7时,时间为4.2秒(总时间约为49秒)
    • n_job=8时,时间为4.2秒(总时间约为49秒)
    • With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing n_job
      • when n_job=1 and n_job=2 the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)
      • when n_job=3, time was 3.4s (overall time 1.4 mins)
      • when n_job=4, time was 3.8s (overall time 58 secs)
      • when n_job=5, time was 4.2s (overall time 51 secs)
      • when n_job=6, time was 4.2s (overall time ~49 secs)
      • when n_job=7, time was 4.2s (overall time ~49 secs)
      • when n_job=8, time was 4.2s (overall time ~49 secs)

      现在您可以看到,每个线程的时间增加了,但总体时间似乎减少了(尽管超过了n_job=4 the different was not exactly linear) and remained constained with n_jobs> = 6`,这是由于初始化和释放线程会产生成本)请参阅此github问题

      Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained withn_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.


      Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.

      我建议您阅读有关Ahmdal定律的信息,该定律指出通过公式给出的并行化可以实现加速的理论界限 图片来源:阿姆达尔定律:*

      I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia


      Finally, it might be due to the data size and the complexity of the model you use for training as well.


      Here is a blog post explaining the same issue regarding multithreading.