增加n_jobs对GridSearchCV没有影响

增加n_jobs对GridSearchCV没有影响

问题描述:

我设置了一个简单的实验,以在使用KNeighborsClassifier运行sklearn GridSearchCV时检查多核CPU的重要性.我得到的结果令我感到惊讶,我想知道我是否误解了多核的好处,或者我做得不好.

I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.

2-8个工作之间的完成时间没有差异.怎么会 ?我已经注意到"CPU性能"选项卡上的差异.在第一个单元运行时,CPU使用率约为13%,而最后一个单元则逐渐增加到100%.我期望它能更快完成.也许不是线性地更快,也就是8个工作将比4个工作快2倍,但要快一点.

There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.

这是我的设置方式:

我正在使用jupyter-notebook,单元格是指jupyter-notebook单元格.

I am using jupyter-notebook, cell refers to jupyter-notebook cell.

我已经加载了MNIST,并对X_play中的3000位数字使用了0.05测试大小.

I have loaded MNIST and used 0.05 test size for 3000 digits in a X_play.

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist['target']

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)

在下一个单元格中,我设置了KNN和一个GridSearchCV

In the next cell I have setup KNN and a GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

然后,我为8个n_jobs值完成了8个单元格.我的CPU是4核8线程的i7-4770.

Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.

grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)

结果

Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  2.0min finished
Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=3)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=4)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=5)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=6)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=7)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=8)]: Done  18 out of  18 | elapsed:  1.4min finished

第二项测试

随机森林分类器的使用要好得多.测试大小为0.530000图片.

Random Forest Classifier usage was much better. Test size was 0.5, 30000 images.

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]

Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished

以下是可能可能是这种行为的原因

Here are some reasons which might be a cause of this behaviour

  • 随着数量的增加线程数,初始化和释放每个线程会产生明显的开销.我在i7 7700HQ上运行了您的代码,每次增加n_job时,都会看到以下行为
    • n_job=1n_job=2的每个线程的时间(通过GridSearchCV对模型进行全面评估并对其进行完全测试并评估每个模型的时间)为2.9s(总时间约为2分钟)
    • n_job=3时,时间为3.4秒(总时间为1.4分钟)
    • n_job=4时,时间为3.8秒(总时间为58秒)
    • n_job=5时,时间为4.2秒(总时间为51秒)
    • n_job=6时,时间为4.2秒(总时间约为49秒)
    • n_job=7时,时间为4.2秒(总时间约为49秒)
    • n_job=8时,时间为4.2秒(总时间约为49秒)
    • With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing n_job
      • when n_job=1 and n_job=2 the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)
      • when n_job=3, time was 3.4s (overall time 1.4 mins)
      • when n_job=4, time was 3.8s (overall time 58 secs)
      • when n_job=5, time was 4.2s (overall time 51 secs)
      • when n_job=6, time was 4.2s (overall time ~49 secs)
      • when n_job=7, time was 4.2s (overall time ~49 secs)
      • when n_job=8, time was 4.2s (overall time ~49 secs)

      现在您可以看到,每个线程的时间增加了,但总体时间似乎减少了(尽管超过了n_job=4 the different was not exactly linear) and remained constained with n_jobs> = 6`,这是由于初始化和释放线程会产生成本)请参阅此github问题

      Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained withn_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.

      此外,可能还存在其他瓶颈,例如数据量大,要同时广播到所有线程,线程在RAM上抢占(或其他资源等),如何将数据压入每个线程线程等.

      Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.

      我建议您阅读有关Ahmdal定律的信息,该定律指出通过公式给出的并行化可以实现加速的理论界限 图片来源:阿姆达尔定律:*

      I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia

      最后,这可能是由于数据大小以及您用于训练的模型的复杂性所致.

      Finally, it might be due to the data size and the complexity of the model you use for training as well.

      这里是博客文章,它解释了相同的问题关于多线程.

      Here is a blog post explaining the same issue regarding multithreading.