增加n_jobs对GridSearchCV没有影响

问题描述：

我设置了一个简单的实验，以在使用KNeighborsClassifier运行sklearn GridSearchCV时检查多核CPU的重要性.我得到的结果令我感到惊讶，我想知道我是否误解了多核的好处，或者我做得不好.

I have setup simple experiment to check importance of the multi core CPU while running sklearn GridSearchCV with KNeighborsClassifier. The results I got are surprising to me and I wonder if I misunderstood the benefits of multi cores or maybe I haven't done it right.

2-8个工作之间的完成时间没有差异.怎么会 ?我已经注意到"CPU性能"选项卡上的差异.在第一个单元运行时，CPU使用率约为13％，而最后一个单元则逐渐增加到100％.我期望它能更快完成.也许不是线性地更快，也就是8个工作将比4个工作快2倍，但要快一点.

There is no difference in time to completion between 2-8 jobs. How come ? I have noticed the difference on a CPU Performance tab. While the first cell was running CPU usage was ~13% and it was gradually increasing to 100% for the last cell. I was expecting it to finish faster. Maybe not linearly faster aka 8 jobs would be 2 times faster then 4 jobs but a bit faster.

这是我的设置方式:

我正在使用jupyter-notebook，单元格是指jupyter-notebook单元格.

I am using jupyter-notebook, cell refers to jupyter-notebook cell.

我已经加载了MNIST，并对X_play中的3000位数字使用了0.05测试大小.

I have loaded MNIST and used 0.05 test size for 3000 digits in a X_play.

from sklearn.datasets import fetch_mldata
from sklearn.model_selection import train_test_split

mnist = fetch_mldata('MNIST original')

X, y = mnist["data"], mnist['target']

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
_, X_play, _, y_play = train_test_split(X_train, y_train, test_size=0.05, random_state=42, stratify=y_train, shuffle=True)

在下一个单元格中，我设置了KNN和一个GridSearchCV

In the next cell I have setup KNN and a GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

knn_clf = KNeighborsClassifier()
param_grid = [{'weights': ["uniform", "distance"], 'n_neighbors': [3, 4, 5]}]

然后，我为8个n_jobs值完成了8个单元格.我的CPU是4核8线程的i7-4770.

Then I done 8 cells for 8 n_jobs values. My CPU is i7-4770 with 4 cores 8 threads.

grid_search = GridSearchCV(knn_clf, param_grid, cv=3, verbose=3, n_jobs=N_JOB_1_TO_8)
grid_search.fit(X_play, y_play)

结果

Parallel(n_jobs=1)]: Done  18 out of  18 | elapsed:  2.0min finished
Parallel(n_jobs=2)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=3)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=4)]: Done  18 out of  18 | elapsed:  1.3min finished
Parallel(n_jobs=5)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=6)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=7)]: Done  18 out of  18 | elapsed:  1.4min finished
Parallel(n_jobs=8)]: Done  18 out of  18 | elapsed:  1.4min finished

第二项测试

随机森林分类器的使用要好得多.测试大小为0.5，30000图片.

Random Forest Classifier usage was much better. Test size was 0.5, 30000 images.

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier()
param_grid = [{'n_estimators': [20, 30, 40, 50, 60], 'max_features': [100, 200, 300, 400, 500], 'criterion': ['gini', 'entropy']}]

Parallel(n_jobs=1)]: Done 150 out of 150 | elapsed: 110.9min finished
Parallel(n_jobs=2)]: Done 150 out of 150 | elapsed: 56.8min finished
Parallel(n_jobs=3)]: Done 150 out of 150 | elapsed: 39.3min finished
Parallel(n_jobs=4)]: Done 150 out of 150 | elapsed: 35.3min finished
Parallel(n_jobs=5)]: Done 150 out of 150 | elapsed: 36.0min finished
Parallel(n_jobs=6)]: Done 150 out of 150 | elapsed: 34.4min finished
Parallel(n_jobs=7)]: Done 150 out of 150 | elapsed: 32.1min finished
Parallel(n_jobs=8)]: Done 150 out of 150 | elapsed: 30.1min finished

答

以下是可能可能是这种行为的原因

Here are some reasons which might be a cause of this behaviour

随着数量的增加线程数，初始化和释放每个线程会产生明显的开销.我在i7 7700HQ上运行了您的代码，每次增加n_job时，都会看到以下行为
- 当n_job=1和n_job=2的每个线程的时间(通过GridSearchCV对模型进行全面评估并对其进行完全测试并评估每个模型的时间)为2.9s(总时间约为2分钟)
- n_job=3时，时间为3.4秒(总时间为1.4分钟)
- n_job=4时，时间为3.8秒(总时间为58秒)
- n_job=5时，时间为4.2秒(总时间为51秒)
- n_job=6时，时间为4.2秒(总时间约为49秒)
- n_job=7时，时间为4.2秒(总时间约为49秒)
- n_job=8时，时间为4.2秒(总时间约为49秒)
- With increasing no. of threads, there is an apparent overhead incurred for intializing and releasing each thread. I ran your code on my i7 7700HQ, I saw the following behaviour with each inceasing n_job
  - when n_job=1 and n_job=2 the time per thread(Time per model evaluation by GridSearchCV to fully train the model and test it) was 2.9s (overall time ~2 mins)
  - when n_job=3, time was 3.4s (overall time 1.4 mins)
  - when n_job=4, time was 3.8s (overall time 58 secs)
  - when n_job=5, time was 4.2s (overall time 51 secs)
  - when n_job=6, time was 4.2s (overall time ~49 secs)
  - when n_job=7, time was 4.2s (overall time ~49 secs)
  - when n_job=8, time was 4.2s (overall time ~49 secs)
  现在您可以看到，每个线程的时间增加了，但总体时间似乎减少了(尽管超过了n_job=4 the different was not exactly linear) and remained constained with n_jobs> = 6`，这是由于初始化和释放线程会产生成本)请参阅此github问题和
  
  Now as you can see, time per thread increased but overall time seem to decrease (although beyond n_job=4 the different was not exactly linear) and remained constained withn_jobs>=6` This is due to the fact that there is a cost incurred with initializing and releaseing threads. See this github issue and this issue.
  
  此外，可能还存在其他瓶颈，例如数据量大，要同时广播到所有线程，线程在RAM上抢占(或其他资源等)，如何将数据压入每个线程线程等.
  
  Also, there might be other bottlenecks like data being to large to be broadcasted to all threads at the same time, thread pre-emption over RAM (or other resouces,etc.), how data is pushed into each thread, etc.
  
  我建议您阅读有关Ahmdal定律的信息，该定律指出通过公式给出的并行化可以实现加速的理论界限图片来源:阿姆达尔定律:*
  
  I suggest you to read about Ahmdal's Law which states that there is a theoretical bound on the speedup that can be achieved through parallelization which is given by the formula Image Source : Ahmdal's Law : Wikipedia
  
  最后，这可能是由于数据大小以及您用于训练的模型的复杂性所致.
  
  Finally, it might be due to the data size and the complexity of the model you use for training as well.
  
  这里是博客文章，它解释了相同的问题关于多线程.
  
  Here is a blog post explaining the same issue regarding multithreading.

增加n_jobs对GridSearchCV没有影响

相关推荐