Machine Learning : Pre-processing features Machine Learning : Pre-processing features

from:http://analyticsbot.ml/2016/10/machine-learning-pre-processing-features/

I am participating in this Kaggle competition. It is a prediction problem contest. The problem statement is:

How severe is an insurance claim?

When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.

Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.

You can take a look at the data here. You can easily open the dataset in excel and take a look at the variables/features in the dataset. There are 116 categorical variables in the dataset and 14 continuous variables. Let’s start the analysis

Import all necessary modules.

 
1
2
3
4
5
6
7
8
9
# import required libraries
# pandas for reading data and manipulation
# scikit learn to one hot encoder and label encoder
# sns and matplotlib to visualize
pd
sns
plt
DictVectorizer
operator

All these modules should be installed on your machine. I am using Python 2.7.11. If you have to install these modules, you can simply do

 
1
2
3
4
>
 
:
pandas

Let’s read the datasets using pandas

 
1
2
3
# read data from csv file
)
)

Let’s take a look at the data

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# let's take a look at the train and test data
'**************************************'
'TRAIN DATA'
'**************************************'
)
'**************************************'
'TEST DATA'
'**************************************'
)
 
# the above code wont print all columns.
# to print all columns
)
)
 
# let's take a look at the train and test data again
'**************************************'
'TRAIN DATA'
'**************************************'
)
'**************************************'
'TEST DATA'
'**************************************'
)
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
DATA
*
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
  
 
]
*
DATA
*
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
  
 
]
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
*
DATA
*
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
  
*
DATA
*
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
 
  
  
  
  
  
0.825823

You might have noticed that we printed the same thing twice. Well, the first time Python prints a small number of columns and the first five observations but the second time it prints all columns and their 5 observations. This is because of the

 
1
2
)
)

Make sure you have the 5 in head else it will print everything on screen, which will not be pretty. Take a look at the columns present in the train and test set.

 
1
2
columns
columns

There is an ID column in both the data sets which we don’t need for any analysis. Moreover, we will keep the loss column from the training data set into a separate variable.

 
1
2
3
4
# remove ID column. No use.
)
)
)

Let’s take a look at the continuous variables and their basic statistical analysis.

 
1
2
3
4
5
# high level statistics. mean media mode count and quartiles
# note - this will work only for the continous variables
# not for the categorical variables
)
)
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
## train
 
  
  
  
  
  
  
  
  
 
  
  
  
  
  
  
  
  
 
  
  
  
  
  
  
  
  
 
cont14  
  
  
  
  
  
  
  
0.844848

In many competitions, you’ll find there are some features that might be present in the training set but not in the test set and vice-versa.

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# at this point, it is wise to check whether there are any features that
# are there is one of the dataset but not in other
False
]
:
:
True
)
 
:
' features are present in training set but not in test set'
 
]
:
:
True
)
:
' features are present in test set but not in training set'

In this case, we see that there are not different columns between train and test sets.

Let’s identify the categorical and continuous variables. For this data set, there are two ways to find them:

  • variables have ‘cat’ and ‘cont’ in them, defining them
  • pandas consider the data type as object
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# find categorical variables
# in this problem, categorical variables are start with cat which is easy
# to identify
# in other problems it not might be like that
# we will see two ways to identify this in this problem
# we will also find the continous or numerical variables
## 1. by name
]
]
 
]
]
 
## 2. by type = object
index
index
 
index
index

Correlation between continuous variables

Let’s take a look at correlation between the variables. The idea is to remove variables that are highly correlated.

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# lets check for correlation between continous data
# correlation between numerical variables is something like this
# if we increase one variable, there is a siginficant almost increase/decrease
# in the other variable. it varies from -1 to 1
 
)
)
 
# for the purpose of this analysis, we will consider to variables to
# highly correlation if the correlation is more than 0.6
0.6
:
:
:
)
 
:
:
:
)
 
# we can remove one of the two highly correlatied variables to improve performance
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
0.76
0.66
0.93
0.80
0.81
0.88
0.79
0.77
0.75
0.61
0.70
0.61
0.79
0.74
0.63
0.71
0.99
0.82
0.64
0.71
0.76
0.66
0.93
0.80
0.81
0.88
0.79
0.77
0.75
0.61
0.70
0.61
0.79
0.74
0.63
0.71
0.99
0.82
0.64
0.71

Let’s take at the labels present in the categorical variables. Although we don’t have any different columns, what may happen is some labels might not be present in one or the other data set.

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# lets check for factors in the categorical variables
:
)
 
:
)
 
# lets take a look whether the unique values/factors are not present in each of the dataset
# for example cat1 in both the datasets has values only A & B. Sometimes
# it may happen that some new value is present in the test set which maybe ruin your model
]
:
        
:
' '
' '
)
)
 
:
        
:
' '
' '
)
)
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
'U'
]
]
]
]
'O'
]
]
'B'
]
]
'F'
'U'
'BG'
'BS'
'V'
]
'CM'
'BP'
'U'
'E'
'CK'
'B'
'BU'
'CU'
]
]
'AP'
'G'
'V'
]
'F'
'AO'
'BI'
]
'S'
]
'C'
]
'HQ'
'KW'
'BA'
'DF'
'EC'
'IQ'
'ES'
'FX'
'HH'
'GJ'
'HD'
'HY'
'AW'
'IP'
'FU'
'Y'
'FJ'
'EV'
'Q'
'EQ'
'HO'
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
'U'
]
]
'E'
'AT'
'AE'
'Y'
]
'CM'
'DK'
'AY'
'CG'
'L'
'BL'
'G'
'BM'
]
]
'AH'
'S'
'V'
]
'Q'
'AU'
'BB'
]
]
'X'
]
'DA'
'LF'
'GX'
'LY'
'BY'
'ME'
'EF'
'KP'
'EE'
'HD'
'CT'
'EJ'
'AK'
'IP'
'BW'
'LR'
'Q'
'BT'
'AV'
'DV'
]
]
 
]
 
)
]
 
]
 
)
]
 
]
 
)
'U'
]
 
]
 
)
]
 
]
 
)
'O'
]
 
]
 
)
'F'
'U'
'BG'
'BS'
'V'
]
 
'E'
'AT'
'AE'
'Y'
]
 
)
'CM'
'BP'
'U'
'E'
'CK'
'B'
'BU'
'CU'
]
 
'CM'
'DK'
'AY'
'CG'
'L'
'BL'
'G'
'BM'
]
 
)
]
 
]
 
)
'F'
'AO'
'BI'
]
 
'Q'
'AU'
'BB'
]
 
)
'S'
]
 
]
 
)
'HQ'
'KW'
'BA'
'DF'
'EC'
'IQ'
'ES'
'FX'
'HH'
'GJ'
'HD'
'HY'
'AW'
'IP'
'FU'
'Y'
'FJ'
'EV'
'Q'
'EQ'
'HO'
]
 
'DA'
'LF'
'GX'
'LY'
'BY'
'ME'
'EF'
'KP'
'EE'
'HD'
'CT'
'EJ'
'AK'
'IP'
'BW'
'LR'
'Q'
'BT'
'AV'
'DV'
]
 
)

Let’s plot the categorical variables to see the distribution of the variables.

 
1
2
3
4
5
6
7
8
9
10
# lets visualize the values in each of the features
# keep in mind you'll be seeing a lot of plots now
# better is use ipython/jupyter notebook to plot inline plots
:
)
)
 
:
)
)

Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features Machine Learning : Pre-processing features
Machine Learning : Pre-processing features

One Hot Encoding of categorical variables

Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

  1. The first way is to use dictvectorizer to encode the labels in the feature.
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# cat1 to cat72 have only two labels A and B
# cat73 to cat 108 have more than two labels
# cat109 to cat116 have many labels
# moreover you must have noticed that some labels are missing in some features of train/test dataset
# this might become a problem when working with multiple datasets
# to avoid this, we will merge data before doing onehotencoding
)
index
# lets check for factors in the categorical variables
:
)
 
# 1. one hot encoding all categorical variables
)
)
vocabulary_
)
)
 
# it can be seen that we are adding too many new variables. This encoding is important
# since machine learning algorithms dont understand strings and we have to convert string factors
# as numeric factors which increase our dimensionality
)
  
# remove initial categorical variables
)
 
# take back the train and test set from the above data
]
]
]
]
loss

2.  Second method is to use pandas to get dummy variables

 
1
2
3
4
5
6
7
8
9
10
11
12
13
# 2. using get dummies from pandas
train_test
)
 
      
)
 
# take back the train and test set from the above data
]
]
]
]
loss

3. Some of these variables only have two labels and some have more than two. One way is to use factorize to convert these labels to numeric

 
1
2
3
4
5
6
7
8
9
10
11
# 3. pd.factorize
train_test
:
]
 
# take back the train and test set from the above data
]
]
]
]
loss

4. Another way to is to mix the dummy variables and factorize

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 4. mixed model
# what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize
# these variables
# for the rest we can make dummies
train_test
:
]
 
)
 
      
)
 
# take back the train and test set from the above data
]
]
]
]
loss

Here’s the full code

 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
# import required libraries
# pandas for reading data and manipulation
# scikit learn to one hot encoder and label encoder
# sns and matplotlib to visualize
pd
sns
plt
DictVectorizer
operator
 
# read data from csv file
)
)
 
# let's take a look at the train and test data
'**************************************'
'TRAIN DATA'
'**************************************'
)
'**************************************'
'TEST DATA'
'**************************************'
)
 
# the above code wont print all columns.
# to print all columns
)
)
 
# let's take a look at the train and test data again
'**************************************'
'TRAIN DATA'
'**************************************'
)
'**************************************'
'TEST DATA'
'**************************************'
)
 
# remove ID column. No use.
)
)
)
 
# high level statistics. mean media mode count and quartiles
# note - this will work only for the continous variables
# not for the categorical variables
)
)
 
# at this point, it is wise to check whether there are any features that
# are there is one of the dataset but not in other
False
]
:
:
True
)
 
:
' features are present in training set but not in test set'
 
]
:
:
True
)
:
' features are present in test set but not in training set'
        
# find categorical variables
# in this problem, categorical variables are start with cat which is easy
# to identify
# in other problems it not might be like that
# we will see two ways to identify this in this problem
# we will also find the continous or numerical variables
## 1. by name
]
]
 
]
]
 
## 2. by type = object
index
index
 
index
index
 
# lets check for correlation between continous data
# correlation between numerical variables is something like this
# if we increase one variable, there is a siginficant almost increase/decrease
# in the other variable. it varies from -1 to 1
 
)
)
 
# for the purpose of this analysis, we will consider to variables to
# highly correlation if the correlation is more than 0.6
0.6
:
:
:
)
 
:
:
:
)
 
####cat6 and cat1 = 0.76
####cat7 and cat6 = 0.66
####cat9 and cat1 = 0.93
####cat9 and cat6 = 0.80
####cat10 and cat1 = 0.81
####cat10 and cat6 = 0.88
####cat10 and cat9 = 0.79
####cat11 and cat6 = 0.77
####cat11 and cat7 = 0.75
####cat11 and cat9 = 0.61
####cat11 and cat10 = 0.70
####cat12 and cat1 = 0.61
####cat12 and cat6 = 0.79
####cat12 and cat7 = 0.74
####cat12 and cat9 = 0.63
####cat12 and cat10 = 0.71
####cat12 and cat11 = 0.99
####cat13 and cat6 = 0.82
####cat13 and cat9 = 0.64
####cat13 and cat10 = 0.71
# we can remove one of the two highly correlatied variables to improve performance
 
# lets check for factors in the categorical variables
:
)
 
:
)
 
# lets take a look whether the unique values/factors are not present in each of the dataset
# for example cat1 in both the datasets has values only A & B. Sometimes
# it may happen that some new value is present in the test set which maybe ruin your model
]
:
        
:
' '
' '
)
)
 
:
        
:
' '
' '
)
)
        
# lets visualize the values in each of the features
# keep in mind you'll be seeing a lot of plots now
# better is use ipython/jupyter notebook to plot inline plots
:
)
#plt.show()
 
:
)
#plt.show()
 
# cat1 to cat72 have only two labels A and B
# cat73 to cat 108 have more than two labels
# cat109 to cat116 have many labels
# moreover you must have noticed that some labels are missing in some features of train/test dataset
# this might become a problem when working with multiple datasets
# to avoid this, we will merge data before doing onehotencoding
)
index
# lets check for factors in the categorical variables
:
)
 
# 1. one hot encoding all categorical variables
)
)
vocabulary_
)
)
 
# it can be seen that we are adding too many new variables. This encoding is important
# since machine learning algorithms dont understand strings and we have to convert string factors
# as numeric factors which increase our dimensionality
)
  
# remove initial categorical variables
)
 
# take back the train and test set from the above data
]
]
]
]
loss
 
# 2. using get dummies from pandas
train_test
)
 
      
)
 
# take back the train and test set from the above data
]
]
]
]
loss
 
# 3. pd.factorize
train_test
:
]
 
# take back the train and test set from the above data
]
]
]
]
loss
 
# 4. mixed model
# what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize
# these variables
# for the rest we can make dummies
train_test
:
]
 
)
 
      
)
 
# take back the train and test set from the above data
]
]
]
]
loss
 
 
## this we can use for training and testing in the model

Happy Python-ing!

Posted in Machine LearningPython