Machine Learning : Pre-processing features Machine Learning : Pre-processing features

from：http://analyticsbot.ml/2016/10/machine-learning-pre-processing-features/

October 21, 2016

I am participating in this Kaggle competition. It is a prediction problem contest. The problem statement is:

How severe is an insurance claim?

When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect.

Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.

You can take a look at the data here. You can easily open the dataset in excel and take a look at the variables/features in the dataset. There are 116 categorical variables in the dataset and 14 continuous variables. Let’s start the analysis

Import all necessary modules.

# import required libraries

# pandas for reading data and manipulation

# scikit learn to one hot encoder and label encoder

# sns and matplotlib to visualize

sns

plt

DictVectorizer

operator

All these modules should be installed on your machine. I am using Python 2.7.11. If you have to install these modules, you can simply do

pandas

Let’s read the datasets using pandas

# read data from csv file

)

Let’s take a look at the data

# let's take a look at the train and test data

'**************************************'

'TRAIN DATA'

'**************************************'

)

'**************************************'

'TEST DATA'

'**************************************'

)

# the above code wont print all columns.

# to print all columns

)

# let's take a look at the train and test data again

'**************************************'

'TRAIN DATA'

'**************************************'

)

'**************************************'

'TEST DATA'

'**************************************'

)

DATA

]

DATA

]

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

DATA

0.825823

You might have noticed that we printed the same thing twice. Well, the first time Python prints a small number of columns and the first five observations but the second time it prints all columns and their 5 observations. This is because of the

)

Make sure you have the 5 in head else it will print everything on screen, which will not be pretty. Take a look at the columns present in the train and test set.

1 2	columns columns

There is an ID column in both the data sets which we don’t need for any analysis. Moreover, we will keep the loss column from the training data set into a separate variable.

# remove ID column. No use.

)

Let’s take a look at the continuous variables and their basic statistical analysis.

# high level statistics. mean media mode count and quartiles

# note - this will work only for the continous variables

# not for the categorical variables

)

## train

cont14

0.844848

In many competitions, you’ll find there are some features that might be present in the training set but not in the test set and vice-versa.

# at this point, it is wise to check whether there are any features that

# are there is one of the dataset but not in other

False

]

True

)

' features are present in training set but not in test set'

]

True

)

' features are present in test set but not in training set'

In this case, we see that there are not different columns between train and test sets.

Let’s identify the categorical and continuous variables. For this data set, there are two ways to find them:

variables have ‘cat’ and ‘cont’ in them, defining them
pandas consider the data type as object

# find categorical variables

# in this problem, categorical variables are start with cat which is easy

# to identify

# in other problems it not might be like that

# we will see two ways to identify this in this problem

# we will also find the continous or numerical variables

## 1. by name

]

## 2. by type = object

index

Correlation between continuous variables

Let’s take a look at correlation between the variables. The idea is to remove variables that are highly correlated.

# lets check for correlation between continous data

# correlation between numerical variables is something like this

# if we increase one variable, there is a siginficant almost increase/decrease

# in the other variable. it varies from -1 to 1

)

# for the purpose of this analysis, we will consider to variables to

# highly correlation if the correlation is more than 0.6

0.6

)

# we can remove one of the two highly correlatied variables to improve performance

0.76

0.66

0.93

0.80

0.81

0.88

0.79

0.77

0.75

0.61

0.70

0.61

0.79

0.74

0.63

0.71

0.99

0.82

0.64

0.71

0.76

0.66

0.93

0.80

0.81

0.88

0.79

0.77

0.75

0.61

0.70

0.61

0.79

0.74

0.63

0.71

0.99

0.82

0.64

0.71

Let’s take at the labels present in the categorical variables. Although we don’t have any different columns, what may happen is some labels might not be present in one or the other data set.

# lets check for factors in the categorical variables

)

# lets take a look whether the unique values/factors are not present in each of the dataset

# for example cat1 in both the datasets has values only A & B. Sometimes

# it may happen that some new value is present in the test set which maybe ruin your model

]

' '

)

' '

)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

254

255

256

257

258

259

260

261

262

263

264

265

266

267

268

269

270

271

272

273

274

275

276

277

278

279

280

281

282

283

284

285

286

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

]

'U'

]

'O'

]

'B'

]

'F'

'U'

'BG'

'BS'

'V'

]

'CM'

'BP'

'U'

'E'

'CK'

'B'

'BU'

'CU'

]

'AP'

'G'

'V'

]

'F'

'AO'

'BI'

]

'S'

]

'C'

]

'HQ'

'KW'

'BA'

'DF'

'EC'

'IQ'

'ES'

'FX'

'HH'

'GJ'

'HD'

'HY'

'AW'

'IP'

'FU'

'Y'

'FJ'

'EV'

'Q'

'EQ'

'HO'

]

'U'

]

'E'

'AT'

'AE'

'Y'

]

'CM'

'DK'

'AY'

'CG'

'L'

'BL'

'G'

'BM'

]

'AH'

'S'

'V'

]

'Q'

'AU'

'BB'

]

'X'

]

'DA'

'LF'

'GX'

'LY'

'BY'

'ME'

'EF'

'KP'

'EE'

'HD'

'CT'

'EJ'

'AK'

'IP'

'BW'

'LR'

'Q'

'BT'

'AV'

'DV'

]

)

]

)

]

)

'U'

]

)

]

)

'O'

]

)

'F'

'U'

'BG'

'BS'

'V'

]

'E'

'AT'

'AE'

'Y'

]

)

'CM'

'BP'

'U'

'E'

'CK'

'B'

'BU'

'CU'

]

'CM'

'DK'

'AY'

'CG'

'L'

'BL'

'G'

'BM'

]

)

]

)

'F'

'AO'

'BI'

]

'Q'

'AU'

'BB'

]

)

'S'

]

)

'HQ'

'KW'

'BA'

'DF'

'EC'

'IQ'

'ES'

'FX'

'HH'

'GJ'

'HD'

'HY'

'AW'

'IP'

'FU'

'Y'

'FJ'

'EV'

'Q'

'EQ'

'HO'

]

'DA'

'LF'

'GX'

'LY'

'BY'

'ME'

'EF'

'KP'

'EE'

'HD'

'CT'

'EJ'

'AK'

'IP'

'BW'

'LR'

'Q'

'BT'

'AV'

'DV'

]

)

Let’s plot the categorical variables to see the distribution of the variables.

# lets visualize the values in each of the features

# keep in mind you'll be seeing a lot of plots now

# better is use ipython/jupyter notebook to plot inline plots

)

One Hot Encoding of categorical variables

Encode categorical integer features using a one-hot aka one-of-K scheme. The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features. The output will be a sparse matrix where each column corresponds to one possible value of one feature.

The first way is to use dictvectorizer to encode the labels in the feature.

# cat1 to cat72 have only two labels A and B

# cat73 to cat 108 have more than two labels

# cat109 to cat116 have many labels

# moreover you must have noticed that some labels are missing in some features of train/test dataset

# this might become a problem when working with multiple datasets

# to avoid this, we will merge data before doing onehotencoding

)

index

# lets check for factors in the categorical variables

)

# 1. one hot encoding all categorical variables

)

vocabulary_

)

# it can be seen that we are adding too many new variables. This encoding is important

# since machine learning algorithms dont understand strings and we have to convert string factors

# as numeric factors which increase our dimensionality

)

# remove initial categorical variables

)

# take back the train and test set from the above data

]

loss

2. Second method is to use pandas to get dummy variables

# 2. using get dummies from pandas

train_test

)

# take back the train and test set from the above data

]

loss

3. Some of these variables only have two labels and some have more than two. One way is to use factorize to convert these labels to numeric

# 3. pd.factorize

train_test

]

# take back the train and test set from the above data

]

loss

4. Another way to is to mix the dummy variables and factorize

# 4. mixed model

# what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize

# these variables

# for the rest we can make dummies

train_test

]

)

# take back the train and test set from the above data

]

loss

Here’s the full code

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245

246

247

248

249

250

251

252

253

# import required libraries

# pandas for reading data and manipulation

# scikit learn to one hot encoder and label encoder

# sns and matplotlib to visualize

sns

plt

DictVectorizer

operator

# read data from csv file

)

# let's take a look at the train and test data

'**************************************'

'TRAIN DATA'

'**************************************'

)

'**************************************'

'TEST DATA'

'**************************************'

)

# the above code wont print all columns.

# to print all columns

)

# let's take a look at the train and test data again

'**************************************'

'TRAIN DATA'

'**************************************'

)

'**************************************'

'TEST DATA'

'**************************************'

)

# remove ID column. No use.

)

# high level statistics. mean media mode count and quartiles

# note - this will work only for the continous variables

# not for the categorical variables

)

# at this point, it is wise to check whether there are any features that

# are there is one of the dataset but not in other

False

]

True

)

' features are present in training set but not in test set'

]

True

)

' features are present in test set but not in training set'

# find categorical variables

# in this problem, categorical variables are start with cat which is easy

# to identify

# in other problems it not might be like that

# we will see two ways to identify this in this problem

# we will also find the continous or numerical variables

## 1. by name

]

## 2. by type = object

index

# lets check for correlation between continous data

# correlation between numerical variables is something like this

# if we increase one variable, there is a siginficant almost increase/decrease

# in the other variable. it varies from -1 to 1

)

# for the purpose of this analysis, we will consider to variables to

# highly correlation if the correlation is more than 0.6

0.6

)

####cat6 and cat1 = 0.76

####cat7 and cat6 = 0.66

####cat9 and cat1 = 0.93

####cat9 and cat6 = 0.80

####cat10 and cat1 = 0.81

####cat10 and cat6 = 0.88

####cat10 and cat9 = 0.79

####cat11 and cat6 = 0.77

####cat11 and cat7 = 0.75

####cat11 and cat9 = 0.61

####cat11 and cat10 = 0.70

####cat12 and cat1 = 0.61

####cat12 and cat6 = 0.79

####cat12 and cat7 = 0.74

####cat12 and cat9 = 0.63

####cat12 and cat10 = 0.71

####cat12 and cat11 = 0.99

####cat13 and cat6 = 0.82

####cat13 and cat9 = 0.64

####cat13 and cat10 = 0.71

# we can remove one of the two highly correlatied variables to improve performance

# lets check for factors in the categorical variables

)

# lets take a look whether the unique values/factors are not present in each of the dataset

# for example cat1 in both the datasets has values only A & B. Sometimes

# it may happen that some new value is present in the test set which maybe ruin your model

]

' '

)

' '

)

# lets visualize the values in each of the features

# keep in mind you'll be seeing a lot of plots now

# better is use ipython/jupyter notebook to plot inline plots

)

#plt.show()

)

#plt.show()

# cat1 to cat72 have only two labels A and B

# cat73 to cat 108 have more than two labels

# cat109 to cat116 have many labels

# moreover you must have noticed that some labels are missing in some features of train/test dataset

# this might become a problem when working with multiple datasets

# to avoid this, we will merge data before doing onehotencoding

)

index

# lets check for factors in the categorical variables

)

# 1. one hot encoding all categorical variables

)

vocabulary_

)

# it can be seen that we are adding too many new variables. This encoding is important

# since machine learning algorithms dont understand strings and we have to convert string factors

# as numeric factors which increase our dimensionality

)

# remove initial categorical variables

)

# take back the train and test set from the above data

]

loss

# 2. using get dummies from pandas

train_test

)

# take back the train and test set from the above data

]

loss

# 3. pd.factorize

train_test

]

# take back the train and test set from the above data

]

loss

# 4. mixed model

# what we can do is mix these models since cat1 to cat72 just have 2 labels, so we can factorize

# these variables

# for the rest we can make dummies

train_test

]

)

# take back the train and test set from the above data

]

loss

## this we can use for training and testing in the model

Happy Python-ing!

Posted in Machine Learning, Python

Machine Learning : Pre-processing features Machine Learning : Pre-processing features

相关推荐