A Programmer's Guide to Data Mining

Chapter 5

Chapters
1: Introduction
2: Recommendation systems
3: Item-based filtering
4: Classification
5: More on classification
6: Naïve Bayes
7: Unstructured text
8: Clustering

Further Explorations in Classification

This chapter examines several other algorithms for classification including kNN and naïve Bayes. We look at the power of adding more data.

Contents

Evaluating classifiers: training sets and test data
10-fold cross validation
Which is better: adding more data or improving the algorithm?
the kNN algorithm
Python implementation of kNN

The PDF of the Chapter

Python code

Page 13: divide data into buckets: divide.py

Page 14: nearestNeighborClassifier.py from last chapter (please modify to implement 10-fold cross validation).

Page 15: one solution to implementing 10-fold cross validation: crossValidation.py

Page 36: one solution to implementing kNN: pimaKNN.py.

Data

Page 13. Auto MPG Data Set. (Quinlin 1993)

Version divided into buckets in the format the book uses: mpgData.zip
Original Version from the Machine Learning Repository.

Page 34. Pima Indians Diabetes Data Set (National Institute of Diabetes and Digestive and Kidney Diseases)

pimaSmall.zip (containing 100 instances divided into 10 buckets)
pima.zip (full data set divided into 10 buckets)