Chapters
1: Introduction
2: Recommendation systems
3: Item-based filtering
4: Classification
5: More on classification
6: Naïve Bayes
7: Unstructured text
8: Clustering

Further Explorations in Classification

This chapter examines several other algorithms for classification including kNN and naïve Bayes. We look at the power of adding more data.

Contents

  • Evaluating classifiers: training sets and test data
  • 10-fold cross validation
  • Which is better: adding more data or improving the algorithm?
  • the kNN algorithm
  • Python implementation of kNN

The PDF of the Chapter

Python code

Page 13: divide data into buckets: divide.py

Page 14: nearestNeighborClassifier.py from last chapter (please modify to implement 10-fold cross validation).

Page 15: one solution to implementing 10-fold cross validation: crossValidation.py

Page 36: one solution to implementing kNN: pimaKNN.py.

Data

Page 13. Auto MPG Data Set. (Quinlin 1993)
Page 34. Pima Indians Diabetes Data Set (National Institute of Diabetes and Digestive and Kidney Diseases)
  • pimaSmall.zip (containing 100 instances divided into 10 buckets)
  • pima.zip (full data set divided into 10 buckets)