Chapter 5
Chapters
1: Introduction
2: Recommendation systems
3: Item-based filtering
4: Classification
5: More on classification
6: Naïve Bayes
7: Unstructured text
8: Clustering
Further Explorations in Classification
This chapter examines several other algorithms for classification including kNN and naïve Bayes. We look at the power of adding more data.
Contents
- Evaluating classifiers: training sets and test data
- 10-fold cross validation
- Which is better: adding more data or improving the algorithm?
- the kNN algorithm
- Python implementation of kNN
The PDF of the Chapter
Python code
Page 13: divide data into buckets: divide.py
Page 14: nearestNeighborClassifier.py from last chapter (please modify to implement 10-fold cross validation).
Page 15: one solution to implementing 10-fold cross validation: crossValidation.py
Page 36: one solution to implementing kNN: pimaKNN.py.
Data
Page 13. Auto MPG Data Set. (Quinlin 1993)
- Version divided into buckets in the format the book uses: mpgData.zip
- Original Version from the Machine Learning Repository.
Page 34. Pima Indians Diabetes Data Set (National Institute of Diabetes and Digestive and Kidney Diseases)
- pimaSmall.zip (containing 100 instances divided into 10 buckets)
- pima.zip (full data set divided into 10 buckets)