Chapter 6

Naïve Bayes and unstructured text

This chapter explores how we can use Naïve Bayes to classify unstructuted text. Can we classify twitter posts about a movie as to whether the post was a positive review or a negative one?

Contents

  • an automatic system for determining positive and negative texts
  • how to train a Naïve Bayes classifier using unstructured text
  • stop words — discarding common words
  • classifying newsgroups
  • Python code for Naïve Bayes
  • Customized News — The Daily Me
  • The Twitter Challenge
  • the httplib2 library and how to install it
  • the json library and how to use it with Twitter

PDF of the chapter

Python Code

Data

2 Comments to Chapter 6

  1. by Matt Martin

    On November 18, 2010 at 12:41 am

    In the part about JSON you say that it stands for “JavaScript Object Notion,” but I think you meant “JavaScript Object Notation.”

    Throughout the Twitter section, you use the word “twitter” as referring to a status update, but I think the more accepted term for that is “tweet.” Also, when you refer to Twitter as a web site you might want to capitalize it (it looks like you used both “Twitter” and “twitter” interchangeably).

    Finally, the link to this chapter from the table of contents page (not the home page) seems to be broken.

  2. by Kristine

    On December 9, 2011 at 1:18 am

    Thanks for all the great information, I found it very interesting.

    Quick note, the python code contains two truncation errors:

    1. in the test function, the division of (correct / total) is between two integers and results in truncation to zero. This can be fixed by changing the declarations of correct and total in the same function to be to 0.0 instead of just 0.

    2. Similarly, when computing the probabilities in the class initialization, the arithmetic is all integers and everything gets truncated to 0. Changing the calculation from (count + 1) / denominator to (count + 1.0) / denominator solves this problem.

    Thanks again!

Leave a Comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

By submitting a comment here you grant Ron Zacharski a perpetual license to reproduce your words and name/web site in attribution. Inappropriate or irrelevant comments will be removed at an admin's discretion.