Chapter 2

Getting Started with Recommendation Systems

Contents

  • How a recommendation system works.
  • How social filtering works
  • How to find similar items
  • Manhattan distance
  • Euclidean distance
  • Minkowski distance
  • Pearson Correlation Coefficient
  • Cosine similarity
  • implementing k-nearest neighbors in Python
  • the Book Crossing dataset

The PDF of the chapter

Python code

The code for the initial Python example: filteringdata.py

The code for the Pearson implementation: filteringdataPearson.py

The code for the Python recommender class: recommender.py

Data

The Book Crossing Data: BX-Dump.zip

Movie Ratings (2o movies rated on a scale of 1-5; a blank means that person didn’t see that movie). Data provided by students from the University of Mary Washington data mining class.

9 Comments to Chapter 2

  1. by Erin Wuepper

    On August 30, 2010 at 7:55 pm

    These are just some minor punctuation and clarity corrections. My corrections are in the []s.

    Last line of page two: “A 2D chart of these relationships is shown on the following page[.]”

    Line 7 of page three: “In the 2D case[,] each person is represented by an (x, y) point.

    Line 7 of page three: “I will [add a] subscript [to] the x and y to refer to different people.”

    Lines 4-7 of page six: “This is going to skew our distance measurement[,] since the Hailey-Veronica distance is in 2 dimensions [while] the Hailey-Jordyn distance is in 5[. Because of this,] it is likely that the Hailey-Veronica distance will be shorter than the Hailey-Jordyn one, creating a false sense that Hailey is more similar to Veronica than she is to Jordyn.

    Line 2 of page seven: “There are several [ways of] represent[ing] the data in the table above [using] Python.”

    Last three lines on page ten: “We see that Angelica rated every band that Veronica did[. W]e have no new ratings[,] so no recommendations.
    Shortly[,] we will see how to improve the system to avoid [these] case[s].”

  2. by Amy Sams

    On August 30, 2010 at 10:52 pm

    Above the chart on Page 2 and top of page 3: Snow Crash & The Girl with the Dragon Tattoo need to be italized since they are books.

    Above the formula on Page 3: “calculated by” could be added; appear like this —-> The Manhattan Distance is then calculated by:

    Top of Page 4: need a colon after “gives us”

    1st line under heading Euclidean Distance on Page 4: capitalize distance (you have distance capitalized in prior sentences)

    5th line under heading Euclidean Distance on Page 4: capitalize distance and put a colon after “is”

    Top of Page 5: capitalize all three occurrences of distance

    Highlighted Portion on Page 5: “the more a large difference” doesn’t make sense

    Last line on Page 6: comma after “To compensate for this”

    Heading “The code to compute…” on Page 8: capitalize distance here as well as the sentence following

    Above the third section of code on Page 9: change the period to a colon; “Here is my function to make recommendations:”

  3. by Dublin

    On September 7, 2010 at 5:38 pm

    Ran into a bug in recommender.py

    # now make list from dictionary and only get the first n items
    recommendations = list(recommendations.items())[:self.n]
    recommendations = [(self.convertProductID2name(k), v) for (k, v) in recommendations]
    # finally sort and return
    recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse = True)
    return recommendations

    Should be:
    # now make list from dictionary and only get the first n items
    recommendations = list(recommendations.items())
    recommendations = [(self.convertProductID2name(k), v) for (k, v) in recommendations]
    # finally sort and return
    recommendations.sort(key=lambda artistTuple: artistTuple[1], reverse = True)
    return recommendations[:self.n]

    That way we return the top ‘n’ recs, rather than pulling out an indeterminate ‘n’ recs and sorting those. The same bug exists in the userRatings function.

  4. by Dublin

    On September 7, 2010 at 6:09 pm

    - When using the Minkowski distance, what should I consider when choosing a value for r? Are there any common criteria people use, or is it more trial and error?
    - It would be nice if there was a summary box at the end that lays out the practical differences between manhattan, euclidean, pearson, and cos sim.
    - p2.6, “rated five movies in common,” should be “songs.”
    - p2.6, point out that dividing distance by # of dims is a “rule of thumb” massaging of the data rather than part of some rigorously proven mathematical algorithm. If it is practical and works that’s awesome, but I found myself wondering where it came from and why it was correct.
    - p2.8 – In the manhattan function, “distance = 0″ isn’t indented enough.
    - p2.9 – for the recommend function, it would be helpful if the comments pointed out that [0][1] is the getting the username of the nearest neighbor, and that recommendations.sort is sorting on rating desc.
    - Charts 1, 2, and 3 should label the y axis, so I can see at a glance who we’re comparing Angelica against without having to refer back to the text.
    - p2.13 – It might not be apparent to those who haven’t taken stats that x-bar and y-bar are sample means.
    - The pearson section’s graphs and corresponding scores made it much easier to understand what was going on with the algorithm. A similar approach would be useful with cosine similarity. It would also be good if you captured in a sentence or two what cos sim is actually doing (the wikipedia intro helped me get a mental image of what was going on).

  5. by raz

    On September 8, 2010 at 9:55 pm

    Okay. Thanks. Will work on these this weekend.

  6. by Cardigan

    On September 10, 2010 at 4:57 am

    - p2.6 “movies” should be “bands”
    - p2.23 “CVS” should be “CSV”

  7. by Gary

    On September 11, 2010 at 10:53 pm

    It appears you meant to hava a Q & A box on page 2-13 and 2-14. As it is it’s kind of a Q & QA box. Maybe meant to repeat the question but have ANSWER: at the top of the box on 2-14?

  8. by Gary

    On September 11, 2010 at 11:10 pm

    Please withdraw my last comment, the pages are actually in ch 3.

  9. by Amy Sams

    On September 19, 2010 at 5:16 pm

    The answer to the Puzzler on Page 18:

    Sally —> Pearson = 0.8; Influence = 0.8/(0.8+0.7) = 0.5333
    Eric —> Pearson = 0.7; Influence = 0.7/(0.8+0.7) = 0.4667

    Projected Rating = (3.5*0.5333) + (5.0*0.4667) = 4.2

Leave a Comment

You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

By submitting a comment here you grant Ron Zacharski a perpetual license to reproduce your words and name/web site in attribution. Inappropriate or irrelevant comments will be removed at an admin's discretion.