notebooks/ICML_Code/PathsonRecSys.ipynb

Need to download MovieLens 20m Data Set (190 MB).

Data sets:

  • ratings.csv
    • 3 columns: userId, movieId, rating
  • movies.csv
    • 3 columns: movieId, title, genres (| separated)

We load the ratings.csv and movies.csv data sets and create a subset of movies based on the number of ratings.

  • Create a map movies_map from movieId list of user ids
  • Create a subset slim_movies_map of movies_map by keeping movies with at least 10 ratings
  • Create a subset slimer_movies_map of movies_map randomly including some movies and making sure to include a few select movies. The number of movies goes from 22,884 to ~216.

Question: Is the slimming of movies_map just for shorter computational time? Or is there some statistical advantage of a smaller sample size?

Building 1-skeleton:

The code simply generates a Cech complex but instead of balls we use Jaccard distance.

The covers contain all users who rated movie . Thus, the intersection would give the users who rated a set of movies. Hence, if , it follows that at least two people rated the same movie, that is, the covers intersect.

In terms of the mapper, our filter function is given by We cover the space of movies with singleton sets, resulting in the preimage giving all ratings of a movie.

The movies_map represents all users who rated a particular movie.

  1. Starts with Cover.build:
  • Create 0-simplices, one for each movie.
  • For all -combinations of movies () Jaccard distance of the covers. If it is less than 1, add -simplex.
  • Sample result:
[([3], 0.0), ([100], 0.0), ([253], 0.0), ([447], 0.0), ([574], 0.0), ([688], 0.0), ([702], 0.0), ([753], 0.0), ([1127], 0.0), ([1197], 0.0), ([1269], 0.0), ([1314], 0.0), ([1386], 0.0), ([1459], 0.0), ([1659], 0.0), ([1907], 0.0), ([1909], 0.0),...((117533, 118930), 0.9782608695652174)]
  1. Generate a nx.Graph representing the 1-skeleton