April 30

recommender systems

overview of topics in the texbook :

Given data per user that lists what they like (say movie ratings), how should we pick suggested movies for a given individual?

find their interests, choose most popular in those cateogries (easy but not targeted)
user-based collaborative filter
- find similar users and recommend what they like
- "cosine similarity" : do vector of interests point in similar directions?
- build matrix of users (rows), movies (columns), values of 1=like, 0=not_like
- find most common interests for most similar people
- ... but problematic if lots of features
item-based collaborative filter
- find out which items (movies) have similar likes/dislikes
- transpose that last matrix
- find movies that are similar
- recommend movies similar to those that you like
matrix factorization
- data used : http://files.grouplens.org/datasets/movielens/ml-100k.zip (not simple csv file)
  - u.item : id|movie_name (bars between items)
  - u.data : user_id, movie_id, rating (tabs between items)
  - (We should discuss how these are similar to SQL database notions.)
- trains a neural net to predict rating given user and movie as input
- used "principal component analysis" to understand neural net matrix
- ... which gives as one of its primary directions which movies are best (highest rated overall)
further reading
- surprise python library
- the netflix prize was a "thing" near the start of the ML craze.

SQL

How to organize in files complicated data collections?

Consider books, their authors, and their publishers.

This doesn't work well.

book    
----------
id,name,phone,publisher,auther1,auther2,

problems:

authors with same names
duplication of names
books with 20 authors ... does every row have 20 slots?
information about publishers (address, phone number, ....)

Instead of trying to put an author list into a book as we would do in a language like python

book = {name='This book', authors=['George', 'Alice']}

we user reference pointers (ids) to refer to other "things" (objects) in our database.

a database is made of tables
- each table describes a type of thing (i.e. Person, Book, Publisher)
  - each row is a an instance of this type of thing
  - and each column describes a property of that instance
  - and has a unique identifier ("primary key")
- connections between the tables are indentifiers for one thing in another table ("foreign key")

Here's an example.

Publishers
------------
id, name, address, phone, ...
1, Bob's Books, ...
2, Mary's Magazines, ...

Authors
-------
id, name, address, phone, ...
1, John Smith, ...
2, Jane Doe, ....

Books
-----
id, name, ISBN, ...
1, A Field Guide to Benches, 1112223334
2, Rocket Science Made Easy, 9879879876

So far so good : no replication of information, no ever-expanding columns.

But now we need to make the relations. There are several types

one to one
one to many
many to many

Since each book only has a single publisher, we can put a single pointer in the book database to that publisher's id.

Books
-----
id, name, ISBN, publisher_id
1, A Field Guide to Benches, 1112223334, 2     
2, Rocket Science Made Easy, 9879879876, 1

Many to many is the tricky one. We make a new table with an entry for each connection. Sometimes there's an obvious name for this table; sometimes not.

AuthorBook
-----------
id, author_id, book_id
1, 1, 1
2, 2, 1
3, 1, 2

The "id" in that last table is the unique identity of the connection ... which we may never need. (Unless we start making changes.)

Quick quiz: who are the authors of which books?

And here's favorite explanation of this stuff : the gaytabase (That version is not currently online, only this shorter version.)

https://cs.marlboro.college /cours /spring2020 /data /notes /apr30
last modified Fri January 24 2025 7:36 pm

Data
Science

course

site

April 30

recommender systems

SQL

DataScience

course

site

April 30

recommender systems

SQL

Data
Science