Data
Science

Spring 2020
course
site

April 30

recommender systems

overview of topics in the texbook :

Given data per user that lists what they like (say movie ratings), how should we pick suggested movies for a given individual?

SQL

How to organize in files complicated data collections?

Consider books, their authors, and their publishers.

This doesn't work well.

book    
----------
id,name,phone,publisher,auther1,auther2,

problems:

Instead of trying to put an author list into a book as we would do in a language like python

book = {name='This book', authors=['George', 'Alice']}

we user reference pointers (ids) to refer to other "things" (objects) in our database.

Here's an example.

Publishers
------------
id, name, address, phone, ...
1, Bob's Books, ...
2, Mary's Magazines, ...

Authors
-------
id, name, address, phone, ...
1, John Smith, ...
2, Jane Doe, ....

Books
-----
id, name, ISBN, ...
1, A Field Guide to Benches, 1112223334
2, Rocket Science Made Easy, 9879879876

So far so good : no replication of information, no ever-expanding columns.

But now we need to make the relations. There are several types

Since each book only has a single publisher, we can put a single pointer in the book database to that publisher's id.

Books
-----
id, name, ISBN, publisher_id
1, A Field Guide to Benches, 1112223334, 2     
2, Rocket Science Made Easy, 9879879876, 1

Many to many is the tricky one. We make a new table with an entry for each connection. Sometimes there's an obvious name for this table; sometimes not.

AuthorBook
-----------
id, author_id, book_id
1, 1, 1
2, 2, 1
3, 1, 2

The "id" in that last table is the unique identity of the connection ... which we may never need. (Unless we start making changes.)

Quick quiz: who are the authors of which books?

And here's favorite explanation of this stuff : the gaytabase (That version is not currently online, only this shorter version.)

https://cs.marlboro.college /cours /spring2020 /data /notes /apr30
last modified Tue April 16 2024 8:43 pm