Database Design Course by Caleb Curry
Introduction to Database Design
Situation: There are notes from an 8 hour course on YouTube.
TLDR
- What is the relationship between entities (tables)? Physically draw out the lines and relationships (cardinality)
- one-to-one
- one-to-many (‘many’ side is the Foreign Key)
- many-to-many (break into two one-to-many relationships w/ intermediary table)
- Do we need Lookup Tables?
- Design Data Tables for Integrity
- Entity Integrity - ID for uniqueness
- Ensure Atomic Values (Natural Keys, if cannot, then add surrogate keys)
- Referential Integrity - connect tables between Foreign Keys to Primary Keys
- Domain Integrity - identify data types of each variable (i.e., numeric, string, date)
- No repeating data
- Entity Integrity - ID for uniqueness
- Identify which foreign key is NOT NULL (surrogate id will Auto-Increment) (modality)
- Normalize the data
- 1 NF (first normal form) - atomicity
- 2 NF (second normal form) - partial depenency
- 3 NF (third normal form) - transitive dependency (solution: take problematic columns and split into their own tables with foreign key)
- Foreign Key Constraints; SQL statements:
- ON DELETE
- RESTRICT
- CASCADE
- SET NULL
- ON UPDATE
- RESTRICT
- CASCADE
- SET NULL
- ON DELETE
- Deteremine which JOIN is needed to get the best “view”; which table goes after the FROM statement (which table is on the left?)
- INNER JOIN
- LEFT JOIN
- SELF JOIN
What is a Relational Database?
- Entity (Rows)
- Entity = User (a person, an object)
- Row = all attribute values for an entity
- Attribute (Columns)
- Attributes are about an entity (user name, name, password, address etc)
- Mathematical name for a Row is a Tuple
RDBMS
- View Mechanism changes the how data is presented (i.e., we don’t want all rows x columns, just a subset)
- Select only certain columns
- View = Read Only (not everyone has access to Update for security purpose)
- RDBMS allows “transactions”
- Example: MySQL, SQL Server, PostgreSQL (open source)
- Database & Relational Database are not separate things
SQL
- Define structure (DDL, data definition language)
- CREATE,
- structure & connection between tables - JOIN
- Manipulates data (DML, data manipulation language) -
- UPDATE
- data within tables
What is Database Design?
- Separate information into multiple tables, while preventing data integrity issues
- How do you measure whether a database is “good” or “bad”? –> Data Integrity
- Good design prevents “data integrity issues”
- All data up to date
- No repeating data
- No incorrect data
- No broken relationships
- Conceptual Schema: How data is related.
- Logical Schema: Table structures (i.e., X number of columns, data types), number of tables
- Physical Schema: Implementing into database, table types, what server? how will people access?
Data Integrity
- Entity integrity
- ID is used to enforce uniqueness of an entity (user)
- Referential integrity
- Foreign key constrains
- Domain integrity
- Range of what we’re storing (correct Data Types; integers, text or dates)
- Note: Relational Database does not come from the word “relationship”, it comes from Relations which is a mathematical connection between Sets
- When we don’t have data integrity, we have errors
- When we have errors, data integrity allows us to correct those errors
Database Terms (Review)
- Data
- Database
- Relational Database (stores things in tables)
- DBMS (how to control database)
- RDBMS
- Null (when someone doesn’t enter a value)
- Anomolies (errors)
- Integrity (protect against anomolies)
- Entity
- Referential - keep connection through Foreign Keys and Primary Keys
- Domain - correct data types
- Entity - what we store
- Attributes*** - things about an entity
- Relations* - connection between two sets or Tables
- Tuple** or Row (all attributes about an entity)
- Table* - physical representation of
- Rows**
- Columns***
- File* (aka Table)
- Record** (aka Row)
- Field*** (aka Column)
- Value (something in a column)
- Entry (aka a Row)
- DB Design - process of designing table to have integrity
- Schema - structure of tables
- Normalization - steps to get the best data base design
- Naming Convention - consistency in naming
- Keys - something to make things in unique in a table and connect between tables
More Database Terms
- SQL
- DDL data define language
- DML data manipulation language
- SQL Keywords - reserved words (e.g., SELECT)
- Frontend - we program frontends so people can securely access the database (doesn’t allow us to type in SQL)
- Backend - serverside code to communicate with database
- Client side
- Server side - serves instances of the database to the client
- Server side scripting language
- Views - taking data from database and illustrating it in a different from how it’s stored
- Joins - connect data from multiple tables
- ID foreign key connection
Atomic Values
- Everything in a database should be about 1 thing
- Example: “Paul Apivat Hanvongse” – to make it atomic, create 3 separate columns (i.e., nickname, first name, last name)
- Atom - smallest indivisible piece (1 thing) but still makes sense to treat as 1 thing
- example: Address - street, city, state, zip code (3 columns)
Relationships
- Relationship - connects two or more entities
Example: (Entities) Database –> Student - Attribute –> Professor - Attribute –> Class - Attribute –> Major - Attribute
There’s multiple relationships here; Student have a Major, Students are in a Class, Professors are part of a Major, Professors teach a Class.
One-to-One
- One Entity has connection with another Entity (e.g., Husband - Wife)
- Social security number unique to one person
One-to-Many
- Comments under a Youtube video;
User –> Comment 1 –> Comment 2 –> Comment 3
Many-to-Many
- Polygamous marriage (i.e., Multiple husbands have multiple wives)
- College: Class & Students; class can have multiple students & students can take multiple classes
Designing One-to-One Relationships
- Example: Person and Username
- Generally One-to-one relations are stored in the same table
ID | name | user_name |
---|---|---|
1 | Apivat | Paul |
2 | Caleb | Caleb_Curry |
- There are times when one-to-one relations are stored in different tables (when you want to store extra attributes about the attribute)
Cardholder Entity
| ID |
|————|
| first_name |
| last_name |
| card_id |
(card_id <–> ID )
Card | ID | |————- | | card_number | | issue_date | | late_fees | | max |
Summary - One-to-One Relationship
- Store attribute of the entity in the table
- OR use another table and connect with a foreign_key
Designing One-to-Many Relationships
- The “many” side is a foreign key to the “one” side
- User id stays the same.
User |
---|
user_id |
(multiple cards)
Card |
---|
card_id |
user_id |
Card2 |
---|
card2_id |
user_id |
Card3 |
---|
card3_id |
user_id |
Parent Tables and Child Tables
- Tables are either Parent or Child
- Keys keep tables Unique
- Primary key = Parent (User ID)
- Foreign key = Child (user_id as a reference to User ID)
- Child points back to Parents
- In one-to-one, we don’t have to worry about parent or child
- In one-to-many, many Children point to a Parent
- When we have a Child table, we always know the Parent (but not vice versa)
Notation One to One –> 1 : 1 One to Many –> 1 : N Many to Many –> M : N
Designing Many-to-Many Relationships
- M : N
- Classes : Students
- Parent <–> Parent
- Solution: Break it up to TWO One-to-Many Relationships with an INTERMEDIARY or JUNCTION table to connect
ID | Class |
---|---|
63 | math |
75 | science |
89 | english |
ID | Student |
---|---|
8 | John |
17 | Jake |
16 | Sally |
6 | Claire |
Intermediary Table (Child Table for both Parents)
class_id | student_id |
---|---|
75 | 8 |
89 | 8 |
63 | 17 |
75 | 17 |
89 | 17 |
89 | 16 |
63 | 6 |
89 | 6 |
Summary of Relationships
- Now we can design every “binary relationship” - any relationship between two Entities
Introduction to Keys
- Keys should be Unique
- Never Changing
- Never NULL
- What should be unique? (e.g., a user e-mail)
- Could be a Natural Key (already in the table, no need to define a new column)
- Could be user name.
- Key should be Never Changing (otherwise database integrity is compromised)
Primary Key Index
- Index - think Index in a Book;
- Index points you to the data
- Keys are a type of Index
- Indexs are used for
- SELECT * FROM
- WHERE first_name = ‘Caleb’ (need index for this)
Look Up Table
Example look-up table of member status:
- member_status id is the foreign key that can point to a members table
- all connections stay the same even if member_status changes
- can set Foreign Key constraints
id | member_status |
---|---|
1 | gold |
2 | silver |
3 | bronze |
4 | first_quest_complete |
5 | guess_pass |
6 | level_1 |
7 | level_2 |
8 | level_3 |
(where member_status & student_id are One-to-Many)
id | student | member_status_id |
---|---|---|
8 | John | 3 |
17 | Jake | 3 |
16 | Sally | 6 |
6 | Claire | 7 |
(where member_status & student_id are Many-to-Many)
student_id | member_status_id |
---|---|
8 | 3 |
8 | 4 |
17 | 6 |
17 | 7 |
17 | 8 |
16 | 1 |
6 | 4 |
6 | 5 |
Lookup Tables (w Keys) allow for:
- Integrity
- Uniqueness
- Improves functionality (no repeating data)
- Less work
- Allows for added complexity
Superkey and Candidate Key
- Two main kinds of keys:
- Primary Key
- Foreign Key
Superkey
- Any number of columns that forces each row to be unique
- How do we know each row is unique and talks about one entity (user)?
- Superkey = any number column values to force that each row is unique
- Candidate key = the least number of column to force every row to be unique (ie., username)
You’ll never program a superkey
Candidate Key
- Superkey is asking: “Can every row be unique?”
- Then, once we answer yes, Candidate key asks: “How many columns are needed (to force every row to be unique) ?” – what’s the least number of columns
- Then, “How many Candidate Keys do we have?”
- THEN, decide which Candidate Key will be the PRIMARY KEY.
Primary Key & Alternate Keys Primary Key Possibilities
- username*
- full name, last name, middle name, address, birthday
*out of these three, username is the best primary key
What are the criteria for choosing a Primary Key?
- UNIQUE
- NEVER CHANGING
- NEVER NULL
Primary keys can also be an Index (use Select statement and how you connect most of your data).
Keys not chosen to be the Primary Key become the Alternate Keys. Alternate Keys can be useful - you could use “email” (an alternate key) as an Index.
SELECT * FROM table WHERE email = “value@gmail.com”
Surrogate Keys and Natural Keys
- These are “categories” of Primary Keys
- You won’t search for these, but you’ll Design the Database with these in mind
- Natural Keys are NATURALLY in the table; they fit requirements for Primary Keys and already in the table (email, username)
- Surrogate Keys are ADDED to your table (i.e., id); they AUTO-INCREMENT
- user –> user_id
- sale –> sale_id
- comment –> comment_id
Should I use Surrogate Keys or Natural Keys?
- Natural keys = already there, but not always obvious which should e natural key
- Surrogate keys = easy, but you have to add a new column
- Choose one or the other
- minor performance differences
- Caleb personally uses Surrogate Key
- Example: user –> user_id
Foreign Key
- Foreign Key References a Primary Key (either same table or separate table)
- Every table has ONE Primary Key (could be composed of many columns)
- Every table can have MULTIPLE Foreign Keys referencing many other tables
Not NULL Foreign Keys
- Foreign Key constraints
- Every row is required to have a value if the column has “not Null”
- related to Cardinality
- (NOTE: sometimes you don’t want to set “Not NULL” because an id doesn’t currently exist; but sometimes you want to force that relationship to be there between Foreign Key and Primary Key)
- you either want Not NULL or Not Required (depending on the situation)
- Primary Key values should never change, Foreign Key values can change
- We don’t want Primary Key values to change, but we could have Foreign Key references change (?)
- Primary Key and Foreign Key should be the same data type
Foreign Key Constraints
- FK = Referential Integrity
- Make sure if you update Parent, ZChildren will update
- Prevent creating Children if there’s no Parent
SQL Statements that talk about FK constraints and refer to the Parent (Primary Key)
- ON DELETE = when we delete the Parent, we want to delete the Child
- RESTRICT = (no action) throw an error whenever the Parent is deleted
- CASCADE = whatever we do to Parent, we do to Child (delete parent, delete child)
- SET NULL = if delete Parent, sets Child to NULL
- ON UPDATE = when we update the Parent, we want to update the Child
- RESTRICT = (no action) throw an error when try to Update Parent
- CASCADE = If we update Parent, Child updates as well
- SET NULL = if update Parents, thro
- Every Foreign Key column value needs to reference a Primary Key value
Simple Key, Composite Key, Compound Key
- categories of Keys
- Simple Primary Keys - single column (e.g., username)
- Composite Primary Key has multiple columns, as a group as Primary Key; at least one column doesn’t have to be a key
- Compound Primary Key - combination of columns in intermediary tables (in many-to-many relations); Primary Keys are compounded; all columns have to be a key
For Intermediary Tables some people will
- add a surrogate_id to the Compound
REVIEW
- Superkey
- Candidate Key - least number of columns used to enforce uniqueness
- Primary Key** - the candidate key you select as the Main key
- Alternate Keys - the candidate keys you didn’t select as Primary Key
- Foreign Keys** - make connection between tables; references Primary Keys
Primary + Foreign
- Surrogate & Natural Keys - surrogate (user_id) is random with no real value; Natural is already contained in your database
- (don’t switch between these two)
- Rule: You should be able to enforce uniqueness by the columns that are Naturally already there, add a surrogate key if you want
- Rule 2: If you cannot define uniqueness naturally, you’ll need to rely on a Surrogate key (try to avoid)
(can switch between Simple + Composites - you won’t define these explicitly) 6. Simple Key - one column key 7. Composite Key - multiple column keys 8. Compound Key - multiple column keys
Foreign Key Constraints
- ON DELETE
- RESTRICT
- CASCADE
- SET NULL
- ON UPDATE
- RESTRICT
- CASCADE
- SET NULL
Introduction to Entity Relationship Modeling
- A standard for Drawing Databases
- EER Model (Enhanced Entity Relationship Model)
- ERD (Enhanced Relationship Diagram)
- ER Model (Enhanced Relationship)
- DDL: Define Database Structure
{do actual drawing}
Cardinality
- one to one: |——-|
- one to many: |———–E
- one row to many rows
- many to one: E———|
Modality
- Can assign “NOT NULL” to Foreign Key, to say each card must have an owner
card_holder (Primary Key) | card Foreign Key | card_number |
---|---|---|
7 | 7 | 12 |
12 | 7 | 48 |
368 | 7 | 98 |
368 | 112 |
4 Modalities card_owner ——– card
- -|———0–|- one-to-one; one card_owner can have zero or 1 card
- -|———|–|- one-to-one; one card_owner must have 1 and only 1 card
- -|———0–E- one-to-many; one card_owner can have 0 or many cards
- -|———|–E- one-to-many; one card_owner must hv 1 or many cards
Introduction to Database Normalization
- Normalization is a process where we go through our database and correct things that may cause database problems like
- data integrity problems
- repeating data
Third Main Forms (three step-by-step data normalization process)
- 1 NF (first normal form)
- 2 NF (second normal form)
- 3 NF (third normal form)
Systematic way to normalize a good structured database
- everything must be atomic
- think about how data depends on other data (dependencies)
- must go in sequential order 1 NF –> 2 NF –> 3 NF
1 NF (First Normal Form of Database Normalization)
- Atomicity (data must be atomic or about one thing)
Problem: “Address column” is not atomic (street, apt, city, country, zip) Solution: Break this column into multiple columns
Problem: User enters two email(s) as a single value (two values in one cell) Still Problem: Generate same two users for two emails (two of same primary keys)
Solution: Break into two tables - User table & Email table - turn this into a one-to-many (one user to many emails)
User user_id (primary key) first_name last_name
Email email_id email user_id (foreign key)
2 NF (Second Normal Form of Database Normalization)
- Partial Dependency (when a column depends on part of a Primary Key)
- You need ot have a compound or composite key (Primary Key has to be multiple columns)
- Found in Many-to-Many relationships w/ Intermediary Tables
{see notebook for illustration}
3 NF (Third Normal Form of Database Normalization)
- must do 1st and 2nd Normal Form before getting to 3 NF
- Transitive Dependency (when a column depends on a column, which depends on a Primary Key)
- Solution: Take a transitive dependency (problematic column), move them to a new table and reference them with a foreign key
Summary of Normal Forms
- 1 NF = making everything atomic
- 2 NF = removing partial dependencies
- 3 NF = removing transitive dependencies
Indexes (Clustered, Nonclustered, Composite Index)
- See a book’s index (or Phone Book)
- A list of where certain data points are
- Data is sorted in a way that can easily be found
- Nonclustered Index = tells you how to get to the data (book’s index)
- points to the data
- a list of references that point to the data (like back of the book)
- Clustered Index = organizes the actual data in a way that’s easy to use
- organizes the actual data
- faster and better than non-clustered
Databases
- Rather than having to go through “all” the data (i.e., Table Scan), you create an index (index seek, makes queries faster)
- Downside: When you update the data, you have to update the index as well otherwise index becomes outdated and useless
- You only want to create an index for frequently used data
- Primary Key that is indexed makes it faster
- Apply WHERE (sql query) column to an Indexed column
- Index increase speed of JOINS
- Whenever you’re JOINing certain columns, the two columns you’re joining should be Indexed
Data Types
- Date
- date
- time
- datetime
- timestamp (can be millisecond or time, when something was “done” or when something was “created” or “updated” or when a new row is entered)
- Numeric
- integer (only whole numbers)
- decimal (more accurate)
- float / double (unsigned)
- binary
- String
- Char(8),
- varchar(8) 0 up to 8 characters
- text - longer strings (comments, paragraphs)
Everything above here is DDL - data definition language
Below here are DML - data manipulation language**
Introduction to JOINS
- Joins bring multiple tables into a presentable format
Inner Join
- Table A (customer table)
- Table B (card table)
- When there are rows that connect them, a new Table is presented
- Taking only the rows that intersect between two tables
- Eliminate any customers that do not have a card
- Eliminate any cards that do not have a customer
- Exclude rows that are not in both tables
Example:
Customer Table
customer (Primary Key) |
---|
customer_id |
first_name |
last_name |
Card Table
card (Foreign Key) |
---|
card_id |
customer_id |
max_amount |
monthly_bill |
amount_paid |
amount_owed |
Example Inner Join of Customer Table & Card Table
first_name | last_name | amount_paid | amount_owed |
---|---|---|---|
Paul | Apivat | 2200 | 3000 |
Paul | Apivat | 720 | 1000 |
Jimmy | John | 3000 | 5000 |
SELECT first_name, last_name, amount_paid, amount_owed FROM customer cu INNER JOIN card ca ON cu.customer_id = ca.customer_id
INNER JOIN on 3 Tables
See illustration
INNER JOIN on 3 Tables (with Example)
See illustration
Introduction to OUTER JOINS
- INNER JOIN - least rows
- LEFT (Outer) JOIN
- RIGHT (Outer) JOIN
- FULL (Outer) JOIN - most rows
RIGHT OUTER JOIN
-
same as Left Outer Join, except the Right Table keeps all the rows
-
IN PRACTICE, most people don’t use Right Joins; instead they’ll use a LEFT JOIN and just flip the tables
-
How do you know which table is LEFT or RIGHT?
SELECT column1, column2, column3 FROM this_table_is_Left LEFT JOIN….
JOIN with NOT NULL Columns
- Not Null columns can cause some confusion when it comes to Joins
- If you want to return all of a column, don’t worry about NOT NULL, just put that table in the SELECT statement (i.e., make the table on the left side) and use a LEFT JOIN
{see illustrations}
Outer JOIN Across 3 Tables Can combine
- LEFT (outer) JOIN with
- RIGHT (outer) JOIN
Aliases
- use when writing SELECT statements
- use AS
SELF JOIN
- take a Table and JOINING with itself
- How to think about it:
- Duplicate your exact table
- Joining with itself
Illustration
v1 |
---|
user |
referred_by |
v2 |
---|
user_id |
SELECT v1.first_name, v1.last_name, v1.email, v2.email FROM user AS v1 JOIN user AS v2 ON v1.referred_by = v2.user_id
(Self-Join is taking the same table, “user”, and making a duplicate)
For more content on data science, R, and Python find me on Twitter.