Introduction to Databases¶

Database definition and types¶

You have probably heard the term database before, and already have some preconception of what a database is, am I right? Just as computers, databases are a tool that can be used for many purposes, and they are everywhere! Formally, a database is a collection of data that is organized in a way that facilitates access, retrieval, and use of that data. We have already discussed how data is important in the world of computers, and how it is used to make decisions. In fact, many believe that data is the new oil, and that it is the most valuable resource in the world. Why is that? Computer programs and algorithms are used to analyze data, and help us make decisions based on that data, making organisations more efficient and profitable to the point that no organisation today conceives operating without data. In fact, data is so significant that it is the basis of the new economy, the so-called data economy, or digital economy. Back in the 70s, if one looked at the 10 biggest companies in the world by market cap, 7 of them were oil companies. Today, if one looks at the 10 biggest companies in the world, 8 of them are either key technology companies sustaining the digital economy, or companies that are using data as their core business asset.

So, databases are one of the pillars of our economy and our society.

Database types¶

There are many different types of databases, but the main classification normally used is between relational and non-relational databases:

Relational databases are the most common type of database, and are used in many different applications. They are based on the relational model, which is a mathematical model for data, and is the basis for the SQL language. The relational model is based on the concept of data structured on a table, which is a collection of data items (called fields) organized in rows and columns. The rows represent individual records, and the columns represent the attributes of the data. Besides having data organised in different tables, a relational database is going to create relations between the tables, which are going to be used to retrieve data from multiple tables at the same time. The most common relational database is the MySQL database, which is the one we are going to use in this course.
Non-relational databases are also called NoSQL databases, and are also used in many different applications. They are based on the document model, which is another mathematical model for data, and is the basis for the MongoDB language. The document model is based on the concept of data structured on a document, which is a collection of data items (called fields) organized in a JSON format. The most common non-relational database is the MongoDB database, which unfortunately is for now out of the scope of this course.

Key Concepts of database design¶

Ideally a database should minimize:

Data Redundancy: Same information element is duplicated in different records or files
Data Isolation: How often a change in one record triggers a change in other records
Data Inconsistency: When the same information element can be obtained from different equivalent methods from records and these methods do not provide the same information.

Also, ideally, a database should maximize:

Data Security: Protection against cyber-security threats.
Data Integrity: Accuracy (that information has the right level of detail), completeness (that there is no critical information missing) and consistency (opposite to data inconsistency) of data
Data Independence: How good the model scales (applications do not require major changes if new information elements are added to the database)

Relational model¶

Ok, so let´s dive into the relational model, the mathematical model for data mentioned above. The main idea behind the relational model is to group together the data of interest of the same type of object or concept of the real world. For instance, let´s think of an application that tracks payments for products made by customers. In the real world, products, customers, and payments are different types of concepts or objects: Products are goods or services we offer, customers are persons or organisations, and payments are bank transactions. In a relational model, each of these real-world concepts will be modeled by a different entity, which is by definition a type of object or concept described in our database.

Basic relational model

In a relational database, the data of each entity is going to be stored in a different table, organized in columns, and rows. Each column is going to contain the data of a property of interest of the objects modeled by the corresponding entity type. For instance, for the entity Customer, we may be interested in storing basic information for invoicing, like the VAT number, address, and basic contact information, like the phone number, email address and contact mail. Each of these properties of the Customer entity are known as attributes, and are going to be organized in different columns (also known as fields) of the Customer table. Now, each of the customers in the Customer entity database is known as an instance, and is going to compose a row in the customer table in the database.

Basic model entity

Notice that the VAT ID is unique for each customer. Therefore, we can use the VAT number to uniquely identify each customer in the database. The attributes which allow us to uniquely identify instances in a relational model are known as identifiers or key attributes.

Notice also that there are relationships between the real-world concepts that we are trying to model. A payment is made by a customer, for a product, so there is a relationship between a payment and a product and a relationship between a product and a payment. These relationships are for now represented as arrows, although in the sections below we will learn a better way to represent relationships.

Terminology¶

As a summary, let’s establish the key terminology used above:

An entity represents a type of real-world object or concept from the miniworld that is described in the database.
An attribute represents some property of interest that further describes an entity. These properties define the entity’s characteristics.
Identifiers are attributes that uniquely identify entity instances (key attributes).
A relationship among two or more entities represents an association among the entities.

Also, remember that in databases, data is organised in tables and each table contains the data of an entity. Data corresponding to an instance is organised in a row or record, and each record is divided into columns or fields so that the data of any attribute are stored in the same column.

hierarchy of data

Basic types of keys¶

Now, let´s continue with the basic example above, let us assume that for products, we are only interested in storing the unique identifier of the product, the name of the product, the unitary price, and for payments, we are just interested in storing the payment date, and a reference to the product the payment was made for and a reference to the customer that made the payment. How would you effectively think how would we effectively refer to a customer? If you thought of using the VAT ID (primary key), you are already have the right mindset to create data models! Yes, in practice, we will use a field to store the references to the (foreign) customer entity instances. These type of attributes that referenced instances of other entities are known as foreign keys. Our example would look something like this:

entities in basic model

Primary keys: A primary key is a column or a set of columns in a table whose values uniquely identify a row in the table. A relational database is designed to enforce the uniqueness of primary keys by allowing only one row with a given primary key value in a table.
Foreign keys: A foreign key is a column or a set of columns in a table whose values correspond to the values of the primary key in another table. In order to add a row with a given foreign key value, there must exist a row in the related table with the same primary key value.

Properties of Relationships¶

Cardinality Ratio¶

The maximum cardinality (cardinality ratio) is the maximum number of entity instances that can participate in a relationship. For instance, in the example above, one Payment can only be made by exactly one customer, so the maximum number of customers that can participate in the relationship is one. However, a recurring customer can have many payments, so in general, the maximum number of payments can be noted as N (to indicate that is an undetermined number). Therefore, the cardinality ratio of the relationship between customers and payments is one-to-many and is noted as [1:N]. Extending these concepts to any type of relationships yields:

One-to-One [1:1]: A record in Table A relates to at most one record in Table B, and a record in table B relates to at most record in Table B. This type of relationship is represented as A+--+B
One-to-Many [1:N]: a record in Table A can relate to many records in Table B, and a record in table B relates to at most one record in Table A. This type of relationship is represented as A+---<B
Many-to-Many [M:N]: Each record in table A can relate to many records in table B and each record in table B can relate to more than one record in table A. Many-to-Many relationships are represented as: A>---<B.

Using the symbols, above, our model is now:

Participation¶

The minimum cardinality, or participation, is the minimum number of entity instances that must participate in a relationship. The participation can be mandatory or optional. If mandatory, it means that at least 1 entity must participate in the relationship. For instance, the participation of customer in the relationship Customer>---+Payment is mandatory, because 1 payment must be made by one customer (and only one). However, the participaton of Payment may be optional, if, for instance we could also have potential customers in our customer table. We will use symbols under each end of the relationship as:

If 1 (symbol |), participation in the relationship by the entity is mandatory, i.e. at least one entity instance must participate in the relationship.
If 0 (symbol 0), participation in the relationship by the entity is optional, i.e. no entity instance must participate in the relationship.

To make sure you got this right, let us practice with the following relationships!

Practicing

Join Clauses¶

Join clauses are used to combine data from different tables. More specifically, Join clauses combine data from two tables, the left table and the right table, on a condition (a boolean expression that needs to be met). Normally, the condition is based on the values of one or more columns that are common to both tables, and that represent the relationship between the two tables. Also, normally, the columns used in the condition are primary keys in one table and foreign keys in the other table.

There are four different types of join clauses, described below, and depicted in this figure:

Types of join clauses

Basically, we need to bear in mind that not all rows in both tables are going to meet the condition. The database engine is going to evaluate the condition and then return the data in either table that meets the condition depending on the type of join:

Inner join: To get only the data that meets the condition.
Left join: To get all data from the left table and the data of the right table that meets the condition.
Right join: Opposite of the left join. Normally, the use of left join is preferred.
Outer join: Opposite of inner join, gets all data from both tables. It is seldom used in practice.