Database introduction 1 - background info

Content:

  1. Introduction
  2. Database types
  3. Database products
  4. Relational databases
    1. History
    2. Terminology
    3. Normalization
    4. Object-relational impedance mismatch
  5. Database API's
    1. SQL
    2. Database specific API's/Libraries for relational databases
    3. Embedded SQL
    4. Standardized API's
    5. Database specific API's for NoSQL databases
    6. ORM's
  6. Topologies

Introduction:

Most applications utilize some form of database to persist data in.

Knowledge about databases is therefore essential for a software developer.

This article targets beginners and attempt to provide them with a basic understanding of database types, products and terminology with the intent to provide the necessary background information to work with databases.

If you want to skip the background information and go straigth to the "HowTo", then got to this article.

Database types

Databases are often categorized based on their storage model:

Database type Characteristica Usage
Hierarchical database Data are organized in a tree like structure (think org chart) The standard database type from mid 1960's to mid 1980's and still used in the mainframe world
Relational database (RDBMS) Data are organized as sets (see below for details) The standard database type from mid 1980's
Object oriented database (OODBMS) Data are stored as an object model Never really caugth on even though OOP became the standard for programming
Key value store Data are stored as keys and values Widely used on mainframe and mini computers in 60's, 70's and 80's - returned in the late 00's as part of the NoSQL movement
Document store Data are stored as large documents (usually in XML or JSON format) Became popular as part of the NoSQL movement
Column store Data are stored as keys, values and timestamps Became popular as part of the NoSQL movement
NoSQL databases Different definitions exist:
  • Databases that does not use SQL interface
  • Non-relational databases
  • Group of key value store, document store, column store, object oriented database and a few more
Became popular in internet startups and similar in the late 00's
NewSQL databases Relational databases build on NoSQL technology Concept showed up in early 10's but has not gained wide popularity yet
Graph databases Data consist of nodes and connection between nodes Concept showed up in late 00's and has gained some popularity for advanced analytics

What database type is best depends a lot on the problem to solve.

My rules of thumb would be:

if data is large (> 250 TB) then chose a column store
if data are unstructured (different structure for each piece of data) XML or JSON then chose a document store
if data only have one key, need for queries is very limited and performance is critical then chose key value store
for all other cases chose a relational database

Databases can also be categorized based on how the application communicate with it:

Database server
Application communicate with database server process via network protocol. The database server process may run on the same system as the application or on another system.
Embedded database
Application communicate with database software using normal function/method calls. The database code is in the same process as the application code and therefore also on the same system.

The advantages of a database server is scalability:

The advantages of an embedded database is less work:

Database products

Some of the more popular database products are:

Database type Popular products
Hierarchical database Database servers:
  • IBM IMS
Relational database (RDBMS) Database servers:
  • Oracle DB
  • IBM DB2
  • Sybase ASE
  • MS SQLServer
  • IBM Informix
  • MySQL/MariaDB
  • PostgreSQL
Embedded databases:
  • FireBird
  • SQLite
  • MS Access
  • HSQLDB
  • H2
Object oriented database (OODBMS) Database servers:
  • Intersystems Cache
Embedded databases:
  • db4o
Key value store Database servers:
  • Oracle NoSQL
  • Amazon DynamoDB
  • Couchbase
  • Riak
  • Voldemort
Embedded databases:
  • DBM
  • BDB
  • LevelDB
  • RocksDB
Document store Database servers:
  • MongoDB
  • Apache CouchDB
  • OrientDB
Embedded databases:
  • Nitrite
  • LiteDB
Column store Database servers:
  • Google BigTable
  • Apache HBase
  • Cassandra
NoSQL databases See above
NewSQL databases Database servers:
  • Google Spanner
  • SAP HANA
  • VoltDB
Graph databases Database servers:
  • Neo4j
  • OrientDB

I will not try try and evaluate what products are good for what purpose.

You will need to evaluate your needs with what the different databases offers regarding:

Relational databases

Relational databases are still the most widely used databases and furthermore there is a tendency to measure non-relational databases against relational database, so it is worth looking a little bit into relational databases.

History

The relational model was invented by Edgar Codd in 1969/1970.

It was later advocated by Chris Date, whose book "An Introduction to Database Systems" can be considered the bible of relational databases (and note that the term "introduction" in the title should not be take literally - it is a book at a relative high level).

Relational database became popular during the 1980's when products like IBM DB2, Oracle DB, DEC RDB, Sybase and Informix arrived in the market.

Terminology

There are many confusing aspects of relational database theory including the terms used for the most basic concepts:

Relational database theory term Standard database term Application term
Relation Table Record set
Attribute Column Field
Tuple Row Record

The different key types are:

Candidate key(CK)
Column or combination of columns that can uniquely identify a row
Primary key (PK)
The candidate key chosen for uniquely identifing a row
Alternate key (AK)
A candidate key not chosen
Foreign key (FK)
A column in one table referring to the primary key in another table

The different relation types are:

1:1 relation / one-to-one relation
each row in table1 has a relation to only one row in table2 and each row in table2 has a relation to only one row in table1
1:M relation / one-to-many relation
each row in table1 has a relation to many rows in table2 and each row in table2 has a relation to only one row in table1
M:M relation / many-to-many relation
each row in table1 has a relation to many rows in table2 and each row in table2 has a relation to many rows in table1

Joining is combing two tables from a relation. The different join types are:

JOIN or INNER JOIN
only combinations that have rows in both tables
LEFT JOIN or LEFT OUTER JOIN
all rows from left table combined with row from right table if exists otherwise filled with NULL values
RIGHT JOIN or RIGHT OUTER JOIN
all rows from right table combined with row from left table if exists otherwise filled with NULL values

Normalization

Normalization is the process of optimizing database tables to renmove any redundancy in the data.

There are different levels of normalization:

Abbreviation Name Semantics Achieved by
1NF 1st Normal Form All values are atomic - not a list of multiple values Move the list to a separate table with a FK to the original table
2NF 2nd Normal Form 1NF + all fields not part of any CK depends on all fields in any CK - not on some of the fields Move those fields to separate tables
3NF 3rd Normal Form 2NF + no field depends on any field not part of a CK Move those fields to separate tables
BCNF Boyce Codd Normal Form
4NF 4th Normal Form
5NF 5th Normal Form

Object-relational impedance mismatch

It should be obvious by now that a relational database model and an object model used in an application is not a 1:1 match.

This is known as the object-relational impedance mismatch.

Some consider this a major problem for the mix of object oriented programming and relational databases. But I don't see it as a major problem. Due to both OOP and RDBMS being very popular then methodologies and tools has been developed that mitigate most of the problems.

Database API's

SQL

All relational databases today supports SQL (Structured Query Language).

Its history goes back to 1974.

It has been standardized by ANSI and ISO in 1986, 1992, 1999, 2003, 2006, 2008 and 2011.

Note that unlike traditional programming languages then the new features are rarely used. Most SQL usage just require SQL 1992.

Database specific API's/Libraries for relational databases

These API's are typical using SQL.

Common API's/Libraries are:

Database API/Library Language
Sybase ASE DBLIB C
Sybase ASE CTLIB C
MS SQLServer DBLIB C
MySQL MySQL Connector/C aka libmysql C
MySQL MySQL Connector/C++ C++
PostgreSQL PostgresSQL C API aka libpq C
PostgreSQL PostgresSQL C++ API aka libpq++ C++
Oracle DB OCI C
Oracle DB OCCI C++
Oracle DB LIBOCI C and C++
MySQL mysql extension (NOTE: OBSOLETE) PHP
MySQL mysqli extension PHP
PostgreSQL pgsql extension PHP
MS SQLServer mssql extension PHP

I don't see much need to use these API's today. I would use a standardized API instead.

For examples of DBLIB see here.

For examples of MySQL Connector/C see here.

For examples of MySQL Connector/C++ see here.

For examples of PostgreSQL C API see here.

For example of LIBOCI see here.

For examples of mysql extension see here.

For examples of mysqli extension see here.

For examples of pgsql extension see here.

Embedded SQL

Embedded SQL was the primary way to access relational databases for many years and are still used in IBM mainframe, commercial Unix and OpenVMS environments.

The concept is that one write a program with:

and then a preproceser convert the embedded SQL statements to call to database specific API and then it is compiled.

Embedded SQL is not something I would consider if there are any alternatives.

For examples of embedded SQL (using C) see here.

Java has a special flavor of embedded SQL called SQLJ. For example see here.

Standardized API's

These API's are typical using SQL.

Common API's are:

API Language/Platform Driver/provider Notes
JDBC Java JDBC Driver There are four types of drivers - most important are type 2 that uses native libraries and type 4 that is pure Java
ODBC C (common on Windows, available on Unix/Linux, standard for all IBM DB2) ODBC Driver Note that when using ODBC on Windows it is often using OLE DB with the bridge to ODBC
OLE DB / ADO COM (C++, VB6 and VBS on Windows) OLE DB Provider / ADO Provider
ADO.NET .NET (C#, VB.NET) ADO.NET Provider
PDO PHP PDO Driver
Qt SQL C++ Qt SQL driver

For relational databases these should be the preferred API for all usage that does not fit well with ORM. This include operations on record sets instead of individual records - often in a DWH context.

For examples of JDBC see here.

For examples of ODBC see here.

For examples of ADO see here.

For examples of ADO.NET see here.

For examples of PDO see here.

For examples of Qt SQL see here.

Database specific API's for NoSQL databases

These are very different between different NoSQL databases.

The API typical consists of:

  1. Library API implemented in multiple languages
  2. RESTFul web service API

In my opinion these API's are often rather cumbersome to use. But if one need to use a NoSQL database, then one does not have any choice.

ORM's

The concept is to write:

and then the ORM handles the database access without the programmer needing to write any SQL.

ORM's are typical using an ORM specific query lanaguage for advanced queries.

Common ORM's are:

ORM Language/Platform Query language Mapping Notes
EJB Entity Bean CMP Java EE EJBQL XML Obsolete. Replaced by JPA.
Hibernate Java HQL XML
MyBatis Java SQL Convention Fork of iBatis
JPA Java JPQL Class annotations JPA is really an API. Multiple implementations exist: Hibernate, EclipseLink and OpenJPA.
LINQ to SQL .NET (C# and VB.NET) LINQ Class attributes From .NET 3.5. Obsolete. Replaced by EF.
EF .NET (C# and VB.NET) LINQ Class attributes From .NET 4.0.
NHibernate .NET (C# and VB.NET) HQL or LINQ XML Port of Hibernate.
Doctrine PHP DQL or SQL XML or class annotations (doc) Hibernate inspired.
Eloquent PHP SQL XML Part of Laravel.
Propel PHP generated API auto

For relational databases these should be the preferred API for CRUD operating on single records - usually in an OLTP context.

In theory ORM can be used for NoSQL databases as well, but it is very rare.

For examples of using Hibernate see here.

For examples of using MyBatis see here.

For examples of using LINQ to SQL see here.

For examples of using EF see here.

For examples of using NHibernate see here.

For examples of using Doctrine see here.

For examples of using Eloquent see here.

Topologies:

Active/passive cluster with replication:

Applications connect to the active database server.

The active database server replicate all updates to the passive database server.

If active database server goes down then then the passive database server becomes active and applications reconnect to it.

Model:

Replication model

After failure:

Replication model after failure

Active/passive cluster with shared storage:

Applications connect to the active database server.

If active database server goes down then then the passive database server becomes active and applications reconnect to it.

Due to the shared storage the passive database server is always uptodate.

Model:

Shared storage

After failure:

Shared storage after failure

Active/active cluster with shared storage:

Applications connect to a ny database server.

Due to the shared storage all database server are always uptodate.

Model:

Active/Active

After failure:

Active/Active after failure

Active/active cluster with sharding/partioning:

Each database server has a fraction of the data.

Applications access the database server that has the data they needs.

There is no redundancy - unless combined with one of the previous models.

Model:

Sharding

With redundancy added:

Sharding with redundancy

For some reason the term sharding is most common in OLTP context while the term partitioning is most common in DWH context.

Large scale distributed databases:

This is a rather complicated topology and only applicablae when dozens/hundreds/thousands of database servers.

Implementations differ sligthly but typical it works like:

Article history:

Version Date Description
1.0 October 17th 2016 Initial version based on multiple old articles on Eksperten.dk
1.1 December 22nd 2017 Add bunch of links to examples
1.2 March 6th 2022 Add section on topologies

Other articles:

See list of all articles here

Comments:

Please send comments to Arne Vajhøj