Most applications utilize some form of database to persist data in.
Knowledge about databases is therefore essential for a software developer.
This article targets beginners and attempt to provide them with a basic understanding of database types, products and terminology with the intent to provide the necessary background information to work with databases.
If you want to skip the background information and go straigth to the "HowTo", then got to this article.
Databases are often categorized based on their storage model:
Database type | Characteristica | Usage |
---|---|---|
Hierarchical database | Data are organized in a tree like structure (think org chart) | The standard database type from mid 1960's to mid 1980's and still used in the mainframe world |
Relational database (RDBMS) | Data are organized as sets (see below for details) | The standard database type from mid 1980's |
Object oriented database (OODBMS) | Data are stored as an object model | Never really caugth on even though OOP became the standard for programming |
Key value store | Data are stored as keys and values | Widely used on mainframe and mini computers in 60's, 70's and 80's - returned in the late 00's as part of the NoSQL movement |
Document store | Data are stored as large documents (usually in XML or JSON format) | Became popular as part of the NoSQL movement |
Column store | Data are stored as keys, values and timestamps | Became popular as part of the NoSQL movement |
NoSQL databases | Different definitions exist:
|
Became popular in internet startups and similar in the late 00's |
NewSQL databases | Relational databases build on NoSQL technology | Concept showed up in early 10's but has not gained wide popularity yet |
Graph databases | Data consist of nodes and connection between nodes | Concept showed up in late 00's and has gained some popularity for advanced analytics |
Databases can also be categorized based on how the application communicate with it:
The advantages of a database server is scalability:
The advantages of an embedded database is less work:
Some of the more popular database products are:
Database type | Popular products |
---|---|
Hierarchical database |
Database servers:
|
Relational database (RDBMS) |
Database servers:
|
Object oriented database (OODBMS) |
Database servers:
|
Key value store |
Database servers:
|
Document store |
Database servers:
|
Column store |
Database servers:
|
NoSQL databases | See above |
NewSQL databases |
Database servers:
|
Graph databases |
Database servers:
|
I will not try try and evaluate what products are good for what purpose.
You will need to evaluate your needs with what the different databases offers regarding:
Relational databases are still the most widely used databases and furthermore there is a tendency to measure non-relational databases against relational database, so it is worth looking a little bit into relational databases.
The relational model was invented by Edgar Codd in 1969/1970.
It was later advocated by Chris Date, whose book "An Introduction to Database Systems" can be considered the bible of relational databases (and note that the term "introduction" in the title should not be take literally - it is a book at a relative high level).
Relational database became popular during the 1980's when products like IBM DB2, Oracle DB, DEC RDB, Sybase and Informix arrived in the market.
There are many confusing aspects of relational database theory including the terms used for the most basic concepts:
Relational database theory term | Standard database term | Application term |
---|---|---|
Relation | Table | Record set |
Attribute | Column | Field |
Tuple | Row | Record |
The different key types are:
The different relation types are:
Joining is combing two tables from a relation. The different join types are:
Normalization is the process of optimizing database tables to renmove any redundancy in the data.
There are different levels of normalization:
Abbreviation | Name | Semantics | Achieved by |
---|---|---|---|
1NF | 1st Normal Form | All values are atomic - not a list of multiple values | Move the list to a separate table with a FK to the original table |
2NF | 2nd Normal Form | 1NF + all fields not part of any CK depends on all fields in any CK - not on some of the fields | Move those fields to separate tables |
3NF | 3rd Normal Form | 2NF + no field depends on any field not part of a CK | Move those fields to separate tables |
BCNF | Boyce Codd Normal Form | ||
4NF | 4th Normal Form | ||
5NF | 5th Normal Form |
It should be obvious by now that a relational database model and an object model used in an application is not a 1:1 match.
This is known as the object-relational impedance mismatch.
Some consider this a major problem for the mix of object oriented programming and relational databases. But I don't see it as a major problem. Due to both OOP and RDBMS being very popular then methodologies and tools has been developed that mitigate most of the problems.
All relational databases today supports SQL (Structured Query Language).
Its history goes back to 1974.
It has been standardized by ANSI and ISO in 1986, 1992, 1999, 2003, 2006, 2008 and 2011.
Note that unlike traditional programming languages then the new features are rarely used. Most SQL usage just require SQL 1992.
These API's are typical using SQL.
Common API's/Libraries are:
Database | API/Library | Language |
---|---|---|
Sybase ASE | DBLIB | C |
Sybase ASE | CTLIB | C |
MS SQLServer | DBLIB | C |
MySQL | MySQL Connector/C aka libmysql | C |
MySQL | MySQL Connector/C++ | C++ |
PostgreSQL | PostgresSQL C API aka libpq | C |
PostgreSQL | PostgresSQL C++ API aka libpq++ | C++ |
Oracle DB | OCI | C |
Oracle DB | OCCI | C++ |
Oracle DB | LIBOCI | C and C++ |
MySQL | mysql extension (NOTE: OBSOLETE) | PHP |
MySQL | mysqli extension | PHP |
PostgreSQL | pgsql extension | PHP |
MS SQLServer | mssql extension | PHP |
I don't see much need to use these API's today. I would use a standardized API instead.
For examples of DBLIB see here.
For examples of MySQL Connector/C see here.
For examples of MySQL Connector/C++ see here.
For examples of PostgreSQL C API see here.
For example of LIBOCI see here.
For examples of mysql extension see here.
For examples of mysqli extension see here.
For examples of pgsql extension see here.
Embedded SQL was the primary way to access relational databases for many years and are still used in IBM mainframe, commercial Unix and OpenVMS environments.
The concept is that one write a program with:
and then a preproceser convert the embedded SQL statements to call to database specific API and then it is compiled.
Embedded SQL is not something I would consider if there are any alternatives.
For examples of embedded SQL (using C) see here.
Java has a special flavor of embedded SQL called SQLJ. For example see here.
These API's are typical using SQL.
Common API's are:
API | Language/Platform | Driver/provider | Notes |
---|---|---|---|
JDBC | Java | JDBC Driver | There are four types of drivers - most important are type 2 that uses native libraries and type 4 that is pure Java |
ODBC | C (common on Windows, available on Unix/Linux, standard for all IBM DB2) | ODBC Driver | Note that when using ODBC on Windows it is often using OLE DB with the bridge to ODBC |
OLE DB / ADO | COM (C++, VB6 and VBS on Windows) | OLE DB Provider / ADO Provider | |
ADO.NET | .NET (C#, VB.NET) | ADO.NET Provider | |
PDO | PHP | PDO Driver | |
Qt SQL | C++ | Qt SQL driver |
For relational databases these should be the preferred API for all usage that does not fit well with ORM. This include operations on record sets instead of individual records - often in a DWH context.
For examples of JDBC see here.
For examples of ODBC see here.
For examples of ADO see here.
For examples of ADO.NET see here.
For examples of PDO see here.
For examples of Qt SQL see here.
These are very different between different NoSQL databases.
The API typical consists of:
In my opinion these API's are often rather cumbersome to use. But if one need to use a NoSQL database, then one does not have any choice.
The concept is to write:
and then the ORM handles the database access without the programmer needing to write any SQL.
ORM's are typical using an ORM specific query lanaguage for advanced queries.
Common ORM's are:
ORM | Language/Platform | Query language | Mapping | Notes |
---|---|---|---|---|
EJB Entity Bean CMP | Java EE | EJBQL | XML | Obsolete. Replaced by JPA. |
Hibernate | Java | HQL | XML | |
MyBatis | Java | SQL | Convention | Fork of iBatis |
JPA | Java | JPQL | Class annotations | JPA is really an API. Multiple implementations exist: Hibernate, EclipseLink and OpenJPA. |
LINQ to SQL | .NET (C# and VB.NET) | LINQ | Class attributes | From .NET 3.5. Obsolete. Replaced by EF. |
EF | .NET (C# and VB.NET) | LINQ | Class attributes | From .NET 4.0. |
NHibernate | .NET (C# and VB.NET) | HQL or LINQ | XML | Port of Hibernate. |
Doctrine | PHP | DQL or SQL | XML or class annotations (doc) | Hibernate inspired. |
Eloquent | PHP | SQL | XML | Part of Laravel. |
Propel | PHP | generated API | auto |
For relational databases these should be the preferred API for CRUD operating on single records - usually in an OLTP context.
In theory ORM can be used for NoSQL databases as well, but it is very rare.
For examples of using Hibernate see here.
For examples of using MyBatis see here.
For examples of using LINQ to SQL see here.
For examples of using EF see here.
For examples of using NHibernate see here.
For examples of using Doctrine see here.
For examples of using Eloquent see here.
Applications connect to the active database server.
The active database server replicate all updates to the passive database server.
If active database server goes down then then the passive database server becomes active and applications reconnect to it.
Model:
After failure:
Applications connect to the active database server.
If active database server goes down then then the passive database server becomes active and applications reconnect to it.
Due to the shared storage the passive database server is always uptodate.
Model:
After failure:
Applications connect to a ny database server.
Due to the shared storage all database server are always uptodate.
Model:
After failure:
Each database server has a fraction of the data.
Applications access the database server that has the data they needs.
There is no redundancy - unless combined with one of the previous models.
Model:
With redundancy added:
For some reason the term sharding is most common in OLTP context while the term partitioning is most common in DWH context.
This is a rather complicated topology and only applicablae when dozens/hundreds/thousands of database servers.
Implementations differ sligthly but typical it works like:
Version | Date | Description |
---|---|---|
1.0 | October 17th 2016 | Initial version based on multiple old articles on Eksperten.dk |
1.1 | December 22nd 2017 | Add bunch of links to examples |
1.2 | March 6th 2022 | Add section on topologies |
See list of all articles here
Please send comments to Arne Vajhøj
What database type is best depends a lot on the problem to solve.
My rules of thumb would be: