Database introduction 2 - design and programming

Content:

Introduction
SQL
Database design
1. Process
2. Example
Recommendations

Introduction:

Most applications utilize some form of database to persist data in.

Knowledge about databases is therefore essential for a software developer.

This article targets beginners and attempt to provide them with a basic understanding of SQL, database design and database programming.

If you want some background information before starting, then got to this article.

SQL:

SQL is the standard language for accessing relational databases.

SQL can be divided in:

DDL (Data Definition Language) for defining the database structure
DML (Data Manipulation Language) for working with the data in the database

DDL (Data Definition Language)

Syntaxes for common operations are:

create table: CREATE TABLE tablename(fieldname₁ fieldtype₁ NOT NULL,fieldname₂ fieldtype₂,...,fieldname_n fieldtype_n,PRIMARY KEY(fieldname₁))
create table with primary key consisting of two fields: CREATE TABLE tablename(fieldname₁ fieldtype₁ NOT NULL,fieldname₂ fieldtype₂ NOT NULL,fieldname₃ fieldtype₃,...,fieldname_n fieldtype_n,PRIMARY KEY(fieldname₁,fieldname₂))
delete table: DROP TABLE tablename
create index: CREATE INDEX indexname ON tablename(fieldname)
create index requiring unique values: CREATE UNIQUE INDEX indexname ON tablename(fieldname)
delete index: DROP INDEX indexname

These are rather self-explanatory, but there are specific examples later.

Standard data types are:

INTEGER or INT: integer number
CHAR(n): fixed length text with length n in 8 bit character set
VARCHAR(n): variable length text with max length n in 8 bit character set
NCHAR(n): fixed length text with length n in 16 bit character set
NVARCHAR(n): variable length text with max length n in 16 bit character set
NUMERIC(p,s) or DECIMAL(p,s): fixed point number with max p digits and s decimals
REAL: floating point number in single precision
DOUBLE PRECISION: floating point number in double precision
BOOLEAN: true or false
BINARY(n): fixed length binary data with length n
VARBINARY(n): variable length binary data with max length n
TIME: time
DATE: date
TIMEDATE or TIMESTAMP: time and date
BLOB: Binary Large OBject = large binary content
CLOB: Character Large OBject = large text content
XML: XML document
JSON: JSON document

Note that the data types actually supported are database specific.

DML (Data Manipulation Language) - Update

Syntaxes for common operations are:

insert record with values for all fields: INSERT INTO tablename VALUES(value₁,...,value_n)
insert record with values for some fields (and remaining fields get default value): INSERT INTO tablename(fieldname₁,...,fieldname_n) VALUES(value₁,...,value_n)
insert multiple records with data from query: INSERT INTO tablename SELECT ...
update single field in record: UPDATE tablename SET fieldname₁ = value₁ WHERE fieldname₂ = value₂
update single field in all records: UPDATE tablename SET fieldname = value
update three fields in record: UPDATE tablename SET fieldname₁ = value₁,fieldname₂ = value₂,fieldname₃ = value₃ WHERE fieldname₄ = value₄
delete record: DELETE FROM tablename WHERE fieldname=value
delete all records: DELETE FROM tablename

These are rather self-explanatory, but there are specific examples later.

DML (Data Manipulation Language) - Query

SQL is an abbreviation for Structured Query Language, so obviously queries are a big part of SQL.

Generic syntax is:

SELECT fieldlist
FROM table list or table joins
[WHERE condition]
[GROUP BY field list]
[HAVING condition]
[ORDER BY field and direction list]

The field list is a comma separated list of items that will become columns in the query output.

Examples of items are:

fieldname
calculations
aggregatedfunction(fieldname)
(query returning single value)

An asterisk for field list mean all columns from table list or table joins.

The column get the fieldname as columnname for the first case. For the remaining cases one can specify the columnname via an alias:

item AS columnname

Aggregated functions include:

COUNT
MIN
MAX
AVG

COUNT(*) count rows while COUNT(fieldname) count rows where fieldname is not NULL.

A table list consists of one or more tables separated by comma:

tablename₁,...,tablename_n

Note that the output will contain every row combined with other combination of rows, so with 3 tables of 10, 20 and 30 rows the query output will have 6000 rows. To limit it to rows matching one can use WHERE conditions.

Table joins look like:

table₁ JOIN table₂ ON table₁.fieldname = table₂.fieldname

Conditions can be any expression returning true or false using fieldnames, constants, operators and functions.

Operators include:

+
-
*
/
=
<>
<
>
<=
>=
AND
OR
NOT
IS NULL (=NULL does not work!)
LIKE (wildcard compare using % as wildcard)

An order by field list is a comma separated list of items used to sort the query putput.

Items can be:

fieldname
fieldname ASC
fieldname DESC
expression based on one or more fields

This is all a bit abstract, but hopefully it will become more clear via the examples in the following section.

Examples

All the examples will work with the following tables:

employee
id(INTEGER),PK
name(VARCHAR)
category(INTEGER),FK->category.id,IX
pay(NUMERIC)
supervisor(INTEGER)

category
id(INTEGER),PK
title(VARCHAR)
permanent(BOOLEAN)
supervisor(BOOLEAN)

and data:

employee
id	name	Category	pay	supervisor
1	Alan A	1	140000
2	Brian B	2	100000	1
3	Chris C	2	90000	1
4	Dave D	3	75000	1
5	Erin E	3	80000	1
6	Fred F	4	45000	1

category
id	title	permanent	supervisor
1	Manager	True	True
2	Engineer	True	False
3	Sales rep	True	False
4	Handy man	False	False

SQL to create tables:

CREATE TABLE employee (
    id INTEGER NOT NULL,
    name VARCHAR(50) NOT NULL,
    category INTEGER NOT NULL,
    pay DECIMAL NOT NULL,
    supervisor INTEGER,
    PRIMARY KEY(id)
);
CREATE TABLE category (
    id INTEGER NOT NULL,
    title VARCHAR(25) NOT NULL,
    permanent BOOLEAN NOT NULL,
    supervisor BOOLEAN NOT NULL,
    PRIMARY KEY(id)
);

MySQL demo:

mysql> CREATE TABLE employee (
    ->     id INTEGER NOT NULL,
    ->     name VARCHAR(50) NOT NULL,
    ->     category INTEGER NOT NULL,
    ->     pay DECIMAL NOT NULL,
    ->     supervisor INTEGER,
    ->     PRIMARY KEY(id)
    -> );
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE category (
    ->     id INTEGER NOT NULL,
    ->     title VARCHAR(25) NOT NULL,
    ->     permanent BOOLEAN NOT NULL,
    ->     supervisor BOOLEAN NOT NULL,
    ->     PRIMARY KEY(id)
    -> );
Query OK, 0 rows affected (0.00 sec)

SQL to insert data:

INSERT INTO employee VALUES(1,'Alan A',1,140000,NULL);
INSERT INTO employee VALUES(2,'Brian B',2,100000,1);
INSERT INTO employee VALUES(3,'Chris C',2,90000,1);
INSERT INTO employee VALUES(4,'Dave D',3,75000,1);
INSERT INTO employee VALUES(5,'Erin E',3,80000,1);
INSERT INTO employee VALUES(6,'Fred F',4,45000,1);
INSERT INTO category VALUES(1,'Manager',TRUE,TRUE);
INSERT INTO category VALUES(2,'Engineer',TRUE,FALSE);
INSERT INTO category VALUES(3,'Sales rep',TRUE,FALSE);
INSERT INTO category VALUES(4,'Handy man',FALSE,FALSE);

MySQL demo:

mysql> INSERT INTO employee VALUES(1,'Alan A',1,140000,NULL);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee VALUES(2,'Brian B',2,100000,1);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee VALUES(3,'Chris C',2,90000,1);
Query OK, 1 row affected (0.02 sec)

mysql> INSERT INTO employee VALUES(4,'Dave D',3,75000,1);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO employee VALUES(5,'Erin E',3,80000,1);
Query OK, 1 row affected (0.02 sec)

mysql> INSERT INTO employee VALUES(6,'Fred F',4,45000,1);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO category VALUES(1,'Manager',TRUE,TRUE);
Query OK, 1 row affected (0.01 sec)

mysql> INSERT INTO category VALUES(2,'Engineer',TRUE,FALSE);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO category VALUES(3,'Sales rep',TRUE,FALSE);
Query OK, 1 row affected (0.00 sec)

mysql> INSERT INTO category VALUES(4,'Handy man',FALSE,FALSE);
Query OK, 1 row affected (0.00 sec)

All columns:

SELECT *
FROM employee;

MySQL demo:

mysql> SELECT *
    -> FROM employee;
+----+---------+----------+--------+------------+
| id | name    | category | pay    | supervisor |
+----+---------+----------+--------+------------+
|  1 | Alan A  |        1 | 140000 |       NULL |
|  2 | Brian B |        2 | 100000 |          1 |
|  3 | Chris C |        2 |  90000 |          1 |
|  4 | Dave D  |        3 |  75000 |          1 |
|  5 | Erin E  |        3 |  80000 |          1 |
|  6 | Fred F  |        4 |  45000 |          1 |
+----+---------+----------+--------+------------+
6 rows in set (0.00 sec)

Just some columns:

SELECT name,pay
FROM employee;

MySQL demo:

mysql> SELECT name,pay
    -> FROM employee;
+---------+--------+
| name    | pay    |
+---------+--------+
| Alan A  | 140000 |
| Brian B | 100000 |
| Chris C |  90000 |
| Dave D  |  75000 |
| Erin E  |  80000 |
| Fred F  |  45000 |
+---------+--------+
6 rows in set (0.00 sec)

Calculated column:

SELECT name,pay/12 AS monthlypay
FROM employee;

MySQL demo:

mysql> SELECT name,pay/12 AS monthlypay
    -> FROM employee;
+---------+------------+
| name    | monthlypay |
+---------+------------+
| Alan A  | 11666.6667 |
| Brian B |  8333.3333 |
| Chris C |  7500.0000 |
| Dave D  |  6250.0000 |
| Erin E  |  6666.6667 |
| Fred F  |  3750.0000 |
+---------+------------+
6 rows in set (0.00 sec)

Just some rows:

SELECT name
FROM employee
WHERE category = 2;

MySQL demo:

mysql> SELECT name
    -> FROM employee
    -> WHERE category = 2;
+---------+
| name    |
+---------+
| Brian B |
| Chris C |
+---------+
2 rows in set (0.00 sec)

and:

SELECT name
FROM employee
WHERE pay > 50000;

MySQL demo:

mysql> SELECT name
    -> FROM employee
    -> WHERE pay > 50000;
+---------+
| name    |
+---------+
| Alan A  |
| Brian B |
| Chris C |
| Dave D  |
| Erin E  |
+---------+
5 rows in set (0.00 sec)

and:

SELECT name
FROM employee
WHERE pay > 50000 AND category <> 1;

MySQL demo:

mysql> SELECT name
    -> FROM employee
    -> WHERE pay > 50000 AND category <> 1;
+---------+
| name    |
+---------+
| Brian B |
| Chris C |
| Dave D  |
| Erin E  |
+---------+
4 rows in set (0.01 sec)

and:

SELECT name
FROM employee
WHERE name NOT LIKE 'A%';

MySQL demo:

mysql> SELECT name
    -> FROM employee
    -> WHERE name NOT LIKE 'A%';
+---------+
| name    |
+---------+
| Brian B |
| Chris C |
| Dave D  |
| Erin E  |
| Fred F  |
+---------+
5 rows in set (0.00 sec)

Ordering:

SELECT name,pay
FROM employee
ORDER BY pay;

MySQL demo:

mysql> SELECT name,pay
    -> FROM employee
    -> ORDER BY pay;
+---------+--------+
| name    | pay    |
+---------+--------+
| Fred F  |  45000 |
| Dave D  |  75000 |
| Erin E  |  80000 |
| Chris C |  90000 |
| Brian B | 100000 |
| Alan A  | 140000 |
+---------+--------+
6 rows in set (0.00 sec)

and:

SELECT name,pay
FROM employee
ORDER BY pay DESC;

MySQL demo:

mysql> SELECT name,pay
    -> FROM employee
    -> ORDER BY pay DESC;
+---------+--------+
| name    | pay    |
+---------+--------+
| Alan A  | 140000 |
| Brian B | 100000 |
| Chris C |  90000 |
| Erin E  |  80000 |
| Dave D  |  75000 |
| Fred F  |  45000 |
+---------+--------+
6 rows in set (0.00 sec)

Inner join:

SELECT employee.name,category.title
FROM employee JOIN category ON employee.category = category.id;

MySQL demo:

mysql> SELECT employee.name,category.title
    -> FROM employee JOIN category ON employee.category = category.id;
+---------+-----------+
| name    | title     |
+---------+-----------+
| Alan A  | Manager   |
| Brian B | Engineer  |
| Chris C | Engineer  |
| Dave D  | Sales rep |
| Erin E  | Sales rep |
| Fred F  | Handy man |
+---------+-----------+
6 rows in set (0.00 sec)

Inner join the old way

SELECT employee.name,category.title
FROM employee,category
WHERE employee.category = category.id;

MySQL demo:

mysql> SELECT employee.name,category.title
    -> FROM employee,category
    -> WHERE employee.category = category.id;
+---------+-----------+
| name    | title     |
+---------+-----------+
| Alan A  | Manager   |
| Brian B | Engineer  |
| Chris C | Engineer  |
| Dave D  | Sales rep |
| Erin E  | Sales rep |
| Fred F  | Handy man |
+---------+-----------+
6 rows in set (0.00 sec)

Self join:

SELECT e1.name AS name,e2.name AS bossname
FROM employee e1 JOIN employee e2 ON e1.supervisor = e2.id;

MySQL demo:

mysql> SELECT e1.name AS name,e2.name AS bossname
    -> FROM employee e1 JOIN employee e2 ON e1.supervisor = e2.id;
+---------+----------+
| name    | bossname |
+---------+----------+
| Brian B | Alan A   |
| Chris C | Alan A   |
| Dave D  | Alan A   |
| Erin E  | Alan A   |
| Fred F  | Alan A   |
+---------+----------+
5 rows in set (0.00 sec)

Self join and left join:

SELECT e1.name AS name,e2.name AS bossname
FROM employee e1 LEFT JOIN employee e2 ON e1.supervisor = e2.id;

MySQL demo:

mysql> SELECT e1.name AS name,e2.name AS bossname
    -> FROM employee e1 LEFT JOIN employee e2 ON e1.supervisor = e2.id;
+---------+----------+
| name    | bossname |
+---------+----------+
| Alan A  | NULL     |
| Brian B | Alan A   |
| Chris C | Alan A   |
| Dave D  | Alan A   |
| Erin E  | Alan A   |
| Fred F  | Alan A   |
+---------+----------+
6 rows in set (0.02 sec)

Union:

(SELECT name FROM employee WHERE category=1)
UNION
(SELECT name FROM employee WHERE pay >= 80000);

MySQL demo:

mysql> (SELECT name FROM employee WHERE category=1)
    -> UNION
    -> (SELECT name FROM employee WHERE pay >= 80000);
+---------+
| name    |
+---------+
| Alan A  |
| Brian B |
| Chris C |
| Erin E  |
+---------+
4 rows in set (0.00 sec)

Aggregated function and grouping:

SELECT category.title,COUNT(*) AS noempl,AVG(pay) AS avgpay
FROM employee JOIN category ON employee.category = category.id
GROUP BY category.title;

MySQL demo:

mysql> SELECT category.title,COUNT(*) AS noempl,AVG(pay) AS avgpay
    -> FROM employee JOIN category ON employee.category = category.id
    -> GROUP BY category.title;
+-----------+--------+-------------+
| title     | noempl | avgpay      |
+-----------+--------+-------------+
| Engineer  |      2 |  95000.0000 |
| Handy man |      1 |  45000.0000 |
| Manager   |      1 | 140000.0000 |
| Sales rep |      2 |  77500.0000 |
+-----------+--------+-------------+
4 rows in set (0.01 sec)

Sub queries:

SELECT name
FROM employee
WHERE category IN (SELECT id FROM category WHERE permanent);

MySQL demo:

mysql> SELECT name
    -> FROM employee
    -> WHERE category IN (SELECT id FROM category WHERE permanent);
+---------+
| name    |
+---------+
| Alan A  |
| Brian B |
| Chris C |
| Dave D  |
| Erin E  |
+---------+
5 rows in set (0.00 sec)

and:

SELECT name
FROM employee
WHERE EXISTS (SELECT * FROM category WHERE permanent AND id = employee.category);

MySQL demo:

mysql> SELECT name
    -> FROM employee
    -> WHERE EXISTS (SELECT * FROM category WHERE permanent AND id = employee.category);
+---------+
| name    |
+---------+
| Alan A  |
| Brian B |
| Chris C |
| Dave D  |
| Erin E  |
+---------+
5 rows in set (0.00 sec)

Database design:

First step for a database is obviously to design the database structure.

For those having done it many times then it is not a problem, but for the beginner it can be a challenging task.

The following is a simple process designed to help beginners get started.

It is not the only way of doing it. It may not even be the best way of doing it. But I believe it should help the beginner get started.

Process

Determine what object types (classes) that need to be saved in the database and create a table for each of them.
Determine the attributes of each object type (class) and make those fields in the tables.
For each table look for a good identification of rows - if one exists then make that field primary key - if not add an id field and make that primary key.
Look at relations between tables:
- 1:1 relation => merge tables
- 1:M relation => add a field to the table on the M side pointing to the primary key on the 1 side
- M:M relation => add a new table with two fields one pointing to the primary key of each side
Choose data type for each field:
- Identifying numbers => INTEGER
- Counting numbers => INTEGER
- Text strictly western => VARCHAR
- Text including non-western => NVARCHAR
- Amount => NUMERIC
- True/False => BOOLEAN
- date only => DATE
- date and time => DATETIME
- Measured values => DOUBLE PRECISION
Determine the most natural queries and ensure that all involved fields has an index on it (primary keys automatically get index on them).
Convert model to DDL SQL

Example

Let us walk through an example step by step to better understand what the above process really means.

The example will design a little school database.

Determine what object types (classes) that need to be saved in the database and create a table for each of them.

The obvious objects types are:

student
class
teacher

So with one table for each of them we have:

student

class

teacher

Determine the attributes of each object type (class) and make those fields in the tables.

Student have two obvious attributes:

name
year

Course have one obvious attribute:

name

Teacher have one obvious attribute:

name

Adding those to the tables we have:

student
name
year

class
name

teacher
name

For each table look for a good identification of rows - if one exists then make that field primary key - if not add an id field and make that primary key.

We will assume that class names are unqiue, but that both student and teacher names can have duplicates, so we need to add id fields to the student and teacher tables, and mark fields as primary key:

student
id,PK
name
year

class
name,PK

teacher
id,PK
name

Look at relations between tables

We see that:

class-student is a 1:M relation as one class can have many students but a student belongs to one class, so we add FK to student table
class-teacher is a M:M relation as one class can have many teachers and a teacher can have many classes, so we create a new class_teacher table

So now we have:

student
id,PK
name
year
class,FK->class

class
name,PK

teacher
id,PK
name

class_teacher
class,partPK,FK->class
teacher,partPK,FK->teacher

Choose data type for each field

All id's should be INTEGER.

All names should be VARCHAR.

Year is a number so it should be INTEGER.

So now we have:

student
id(INTEGER),PK
name(VARCHAR)
year(INTEGER)
class(VARCHAR),FK->class

class
name(VARCHAR),PK

teacher
id(INTEGER),PK
name(VARCHAR)

class_teacher
class(VARCHAR),partPK,FK->class
teacher(INTEGER),partPK,FK->teacher

Determine the most natural queries and ensure that all involved fields has an index on it (primary keys automatically get index on them)

Find teachers classes:

SELECT class.name
FROM (class JOIN class_teacher ON class.name=class_teacher.class) JOIN teacher ON class_teacher.teacher=teacher.id
WHERE teacher.name = ?;

Find classes teachers:

SELECT teacher.name
FROM (teacher JOIN class_teacher ON teacher.id=class_teacher.teacher) JOIN class ON class_teacher.class=class.name
WHERE class.name = ?;

Find students teachers:

SELECT teacher.name
FROM ((teacher JOIN class_teacher ON teacher.id=class_teacher.teacher) JOIN class ON class_teacher.class=class.name) JOIN student ON class.name=student.class
WHERE student.name = ?;

We see that the following fields are used for comparison:

student.name
student.class
class.name
teacher.id
teacher.name
class_teacher.class
class_teacher.teacher

class.name and teacher.id are PK so they already have an index. The rest need an index. Resulting in:

student
id(INTEGER),PK
name(VARCHAR),IX
year(INTEGER)
class(VARCHAR),FK->class,IX

class
name(VARCHAR),PK

teacher
id(INTEGER),PK
name(VARCHAR),IX

class_teacher
class(VARCHAR),partPK,FK->class,IX
teacher(INTEGER),partPK,FK->teacher,IX

Convert model to DDL SQL

Simple:

CREATE TABLE student(
    id INTEGER NOT NULL,
    name VARCHAR(50),
    year INTEGER,
    class VARCHAR(20) NOT NULL,
    PRIMARY KEY(id)
);
CREATE TABLE class (
    name VARCHAR(20) NOT NULL,
    PRIMARY KEY(name)
);
CREATE TABLE teacher (
    id INTEGER NOT NULL,
    name VARCHAR(50),
    PRIMARY KEY(id)
);
CREATE TABLE class_teacher (
    class VARCHAR(20) NOT NULL,
    teacher INTEGER NOT NULL,
    PRIMNARY KEY(class,teacher)
);
CREATE INDEX student_name_ix ON student(name);
CREATE INDEX student_class_ix ON student(class);
CREATE INDEX teacher_name_ix ON teacher(name);
CREATE INDEX class_teacher_class_ix ON class_teacher(class);
CREATE INDEX class_teacher_teacher_ix ON class_teacher(teacher);

Recommendations:

Purpose

The purpose of this section is to provide some practical advice about what is good and what is bad when programming against a database, which will hopefully help the beginner to get one the right track.

Note that this section is rules of thumbs that are valid in most but not all cases. It makes sense for a beginner to follow these rules. The experienced developer will know when it may be warranted to break the rules.

Prefer columns over tables

It is almost always better to have a single table with an extra column instead of multiple tables with the same table structure.

So not:

tablex_2005
id(INTEGER),PK
val(INTEGER)

tablex_2006
id(INTEGER),PK
val(INTEGER)

tablex_2007
id(INTEGER),PK
val(INTEGER)

but:

tablex
id(INTEGER),PK
year(INTEGER)
val(INTEGER)

And not:

countries_europe
id(INTEGER),PK
name(VARCHAR)
capital(VARCHAR)

countries_northamerica
id(INTEGER),PK
name(VARCHAR)
capital(VARCHAR)

countries_asia
id(INTEGER),PK
name(VARCHAR)
capital(VARCHAR)

but:

countries
id(INTEGER),PK
name(VARCHAR)
capital(VARCHAR)
continent(VARCHAR)

The combined table:

makes it easier to write queries
will usually have better performance

Note that this is not denormalization. Denormalization would be merging tables to avoid a JOIN. This is merging table to avoid a UNION.

For very large databases tables are actually sometimes split out in multiple tables/servers. This is called partitioning/sharding. Forget about that for now and read more about it when your database exceeds 10 TB.

Use indexes for performance

Correct usage of indexes is essential for good performance.

90% of database performance problems can be solved with correct indexes.

For my rules of thumb see here.

For queries like:

SELECT * FROM t1 JOIN t2 ON t1.f1=t2.f1 WHERE t2.f2=X ORDER BY t1.f3;
SELECT * FROM t1,t2 WHERE t1.f1=t2.f1 AND t2.f2=X ORDER BY t1.f3;

put index on:

t1.f1
t2.f1
t2.f2
t1.f3

There are actually two index types:

tree based
hash based

where tree based indexes are by far the most common.

To illustrate the impact of indexes let us compare how search time changes when the database becomes bigger with an without indexes.

Search for a value without index is an O(n) operation while search for a value with a tree based index is an O(log(n)) operation.

So let us look at some fictive query with WHEN field='xxx' condition:

Number of rows	Time to execute without index	Time to execute with tree index
1000	0.01 seconds	0.01 seconds
1000000	10 seconds	0.02 seconds
1000000000	2 hours 47 minutes	0.03 seconds

That is a big difference. Searching for values in non-indexed fields will kill any query when the database becomes sufficiently big.

Note that sometimes indexes does not help. Specifically:

Condition	tree based index helps	hash based index helps
field = 'xxx'	yes	yes
field < 'xxx'	yes	no
field > 'xxx'	yes	no
field LIKE 'xxx%'	yes	no
field LIKE '%xxx'	no	no
field LIKE '%xxx%'	no	no
ORDER BY field	yes	no

Which is why LIKE a value starting with a wildcard is always a performance killer and should be avoided if possible.

Sometimes full text search is a solution to that problem.

It should also be obvious why tree based indexes are more common than hash based indexes.

Use standard SQL for portability

Different database software uses sligthly different SQL dialects. It is recommended to use standard SQL that work for all database software. That makes it a lot easier to change database software later. Even if database specific SQL can not be avoided, then only use it where absolutely needed and keep the rest standard. That minimizes work to be done when/if porting to another database software.

Ideally one should only need to change a connection string in a config file to change database software. It may be difficult to fully achieve that in practice, but one can do as much as possible. For the remaining be sure to isolate it in the application to make it easier to find and modify.

Do not believe in the "We will never change database software, so it does not matter." philosophy. Many databases exist for 10-20-30-40-50 years and technology changes and the IT org changes (new CIO/CTO) and the business changes (mergers, acquisitions and spin-off's). Nobody knows how the future will look like.

Quoted names

MySQL specific:

SELECT `x y z field` FROM `a b c table`;

MS SQLServer and MS Access specific:

SELECT [x y z field] FROM [a b c table];

The solution is simply to avoid names with spaces, names that are reserved words etc.:

SELECT xyzfield FROM abctable;

Related: also avoid names that are not reserved words in your database software, but may be so in other database software.

Functions

Most database software have a long list of functions for numeric, string and date/time manipulation and unfortunatetly these are not standardized so not very portable.

One should avoid if possible and in some cases the logic can be moved from database SQL to application code.

User defined functions and stored procedures

These can make SQL in application a lot cleaner and also provide certain performance, security and operational benefits. But they are not very portable.

Always analyze carefully whether the benefits outweigh the loss of portability before going this route.

Use proper division of responsibility between database SQL and application code

SQL is great for selction, joining, grouping and sorting.

But SQL is not a general purpose programming language.

Doing complex conversions or calculations in SQL often end up with unreadable SQL and poor performance.

That stuff should be done in a general purpose programming language. So let the SQL fetch the data and let the application code process the data and do conversions and calculations.

This rule also applies to user defined functions and stored procdures using SQL as language.

This logic in user defined functions and stored procdures only works out well, when they can be written in a general purpose language. Some database software do support that:

Oracle DB (Java, C# and VB.NET)
IBM DB2 (Java)
Sybase (Java)
MS SQLServer (C# and VB.NET)
Derby (Java)

This is a possibility that is often missed, because the database person is not aware of it.

For examples of UDF and SP in Java see here.

For examples of UDF and SP in C# see here.

Use correct data type

It is very important to chose the correct data type.

For my rules of thumb see here.

The general rules is to chose the most restrictive data type that will accept all valid values. That minimizes the risk for bad data.

Do not store numbers in a VARCHAR field as that data type will allow entry of non-digits.

Do not store dates/times/datetimes in a VARCHAR field as that data type will allow entry of non-valid data.

When chosing data type then also think about the semantics of various operations:

INTEGER's are sorted 3,22,111 but VARCHAR's are sorted '111','22','3'
INTEGER's are added 11+22=33 but VARCHAR's are concatanated '11'+'22'='1122' (in most databases - MySQL is an exception)

Avoid REAL and DOUBLE precision for amounts as they come with uncertainty in storage and operations. Always use NUMERIC or DECIMAL for amounts.

Protect against SQL injection

SQL injection is one of the most common security vulnerabilities.

For a description of how it works see here.

The solution is to use prepared statement / parameters (the term depends on the technology, but the concept is the same).

This article shows how to do it in Java, .NET, PHP and ASP.

Code to handle concurrency correct

Moving from a single user accessing the database to multiple users accessing the database concurrently creates a potential for problems due to users actions interfering with each other.

This require careful handling in the code.

This article shows both classic concurrency problems and possible solutions.

Do not ignore this aspect just because it is not likely to create problems with few users. Eventually it will become a problem.

Do not rule out puting files in database

There is a widely spread assumption that putting files (like pictures) into the database will result in bad performance and that the correct solution is to put the files in the file system and a just a reference in the database.

The reality is a bit more complicated than that.

The file system solution has several potential problems:

it is practically impossible to ensure transactional integrity (atomic update of database and file system)
it is very difficult to make a consistent backup of database and file system
working with the data requires two toolsets - one for database and one for file system
clustering can be tricky as file systems does not use the same replication mechanism as a database
security on file system requires streaming by application not direct serving by web server, which reduces the performance benefits

But what about the performance itself?

I have no doubt that a Access 2.0 / Windows 3.11 / 486 or Access 97 / Windows 98 / Pentium II would perform very poorly storing lots of files.

But that does not really say anything about a modern database server.

I have tested a bit with ASP.NET + MS SQLServer and ASP.NET + MySQL, and my conclusion was that there were some overhead but that it was not bad at all (double digit percentage but not a large factor).

So if you need to store some files, then I strongly recommend that you make your own tests to see if it will work storing them in the database instead of just believing in the old hearsay.

Use INFORMATION_SCHEMA to get information

Whenever information about database tables and database columns are needed use INFORMATION_SCHEMA (unless database API provide similar capabilities).

INFORMATION_SCHEMA has all the relevant information and is portable among databases.

Example queries:

SELECT TABLE_SCHEMA,TABLE_NAME,TABLE_TYPE
FROM INFORMATION_SCHEMA.TABLES;

SELECT TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,DATA_TYPE,CHARACTER_MAXIMUM_LENGTH,NUMERIC_PRECISION,NUMERIC_SCALE
FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME='tablename';

Article history:

Version	Date	Description
1.0	October 17th 2016	Initial version based on multiple old articles on Eksperten.dk
1.1	March 6th 2022	Add recommendation about INFORMATION_SCHEMA

Comments:

Please send comments to Arne Vajhøj

Database introduction 2 - design and programming

Content:

Introduction:

SQL:

DDL (Data Definition Language)

DML (Data Manipulation Language) - Update

DML (Data Manipulation Language) - Query

Examples

Database design:

Process

Example

Recommendations:

Purpose

Prefer columns over tables

Use indexes for performance

Use standard SQL for portability

Quoted names

Functions

User defined functions and stored procedures

Use proper division of responsibility between database SQL and application code

Use correct data type

Protect against SQL injection

Code to handle concurrency correct

Do not rule out puting files in database

Use INFORMATION_SCHEMA to get information

Article history:

Other articles:

Comments: