Designing the Database Schema: A Comprehensive Guide

Introduction

A well-structured database schema is the foundation of any efficient data management system. Whether you are designing a relational database for an application or a data warehouse for analytical processing, understanding schema design principles is crucial. This blog explores database schemas, key data types, database keys, data warehouse schema designs, and compression techniques in Amazon Redshift.

What is a Database?

A database is an organized collection of data stored electronically, allowing easy access, management, modification, and retrieval of information. It serves as a structured repository where data can be efficiently queried and analyzed.

What is a Schema in a Database?

A database schema defines the logical structure of a database, outlining how data is organized, stored, and related. A schema includes:

Tables and Columns: Defines the structure of data storage.
Data Types: Specifies the kind of data stored in each column.
Relationships: Determines how tables relate to each other.
Constraints: Enforces rules such as primary keys, foreign keys, and unique constraints.
Views, Indexes, Stored Procedures, and Functions: Enhances data retrieval efficiency.

Data Types in a Database

Data types define the nature of data that can be stored in a column. They determine:

The type of data (e.g., numbers, text, dates).
Storage space required.
Permissible operations on the data.

Common Data Types in Amazon Redshift:

Numeric Types: SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION
Character Types: CHAR, VARCHAR, TEXT
Date/Time Types: DATE, TIME, TIMESTAMPTZ
Boolean Type: BOOLEAN
Spatial Data Types: GEOMETRY, GEOGRAPHY
Complex Types: SUPER, VARBYTE

Types of Database Keys

Keys play a crucial role in ensuring data integrity and relationships in relational databases.

1. Primary Key (PK)

Uniquely identifies each record in a table.
Cannot have duplicate or NULL values.
Example:

CREATE TABLE Customers ( CustomerID INT PRIMARY KEY, Name VARCHAR(50), Email VARCHAR(50) );

2. Foreign Key (FK)

Establishes relationships between tables.
References a primary key in another table.
Example:

CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    CustomerID INT,
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

3. Candidate Key

A set of attributes that uniquely identify a record.
A table can have multiple candidate keys, but only one is chosen as the primary key.

4. Composite Key

A primary key consisting of multiple columns.
Example:

CREATE TABLE CourseRegistrations (
    StudentID INT,
    CourseID INT,
    PRIMARY KEY (StudentID, CourseID)
)

Unique Key

Ensures unique values in a column but allows NULL values.
Example:

CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    Email VARCHAR(100) UNIQUE
);

Super Key

A set of attributes that uniquely identify records. Every candidate key is a super key, but not all super keys are candidate keys.

7. Surrogate Key

A system-generated key, usually an auto-incremented number.
Example:

CREATE TABLE Products (
    ProductID SERIAL PRIMARY KEY,
    ProductName VARCHAR(50)
);

Data Warehouse Schema Design

Schema design is vital in data warehouses to optimize query performance.

1. Star Schema

A star schema consists of a central fact table connected to multiple dimension tables.

Components:

Fact Table: Stores measurable data (e.g., sales, revenue).
Dimension Tables: Store descriptive attributes (e.g., customers, products, time).

Example:

Fact Table: sales_fact

sale_id, product_id, customer_id, date_id, amount

Dimension Tables:

dim_customer (customer details)
dim_product (product details)
dim_date (date information)

Advantages:

Simple and intuitive design.
Faster query performance due to denormalization.

Disadvantages:

Data redundancy leads to increased storage requirements.

2. Snowflake Schema

A snowflake schema is a normalized version of the star schema, reducing redundancy by breaking down dimension tables into sub-dimensions.

Example:

Fact Table: sales_fact

sale_id, product_id, customer_id, date_id, amount

Dimension Tables:

dim_customer → linked to customer_region
dim_product → linked to product_category
dim_date

Advantages:

Reduced data redundancy.
Better data integrity.

Disadvantages:

Complex queries due to multiple joins.
Slightly slower query performance.

Criteria	Star Schema	Snowflake Schema
Complexity	Simple and denormalized	Complex with normalization
Query Performance	Faster due to fewer joins	Slower due to more joins
Storage	More storage required	Optimized storage
Use Case	Best for OLAP, dashboards	Best for large data models

Columnar Compression in Amazon Redshift

Amazon Redshift uses columnar storage and compression to improve performance and reduce storage costs.

Types of Compression in Redshift:

Automatic Compression
- Redshift automatically applies the best compression during data loading.
- Example:

COPY table_name FROM 's3://bucket/path' CREDENTIALS 'aws_credentials' COMPUPDATE ON;

Manual Compression

Users can specify compression encodings manually.
Example:

CREATE TABLE table_name (
    column1 INT ENCODE AZ64,
    column2 VARCHAR(255) ENCODE BYTEDICT,
    column3 TIMESTAMP ENCODE DELTA
);

Common Compression Encodings:

BYTEDICT: Best for low-cardinality columns.
DELTA: Suitable for sorted numeric or timestamp columns.
ZSTD & AZ64: General-purpose compression for numeric and text data.

Benefits of Compression:

Reduced storage costs.
Faster query execution due to reduced I/O.
Optimized performance for large datasets.

Database schema design is fundamental for data integrity, efficiency, and performance. Whether working with OLTP or OLAP databases, understanding data types, keys, and schema structures like star and snowflake schemas ensures better data management. In Amazon Redshift, columnar storage and compression further enhance query speeds and storage efficiency, making it a powerful choice for big data processing.

References:

Amazon Redshift Documentation: AWS Redshift Compression

Designing the Database Schema: A Comprehensive Guide

Introduction

What is a Database?

What is a Schema in a Database?

Data Types in a Database

Common Data Types in Amazon Redshift:

Types of Database Keys

1. Primary Key (PK)

2. Foreign Key (FK)

3. Candidate Key

4. Composite Key

Unique Key

Super Key

7. Surrogate Key

Data Warehouse Schema Design

1. Star Schema

Components:

Example:

Advantages:

Disadvantages:

2. Snowflake Schema

Example:

Advantages:

Disadvantages:

Columnar Compression in Amazon Redshift

Types of Compression in Redshift:

Common Compression Encodings:

Benefits of Compression:

References:

Comments

Leave a Reply Cancel reply