Designing the Database Schema: A Comprehensive Guide

Introduction

A well-structured database schema is the foundation of any efficient data management system. Whether you are designing a relational database for an application or a data warehouse for analytical processing, understanding schema design principles is crucial. This blog explores database schemas, key data types, database keys, data warehouse schema designs, and compression techniques in Amazon Redshift.


What is a Database?

A database is an organized collection of data stored electronically, allowing easy access, management, modification, and retrieval of information. It serves as a structured repository where data can be efficiently queried and analyzed.


What is a Schema in a Database?

A database schema defines the logical structure of a database, outlining how data is organized, stored, and related. A schema includes:

  • Tables and Columns: Defines the structure of data storage.
  • Data Types: Specifies the kind of data stored in each column.
  • Relationships: Determines how tables relate to each other.
  • Constraints: Enforces rules such as primary keys, foreign keys, and unique constraints.
  • Views, Indexes, Stored Procedures, and Functions: Enhances data retrieval efficiency.

Data Types in a Database

Data types define the nature of data that can be stored in a column. They determine:

  • The type of data (e.g., numbers, text, dates).
  • Storage space required.
  • Permissible operations on the data.

Common Data Types in Amazon Redshift:

  • Numeric Types: SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION
  • Character Types: CHAR, VARCHAR, TEXT
  • Date/Time Types: DATE, TIME, TIMESTAMPTZ
  • Boolean Type: BOOLEAN
  • Spatial Data Types: GEOMETRY, GEOGRAPHY
  • Complex Types: SUPER, VARBYTE

Types of Database Keys

Keys play a crucial role in ensuring data integrity and relationships in relational databases.

1. Primary Key (PK)

  • Uniquely identifies each record in a table.
  • Cannot have duplicate or NULL values.
  • Example:
CREATE TABLE Customers ( CustomerID INT PRIMARY KEY, Name VARCHAR(50), Email VARCHAR(50) );

2. Foreign Key (FK)

  • Establishes relationships between tables.
  • References a primary key in another table.
  • Example:
CREATE TABLE Orders (
    OrderID INT PRIMARY KEY,
    CustomerID INT,
    FOREIGN KEY (CustomerID) REFERENCES Customers(CustomerID)
);

3. Candidate Key

  • A set of attributes that uniquely identify a record.
  • A table can have multiple candidate keys, but only one is chosen as the primary key.

4. Composite Key

  • A primary key consisting of multiple columns.
  • Example:
CREATE TABLE CourseRegistrations (
    StudentID INT,
    CourseID INT,
    PRIMARY KEY (StudentID, CourseID)
)

Unique Key

  • Ensures unique values in a column but allows NULL values.
  • Example:
CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    Email VARCHAR(100) UNIQUE
);

Super Key

  • A set of attributes that uniquely identify records. Every candidate key is a super key, but not all super keys are candidate keys.

7. Surrogate Key

  • A system-generated key, usually an auto-incremented number.
  • Example:
CREATE TABLE Products (
    ProductID SERIAL PRIMARY KEY,
    ProductName VARCHAR(50)
);

Data Warehouse Schema Design

Schema design is vital in data warehouses to optimize query performance.

1. Star Schema

A star schema consists of a central fact table connected to multiple dimension tables.

Star Schema

Components:

  • Fact Table: Stores measurable data (e.g., sales, revenue).
  • Dimension Tables: Store descriptive attributes (e.g., customers, products, time).

Example:

Fact Table: sales_fact

  • sale_id, product_id, customer_id, date_id, amount

Dimension Tables:

  • dim_customer (customer details)
  • dim_product (product details)
  • dim_date (date information)

Advantages:

  • Simple and intuitive design.
  • Faster query performance due to denormalization.

Disadvantages:

  • Data redundancy leads to increased storage requirements.

2. Snowflake Schema

A snowflake schema is a normalized version of the star schema, reducing redundancy by breaking down dimension tables into sub-dimensions.

SnowFlake Schema

Example:

Fact Table: sales_fact

  • sale_id, product_id, customer_id, date_id, amount

Dimension Tables:

  • dim_customer → linked to customer_region
  • dim_product → linked to product_category
  • dim_date

Advantages:

  • Reduced data redundancy.
  • Better data integrity.

Disadvantages:

  • Complex queries due to multiple joins.
  • Slightly slower query performance.
CriteriaStar SchemaSnowflake Schema
ComplexitySimple and denormalizedComplex with normalization
Query PerformanceFaster due to fewer joinsSlower due to more joins
StorageMore storage requiredOptimized storage
Use CaseBest for OLAP, dashboardsBest for large data models

Columnar Compression in Amazon Redshift

Amazon Redshift uses columnar storage and compression to improve performance and reduce storage costs.

Types of Compression in Redshift:

  1. Automatic Compression
    • Redshift automatically applies the best compression during data loading.
    • Example:
COPY table_name FROM 's3://bucket/path' CREDENTIALS 'aws_credentials' COMPUPDATE ON;

Manual Compression

  • Users can specify compression encodings manually.
  • Example:
CREATE TABLE table_name (
    column1 INT ENCODE AZ64,
    column2 VARCHAR(255) ENCODE BYTEDICT,
    column3 TIMESTAMP ENCODE DELTA
);

Common Compression Encodings:

  • BYTEDICT: Best for low-cardinality columns.
  • DELTA: Suitable for sorted numeric or timestamp columns.
  • ZSTD & AZ64: General-purpose compression for numeric and text data.

Benefits of Compression:

  • Reduced storage costs.
  • Faster query execution due to reduced I/O.
  • Optimized performance for large datasets.

Database schema design is fundamental for data integrity, efficiency, and performance. Whether working with OLTP or OLAP databases, understanding data types, keys, and schema structures like star and snowflake schemas ensures better data management. In Amazon Redshift, columnar storage and compression further enhance query speeds and storage efficiency, making it a powerful choice for big data processing.


References:

Amazon Redshift Documentation: AWS Redshift Compression

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *