Understanding Surrogate Keys and Slowly Changing Dimensions (SCD)

Introduction

In data warehousing, surrogate keys and slowly changing dimensions (SCD) are fundamental concepts that ensure data consistency and historical tracking. Surrogate keys provide unique, stable identifiers for dimension records, while SCD techniques handle changes in descriptive attributes over time. This blog explains their importance, how they work, and why they are essential for a robust data warehouse.

surrogate keys

What are Surrogate Keys?

Surrogate keys are system-generated identifiers used in dimension tables to maintain unique records. Unlike natural keys (from source systems), surrogate keys are stable and unaffected by backend changes.

Key Advantages of Surrogate Keys:

  • Independence: They don’t depend on source system keys, ensuring stability during source system updates.
  • Conflict Resolution: Eliminate key conflicts when data comes from multiple sources.
  • SCD Compatibility: Support tracking historical changes by allowing duplicate natural keys with distinct surrogate keys.

Example:
Suppose a customer’s ID (CUST1001) from a CRM system is updated to CUST2001. With surrogate keys, the data warehouse can maintain the original key (1) without disruptions.

What are Slowly Changing Dimensions (SCD)?

Slowly Changing Dimensions refer to attributes in dimension tables that change infrequently over time. SCD ensures these changes are recorded to preserve historical accuracy.

Types of SCD:

  1. Type 1 (Overwrite): Updates the existing value, losing historical data.
    • Example: Correcting a spelling error in a customer’s name.
  2. Type 2 (History Tracking): Adds a new row for each change, maintaining history.
    • Example: Tracking a customer’s address changes over time.
  3. Type 3 (Limited History): Adds a new column for the previous value, capturing only the latest change.
    • Example: Storing both the current and previous product categories.

How Surrogate Keys Support SCD:

For Type 2, surrogate keys are critical to differentiate rows with the same natural key (e.g., Customer ID) but different historical records (e.g., Address A vs. Address B).

Examples for Surrogate Keys and SCD

1. Surrogate Keys Example

A dimension table with Customer Information that uses both surrogate and natural keys.

Surrogate KeyCustomer ID (Natural Key)Customer NameAddressPhone Number
1CUST1001John Doe123 Maple St123-456-7890
2CUST1002Jane Smith456 Oak Ave987-654-3210
3CUST1003Alice Johnson789 Pine Blvd555-123-4567

Key Insight:

  • The Surrogate Key is a system-generated unique identifier independent of the Customer ID from the source system.
  • If the Customer ID changes in the source system, it does not impact the surrogate key.

2. Slowly Changing Dimensions (SCD) Example

SCD Type 1 (Overwrite):
Customer IDCustomer NameAddress
CUST1001John Doe456 Oak Ave

Scenario:

  • If John Doe moves to a new address, the previous address is overwritten. No historical data is retained.

SCD Type 2 (History Tracking):
Surrogate KeyCustomer IDCustomer NameAddressEffective DateEnd Date
1CUST1001John Doe123 Maple St2024-01-012024-05-31
2CUST1001John Doe456 Oak Ave2024-06-01NULL

Scenario:

  • When John Doe moves to a new address, a new row is added with updated information and an effective date.
  • The historical record is preserved by updating the end date of the previous record.

SCD Type 3 (Limited History):
Customer IDCustomer NameCurrent AddressPrevious Address
CUST1001John Doe456 Oak Ave123 Maple St

Scenario:

  • Only the latest change is recorded, and the previous address is stored in a separate column. Historical data is limited to the last update.

Combining Surrogate Keys and SCD

When designing a data warehouse:

  • Use surrogate keys as primary keys in dimension tables to avoid reliance on mutable natural keys.
  • Implement SCD strategies to retain accurate historical data for reporting and analysis.
  • Store natural keys as reference columns to map back to the source system if needed.

Customer Dimension Table with Surrogate Keys and SCD Type 2:

Surrogate KeyCustomer IDCustomer NameAddressEffective DateEnd Date
1CUST1001John Doe123 Maple St2024-01-012024-05-31
2CUST1001John Doe456 Oak Ave2024-06-01NULL

Fact Table Example:

Transaction IDSurrogate Key (Customer)Transaction AmountTransaction Date
1011$1002024-03-15
1022$2002024-07-20

Key Insight:

  • The Fact Table uses the Surrogate Key to reference the dimension table.
  • Transactions before June 2024 are linked to the old address (key 1), and later transactions are linked to the updated address (key 2).

Surrogate keys and SCD are essential for building a reliable, scalable, and history-aware data warehouse. Surrogate keys provide stability, while SCD techniques ensure accurate tracking of changes. Together, they enable organizations to make data-driven decisions with confidence.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *