TCS Data Engineering Interview most asked Questions and Answers

Data engineering has become a crucial business domain in big data and cloud computing. If you are preparing for a TCS Data Engineering interview, you should be ready to tackle questions ranging from SQL queries to cloud services. This blog will guide you through some of the most commonly asked questions and how to answer them effectively.

Table of Contents

1. What is Data Engineering?

Data Engineering is the process of designing, building, and maintaining data pipelines that transform raw data into meaningful insights. It involves data extraction, transformation, and storage for analytical and operational use.

2. What are the different types of Slowly Changing Dimensions (SCD)?

Slowly Changing Dimensions (SCD) refer to managing historical changes in a database.

There are several types:

SCD Type 0 (Fixed Dimensions)

No changes are allowed once the data is inserted.
Example:

CUSTOMER_ID | CUSTOMER_NAME | CUSTOMER_EMAIL
------------------------------------------
1           | John Doe      | john.doe@gmail.com

If CUSTOMER_NAME changes, it is not updated.

SCD Type 1 (Overwrite)

Overwrites old data with new data without maintaining history.
Example:

UPDATE CUSTOMER_MASTER 
SET CUSTOMER_NAME = 'John Smith' 
WHERE CUSTOMER_ID = 1;

The old name ‘John Doe’ is replaced with ‘John Smith’.

SCD Type 2 (Historical Tracking)

Maintains history by adding new records with effective dates.
Example

INSERT INTO CUSTOMER_MASTER (CUSTOMER_ID, CUSTOMER_NAME, START_DATE, END_DATE, IS_CURRENT)
VALUES (1, 'John Doe', '2022-01-01', '2023-06-30', 0);

INSERT INTO CUSTOMER_MASTER (CUSTOMER_ID, CUSTOMER_NAME, START_DATE, END_DATE, IS_CURRENT)
VALUES (1, 'John Smith', '2023-07-01', NULL, 1);

The old record is closed (IS_CURRENT=0), and a new one is inserted.

SCD Type 3 (Limited History)

Stores the current and previous values within the same row.
Example:

ALTER TABLE CUSTOMER_MASTER ADD COLUMN PREVIOUS_NAME VARCHAR(50);

UPDATE CUSTOMER_MASTER
SET PREVIOUS_NAME = CUSTOMER_NAME, CUSTOMER_NAME = 'John Smith'
WHERE CUSTOMER_ID = 1;

The previous value is retained in PREVIOUS_NAME.

SCD Type 4 (History Table)

Uses a separate history table to track changes.
Example:

CREATE TABLE CUSTOMER_HISTORY AS SELECT * FROM CUSTOMER_MASTER WHERE 1=0;

INSERT INTO CUSTOMER_HISTORY SELECT * FROM CUSTOMER_MASTER WHERE CUSTOMER_ID = 1;

UPDATE CUSTOMER_MASTER SET CUSTOMER_NAME = 'John Smith' WHERE CUSTOMER_ID = 1;

The old data is stored in a history table.

SCD Type 6 (Hybrid Approach)

Combines Type 1, Type 2, and Type 3 approaches.
Example:

INSERT INTO CUSTOMER_MASTER (CUSTOMER_ID, CUSTOMER_NAME, PREVIOUS_NAME, START_DATE, END_DATE, IS_CURRENT)
VALUES (1, 'John Smith', 'John Doe', '2023-07-01', NULL, 1);

Maintains current, previous, and historical data.

Which SCD Type Should You Use?

SCD Type	Use Case
SCD 0	Static data (e.g., Birthdate, Country Code)
SCD 1	No history needed (e.g., Email, Phone Number)
SCD 2	Full history needed (e.g., Address, Customer Name)
SCD 3	Limited history required (e.g., Recent Address Change)
SCD 4	Separate history table needed (e.g., Frequent Changes)
SCD 6	Combination of history tracking methods

3. How do you remove duplicate records in SQL?

To remove duplicate records, you can use ROW_NUMBER() or DISTINCT. Example using ROW_NUMBER():

WITH CTE AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY CUSTOMER_ID ORDER BY CUSTOMER_ID) AS row_num
    FROM CUSTOMER_MASTER
)
DELETE FROM CUSTOMER_MASTER
WHERE CUSTOMER_ID IN (
    SELECT CUSTOMER_ID FROM CTE WHERE row_num > 1
);

This ensures that only unique CUSTOMER_ID values are retained.

4. What is the difference between ETL and ELT?

ETL (Extract, Transform, Load): Data is transformed before loading into the destination.
ELT (Extract, Load, Transform): Data is first loaded into the target system (e.g., a data lake) and then transformed as needed.

5. What is Partitioning in Athena and why is it important?

Partitioning improves query performance in AWS Athena by organizing data into segments based on column values (e.g., date, region). It allows faster query execution by scanning only relevant partitions rather than the entire dataset.

6. How do you pass dynamic parameters in AWS Glue?

You can pass parameters using --job-arguments:

aws glue start-job-run --job-name my-glue-job --arguments='{"--start_date":"2024-01-01", "--end_date":"2024-01-31"}'

Then, in your Glue script, access these parameters:

import sys
args = getResolvedOptions(sys.argv, ['start_date', 'end_date'])
start_date = args['start_date']
end_date = args['end_date']

7. How does Terraform help in AWS resource management?

Terraform is an Infrastructure-as-Code (IaC) tool that allows developers to define AWS infrastructure using configuration files. It enables version control, automated provisioning, and scalable deployments.

8. What are the best practices for optimizing queries in Snowflake?

Use clustering on frequently filtered columns.
Use materialized views for repeated queries.
Optimize storage by using columnar compression.
Utilize result caching to speed up query execution.

Conclusion

Being well-versed in SQL, cloud services, and data modeling concepts will help you excel in a TCS Data Engineering interview. Keep practicing and stay updated with the latest advancements in big data technologies!