Thursday, November 16, 2023
SCD2 — Semantics and Styles
SCD2 — Semantics and Styles AI News, AI, AI tools, Chad Isenberg, Innovation, itinai.com, LLM, t.me/itinai, Towards Data Science - Medium 🔹 The Semantics of Differing SCD2 Techniques 🔹 How small differences can have a big impact In dimensional modeling, it's crucial to represent different types of history effectively in your data warehouse or lakehouse. While there are many resources on building SCD2 tables, let's focus on the semantics of SCD2 and how design choices can impact use cases. Choose your row-versioning dates wisely The dates you choose to row-version your dimensions matter significantly. It's important to consider your most common use cases when designing your SCD2 table. Determining access patterns How you row-version records will determine the access patterns against your tables. Ergonomics play a role in data quality, so making it easy for users to do the right thing should be a priority for data modelers. Choosing reference dates When creating an SCD2 table, the most common approach is to use a date or timestamp in your data. You can choose from three options: 1️⃣ Extract timestamps: Focuses on the raw data captured in your warehouse, prioritizing the loading processes. 2️⃣ Source system timestamps: Considers the raw data when the source system created or updated it. 3️⃣ Business timestamps: Looks at the business entity in relation to a business date. Choosing the format of valid_to and valid_from To determine the effective dates of your records, you can use a strategy based on update columns. dbt snapshots offer this functionality with their timestamp strategy. Note that when the valid_to of the "old" record and the valid_from of the replacing record are equal, strict inequality is required in query patterns. Bonus Round: SCD2 vs. dimensional snapshots SCD2 introduces complexity to data models, and it's worth questioning if this modeling exercise is always necessary. Maxime Beauchemin explores this idea in depth in one of data engineering's seminal works. Wrapping Up Dimensional modeling is a powerful tool for data and analytics engineers. Tracking history is essential for certain analytics use cases and provides valuable insights into operational workflows. While there are various approaches to SCD2, it's crucial to make conscious decisions. These seemingly small changes can have a significant impact in actual usage. So, when you have to explain why a "missing" record isn't actually missing but not valid when expected, you'll understand the importance of these choices. 🔗 Useful Links: - AI Lab in Telegram @aiscrumbot – free consultation - SCD2 — Semantics and Styles - Towards Data Science – Medium - Twitter – @itinaicom
Labels:
AI,
AI News,
AI tools,
Chad Isenberg,
Innovation,
itinai.com,
LLM,
t.me/itinai,
Towards Data Science - Medium
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment