-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata-warehouse-basics.txt
824 lines (711 loc) · 54.3 KB
/
data-warehouse-basics.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
Section 1 (Data Warehousing Concepts):-
What is Data Warehouse:
A data warehouse is typically built on top of some type of a database, so you can think of a data warehouse as the usage and the database as the platform.
It's also important to note that the data inside of a data warehouse comes from elsewhere(Operational system or External Source).
we don't create data for the firsttime in data ware house, transaction occur and are recoded in various OS, and their data is subsequently sent down to the data warehouse.
Data is copied, not moved. have rules to how we build and orginize our data warehouse.
rule1: Data from number of different source.
rule2: It should be subject oriented. (to re-orgnize the data by subject).
rule3: Time Varient,(Store historical data)
rule4: Non-Volatile (periodically load data into data warehouse).
Finally we all are doing this for support data-driven decision-making.
Reason for BUild Ware House:
1.Support Making decions in a data-driven manner.
Data warehouse helps us make decisions based on data.
We need a view into the past, present, and future (or what we predict about the future) in different areas of our organization.
It even allows us to explore the unknown by using powerful analytics to discover interesting and important insights from large amounts of data.
2. One-Stop Shopping:
All necessary data is stored in one location rather than being scattered across various transactional and operational applications.
This simplifies data-driven decision-making, which used to be tedious and problematic due to data being spread out.
When we combine these views of our business data, we have what is known as Business Intelligence (BI).
Before data warehouses, data-driven decision-making involved retrieving data from multiple sources or extract files, which was time-consuming and challenging. With data warehousing, we integrate all data into one place, making it easier to focus on data analysis rather than data gathering and integration.
Compare a Data Warehouse to a Data lake:
Data Warehouse: Built on a relational database like Microsoft SQL Server, Oracle, or IBM's Db2(Sometimes built on a multidimensional database (cube)),Relational databases are versatile and used for transactional systems and applications, not just data warehousing.
Data Lake: Built on a big data environment, Supports rapid intake of new and changed data, Works with a variety of data types,
Three V's of Big Data:
Volume: Large amounts of data.
Velocity: Rapid intake of data.
Variety: Different types of data.
SQL (the standard relational database language) can be used for both traditional BI (business intelligence) and big data environments. This means traditional BI can be done with either a data warehouse or a data lake.
In summary, data warehouses and data lakes both support data-driven decision making, and their synergies enhance business intelligence capabilities.
Compare a Data Warehouse to a Data virtualization:
Data Virtualization:
Instead of copying data, access it from its original locations as needed for reports and analytics.
Unlike data warehousing, no data is copied into a separate database.
Acts as a read-only distributed database, accessing original data locations when needed.
Data virtualization can complement data warehousing and data lakes to enhance data-driven decision making through business intelligence and analytics.
Simple End-to-End Data Warehousing:
Key Points:
Data Warehouse Construction:
A data warehouse is built by pulling data from other applications and systems. We identify our data sources and the data warehouse itself.
ETL Process:
ETL stands for Extract, Transform, and Load. ETL is a critical aspect of data warehousing, moving data from sources to the warehouse.
Data Marts:
After data is in the warehouse, it can be further copied into smaller environments called data marts. Data marts are subsets of the data, tailored for specific groups of users or business functions.
Analogy:
Think of data sources as suppliers.
The data warehouse acts as a wholesaler that collects data from various suppliers.
Data marts are like retailers, providing specific data subsets to users.
Section 2(Data Warehousing Architecture):-
Data Warehousing Architecture:
Centralized Data Warehouse:
Centralized Data Warehouse: Uses a single database to support business intelligence and analytics.
Data Marts:
Data Marts: Smaller, more focused versions of a data warehouse.
Component-Based Data Warehousing:
Component-Based Approach: Multiple components (data warehouses and data marts) work together to form an overall data warehousing environment.Emphasizes the word environment, not a single centralized database.
Architectural Options:
Specialized Databases (Cubes): We'll briefly look at using cubes as part of your data warehousing environment.
Operational Data Store: A variation of a data warehouse, useful for certain architectures.
Staging Layer:
Staging Layer: The initial segment of the data warehouse where data is loaded before transformation and integration.
Types of Staging Layers:
Persistent Staging Layer: Keeps data for future use.
Non-Persistent Staging Layer: Temporary storage for immediate processing.
Build a Centralized Data Warehouse:
Single Environment: All data from various sources feeds into a single database.
One-Stop Shopping: All the data needed for reporting, business intelligence, and analytics is in one place, like a data supermarket.
Advantages:
Simplified Access: Easy access to all data in one location.
Challenges:
Technological Hurdles: Early relational databases were new and faced many challenges with large data volumes.
Trial and Error: Early data warehousing was developed through trial and error, leading to mistakes in design and implementation.
Organizational Cooperation: High level of cooperation needed across different parts of the organization, which can be difficult to achieve.
To over come these challenges we go for data Lakes:
Data Lakes:
Built on big data technology, data lakes can be seen as a successor to traditional data warehouses.
While data lakes are highly distributed, they often appear monolithic and centralized to users, enhancing the idea of one-stop shopping for data.
In summary, while a centralized data warehouse offers simplicity and ease of access, it has historically faced technological and organizational challenges.
However, with modern advancements and the emergence of data lakes, achieving a centralized approach to data for business decision-making is becoming more feasible.
Compare a Data Warehouse to a Data Marts:
Types of Data Marts:
Dependent Data Marts:
Dependent on a Data Warehouse: They draw data from the data warehouse. Without a data warehouse, they can't function.
Uniform Data: Typically have mostly uniform data across all marts.
Straightforward Architecture: Data flows from the warehouse to the marts in an organized manner.
Independent Data Marts:
Independent from a Data Warehouse: They get data directly from source applications, not from a data warehouse.
No Uniformity: Data is often not uniform across marts, and key data like customers or products might be represented differently in each mart.
Complex Architecture: Results in a "spaghetti architecture" with multiple lines of data feeding into various marts directly from source systems.
Comparing Data Warehouses and Independent Data Marts:
Data Warehouse:
Multiple Sources: Usually have dozens or even hundreds of data sources.
ETL Process: Data is extracted, transformed, and loaded (ETL) directly into the warehouse.
Large Data Volumes: Handles very large data volumes.
Dimensional Organization: Data is organized dimensionally.
Independent Data Mart:
Fewer Sources: Typically between 1 to 6 sources.
Similar to Data Warehouse: Other properties are similar to those of a data warehouse, except for the number of data sources.
The distinction between data warehouses and independent data marts can be blurred. Both serve to organize and provide data for analysis, but their structure and sources differ.
Think of them as different tools in your data management toolkit, each with its own strengths and best use cases.
Decide Which Component-Based Architecture is your best fit:
Centralized vs. Component-Based Data Warehousing:
Centralized Environment:
Advantages: Offers "one-stop shopping" for data, leveraging modern technology for success.
Challenges: Requires high cross-organization cooperation and data governance. Small changes can have widespread effects.
Component-Based Environment:
Advantages: Isolates changes, allows mix-and-match technology, and can overcome organizational challenges.
Challenges: Often leads to inconsistent data across components and difficulty in cross-integrating components.
Choosing Between Centralized and Component-Based Approaches:
Centralized Approach:
Enterprise Data Warehouse (EDW): Satisfies analytical needs of the entire enterprise.
Data Lakes: Use big data technology, appearing centralized to users but distributed underneath.
Component-Based Approach:
Architected Path: Integrates data warehouses and data marts, with options like dependent data marts (e.g., Corporate Information Factory).
Non-Architected Path: Independent data marts without integration, leading to inconsistent data but simpler implementation.
Types of Architected Data Warehousing:
Dependent Data Marts:
Example: Corporate Information Factory (CIF), which follows specific rules for data access and architecture.
Front-End Data Marts:
Configuration: Data marts are in front of the data warehouse, with primary analytics occurring at the data mart level.
Data Warehouse Dimensional Bus:
Concept: Developed by Ralph Kimball, using conformed dimensions to ensure consistency across data marts.
Last Resort: Federated Data Warehouse:
Description: Collection of independent data marts with no integration, leading to different answers for the same queries.
Recommendation: Only use as a last resort due to its limitations and the proliferation of data marts.
Summary:
Default Choice: Aim for a centralized data warehouse for simplicity and consistency.
Component-Based Approach: Useful for isolating changes and mixing technologies but can lead to inconsistency.
Federated Data Warehouse: Considered outdated and should be avoided unless necessary.
Include Cubes in your Data warehouse environment:
Cubes (Multi-Dimensional Databases):
Not a relational database.
Specialized database aware of data dimensions.
Used historically as an alternative to RDBMSs.
Cubes were a leading alternative. Products like Express, SBase, and Power Play were built on cubes. Still used, mainly for smaller scale data warehouses and data marts.
Rigid data structure. Structural changes are complex and time-consuming. More variation compared to relational databases.
In summary, while relational databases are the default choice, cubes provide fast query response for smaller data sets and can be a valuable part of a mixed data warehousing environment.
Include Operational Data Stores in your Data Warehousing Environment:
An ODS focuses on current operational data, unlike a data warehouse which focuses on historical data.
Data Feeds: ODS uses real-time data feeds from source applications, while data warehouses use batch-oriented ETL feeds.
Primary Focus: What's happening right now, using current data integrated from various sources.
Business Intelligence: Helps answer real-time questions, unlike historical or predictive analytics.
ODSs were popular for operational decision-making, while data warehouses focused on strategic decisions.
Coexistence:
Option 1: Parallel feeds from source systems to both ODS and data warehouse.
Option 2: Data first sent to ODS, then to the data warehouse.
Faster, more up-to-date data warehouses with less latency have reduced the need for separate ODS.
An ODS is crucial for real-time operational data, unlike data warehouses that focus on historical data.
Although their use has declined with advancements in big data and faster data warehousing, ODSs remain relevant for specific critical applications.
Explore the Role of the Staging Layer Inside a Data Warehouse:
Staging Layer Importance:
Every data warehouse needs a staging layer to temporarily hold incoming data from source applications.
Think of it as a landing zone for data before it's processed and moved into the data warehouse.
Data from source applications is quickly and non-intrusively copied to the staging layer. This process is part of the Extraction (E) in ETL (Extract, Transform, Load).
Two Layers:
Staging Layer: Holds raw data temporarily.
User Access Layer: Where users access data for reports, BI, and analytics. This is what users see as the data warehouse.
Includes star schemas, snowflake schemas, fact tables, and dimension tables.
Focus on extracting data (E), not transforming it. Push transformations into the data warehousing environment, not the source applications.
Inshort:-
The staging layer is essential for handling incoming data from source applications. It temporarily holds raw data, allowing non-intrusive extraction and ensuring smooth data flow into the data warehouse.
This separation of extraction and transformation processes ensures efficient data management and supports robust decision-making capabilities.
Compare the two types of Staging Layers:
Overview of Staging Layers:
Staging Layer: Temporary holding area for incoming data from source applications before it is transformed and moved to the user access layer.
User Access Layer: Where users access the data for reports, BI, and analytics.
Non-Persistent Staging Layer:
Data is not retained after it has been processed. The staging layer is emptied after data is moved to the user access layer.
Adv:
Less Storage Space: Data is not stored long-term, reducing storage requirements.
Efficiency: Simplifies management since old data doesn't accumulate.
DisAdv:
Rebuilding: If the user access layer gets corrupted, data must be re-extracted from the source systems.
Data Quality Assurance: Requires access to source systems to verify data quality, since the staging layer doesn't retain data.
Persistent Staging Layer:
Data is retained even after it has been processed and moved to the user access layer. Acts as a historical record of incoming data.
Advantages:
Rebuilding: If the user access layer gets corrupted, it can be rebuilt using data from the staging layer without needing to access source systems.
Data Quality Assurance: Allows for easy comparison between the staging layer and the user access layer.
Disadvantages:
More Storage Space: Requires more storage since data is retained.
Potential Uncontrolled Access: Risk of power users accessing data directly in the staging layer, bypassing governance protocols.
Inshort:
Both non-persistent and persistent staging layers have their own set of benefits and drawbacks.
The choice between them depends on factors such as storage availability, data integrity requirements, and the need for data quality assurance.
Understanding these differences allows us to design more effective and efficient data warehousing solutions tailored to our specific needs.
Section 3(Bring Data into your Data warehouse):-
Compare ETL to ELT:
ETL: Extract, Transform, Load
Extract:
Quickly pull data from various source applications.
Traditionally done in batches (e.g., every hour, daily, weekly).
Transform:
Prepare and standardize the data from different sources.
Handle complex transformations to ensure uniformity (e.g., standardizing product data, customer information).
This phase is crucial for ensuring that data from multiple sources can be compared and analyzed together.
Load:
Move the transformed data into the user access layer of the data warehouse or data mart.
The data is now ready for business intelligence (BI) and analytics.
Challenges with ETL:
Requires significant business analysis and data modeling before data storage.
A lot of preparatory work is needed to ensure data integrity and usability.
ELT: Extract, Load, Transform
Extract:
Pull data from various source applications, similar to ETL.
Collect data in its rawest form, regardless of structure (structured, semi-structured, unstructured).
Load:
Load the raw data directly into a big data environment, such as Hadoop Distributed File System (HDFS) or AWS S3 buckets.
This step eliminates the need for upfront data modeling and analysis.
Transform:
Transform data when needed, using the computational power of big data environments.
Allows for on-demand data transformation based on specific BI and analytics needs.
Advantages of ELT:
Flexibility: Defer schema creation and data modeling until data is needed for analysis.
Performance: Utilize the computational power of big data environments for transformation, improving performance with large datasets.
Schema on Write (ETL): Schema is defined when data is written to the storage. Data structure is fixed upfront.
Schema on Read (ELT): Schema is defined when data is read for analysis. Data structure is flexible until needed.
Use Cases:
Traditional Data Warehousing (ETL):
Suitable for environments with relational databases and OLAP cubes.
Ideal for structured data and scenarios requiring predefined schemas.
Big Data and Data Lakes (ELT):
Suitable for environments with unstructured or semi-structured data.
Ideal for scenarios requiring flexibility in data modeling and analysis.
ETL and ELT are both essential data processing techniques with distinct advantages and use cases.
ETL remains prevalent in traditional data warehousing, where structured data and predefined schemas are essential.
ELT is gaining popularity with big data technologies, providing flexibility and improved performance for large volumes of diverse data.
Understanding the differences and applications of ETL and ELT allows organizations to choose the best approach for their data processing needs.
Design the Initial Load ETL:
Initial Loading ETL
Overview:
Initial Loading ETL is typically a one-time process performed before a data warehouse goes live.
The primary goal is to gather all the necessary data to get the data warehouse up and running.
This involves extracting relevant data, transforming it, and loading it into the user access layer of the data warehouse.
Steps Involved:
Extract:
Pull relevant data from various source systems.
Focus on gathering data that will be essential for business intelligence (BI) and analytics.
Transform:
Standardize and clean the extracted data.
Ensure data consistency and quality before loading it into the warehouse.
Load:
Move the transformed data into the staging area, then into the user access layer of the data warehouse.
This prepares the data for use in BI and analytics.
Key Considerations:
Relevance: Only bring in data that is relevant and necessary for BI and analytics.
Absolutely Needed Data: Data required for standard reports and visualizations.
Probably Needed Data: Data that is closely related to the essential data and likely to be used soon.
Historical Data: Include historical data to support historical analysis and trend reporting.
One-Time Process: Initial Loading ETL is typically performed once. However, it might need to be redone in specific scenarios:
Data Warehouse Issues: If the data warehouse is corrupted or re-platformed.
Complete Overhaul: When a significant redesign or overhaul of the data warehouse is required.
Incremental ETL
Overview:
Incremental ETL is used to keep the data warehouse up to date after the initial loading.
This process involves regularly updating the data warehouse with new and updated data from source systems.
Difference from Initial Loading ETL:
Initial Loading ETL is a comprehensive, one-time data load, while Incremental ETL is an ongoing process.
Incremental ETL ensures that the data warehouse remains current by adding new data and updating existing data regularly.
Initial Loading ETL is a critical step in setting up a data warehouse, ensuring that all necessary data is in place before the system goes live.
It focuses on relevance, extracting, transforming, and loading data that is essential for BI and analytics.
While typically a one-time process, it may need to be redone in certain situations.
Following the initial load, Incremental ETL processes are used to keep the data warehouse up to date, ensuring continuous availability of fresh data for analytical purposes.
Compare Different Models for Incremental ETL:
Incremental ETL
Overview:
Incremental ETL is performed regularly to update the data warehouse with new, modified, and deleted data.
This process ensures the data warehouse remains current, reflecting the latest changes from the source systems.
Key Activities:
1.Adding New Data
2.Updating Modified Data
3.Handling Deleted Data
purpose:
The main goal of incremental ETL is to keep the data warehouse up to date. It balances the need for non-volatility (ensuring static data during strategic planning) with the need for current info.
Incremental ETL Patterns
There are four major patterns used in incremental ETL:
Append Pattern:
New data is added to the existing tables without altering the existing records.
Suitable for scenarios where historical data accumulation is important.
In-Place Update Pattern:
Existing rows are updated with new values, but the number of rows remains unchanged.
Commonly used for type one slowly changing dimensions in dimensional modeling.
Complete Replacement Pattern:
An entire segment of the data warehouse is overwritten, even if most data remains unchanged.
Less common in modern data warehousing but might be used in legacy systems.
Rolling Append Pattern:
Maintains a fixed time window of historical data by appending new data and deleting the oldest equivalent data.
For example, keeping 36 months of sales history and updating it weekly.
Incremental ETL is crucial for keeping a data warehouse updated with the latest data.
Understanding and applying the right ETL patterns ensure that your data warehouse can support timely and accurate business intelligence and analytics.
The choice of pattern depends on the specific requirements and architecture of the data warehouse, with append and in-place updates being the most commonly used in modern systems.
Explore the role of Data Transformation:
The transformation step ensures that the data is uniform and well-structured, facilitating effective business intelligence and analytics.
Data transformation is pivotal in creating a cohesive, functional data warehouse.
By unifying data values, types, sizes, and eliminating duplicates, we ensure the data is consistent and reliable for business intelligence and analytics.
Understanding these transformation models and applying them effectively is crucial for building robust data warehousing solutions.
Goals of Data Transformation
1. Uniformity: Ensure that data from various sources looks consistent, allowing for accurate comparisons and analyses.
2. Restructuring: Organize raw data from the staging layer into a well-engineered set of data structures suitable for the data warehouse.
Transformation Models
1. Data Value Unification: Objective: Standardize how data values are represented across different systems.
2. Data Type and Size Unification: Objective: Standardize the data types and sizes for similar data across different systems.
3. Data De-duplication: Objective: Remove duplicate entries to avoid counting the same entity multiple times.
4. Dropping Columns (Vertical Slicing): Objective: Remove unnecessary columns from the data to streamline what gets loaded into the data warehouse.
5. Row Filtering (Horizontal Slicing): Objective: Filter out rows based on specific values to include only relevant data in the data warehouse.
6. Error Correction: Objective: Correct known errors in the source data during the ETL process to maintain data accuracy and uniformity.
Implement Mix-and-Match Incremental ETL:
ETL feeds might seem conceptually similar: copying data from source systems and feeding it into a data warehouse regularly to keep everything updated.
However, upon closer inspection, you'll notice that each ETL feed can have different frequencies and patterns.
ETL Feed Frequencies:
Daily Updates: Some ETL feeds update the data warehouse daily. This is suitable for data that changes frequently but doesn't require real-time updates.
Hourly Updates: For more critical data, hourly updates ensure the data warehouse stays up-to-date with minimal lag.
Weekly Updates: For less volatile data, weekly updates are sufficient, reducing the load on the system and optimizing resources.
ETL Patterns:
Appends: New data is added to existing records without altering previous data.
Rolling Appends: Similar to appends but with a mechanism to handle and remove older data, maintaining a rolling window of recent data.
Complete Replacements: Entire tables or datasets are replaced with new data, ensuring complete accuracy but potentially using more resources.
In-Place Updates: Specific records are updated directly in the data warehouse, which is efficient for small, frequent changes.
Mix-and-Match Philosophy:
Each ETL feed can use a different combination of frequency and pattern based on the specific needs of the data.
This flexibility extends down to the table level within each source system.
Section 4(Data Warehousing Design: Building Blocks):-
Decide what you are Data Warehouse will be used for:
Purpose of Data Warehousing:
The primary objective of a data warehouse is to enable data-driven decisions. Ideally, these decisions should encompass:
The Past: Historical analysis to understand trends and patterns.
The Present: Real-time or near real-time analysis to monitor current operations.
The Future: Predictive analytics to forecast future outcomes.
The Unknown: Exploratory analytics to discover new insights
Synergy Between BI and Data Warehousing:
Business intelligence (BI) and data warehousing emerged simultaneously, influencing and driving each other's growth.
This synergistic relationship means that as BI evolved, it required more robust and sophisticated data warehousing solutions, and vice versa
Categories of Business Intelligence:
Basic Reporting: Traditional reports generated regularly to provide insights into business operations.
Online Analytical Processing (OLAP): Tools that allow users to interactively analyze multidimensional data from multiple perspectives.
Dimensional Modeling:
Dimensional modeling involves organizing data into fact and dimension tables, optimized for query performance and ease of understanding. This model is well-suited for:
Aggregating data: Summarizing data to support reporting.
Multidimensional analysis: Enabling OLAP functionalities to drill down into data, slice and dice it, and pivot across various dimensions.
Predictive and Exploratory Analytics:
If your data warehouse aims to support predictive or exploratory analytics, you'll need different data models. These analytics require:
Advanced Statistical Models: Often requiring raw, unstructured data.
Machine Learning Models: Needing large datasets that may not fit into traditional dimensional models.
Data Lakes: To store vast amounts of raw data in its native format.
Design Engineering and Dimensional Modeling:
In the context of dimensional modeling, we'll explore:
Fact Tables: Central tables that contain quantitative data for analysis.
Dimension Tables: Surrounding tables that describe the dimensions (attributes) related to the facts.
The Basic principles of Dimensionality:
Focus on Basic Reporting and OLAP
Purpose: Enable data-driven decisions using basic reporting and online analytical processing (OLAP).
Dimensional Structure: Organize data dimensionally to easily slice, dice, and drill down into data for better insights.
Importance of Measurements and Context
Example:
With Context: "What is the average annual faculty salary by rank, by department, by academic year?"
Measurement: Average annual faculty salary.
Context: Rank, department, academic year.
Understanding "By" and "For"
By: Groups data by every possible value in a dimension.
For: Filters data for specific values in a dimension.
Organize data by measurements with clear, dimensional context to derive meaningful insights for decision-making.
Compare Facts, Fact Tables, Dimensions and Dimension Tables:
Facts and Fact Tables
Facts: Numeric, quantifiable measurements (e.g., salary, number of credits, dollar amounts).
Fact Tables: Store these measurements in a relational database.
Dimensions and Dimension Tables
Dimensions: Provide context for facts (e.g., academic department, student major, campus building).
Dimension Tables: Store this contextual information.
Key Points
Fact vs. Logical Fact: A data warehousing fact is a measurable metric, not just a true statement.
Storing Facts: Facts go into fact tables; these tables hold our measurable data.
Storing Dimensions: Dimensions go into dimension tables; these tables provide context to our facts.
Star vs. Snowflake Schema
Star Schema: Combines all levels of a dimension hierarchy into a single table.
Snowflake Schema: Splits each level of a dimension hierarchy into separate tables.
Important Notes
Facts are different from fact tables.
We use "dimension" and "dimension table" interchangeably, though they can technically differ.
Understanding schemas (star vs. snowflake) helps clarify how dimensions are structured.
Compare Different forms of additivity in facts:
Types of Facts:
Additive Facts: Refers to the ability to add facts.(Can be added in all scenarios.)
Non-Additive Facts: Cannot be added up.(Store components and calculate ratios, averages, or percentages at report time).
Semi-Additive Facts: Sometimes additive, sometimes not (Require careful handling, typically used in specific types of fact tables.)
Compare a star schems to a snowflake schema:
Key Points Recap:
Data Warehouse: Drives analytical decision making.
Business Intelligence (BI): A key part of decision making.
OLAP (Online Analytical Processing): Supports BI through dimensional analysis.
Dimensional Data: Involves facts and dimensions, stored in fact tables and dimension tables.
Star Schema:
Structure: All levels of a hierarchy (e.g., product, product family, category) are in a single dimension table.
Appearance: Resembles a star with one fact table at the center and dimension tables around it.
Joins: Fewer database joins due to single dimension tables.
Storage: Requires more storage due to repeated data.
Normalization: Considered de-normalized, but not strictly accurate.
Snowflake Schema:
Structure: Each level of a hierarchy is in its own table (e.g., product in one table, product family in another).
Appearance: Resembles a snowflake with multiple related tables.
Joins: Requires more database joins due to split dimension tables.
Storage: Requires less storage as it avoids data repetition.
Normalization: Considered normalized, but not strictly accurate.
Example Comparison:
Star Schema Example:
Fact Table: Budget Fact
Dimension Tables:
Academic Department DIM
Time DIM
Expense Category DIM
Hierarchy: All levels squished into single dimension tables.
Snowflake Schema Example:
Fact Table: Budget Fact
Dimension Tables:
Academic Department DIM
College DIM (related to Academic Department)
Semester DIM (related to Time DIM)
Academic Year DIM (related to Semester DIM)
Expense Category DIM
OPEX/CAPEX DIM (related to Expense Category)
Summary:
Both schemas contain the same dimensions but differ in table representation.
Star Schema: Simpler structure, fewer joins, more storage.
Snowflake Schema: More complex structure, more joins, less storage.
Database keys for for data warehousing:
Fundamentals of Databases and Keys:
Schemas: Whether using a star or snowflake schema in a relational database management system (RDBMS), logical relationships connect data across different tables.
Keys: Officially establish relationships through keys.
Primary Keys: Unique identifiers for each row in a table. They can be single or multiple columns.
Foreign Keys: Primary keys from other tables, used to designate logical relationships, enforce data integrity, and improve query performance.
Natural vs. Surrogate Keys:
Natural Keys: Originated from source systems, can be cryptic or understandable. They travel into the data warehouse from source systems.
Surrogate Keys: Generated within the data warehouse, have no business meaning, and are used to relate data across tables.
Key Design Decisions:
Use of Surrogate or Natural Keys:
Guidance: Use surrogate keys as primary and foreign keys. Avoid using natural keys due to data integrity concerns.
Natural Keys in Dimension Tables:
Recommendation: Keep natural keys in dimension tables as secondary keys, even if not used for primary-foreign key relationships.
Natural Keys in Fact Tables:
Guidance: Generally, do not use natural keys in fact tables. This will be discussed in detail later when setting up fact tables and managing history with slowly changing dimensions.
Section 5(Design Facts, Fact Tables, Dimension and Dimension Tables):-
Design Dimension Tables for Star schemas and Snowflake Schemas:
Foundational Concepts:
Dimensions: Provide context for measurements (facts) in data warehousing terminology.
Dimension Table vs. Dimension: Not interchangeable terms; a dimension refers to the concept, while a dimension table is a specific implementation in a relational database.
Primary Keys: Each row in a relational database needs a unique primary key. In data warehousing, surrogate keys are preferred over natural keys for primary keys.
Foreign Keys: Definition: Links to primary keys of other tables to establish relationships.
Star Schema: Structure: Single table with all necessary fields.
Snowflake Schema: Structure: Normalized into separate tables for each level of hierarchy.
Rules for Snowflake Schema:
Non-terminal Dimensions: Have a primary key (surrogate key) and foreign key (surrogate key from higher level).
Terminal Dimensions: Highest level in hierarchy, has only a primary key (surrogate key).
Understanding these schema types and key principles helps in designing effective dimensional tables for data warehousing.
Each schema type—star and snowflake—offers different advantages depending on the data structure and hierarchical relationships needed.
Foreign keys play a crucial role in maintaining these relationships across tables.
This foundational understanding sets the stage for designing robust data warehouse architectures tailored to specific business needs.
Four Main Type of Data Warehousing Fact Tables:
It's important to understand that a fact (the actual measurement) is not the same as a fact table (the structure where these measurements are stored).
Here are the four types of fact tables we will discuss:
Transaction Fact Table
Periodic Snapshot Fact Table
Accumulating Snapshot Fact Table
Factless Fact Table
Transaction Fact Table:
Used for recording facts or measurements from individual transactions that occur in source systems.
Managed at the appropriate level of detail within the data warehouse.
Periodic Snapshot Fact Table:
Tracks measurements at regular intervals, providing periodic readings of specific metrics.
Not focused on individual transactions but on periodic snapshots.
Accumulating Snapshot Fact Table:
Tracks the progress of a well-defined business process through its various stages.
Provides snapshots at different points in time to show the progression.
Factless Fact Table:
Recording the occurrence of an event or transaction without needing to record any actual measurements.
Documenting coverage or eligibility relationships, even if no event occurred.
Role of transaction Fact Tables:
It's known as the transaction grained fact table, referring to the level of detail present in any given transaction fact table.
Overview of Transaction Fact Tables
Core of Dimensional Models:
Transaction fact tables are central to our dimensional models.
Most of the fact tables you build will be of this type.
Function:
They store the facts or measurements from transactions occurring in the business world, governmental agencies, schools, or any organization for which the data warehouse is built.
Structure:
You can store more than one fact in a single fact table, but there are rules governing which facts should be stored together and which should be kept separate.
Context for Facts:
Facts alone are insufficient for meaningful data analysis.
Context is necessary to make informed decisions based on data.
Structuring a Fact Table
You might think the fact table should look like a report, but that's not the case. Instead, we use surrogate keys to provide context.
Surrogate Keys: Use surrogate keys, not natural keys, as primary and foreign keys.
Linking Fact and Dimension Tables
Foreign Key Relationships: The surrogate keys in the fact table act as foreign keys, linking to primary keys in the dimension tables.
Transaction fact tables are essential for capturing and storing detailed transaction data.
They use surrogate keys to link to dimension tables, providing the necessary context for the facts they contain.
This structure allows for efficient data management and meaningful analysis in a data warehousing environment.
Summary
Transaction Fact Tables:
Store facts related to transactions occurring in the business.
Must follow rules to determine if multiple facts can be stored together.
Key Rules:
Facts must be at the same grain (level of detail).
Facts must occur simultaneously.
Primary and Foreign keys for Fact tables:
Fact Tables: Primary keys are composed of foreign keys linking to dimension tables.
Dimension Tables: Use surrogate keys as primary keys, and natural keys from source systems.
Primary Key: Unique identifier, often a combination of surrogate keys in fact tables.
Foreign Key: Links fact tables to dimension tables.
Natural Key: Unique identifier from source systems, not used as the primary key in fact tables.
Role of periodic snapshot fact tables:
This type of fact table captures regular, periodic measurements of data, providing an aggregated, easy-to-access view of data over time.
There are two main types of periodic snapshot fact tables:
Periodic Snapshot Fact Table: Linked to transactional data but aggregated for periodic analysis.
Periodic Snapshot Fact Table (Non-Transactional): Measures levels that aren't directly linked to transactional data.
Key Characteristics
Periodic Measurement: Captures data at regular intervals (e.g., weekly, monthly).
Simplified Analysis: Facilitates easier and more efficient analysis of periodic trends.
Two Types:
Transactional Periodic Snapshot: Derived from transactional data.
Benefits:
Aggregated Data: Provides a summary view, facilitating easier trend analysis.
Simplified Queries: Reduces complexity in SQL queries for periodic analysis.
Clear Business Insights: Helps track and analyze regular balances or levels efficiently.
Non-Transactional Periodic Snapshot: Measures levels without underlying transactions.
Summary
Periodic Snapshot Fact Tables: Used for regular, periodic measurements.
Two Types: Transactional and Non-Transactional.
Simplified Analysis: Facilitates easier trend analysis and reduces query complexity.
Structure: Both types have similar structures, differing only in the source of their data.
periodic snapshots and semi-additive facts:
Semi-Additive Facts: Essential for certain types of analysis in periodic snapshot fact tables.
Operations: Cannot be summed over time but can be averaged or summed within a specific time period.
Use Case: Periodic snapshot fact tables are invaluable for measuring and analyzing regular levels over time,
such as end-of-week balances or periodic measurements in non-transactional scenarios.
Role of Accumulating Snapshot Fact Tables:
The accumulating snapshot fact table. This type is used to track the progress of a business process with formally defined stages.
Purpose of Accumulating Snapshot Fact Tables
These fact tables are designed to:
Measure elapsed time spent in each stage of a business process.
Track both completed and in-progress phases.
Provide a snapshot of where one or more business processes are within their life cycles.
Include multiple relationships from the fact table back to a single dimension table, unlike the one-to-one relationships in other fact tables.
Accumulating snapshot fact tables are powerful tools for tracking the progress and elapsed time of business processes with multiple stages.
They help in understanding the life cycle of processes and in making data-driven decisions based on the stages of these processes.
Accumulating Snapshot Fact Table Ex:
Accumulating snapshot fact tables are powerful tools for tracking the progress of complex business processes, allowing for detailed analysis and easy visualization of the data flow.
By examining a single row of data, you can analyze the entire process for one or more students, making it easier to identify patterns, averages, and total amounts at different stages.
This type of fact table simplifies the analysis of business processes with multiple stages and participants.
Factless Fact Tables:
Fact tables in a data warehouse typically house the measurements we want to track, but a fact is not synonymous with a fact table.
The factless fact table, however, serves two unique purposes:
Recording the occurrence of a transaction without any measurable metrics.
Documenting a formal record of coverage or eligibility relationships.
Factless fact tables are versatile tools in a data warehouse for tracking the occurrence of events and relationships without measurable metrics.
They simplify the analysis of data involving transactions or associations where the presence of records alone signifies meaningful events or relationships.
Fact Tables Comparison of Star schema vs snowflake Schema:
Hierarchical Data in Fact Tables: In a data warehouse, we often deal with hierarchical data.
Star Schema for Fact Tables: In a star schema, the hierarchical information is consolidated into a single dimension table.
Snowflake Schema for Fact Tables: In a snowflake schema, the hierarchical data is normalized into multiple dimension tables.
Efficient Design for Snowflake Schema Fact Tables
Instead of including multiple foreign keys, we follow a similar approach to the star schema:
Include only the lowest level foreign key (Faculty_Key) in the fact table.
Use the relationships defined in the dimension tables to navigate the hierarchy.
Summary
Star Schema:
All hierarchical data in a single dimension table.
Fact table contains a single foreign key to the dimension table.
Snowflake Schema:
Hierarchical data normalized into multiple dimension tables.
Fact table contains only the lowest level foreign key.
Use joins to navigate the hierarchy in queries.
This approach ensures a streamlined and effective design for fact tables in both star and snowflake schemas.
SQL for Dimension and fact tables:
Star Schema Dimension Table: In a star schema, dimension tables typically contain denormalized data.
Snowflake Schema Non-Terminal Dimension Table: Non-terminal dimension tables in a snowflake schema have foreign keys pointing to higher-level dimension tables.
Snowflake Schema Terminal Dimension Table: Terminal dimension tables in a snowflake schema have no further levels in the hierarchy and thus no foreign keys.
Transaction-Grained Fact Table: Transaction-grained fact tables record individual events or transactions.
Periodic Snapshot Fact Table: Periodic snapshot fact tables capture data at regular intervals.
In summary, the SQL statements for creating tables in a data warehouse follow a consistent pattern:
Define columns with data types and constraints.
Specify primary keys for all tables.
Use foreign keys to establish relationships between fact tables and dimension tables.
Ensure that foreign keys reference the appropriate primary keys in their respective dimension tables.
Section 6(Managing Data Warehouse History through slowly changing Dimensions):-
Slowly Changing Dimensions(SCDs) and Data Warehouse History:
crucial for managing historical data. SCDs allow us to handle changes in dimension data over time, which is essential for maintaining the time-variant nature of a data warehouse.
Type 1 SCD:
This method overwrites old data with new data. It is the simplest approach and does not retain any history.
This method is typically used when fixing errors or when historical data is not necessary.
Type 2 SCD:
This method maintains unlimited history by creating a new row in the dimension table for each change.
Each row contains a version of the data with effective start and end dates, allowing us to keep track of historical changes.
Type 3 SCD:
This method keeps a limited history by storing previous values in additional columns within the same row.
It allows for easy comparison between the current and previous values but is limited in the amount of history it can store.
Type 0 SCD: The original value is never updated, retaining the initial value permanently.
Type 4 SCD: Uses a "mini-dimension" for frequently changing attributes to avoid excessive versioning in the main dimension table.
Type 5, 6, and 7 SCDs: These are hybrid approaches combining elements of the previous types to suit specific needs.
Design a type 1 SCD:-
Advantages:
Simplicity: Type 1 is the simplest and most straightforward SCD type.
Error Correction: Content errors, such as an erroneous birth date, are purged forever.
It is commonly used for error correction and situations where we decide not to retain historical information for certain columns of data.
Disadvantages:
Loss of History, Inconsistent Reporting, Overuse
Type 1 SCDs are useful for simple error corrections where retaining historical data is unnecessary.
However, be mindful of the potential for inconsistent reporting and the tendency to overuse this method in quick, simple data warehouse solutions
Design a type 2 SCD:-
Type 2, the existing row of data that's going to be changed actually remains exactly as is. You don't do anything to it at all.
Instead, you add a brand new row to the dimension table, and that new row will reflect the current state of all the attributes, including the data that's going to change.
Advantages of Type 2:
Historical Accuracy: Maintains historical data accurately, which is crucial for reporting and analytics.
Complete History: Allows for accurate reporting over time, reflecting the true state of data at any given point in the past.
Disadvantages of Type 2:
Increased Storage: Requires more storage as new rows are added for changes.
Complexity in Reporting: Reporting and analytics can become more complex due to the need to handle multiple versions of the same entity.
Fact Table Complications: Changes in dimension data impact fact tables, potentially leading to more complex queries and data management.
Type 2 SCDs are powerful for maintaining historical accuracy but introduce complexity in data management and reporting.
Properly handling these complexities ensures accurate and meaningful analytics over time.
Maintain Correct Data order with Type 2 SCDs:
Type 2 Slowly Changing Dimensions (SCD) and address a specific challenge: maintaining the correct order of our data.
Solution 1: Current Flag
Solution 2: Effective Date and Expiration Date
By implementing a combination of current flag and effective/expiration dates,
you can maintain a clear and accurate history of changes while easily identifying the current state of your data.
managing Type 2 SCD requires careful handling of data versioning and sequencing.
The combined use of current flags and effective/expiration dates provides a comprehensive solution to these challenges
Design a type 3 SCD:-
Type 3 SCDs offer a middle ground between Type 1 (no history) and Type 2 (infinite history).
Type 3 SCDs add new columns, rather than rows, to reflect changes.
This setup includes an "old value" column and a "new value" column, allowing easy back-and-forth switching for flexible reporting and analytics.
This approach is particularly useful for scenarios where all data in a dimension is reorganized.
Use Cases and Limitations:
Reorganization:
All rows change at the same time.
Easily switch between old and new structures for reporting.
Random Changes:
Not suited for frequent, individual changes (e.g., a student's address).
Creates unwieldy sets of columns with lots of empty data.
Limitations:
Not ideal for frequent or random changes.
Adding more than two columns can be unwieldy.
Best for scenarios where all data changes simultaneously.
Summary:
Type 3 SCDs provide a flexible and straightforward solution for specific reorganization scenarios.
They simplify reporting and analysis by adding columns for old and new values, allowing easy comparison and trending.
However, they are not suitable for frequent, individual changes and become cumbersome with multiple columns.
For complex histories, reverting to Type 2 SCDs may be more appropriate.
Section 7(Designing your ETL):-
Building your ETL Design from your ETL Architecture:
ETL Architecture vs. Design:
Architecture: Overall structure and framework for ETL processes.
Design: Detailed implementation, including specific dimension and fact tables, and SCD patterns (Type 1, Type 2, Type 3).
Key ETL Design Practices:
Limiting Incoming Data, Processing Dimension Tables First, Parallel Processing.
Dimension Table ETL:
ETL for dimension tables, focusing on a star schema. ETL for snowflake schemas can be more complex due to dependencies among various dimension tables. Therefore, we'll use a star schema example to illustrate ETL processes more clearly.
ETL Process Overview:
Data Preparation:
Filter incoming data to only new and modified records.
Use Change Data Capture (CDC) to detect changes at the source system level.
Change Data Capture (CDC):
Timestamp Comparison: Most modern relational databases use timestamps to track changes. Compare the row's timestamp with the timestamp of the last ETL run.
Database Logs: If timestamps are unavailable, compare data with transaction logs.
Direct Comparison: As a last resort, compare incoming data with the existing data warehouse records.
Transformation:
Apply necessary transformations to standardize and clean the data.
Use the common transformation model discussed in ETL architecture.
Processing Dimension Tables:
New Rows: Insert new rows into dimension tables.
Type 1 Changes: Apply in-place updates for minor changes.
Type 2 Changes: Add new rows for significant changes, maintaining history.
Process SCD Type 1 Changes to a Dimension Table:
This process follows a specific sequence, beginning with data preparation, followed by necessary transformations, and then handling new data by adding surrogate keys.
Once new data is processed, we focus on Type 1 changes before moving on to Type 2 changes.
Sequence of Processing:
Data Preparation:
Limit incoming data to only new and modified records.
Perform necessary transformations on the data.
For new data, add surrogate keys alongside natural keys and other relevant information.
Processing Type 1 Changes:
Type 1 changes involve a basic in-place update to correct any data errors, resulting in the loss of previous values.
This is typically done using a simple SQL update statement to overwrite old data with new data.
Process SCD Type 2 Changes to a Dimension Table:
Type 2 changes involve capturing historical data by adding new rows instead of updating existing ones. This allows us to track changes over time.
Steps for Type 2 Changes:
Incoming Data:
Identify the natural key (e.g., Student_ID: TYOUN21) from the incoming data.
Generate New Surrogate Key:
Use the natural key to guide the generation of a new surrogate key within the data warehouse.
Insert New Row:
Add the new row with the updated information and the newly generated surrogate key.
Design ETL for Fact Tables:
Fact Tables in ETL:
Fact tables store quantitative data for analysis and typically use a basic append model for ETL. However, there are nuances that we need to address.
ETL Process Overview:
Process Dimension Tables First:
Ensure all new and modified data in dimension tables is processed, including handling Type 1 and Type 2 changes.
Verify that any new surrogate keys generated during the dimension processing are available for use in the fact tables.
Fact Table ETL:
Use the basic append model to add new rows to the fact tables.
Address complications such as ensuring correct surrogate key assignment from dimension tables.
Key Takeaways:
Relevant Date Usage:
Use the payment or transaction date to determine the correct surrogate key from the dimension table.
Handling Multiple Versions:
Ensure accurate surrogate key assignment by considering the effective and expiration dates.
ETL Tool Assistance:
Utilize ETL tools to streamline the process, leveraging built-in functions for handling Type 1, Type 2, and other changes.
Processing fact tables involves more than just appending new data;
it requires careful consideration of the corresponding dimension data to ensure accurate historical tracking and data integrity.
By understanding these complexities and using appropriate tools, you can effectively manage ETL processes in a data warehouse environment.