Question: You Are Required To Design And Develop A Data Warehousing System Select A Particular Organization And Identify ….
You are required to design and develop a Data Warehousing system
- Select a particular organization and identify a Data Warehousing solution
- Select Data Warehousing tools you want to use in your development
- Analyze and design the selected system
- Design the Multidimensional modeling
- Specify the Data Warehousing Architectural Design and Infrastructure
- Create the tables (Facts and Multidimensional tables)
- Integrate your system with Extraction, Transformation, and Load (ETL) system
- Create Interfaces to interact with your system
Data warehousing is the process of constructing and using a data warehouse. A data warehouse is constructed by integrating data from multiple heterogeneous sources that support analytical reporting, structured and/or ad hoc queries, and decision making. Data warehousing involves data cleaning, data integration, and data consolidations.
Using Data Warehouse Information
There are decision support technologies that help utilize the data available in a data warehouse. These technologies help executives to use the warehouse quickly and effectively. They can gather data, analyze it, and take decisions based on the information present in the warehouse. The information gathered in a warehouse can be used in any of the following domains −
- Tuning Production Strategies − The product strategies can be well tuned by repositioning the products and managing the product portfolios by comparing the sales quarterly or yearly.
- Customer Analysis − Customer analysis is done by analyzing the customer’s buying preferences, buying time, budget cycles, etc.
- Operations Analysis − Data warehousing also helps in customer relationship management, and making environmental corrections. The information also allows us to analyze business operations.
Integrating Heterogeneous Databases
To integrate heterogeneous databases, we have two approaches −
- Query-driven Approach
- Update-driven Approach
This is the traditional approach to integrate heterogeneous databases. This approach was used to build wrappers and integrators on top of multiple heterogeneous databases. These integrators are also known as mediators.
Process of Query-Driven Approach
- When a query is issued to a client side, a metadata dictionary translates the query into an appropriate form for individual heterogeneous sites involved.
- Now these queries are mapped and sent to the local query processor.
- The results from heterogeneous sites are integrated into a global answer set.
- Query-driven approach needs complex integration and filtering processes.
- This approach is very inefficient.
- It is very expensive for frequent queries.
- This approach is also very expensive for queries that require aggregations.
This is an alternative to the traditional approach. Today’s data warehouse systems follow update-driven approach rather than the traditional approach discussed earlier. In update-driven approach, the information from multiple heterogeneous sources are integrated in advance and are stored in a warehouse. This information is available for direct querying and analysis.
This approach has the following advantages −
- This approach provide high performance.
- The data is copied, processed, integrated, annotated, summarized and restructured in semantic data store in advance.
- Query processing does not require an interface to process data at local sources.
Functions of Data Warehouse Tools and Utilities
The following are the functions of data warehouse tools and utilities −
- Data Extraction − Involves gathering data from multiple heterogeneous sources.
- Data Cleaning − Involves finding and correcting the errors in data.
- Data Transformation − Involves converting the data from legacy format to warehouse format.
- Data Loading − Involves sorting, summarizing, consolidating, checking integrity, and building indices and partitions.
- Refreshing − Involves updating from data sources to warehouse.
Note − Data cleaning and data transformation are important steps in improving the quality of data and data mining results.
Data Warehousing – Terminologies
In this chapter, we will discuss some of the most commonly used terms in data warehousing.
Metadata is simply defined as data about data. The data that are used to represent other data is known as metadata. For example, the index of a book serves as a metadata for the contents in the book. In other words, we can say that metadata is the summarized data that leads us to the detailed data.
In terms of data warehouse, we can define metadata as following −
- Metadata is a road-map to data warehouse.
- Metadata in data warehouse defines the warehouse objects.
- Metadata acts as a directory. This directory helps the decision support system to locate the contents of a data warehouse.
Metadata repository is an integral part of a data warehouse system. It contains the following metadata −
- Business metadata − It contains the data ownership information, business definition, and changing policies.
- Operational metadata − It includes currency of data and data lineage. Currency of data refers to the data being active, archived, or purged. Lineage of data means history of data migrated and transformation applied on it.
- Data for mapping from operational environment to data warehouse − It metadata includes source databases and their contents, data extraction, data partition, cleaning, transformation rules, data refresh and purging rules.
- The algorithms for summarization − It includes dimension algorithms, data on granularity, aggregation, summarizing, etc.
A data cube helps us represent data in multiple dimensions. It is defined by dimensions and facts. The dimensions are the entities with respect to which an enterprise preserves the records.
Illustration of Data Cube
Suppose a company wants to keep track of sales records with the help of sales data warehouse with respect to time, item, branch, and location. These dimensions allow to keep track of monthly sales and at which branch the items were sold. There is a table associated with each dimension. This table is known as dimension table. For example, “item” dimension table may have attributes such as item_name, item_type, and item_brand.
The following table represents the 2-D view of Sales Data for a company with respect to time, item, and location dimensions.
But here in this 2-D table, we have records with respect to time and item only. The sales for New Delhi are shown with respect to time, and item dimensions according to type of items sold. If we want to view the sales data with one more dimension, say, the location dimension, then the 3-D view would be useful. The 3-D view of the sales data with respect to time, item, and location is shown in the table below −
The above 3-D table can be represented as 3-D data cube as shown in the following figure −
Data marts contain a subset of organization-wide data that is valuable to specific groups of people in an organization. In other words, a data mart contains only those data that is specific to a particular group. For example, the marketing data mart may contain only data related to items, customers, and sales. Data marts are confined to subjects.
Points to Remember About Data Marts
- Windows-based or Unix/Linux-based servers are used to implement data marts. They are implemented on low-cost servers.
- The implementation cycle of a data mart is measured in short periods of time, i.e., in weeks rather than months or years.
- The life cycle of data marts may be complex in the long run, if their planning and design are not organization-wide.
- Data marts are small in size.
- Data marts are customized by department.
- The source of a data mart is departmentally structured data warehouse.
- Data marts are flexible.
The following figure shows a graphical representation of data marts.
The view over an operational data warehouse is known as virtual warehouse. It is easy to build a virtual warehouse. Building a virtual warehouse requires excess capacity on operational database servers.
Data Warehousing – Delivery Process
A data warehouse is never static; it evolves as the business expands. As the business evolves, its requirements keep changing and therefore a data warehouse must be designed to ride with these changes. Hence a data warehouse system needs to be flexible.
Ideally there should be a delivery process to deliver a data warehouse. However data warehouse projects normally suffer from various issues that make it difficult to complete tasks and deliverables in the strict and ordered fashion demanded by the waterfall method. Most of the times, the requirements are not understood completely. The architectures, designs, and build components can be completed only after gathering and studying all the requirements.
The delivery method is a variant of the joint application development approach adopted for the delivery of a data warehouse. We have staged the data warehouse delivery process to minimize risks. The approach that we will discuss here does not reduce the overall delivery time-scales but ensures the business benefits are delivered incrementally through the development process.
Note − The delivery process is broken into phases to reduce the project and delivery risk.
The following diagram explains the stages in the delivery process −
Data warehouse are strategic investments that require a business process to generate benefits. IT Strategy is required to procure and retain funding for the project.
The objective of business case is to estimate business benefits that should be derived from using a data warehouse. These benefits may not be quantifiable but the projected benefits need to be clearly stated. If a data warehouse does not have a clear business case, then the business tends to suffer from credibility problems at some stage during the delivery process. Therefore in data warehouse projects, we need to understand the business case for investment.
Education and Prototyping
Organizations experiment with the concept of data analysis and educate themselves on the value of having a data warehouse before settling for a solution. This is addressed by prototyping. It helps in understanding the feasibility and benefits of a data warehouse. The prototyping activity on a small scale can promote educational process as long as −
- The prototype addresses a defined technical objective.
- The prototype can be thrown away after the feasibility concept has been shown.
- The activity addresses a small subset of eventual data content of the data warehouse.
- The activity timescale is non-critical.
The following points are to be kept in mind to produce an early release and deliver business benefits.
- Identify the architecture that is capable of evolving.
- Focus on business requirements and technical blueprint phases.
- Limit the scope of the first build phase to the minimum that delivers business benefits.
- Understand the short-term and medium-term requirements of the data warehouse.
To provide quality deliverables, we should make sure the overall requirements are understood. If we understand the business requirements for both short-term and medium-term, then we can design a solution to fulfil short-term requirements. The short-term solution can then be grown to a full solution.
The following aspects are determined in this stage −
- The business rule to be applied on data.
- The logical model for information within the data warehouse.
- The query profiles for the immediate requirement.
- The source systems that provide this data.
This phase need to deliver an overall architecture satisfying the long term requirements. This phase also deliver the components that must be implemented in a short term to derive any business benefit. The blueprint need to identify the followings.
- The overall system architecture.
- The data retention policy.
- The backup and recovery strategy.
- The server and data mart architecture.
- The capacity plan for hardware and infrastructure.
- The components of database design.
Data Warehouse Tools: Why We Need Them?
A data warehouse is a repository that comprises information from one or multiple sources. For example, an e-commerce company can use a data warehouse to integrate and combine customer diverse information, such as customer email addresses, the cash register, comment cards, etc. The main benefit of a data warehouse is its role in streamlining data for business intelligence (BI). However, the process of ETL in the data warehouse is important for the smooth movement of data from one architectural tier to another.
Unlike traditional data warehouses, modern data warehousing solutions automate the repetitive tasks involved in designing, developing, and deploying a data warehouse design to meet fast-changing business requirements. For this reason, many companies leverage data warehousing tools to gather insights.
List of Features that Data Warehouse Tools Should Have
Here are a few use cases and applications that show how data warehouse solutions are helping organizations address data management challenges:
1. Data Cleansing
Many companies use data warehouse tools and techniques for leveraging historical data for critical business decisions. Hence, it is important to ensure that only high-quality data is loaded into a data warehouse. This can be done by making data cleansing a part of the data warehousing process, which can help detect and remove invalid, incomplete, or outdated records from the source datasets.
2. Data Transformation and Loading
Data transformation involves modifying data into a compatible format with the target system, such as a database, to simplify data loading.
To streamline the data integration step in a data warehouse, many data warehouse management tools offer built-in transformations, such as aggregate, lookup, join, and filter, making data processing easier.
3. Business Intelligence and Data Analysis
Data warehousing and Business Intelligence (BI) are two distinct but closely interlinked technologies that assist an enterprise in making informed decisions. Organizations have abundant information in raw form in the digital era, which is generally stored in a data warehouse. It is crucial for data warehouse analytics tools to have BI functionality to aid data retrieval as it helps generate business insights.
Data extraction is the process of obtaining data from a database or SaaS platform so that it can be replicated to a destination — such as a data warehouse — designed to support online analytical processing (OLAP).
Data extraction is the first step in a data ingestion process called ETL — extract, transform, and load. The goal of ETL is to prepare data for analysis or business intelligence (BI).
Suppose an organization wants to monitor its reputation in the marketplace. It may have data from many sources, including online reviews, social media mentions, and online transactions. An ETL tool can extract data from these sources and load it into a data warehouse where it can be analyzed and mined for insights into brand perception.
Data extraction does not need to be a painful procedure. For you or for your database.
Types of data extraction
Extraction jobs may be scheduled, or analysts may extract data on demand as dictated by business needs and analysis goals. Data can be extracted in three primary ways:
The easiest way to extract data from a source system is to have that system issue a notification when a record has been changed. Most databases provide a mechanism for this so that they can support database replication (change data capture or binary logs), and many SaaS applications provide webhooks, which offer conceptually similar functionality.
Some data sources are unable to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of those records. During subsequent ETL steps, the data extraction code needs to identify and propagate changes. One drawback of incremental extraction is that it may not be able to detect deleted records in source data, because there’s no way to see a record that’s no longer there.
The first time you replicate any source you have to do a full extraction, and some data sources have no way to identify data that has been changed, so reloading a whole table may be the only way to get data from that source. Because full extraction involves high data transfer volumes, which can put a load on the network, it’s not the best option if you can avoid it.
Streamline your data extraction process
Sign up for free →Contact Sales →
The data extraction process
Whether the source is a database or a SaaS platform, the data extraction process involves the following steps:
- Check for changes to the structure of the data, including the addition of new tables and columns. Changed data structures have to be dealt with programmatically.
- Retrieve the target tables and fields from the records specified by the integration’s replication scheme.
- Extract the appropriate data, if any.
Extracted data is loaded into a destination that serves as a platform for BI reporting, such as a cloud data warehouse like Amazon Redshift, Microsoft Azure SQL Data Warehouse, Snowflake, or Google BigQuery. The load process needs to be specific to the destination.