Why stage mainframe data if you don't have to?

Why stage mainframe data if you don't have to?

Before we get to that, let's cover some basics.

Data Mining

Data Mining is an analytic process designed to explore data for consistent patterns and/or systematic relationships between variables. These findings may be validated by applying the detected patterns to new subsets of data. 

Large amounts of data (also known as "Big Data") are usually involved - typically business or market related. The ultimate goal of data mining is to make valid predictions which can be used for direct business applications. 

The data mining process consists of three stages: 

(1) The initial exploration, 
(2) model building or pattern identification with validation/verification, and 
(3) deployment - the application of the model to new data, in order to generate predictions. 

Exploration

The exploration stage usually starts with data preparation. This may involve the cleaning of data, data transformations, selecting subsets of records, and preliminary feature selection operations, to bring the number of variables to a manageable range. 

Model Building and Validation

Here, various models are considered to reveal the best one, based on their predictive performances. Several techniques (e.g. Bagging (Voting, Averaging), Boosting, Stacking and Meta-Learning) may be applied, in a so-called "competitive evaluation of models".

Deployment

The deployment stage involves using the model selected as best, and applying it to new data to generate predictions or estimates of the expected outcome.

As a business information management tool, data mining is used to reveal knowledge structures that can guide decision-making in conditions of limited certainty.  

Problems, in Practice

For business purposes, complex data mining projects may take the combined efforts of various experts, stakeholders, or departments of an entire organization. 

In typical data mining scenarios, large data sets are collected via automatic methods (e.g., from the Web) in which the data gathered was not tightly controlled. Some of the data may contain out-of-range values, impossible data combinations (e.g., Gender: Male, Pregnant: Yes), and so on. 

And all of that data can be difficult to keep track of.

Data Warehousing

Data warehousing is the process of organizing the storage of large, multi-part data sets so as to facilitate the retrieval of information for analytic purposes.

The information involved will likely derive from several disparate sources. Before you can get that data into the warehouse, you have to extract it, transform it, and load it – or “ETL” it - from those sources.

What Staging Is

A data staging area, or landing zone, is an intermediate storage area used for data processing during the extract, transform and load (ETL) process. The staging area usually sits between the data source(s) and the data target(s) - which are often data warehouses, data marts or other data repositories.

At a smaller scale, a staging database is a separate storage area created for the purpose of providing continuous access to application data. Using a staging database, you can for example prevent interruption of services to your websites while new business data is updated and tested. Changes may then be approved, before incorporating them into the real-time environment. 

The Rationale:

Overall, staging areas are intended to increase the efficiency of ETL processes, ensure data integrity and support data quality operations. Access to data can continue, even when new data is being imported from various external sources in preparation for staging. 

Using a staging database is beneficial when you have several satellite sites located in different cities and time zones. All sites may be updated from one location at the same time, without losing crucial data which might result in costly downtime.

Another objective of the staging area is to minimize the impact of data extraction on your source system. After you extract data from your source system, a copy of it is stored in your staging database.  Should your ETL fail further down the line, you won't need to impact your source system by extracting the data a second time.   

And if you store the results of each logical step of your data transformation in staging tables, you can restart your ETL from the last successful staging step.

Staging Functions

Staging areas can assume the form of tables in relational databases, text-based flat files (or XML files) stored in file systems, or proprietary formatted binary files stored in file systems. Functions of the staging area include:

1) Consolidation of data from multiple source systems.

The staging area acts like a large repository, and it's common to tag data in the staging area with additional metadata. This indicates the source of origin, and has timestamps indicating when the data was placed in the staging area.

2) Alignment of data.

This is a function closely related to, and acting in support of, master data management capabilities. The alignment of data involves standardization of reference data across multiple source systems, and the validation of relationships between records and data elements from different sources.

3) Maintaining consistency of data.

This is done by taking advantage of data streaming technologies, reduced overhead from minimizing the need to break and re-establish connections to source systems, and the optimization of concurrency lock management on multi-user source systems. 

4) Independent scheduling and managing multiple targets.

In some cases, data may be pulled into the staging area at different times to be held and processed all at once. This might occur when enterprise processing is to be done across multiple time zones each night, for example. In other instances data might be brought into the staging area to be processed at different times. In addition, the staging area may be used to push data to multiple target systems. 

5) Detecting changes.

The staging area supports reliable forms of change detection, like system-enforced timestamping, change tracking or change data capture (CDC). Often, the target data systems do not.

6) Cleansing the data.

The ETL process using the staging area can be employed to implement business logic which will identify and handle "invalid" data (spelling errors during data entry, missing information, etc.). Data cleansing, also called data cleaning or scrubbing, involves the identification and removal (or update) of this data.

Often, much of the cleaning and transformation work has to be done manually, or by low-level programs that are difficult to write and maintain. Third-party tools are available, and should be used to limit manual inspection and programming effort. 

A Better Environment

Ideally, an analytics platform should provide real-time monitoring, reporting, analysis, dashboards, and a robust set of predictive tools that can support the making of smart, proactive business decisions. These solutions should include a flexible data warehousing infrastructure, so that users can access the information they need without worrying about delays, or system disruptions during an upgrade, improvement or disaster.

IBM's zEnterprise uses a hybrid approach, to achieve these ends.

Transactional data, data warehouses and analytic tools are located on one platform, which scales easily to create a single copy of data that can be accessed by multiple users and groups. A single analytics tool can then be used across an entire organization, ensuring consistent business rules.

IT organizations often take data that's been captured by the mainframe, and move it to distributed servers for processing. This, despite the fact that the System z architecture can operate at 100% utilization for long, sustained periods, and offers the strongest security in the commercial server marketplace.

Furthermore, spreading analytic components across multiple platforms when preparing it for analysis degrades the data quality. This degradation becomes worse, when multiple copies of the data are being created to support development, test and production environments. 

With fast processors, fast I/O, and large memory capacity, in many cases the mainframe can process data more efficiently and more cost effectively than external distributed systems environments.

With Tools, to Match

IBM’s software portfolio includes tools and utilities that can be used to prepare data for analysis. These make it possible to minimize data latency, complexity and costs by bringing data transformation and analytic processes to where the data originates.

Veristorm, an IBM partner, enforces what it calls an "anti-ETL" strategy - in direct contrast to legacy ETL products which transform data on the mainframe then stage it in intermediate storage before it's ready to be loaded to a target site. 

The Veristorm product employs a GUI in which the data you want to move can be viewed, before streaming it directly from DASD (direct-access storage device; think magnetic disk) into Hadoop/Hive.  

The Apache Hive data warehouse software is an open source volunteer project under the Apache Software Foundation. Hive facilitates the querying and managing of large datasets residing in distributed storage. It provides a mechanism to project structure onto this data and query it, using an SQL-like language called HiveQL.

Using Hive, hundreds of unique users can simultaneously query the data using a language familiar to SQL users. Response times are typically much faster than other types of queries on huge datasets of the same kind. More commodity machines can be added to the Hadoop clusters, without a corresponding reduction in performance, as the data volume increases.

And with the Veristorm ethos of "anti-ETL", the raw mainframe data is directly connected to the source, so data analysts can do whatever they choose with it, right away.

Staging need not come into the equation, at all.