Is Sqoop appropriate for all mainframe data?

Is Sqoop appropriate for all mainframe data?

Hadoop is expanding its footprint in large enterprises, and transitioning from trial to production deployments. That’s good for innovative organizations, but it is bringing new challenges for data scientists. Among those challenges is the not-so-simple matter of bringing mainframe data into the Hadoop cluster for processing. Among the tools are open source solutions like Sqoop.

According to Gartner, 70% of current Hadoop projects will involve processing transactions. (55% will involve log files.) Mainframe servers process 70% of all transactions globally. For data scientists, the good news is that for many projects the mainframe server may be the only source they need to access. The bad news is that the mainframe may be unfamiliar territory and the data formats are very different than the XML and JSON (and even ASCII or Unicode) formats they are comfortable with.

For many Hadoop data ingestion projects, data scientists can turn to Sqoop. According to the Apache website: “Apache Sqoop(™) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.” 

Sqoop works by providing a scriptable command-line interface for pulling relational data from a SQL database and loading it into Hadoop HDFS or Hive, often through FTP. It has been proven to be robust in countless RDBMS projects.

Mainframe considerations

Though Sqoop, with some programming effort, works for extracting relational data in most cases it has the implicit assumption that the source and target systems are roughly equivalent. In mainframe environments, the disparity in computing costs, storage costs, and security demand special attention. In all, there are five special considerations when extracting data from a mainframe server:

  • Data sources unsupported by Sqoop
  • Resource impacts on the mainframe
  • Computing costs
  • Storage costs
  • Security

Relational data from DB2 is just one source. The required data may reside in VSAM, QSAM, or binary log files, which are not addressed by Sqoop. With these sources it will be up to the user to write software to interpret the files and manage proprietary mainframe technologies including COBOL Copybooks, EBCDIC character encoding, packed decimal encoding and others.

SQL pulls can be resource intensive, and the resource impact on the mainframe’s real-time transaction processing should be carefully evaluated. Small reductions to the rate of transaction processing time can cost millions in lost business.

Additionally, processing that is performed on the mainframe, including SQL pulls and transformation steps, will incur MIPS charges.

Storing data on the mainframe for an intermediate transformation step will increase storage requirements and costs. Using DASD for staging of processed VSAM/QSAM binary files can be very expensive.

Mainframe transaction data can be extremely sensitive and special care should be taken to keep it secure.

vStorm Enterprise from Veristorm

To address the special requirements of mainframe data extraction, we focused on leveraging the unique characteristics of each platform.

  1. Automate data conversion from mainframe formats
  2. Run on Linux to minimize MIPS costs
  3. Perform in-flight transformation to eliminate storage costs for staging which dramatically cuts MIPS
  4. Provide a GUI for discovery, filtering and definition of jobs for migration of data
  5. Provide scheduling of data movement jobs or integration of the data movement jobs in third party schedulers
  6. Use SSL over the wire for security
  7. No SMP/E install of software on z/OS and No Started task or Servers running on z/OS, again minimizing MIPS

The complexity of unstructured formats like VSAM, QSAM, and SMF/RMF log files is now built into this intelligent copy process. This saves the user from having to write new code, perform new and expensive computation, and prepare, secure and create expensive staging on the mainframe. Data transformation from binary data to the target Hadoop format is performed on Linux, thus minimizing MIPS charges.

In summary, when extracting DB2 data, we use a binary dump instead of SQL, minimizing resource usage and MIPS costs on the mainframe. Data is streamed to Linux and transformed before it is persisted to the target data store.

The benefit of vStorm Enterprise is that the point-and-click interface is all-inclusive. The simplicity and high-degree of automation provides self-service data access for data analysts who want rapid access to enterprise information, with the ability to change queries and refresh data on-demand. 

“Extract-Hadoop-Transform” is an exciting and cost effective model that enables more dynamic analytics and visualizations.