Apache Spark and Mainframe data


Apache Spark and Mainframe data

Apache Spark is being adopted as a general data engine for big data processing. Speed, ease of use, support for a large number of programming languages, compatibility with multiple resource managers and distributed databases are the cornerstones on which Spark has been built. Spark already supports a variety of data stores including HDFS, Cassandra, HBase and Amazon S3.

It’s a big data imperative to integrate mainframe transactional or "truth data" because this is required to obtain the "market of one". We know that 97 out of the top 100 global banks use mainframe data for their online transactions and mainframe databases contain an enterprise’s most valuable data. The integration of this data is a cornerstone of consumer activity analysis.

Integration and conversion of mainframe data is difficult, however, because of the proprietary formats that must be converted to open source formats used in analytics.

So how do you integrate this important mainframe enterprise data?

There is an open source contribution which has been made available, Spark Connector, which requires the data to be pre-processed on the mainframe into text files in PDSs (Partitioned Datasets) and the connector would then pull the data into the Spark ecosystem as data frames.

To use the spark connector, the user must pre-process data to create these text files by performing the binary data conversion in the following manner, in order to get it into the EBCDIC format into a PDS on the mainframe.

  1. In the case of VSAM/QSAM, a programmer would need to write a COBOL or PL/1 program to read each VSAM file and convert the data to an EBCDIC delimited data format file (like CSV).

  2. For DB2, the user would need to use a DB2 unload utility to unload the data and then write the data to a delimited format file.

  3. For CA IDMS and CA Datacom/DB, the respective unload utilities would have to be used to convert the data into a delimited format file.

The user would then write a program in Java or Scala to work with the connector, and the connector would then make the data available as data frames.

The drawbacks of the above examples results in increased MIPS cost and low overall efficiency.

There is a significantly increased MIPS cost due to the processing that occurs on the mainframe server. The average cost per MIPS is $5K, so as production increases, this is an untenable economic scenario.

There is an increase of storage and maintenance expense for this method of data integration, due to the intermediate mainframe disk storage that would continually increase in production.

There is an expense and agility issue due to the labor required to code and maintain the custom programs required to dump the data.

There is, however, a less expensive way to integrate your mainframe data into Spark.

Instead of all of this, vStorm Connect streams and converts mainframe data and writes it to HDFS without MIPS for data conversion, staging, or additional programming.

It also provides enterprise-grade failover, security and load balancing while integrating with mainframe schedulers and security tools to manage data movement.

Spark takes in the mainframe data that is written into HDFS by vStorm Connect and consumes the data as data frames. This simple integration provides an end-to-end solution for enterprises to quickly and easily access their mainframe data in a cost-effective way using Apache Spark.