Jan has studied his BA & MCS in Trinity College Dublin. During his studies, he worked as an intern in SAP. During this work, he earned valuable experience with in-memory database systems, which led to his interests in big data technologies. In 2014, he started a PhD in FMG, TCD, with focus on optimising the resource utilisation of big data frameworks namely MapReduce. In 2015, Jan started working as a big data engineer for Barclays Africa in Prague. He is now in charge of building an internal big data engineering expertise and development of new tools and products including Spline.
Seamless integration of diverse data sources into an enterprise data lake has a great value for data-driven companies. In financial and banking industries, where our company ABSA belongs, mainframes are among the most common platforms. However, their interoperability with other platforms remains challenging. In this talk, we introduce a new data source for Spark called Cobrix (https://github.com/AbsaOSS/cobrix) which radically simplifies consuming mainframe data from Spark. Currently, a wide range of approaches is used to integrate mainframe data with analytics platform such as message queues, direct ODBC/JDBC connectors, tools like Sqoop and LegStar, or running Spark directly on mainframes. But these approaches have several limitations. For instance, the existing tools primarily focus on relational data, therefore, the original hierarchical schema is flattened, exploded and/or projected. As a consequence, the resulting table may become extremely wide (~10k columns) which complicates its further processing. Our solution, Cobrix, extends Spark SQL API with a Data Source for mainframe data. It allows reading binary files stored in HDFS having a native mainframe format, and parsing it into Spark DataFrames, with the schema being provided as a COBOL copybook. Spark's native support for nested structures and arrays allows retention of the original schema. As a result, Cobrix offers a new and convenient way of processing mainframe data. In this talk we first review the difference in data definition models between mainframes and PCs. Then we explain schema mapping between COBOL and Spark in Cobrix. Further, we demonstrate Cobrix usage for reading simple and multi-segment files and present performance and scalability characteristics of the data source. Finally, we discuss the broad picture of mainframe integration through Cobrix, Spark, Avro, Kafka, etc. through use case examples.
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline - a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner. Session hashtag: #EUent3Learn more: