Extending Apache Spark’s Ingestion: Building Your Own Java Data Source

Download Slides

Apache Spark is a wonderful platform for running your analytics jobs. It has great ingestion features from CSV, Hive, JDBC, etc. however, you may have your own data sources or formats you want to use. Your solution could be to convert your data in a CSV or JSON file and then ask Spark to do ingest it through its built-in tools. However, for enhanced performance, we will explore the way to build a data source, in Java, to extend Spark’s ingestion capabilities. We will first understand how Spark works for ingestion, then walk through the development of this data source plug-in. Targeted audience Software and data engineers who need to expand Spark’s ingestion capability. Key takeaways Requirements, needs & architecture – 15%. Build the required tool set in Java – 85%.
Session hashtag: #EUdev6

About Jean Georges Perrin

Jean Georges Perrin "jgp" is passionate about software engineering and all things data, small and big data. His latest endeavors bring him in the Apache ecosystem, with a definite penchant for Spark and more and more Data Science. He is proud to have been the first in France to be recognized as an IBM Champion, and to have been awarded the honor for his ninth consecutive year. Jean Georges shares his more than 20 years of experience in the IT industry as a presenter and participant at conferences and through publishing articles in print and online media. His blog is visible at http://jgp.net. When he is not immersed in IT, which he loves, he enjoys exploring his adopted region of North Carolina with his kids. #Knowledge = 𝑓 ( ∑ (#SmallData, #BigData), #DataScience) & #Software. #IBMChampion x9. #KeepLearning.