I work for Takeda Pharmaceuticals on the Enterprise Data Platform and Products team. The goal of our team is to enable global enterprise data capabilities across Data Engineering, Business Analytics, Data Platforms, and Data Science / Machine Learning. My specific role on the team is to own the Databricks platform and other data science tools that we offer to our enterprise. As the product owner for Databricks at Takeda, I have a deep understanding of platform and have been working on it for the past 3 years. I previously worked as a data scientist and have developed scalable machine learning solutions using the Databricks platform.
May 27, 2021 11:35 AM PT
Takeda's Plasma Derived Therapies (PDT) business unit has recently embarked on a project to use Spark Streaming on Databricks to empower how they deliver value to their Plasma Donation centers. As patients come in and interface without clinics, we store and track all of the patient interactions in real time and deliver outputs and results based on said interactions. The current problem with our existing architecture is that it is very expensive to maintain and has an unsustainable number of failure points. Spark Streaming is essential for allowing this use case because it allows for a more robust ETL pipeline. With Spark Streaming, we are able to replace our existing ETL processes (that are based on Lamdbas, step functions, triggered jobs, etc) into a purely stream driven architecture.
Data is brought into our s3 raw layer as a large set of CSV files through AWS DMS and Informatica IICS as these services bring data from on-prem systems into our cloud layer. We have a stream currently running which takes these raw files up and merges them into Delta tables established in the bronze/stage layer. We are using AWS Glue as the metadata provider for all of these operations. From the stage layer, we have another set of streams using the stage Delta tables as their source, which transform and conduct stream to stream lookups before writing the enriched records into RDS (silver/prod layer). Once the data has been merged into RDS we have a DMS task which lifts the data back into S3 as CSV files. We have a small intermediary stream which merge these CSV files into corresponding delta tables, from which we have our gold/analytic streams. The on-prem systems are able to speak to the silver layer and allow for the near real-time latency that our patient care centers require.