This is a collaborative post from Databricks and Amazon Web Services (AWS). We thank Venkatavaradhan Viswanathan, Senior Partner Solutions Architect at AWS, for his contributions.
Data + AI Summit 2022: Register now to join this in-person and virtual event June 27-30 and learn from the global data community.
Amazon Web Services (AWS) is a Platinum Sponsor of Data + AI Summit 2022, one of the largest events in the industry. Join this event and learn from joint Databricks and AWS customers like Capital One, McAfee, Cigna and Carvana, who have successfully leveraged the Databricks Lakehouse Platform for their business, bringing together data, AI and analytics on one common platform.
At Data + AI Summit, Databricks and AWS customers will take the stage for sessions to help you see how they achieved business results using the Databricks on AWS Lakehouse. Attendees will have the opportunity to hear data leaders from McAfee and Cigna on Tuesday, June 28, then join Capital One on Wednesday, June 29 and Carvana on Thursday, June 30.
The sessions below are a guide for everyone interested in Databricks on AWS and span a range of topics -- from building recommendation engines to fraud detection to tracking patient interactions. If you have questions about Databricks on AWS or service integrations, connect with Databricks on AWS Solutions Architects at Data + AI Summit.
Databricks on AWS customer breakout sessions
Capital One: Running a Low Cost, Versatile Data Management Ecosystem with Apache Spark at Core
Data is the key component of Analytics, AI or ML platform. Organizations may not be successful without having a Platform that can Source, Transform, Quality check and present data in a reportable format that can drive actionable insights. This session will focus on how Capital One HR Team built a Low Cost Data movement Ecosystem that can source data, transform at scale and build the data storage (Redshift) at a level that can be easily consumed by AI/ML programs - by using AWS Services with combination of Open source software(Spark) and Enterprise Edition Hydrograph (UI Based ETL tool with Spark as backend).
How McAfee Leverages Databricks on AWS at Scale
McAfee, a global leader in online protection security, enables home users and businesses to stay ahead of fileless attacks, viruses, malware, and other online threats. Learn how McAfee leverages Databricks on AWS to create a centralized data platform as a single source of truth to power customer insights. We will also describe how McAfee uses additional AWS services, specifically Amazon Kinesis and Amazon CloudWatch to provide real time data streaming and monitor and optimize their Databricks on AWS deployment. Finally, we’ll discuss business benefits and lessons learned during McAfee’s petabyte scale migration to Databricks on AWS using Databricks Delta clone technology coupled with network, compute, storage optimizations on AWS.
Cigna: Journey to Solving Healthcare Price Transparency with Databricks and Delta Lake
Centers for Medicare & Medicaid Services (CMS) published Price Transparency mandate for health care service providers and payers to adhere to publish the cost of services provided based on procedure codes on public domain. This enabled us to create a comprehensive solution that can process tens of Terabytes data combined to create Machine Readable Files in the form JSON files and host them on public domain. We embarked on a journey that embraces the scalability of AWS cloud, Apache Spark, Databricks and DeltaLake to deal with generating and hosting file sizes ranging from megabytes to 100's GBs.
Carvana: Near Real-Time Analytics with Event Streaming, Live Tables, and Delta Sharing
Microservices is an increasingly popular architecture much loved by application teams, for it allows services to be developed and scaled independently. Data teams, though, often need a centralized repository where all data from different services come together to join and aggregate. The data platform can serve as a single source of company facts, enable near real time analytics, and secure sharing of massive data sets across clouds. A viable microservices ingestion pattern is Change Data Capture, using AWS Database Migration Services or Debezium. CDC proves to be a scalable solution ideal for stable platforms, but it has several challenges for evolving services: Frequent schema changes, complex, unsupported DDL during migration, and automated deployments are but a few. An event streaming architecture can address these challenges.
Amgen: Building Enterprise Scale Data and Analytics Platforms at Amgen
Over the past few years, Amgen have developed a suite of modern enterprise platforms that have served as a core foundational capability for data & analytics transformation for our business functions. We operate in mature agile teams with a dedicated product team for each of our platforms to build reusable capabilities and integrating with business programs in line with SAFe. We have massive business impact created by our platforms, be it for business teams looking to self-serve onboarding data into our Data Lake or those looking to build advanced analytics applications powered by advanced NLP, knowledge graphs, and more. Our platforms are powered by modern technologies, extensively using Databricks, AWS native services, and several open source technologies.
Amgen: Amgen’s Journey To Building a Global 360 View of its Customers with the Lakehouse
Serving patients in over 100 countries, Amgen is a leading global biotech company focused on developing therapies that have the power to save lives. Delivering on this mission requires our commercial teams to regularly meet with healthcare providers to discuss new treatments that can help patients in need. With the onset of the pandemic, where face-to-face interactions with doctors and other Healthcare Providers (HCPs) were severely impacted, Amgen had to rethink these interactions. With that in mind, the Amgen Commercial Data and Analytics team leveraged a modern data and AI architecture built on the Databricks Lakehouse to help accelerate its digital and data insights capabilities. This foundation enabled Amgen’s teams to develop a comprehensive, customer-centric view to support flexible go-to-market models and provide personalized experiences to our customers. In this presentation, we will share our recent journey of how we took an agile approach to bringing together over 2.2 petabytes of internally generated and externally sourced vendor data, and onboard into our AWS Cloud and Databricks environments to enable a standardized, scalable and robust capabilities to meet the business requirements in our fast-changing life sciences environment.
Sapient: Turning Big Biology Data into Insights on Disease – The Power of Circulating Biomarkers
Profiling small molecules in human blood across global populations gives rise to a greater understanding of the varied biological pathways and processes that contribute to human health and diseases. Herein, we describe the development of a comprehensive Human Biology Database, derived from non-targeted molecular profiling of over 300,000 human blood samples from individuals across diverse backgrounds, demographics, geographical locations, lifestyles, diseases, and medication regimens, and its applications to inform drug development. Built on a customized AWS and Databricks “infrastructure-as-code” Terraform configuration, we employ streamlined data ETL and machine learning-based approaches for rapid rLC-MS data extraction.
Scribd: Streaming Data into Delta Lake with Rust and Kafka
Scribd's data architecture was originally batch-oriented, but in the last couple years, we introduced streaming data ingestion to provide near-real-time ad hoc query capability, mitigate the need for more batch processing tasks, and set the foundation for building real-time data applications. In this talk I will describe Scribd's unique approach to ingesting messages from Kafka topics into Delta Lake tables. I will describe the architecture, deployment model, and performance of our solution, which leverages the kafka-delta-ingest Rust daemon and the delta-rs crate hosted in auto-scaling Amazon ECS services. I will discuss foundational design aspects for achieving data integrity such as distributed locking with Amazon DynamoDB to overcome S3's lack of "PutIfAbsent" semantics, and avoiding duplicates or data loss when multiple concurrent tasks are handling the same stream. I'll highlight the reliability and performance characteristics we've observed so far. I'll also describe the Terraform deployment model we use to deliver our 70-and-growing production ingestion streams into AWS.
Scribd: Doubling the Capacity of the Data Platform Without Doubling the Cost
The data and ML platform at Scribd is growing. I am responsible for understanding and managing its cost, while enabling the business to solve new and interesting problems with our data. In this talk we'll discuss each of the following concepts and how they apply at Scribd and more broadly to other Databricks customers. Optimize infrastructure costs: Compute is one of the main cost line items for us in the cloud. We are early adopters of Photon and Databricks Serverless SQL, which help us to minimize these costs. We combine these technologies with off the shelf analysis tools in AWS and some helpful optimizations around Databricks and Delta Lake that we’d like to share.
Huuuge Games: Real-Time Cost Reduction Monitoring and Alerting
Huuuge Games is building a state-of-the-art data and AI platform that serves as a unified data hub for all company needs and for all data and AI business insights. We built an advanced architecture based on Databricks which is built on top of AWS. Our Unified data infrastructure handles several billions of records per day in batch and real-time mode, generating players' behavioral profiles, predicting their future behavior, and recommending the best customization of game content for each of our players.
Databricks on AWS breakout sessions
Secure Data Distribution and Insights with Databricks on AWS
Every industry must comply with some form of compliance or data security in order to operate. As data becomes more mission critical to the organization, so does the need to protect and secure it. Public Sector organizations are responsible for securing sensitive data sets and complying with regulatory programs such as HIPAA, FedRAMP, and StateRAMP.
Building a Lakehouse on AWS for Less with AWS Graviton and Photon
AWS Graviton processors are custom-designed by AWS to enable the best price performance for workloads in Amazon EC2. In this session we will review benchmarks that demonstrate how AWS Graviton based instances run Databricks workloads at a lower price and better performance than x86-based instances on AWS, and when combined with Photon, the new Databricks engine, the price performance gains are even greater. Learn how you can optimize your Databricks workloads on AWS and save more.
Securing Databricks on AWS Using Private Link
Minimizing data transfers over the public internet is among the top priorities for organizations of any size, both for security and cost reasons. Modern cloud-native data analytics platforms need to support deployment architectures that meet this objective. For Databricks on AWS such an architecture is realized thanks to AWS PrivateLink, which allows computing resources deployed on different virtual private networks and different AWS accounts to communicate securely without ever crossing the public internet.
Register now to join this free virtual event and join the data and AI community. Learn how companies are successfully building their Lakehouse architecture with Databricks on AWS to create a simple, open and collaborative data platform. Get started using Databricks with $50 in AWS credits and a free trial on AWS Marketplace.