The final wave of Digital Transformation is upon us! Now that we’ve eliminated the paper, cut the cords, and deployed our applications in the cloud the only thing left is the data. It turns out that this is a lot more difficult than expected because data platforms are heavy and important to the organization, requiring new thinking to modernize. At the core of these modernized platforms is Spark – a platform that fully matured in the Hadoop ecosystem and now finds itself at the center of Data Science and Data Engineering platforms. The trick is to leverage the capabilities of Spark and the cloud principles of modern applications, to provide the ideal user experience and to do so where the enterprise data lives – on-premises!
Speaker 1: Hi, I’m Matt Maccaux. And I’m the field CTO for HPE Ezmeral’s, global software business unit. And in this presentation, we’re going to talk about how Spark can ignite a revolution throughout the enterprise with multi-cloud Kubernetes. During this presentation, we’re first going to talk about digital transformation and really how Spark ties into digital transformation, what organizations have been doing. And then we’re going to sort of tie back to what organizations have already accomplished in terms of application monitorization and moving to the cloud and how that does, and does not apply to the data and analytics space with Spark being at the core of that. And then we’re going to talk about how best of breed organizations are taking an industrialized approach to do data analytics and big data at scale, again, with Spark at its core.
And then I’m going to dig specifically into the data science and data engineering personas. What do those different personas require when they use Spark throughout the enterprise taking this industrialized approach? And then finally, I’m going to close talking about how organizations can actually accomplish all of those goals on-premises with open-source Spark on Kubernetes.
So let’s talk about digital transformation. What do people mean by digital transformation? Well, most organizations think digital transformation means going from the analog world to digital, taking paper and manual processes, digitizing them and automating them. And that’s already been done in most organizations, but the work in transformation for digital organizations continues to evolve to mean different things, such as automating the data center, automating and modernizing applications and moving them to the cloud. And this has been an ongoing process that organizations have been taking on and accomplishing for more than a decade now. And we’re now at the point in this transformation effort where organizations are starting to take a look at data platforms. They’re taking a look at these data platforms, whether these are a legacy relational databases and modernizing them to more open-source components, taking a look at enterprise data warehouses and thinking about deploying those in a more agile cost-effective way.
And then lastly, looking at these big data, data lakes and looking at the next generation and transformation for those as well. For organizations that have gone through that process and are looking at these data platforms, many of them are looking to the cloud and specifically the application space to see if there’s lessons that can be learned from that. And as we look at that, let’s first take a look at some of the cloud principles that drove the success of those particular transformations. The first is that when modernizing those applications, taking a loosely coupled approach to the architecture, meaning that we don’t have these large deployments, but each of those application components are built and deployed in a loosely coupled discreet fashion, even separating the storage away from that so that these applications are considered stateless. And then a big part of that of course, is then separating the compute and the storage.
And we saw that with the sort of hyperscalers emerging on the scene. Amazon was the first to come on scene with their S3 object store and the EC2 computes, and everybody’s sort of modeled that approach. And so organizations looked at that and they said, well, I can have these discrete microservices. I don’t need to have the storage, that data or database deployed with that. I can separate it out. And actually the object store, the file store, the cloud is actually reasonably performing. It meets the needs of these more modern applications. Also with that, because a lot of organizations were deploying actually in the clouds, they were able to do on-demand provisioning and they were able to build scaling mechanisms so that they could scale up and scale back those applications. Like retail’s a classic example of the holiday season. We would scale up a bunch of compute to be able to handle that transaction load, but even on a smaller level, on a day-to-day basis, we look at financial services with the rush of the market opening, or even retail on a daily basis or overnight or monthly batch processing.
The examples go on and on. But the key here is that you only really consume the resources as you need them. And importantly, you only pay for those resources as you consume them. And you only pay for really what you use there. And that has extended beyond just the infrastructure itself, the compute and the storage, but also the cloud API services and functions and organizations have then translated that model back on-premises for their application of state. The same thing is not true though, in the data world a lot of organizations are stuck because they see all of the success of the application organization and these cloud-like operating models. And even the data centers are now largely automated, but these data centric systems have a lot of constraints because as I sort of mentioned before, a lot of these applications are actually monolithic in nature.
We can’t break apart all the discrete components of our enterprise data warehouse or existing Hadoop system, at least not very easily. Also sort of related to that is that the data in the computer often tightly coupled and deployed together as a single unit. The reason for that, well, goes back years and years, because those 10 years ago, when Hadoop came on the scene, networks weren’t very fast, disks were still reasonably slow. And the performance of these systems is actually tied to the speed and the latency of the network and the disc. So we wanted to co locate the computational layer and the data layer, but organizations have fast networks now. Discs are really fast. And actually it sort of turned out that the growth of the discs was greatly outpacing the growth and the utilization of compute. And especially as we start talking GPUs, it became even more problematic.
So organizations wanted to be able to break these apart, but most of these platforms were not built that way. And we see modern organizations that are looking to modernize their deployments, but that sort of relies on that vendor. So a lot of organizations or enterprises are scratching their heads now going, do I wait for the software vendor to do that? Or do I go a different path? Plus I can’t really on demand provision my enterprise data warehouse, or even really my data lake. Maybe I can add hardware nodes, but then it requires a rebalance because it needs to remain tuned up properly. So we would generally add these environments as we were doing major upgrades or other maintenance periods and sort of that, that happened infrequently because what these systems are really important for organizational reporting, we dependent upon these systems to give us metrics about what was happening, how many widgets we sold, what our financial reports were to report back to the street or our consumers, our executives.
And so these systems were generally relatively static and slow changing. Plus they were pretty expensive. They usually required capital outlays on infrastructure, software, and people and is really kind of the polar opposite of what happened with the cloud. So what are best of breed organizations doing? What have they already done actually, to modernize this? And I promise I’m going to get back to where Spark plays a part, but I wanted to just give you that little bit of background to sort of explain why organizations have not rushed forward with this, especially on-premises. So when I talked to best of breed organizations, they are taking more of an industrial approach to solving the modern data and analytics platform. What do I mean by industrial, I want you to have this picture in your head of a modern manufacturing plant, or warehouse, where we’ve got lots of robotics, inventories done just-in-time et cetera.
And so these enterprise organizations require significant automation. And the reason for this is, any organization can move heaven and earth to build one analytical model, one really sophisticated data science predictive model to solve an organizational problem. But can they do that repeatedly? Can they build intelligence and data science and decisioning into every organizational process? Do they have the software? Do they have the infrastructure? Do they have the skills to do that? And then do they have a way to actually effect that change by putting into production? And again, these best of reorganizations sort of follow the lessons of these industrial, think of your again, car manufacturing, an airplane manufacturing plants, they start out with an R&D center or center of excellence, where different actors or personas in the organization come together and they talk about what are the tools that the data scientists and data engineers are using?
What version of Spark we are on? Can we use Spark on YARN? What about this Kubernetes thing? Where’s data going to come from? What does prod look like? How do we get prod ready? And so all of these components come together and these best of breed organizations sort of take that R&D or center of excellence approach. So that as they’re thinking about the problems that these data scientists and data engineering teams are going to solve, the production teams are getting that manufacturing line, that production system, or those production systems ready to go. So that when that data scientists or data engineer wants to use the latest Livy server, well prod’s ready to do that. And they’ve got enough elastic infrastructure to take those Sparks submit jobs and deploy elastically on demand, but that requires these organizations coming together. And you may be asking yourselves or thinking, man, this is not really a technology problem per se.
And you’re right, it’s actually an organizational function. So this concept of this industrial approaches, half organizational process change and then half technology. And so that leads me to my next point, that again, taking the cues from the manufacturing world, to this industrial approach, we want sort of a just-in-time, that’s what JIT stands for, this just-in-time approach. So that data scientists and data engineers can provision environments as they need to, just like they would on the cloud. But again, whether they do this on the cloud or on-premises, we need the same experience. They’ve accessed to the organizational data as they need to, especially, this is really key for those data scientists that are trying to build those next generation predictive analytical models. They need access to the right data, but we have to apply the right metrics, quality scores and lineage, security especially, restricting who has access to what data, but doing all of the seamlessly.
And so what this requires at the end of the day to really take this industrialized approach, just to sort of sum up this concept is that it requires modern tooling. I need the right infrastructure and software to bring this together. And again, Spark is one of those common tool sets that we see whether we’re doing data science or data engineering, modernizing, or even running legacy Hadoop platforms, Spark is the work horse of these platforms. And we need to be able to deploy that in that cloud-like manner to get to that industrial scale.
And of course we need the right organizational structure and skillsets in place. If I don’t have anybody that can run Spark on Kubernetes, well perhaps I shouldn’t be running that in production then, I shouldn’t be even be thinking about that, but let’s bring that back to my R&D center of excellence, build up those skillsets so that as we want to transition from that Spark on YARN world, to the Spark on Kubernetes world, to have a much more open ecosystem and not be tied to any one vendor or set of infrastructure, I can actually do that, but I need the right skill sets. And that takes a while to build up over time.
And then lastly, as I mentioned, part of the battle here is actually having the organizational processes in place so that even if I’ve got the right skills. So I have people that know the latest version of Spark and I can run Kubernetes on elastic infrastructure. I actually have the organizational processes in place to be able to take advantage of that. So let’s talk about how we can ignite your data platform modernization with Spark. And I want to start by talking about data science. You all know what data science is. I’m not going to explain what that means, but I want to talk about what data scientists require and more importantly, what they don’t like to do. So data scientists want to work with their own tools, libraries, and workflows. That means they may be, want to go pull the latest open-source framework.
They perhaps want to use the latest version of PyTorch. What they don’t want to do is muck around with a Yammel files, a Docker containers and Kubernetes deployments. They don’t want to run around and get access to data everywhere, they want to just simply get to work. But a lot of these tools and frameworks, especially for these code, first data scientists want to use open-source tools. And all of these open-source tools have plugins to Spark. So most data scientists today are… At least those code first data scientists are writing in Python are still using R or you’ve sort of got those more citizen data scientists that use their sets of tools like H2O or data robot. Well, they also depend on Spark. Again, Spark is this common denominator across all of these different data science personas, toolsets, libraries, and frameworks. And what’s key here is that to give the organization the most flexibility, the most portability for these applications and have consistent use.
Again, getting back to that center of excellence approach, I want to be able to consistently deploy analytical workloads. I want to think about this at an enterprise scale. And what that really means is that I need to use open-source components. So my data scientists want to go from a Jupiter notebook and pick their particular libraries and tools. Again, this is the code first data scientists, workflows and through a few button clicks, be able to spin up a training job or do an inferencing model with billions of attributes behind the scenes of Spark. But what that requires from an enterprise perspective, what I need to set up from an IT perspective is I need to be able to spin up a cluster. I probably need to be able to scale that cluster up and down in Spark does a really good job of auto scaling up and down based on CPU utilization or memory utilization. And then I need to be able to deploy the right workloads or workflow processing, whether it’s cube flow or airflow or ML flow, and all of that needs to be integrated with enterprise security.
But again, that needs to be seamless for these data scientists. And so again, behind the scenes that work horse is Spark, and so we need to be able to deploy Spark and the associated libraries for each of these different users. For data scientists in particular, I probably need to give each of them their own Spark environment. It’s not the same thing as data engineers, but for data scientists, they probably all need their own bespoke environments where I spin up a little bit of compute for them, their own little Spark server with their specific configurational libraries of their frameworks, and then secure that with conductivity to the underlying data. That’s what’s really key here. And I needed to do that in a multi-tenant way so that perhaps my Swiss team doesn’t have access to the same data as my Hong Kong team. I need to be able to isolate those networks.
And so that’s what best of breed organizations are doing to solve for that data science use case. And again, Spark is at the heart of that. Now let’s talk about data engineering. Well, again, Spark is at the heart of all data engineering efforts too, but things are a little different in the data engineering space. Yes, the data engineering teams have their sets of tools and their sets of IDEs. Maybe it’s not Jupiter, maybe it is. People are still using Joplin, but they’re also writing other programming languages, Scala, Java. There’s lots of different ways that these data platform engineering teams are using and writing their code. And the beautiful thing about Spark is it’s got a set of APIs that allows you to write into those codes, those programming languages. But from a data engineering perspective, there’s also its own set of ecosystem tools, whether it’s those open-source tools that I mentioned before, the data engineers use different ones, maybe they’re programming and using things like Kafka, or maybe they’re using the enterprise version like Confluent.
There’s a number of these tools and Spark is part of that ecosystem. And there’s other tools that surround this ecosystem too, for SQL access, accessing a variety of backend data stores like Dremio or Presto, to be able to do more of that SQL based data engineering effort. The key here is that Spark provides a rich set of APIs so that these data platform teams can leverage that elastically scalable compute. But not all engineers need or want that same approach. So, as I mentioned in the data science perspective, each of those data scientists, as they’re doing dev test and model development, they all need their own bespoke environments. And maybe the data engineering team does too. But what they really want to do is a Spark summit. They want to do a Spark summit and sort of act on a serverless basis, where they don’t need to necessarily worry about a server being spun up on the backend and have it always running. Sometimes that makes sense and sometimes it doesn’t.
And so for some jobs, we need an always on cluster, perhaps throughout the day, we’re doing either batch or streaming processing. We probably need a fixed, or at least an always on Spark cluster to be able to handle that steady state workload. And perhaps it scales up and it scales down as the workload ebbs and flows throughout the day. And again, Spark is really good at that. Especially the more modern Spark on Kubernetes using that Spark operator. It scales really, really well for that, but for other jobs, for more sort of batch jobs, ad hoc jobs, nightly jobs, monthly jobs, we actually want to be able to spin stuff up, brand new clusters up and spin them down as well.
And so what’s key here is that for the modern organization, we actually need to be able to support all of these different types of workloads. And Spark is an excellent framework. It’s an excellent computational framework to allow us to do that. For those data scientists, they can snap in their favorite libraries and frameworks. Those data engineers can make those API calls. Perhaps we’re deploying a Livy server alongside that Spark on Kubernetes environments. And all of those data engineers have to do is hit that REST API as provided by Livy. And then Livy does the translating, but every organization’s a little different. The key here is that Spark on Kubernetes is the key for the industrialized approach to data and analytics. Spark is at the core of these transformation efforts, for these enterprise data platforms. And as I mentioned before, Spark on YARN is being replaced. As these Hadoop deployments are starting to wind down and organizations are looking to cloud-based data processing or other options internally Spark on YARN is being replaced with Spark on Kubernetes.
And the 3.x line of Spark is now considered production grade, enterprise ready. And so a lot of organizations that I speak to are looking to adopt that, how do I deploy Spark on Kubernetes for these various personas, but then they’re all sort of scratching their heads and going well, shouldn’t I just do that in the cloud. Each of the cloud providers has their flavor of YARN. There are cloud-based deployments that do data engineering and data science, a data lake or lake house sort of deployment. Shouldn’t I just deploy all of my Spark there. Shouldn’t I just abandoned my on-premise workloads. We all know that hasn’t happened, it hasn’t happened yet. The mainframe still exists in some businesses. Data warehouses certainly exist, and these things data lakes exists everywhere as well. The point out I’m trying to make here is that most data still sits on-premises.
Yes. We see campfires and data science workloads for dev tests being done in the public cloud, but the production environments are back on-premises. And so I remember getting back to that industrialized approach. I need to be able to deploy those dev test workloads into a production environment, which means I need to have these sort of in sync and in harmony. So if my data, for the most part sits on-premises, shouldn’t the compute side of it too. Shouldn’t I be able to spin up Spark with Livy, for data engineering or Spark Jupiter and connect into my HDFS or NFS or my local object stores. Shouldn’t I be able to do that too.
And if I really want to run this in a more industrial way, how do I actually deploy all of this on Kubernetes? Because if I can do this approach with open-source, that means I can use EC2. I can use Azure. I can use Google. I can use my private pull of data center to spin up these workloads for dev tests. Because maybe I don’t have enough infrastructure on-premises for dev tests, but maybe I do the cloud, but if I can use open-source Kubernetes to deploy open-source Spark, and then the variety of tools that run on that. Well, who cares here’s where I’m doing my dev tests, because I’m going to have that consistency when I actually need to operationalize that in a production environment.
The trick though is that most enterprises won’t move forward with an open-source solution. They want support. And so what I’m proposing to you today and what HPE is really bringing to market here and supporting, is giving you the confidence to run 100% open-source with break fix support. That’s the Kubernetes environment. That’s your Spark operator environment. That’s your Livy server. That’s your cube flow. That’s your PyTorch. HPE Ezmeral, the software business unit within HPE and the brand Ezmeral is delivering this capability today. We deploy 100% open-source Kubernetes with 100% open-source Spark operator and the ecosystem of components and HPE provides 24 by seven break fix support for those components as well.
And so that should give you the confidence that you can deploy and modernize your data platforms to be able to run these workloads on-premises or wherever the data happens to exist without locking you into a cloud solution or a proprietary solution, you can bring your own tools. We certified them tools like H2O and Splunk on this platform, but it’s all backed with open-source components. So I’d like to thank you for attending this session. If you have any follow up questions, comments, or thoughts, you can hit us up at hpe.com or you can send me an email or hit me up on LinkedIn. I’m firstname.lastname@example.org. Thank you for attending the session.
Matt Maccaux has been working with clients across many industries for the past 20 years at some of the biggest technology companies in the world. For the past 8 years, Matt has focused on the big data...