Giving Away The Keys To The Kingdom: Using Terraform To Automate Databricks

May 28, 2021 10:30 AM (PT)

Download Slides

The long term success of any part of Scribd’s data platform relies on Platform Engineering putting tools in the hands of developers and data scientists to “choose their own adventure”.

In this session we’ll learn about Databricks (Labs) Terraform integration and how it can automate literally every aspect required for a production-grade platform: data security, permissions, continuous deployment and so on. We’ll learn how Scribd offers their internal customers flexibility without acting as gatekeepers. Just about anything they might need in Databricks is a pull request away. We’ll also learn about the typical deployment patterns of Databricks with Terraform among other customers and clouds and how the project evolved over time.

In this session watch:
Hamilton Hord, Site Reliability Engineer, Scribd
Serge Smertin, Resident Solutions Architect, Databricks

 

Transcript

Hamilton Hord: Hello and welcome to Giving Away the Keys to the Kingdom: Using Terraform to Automate Databricks. My name is Hamilton Hord and I’m a Site Reliability Engineer at Scribd.

Serge Smertin: And I’m Serge Smertin, Senior Resident Solutions Architect at Databricks.

Hamilton Hord: Today. We’ll be talking about a few things. First we’ll review what a complex infrastructure can look like and some of the issues inherent with it. Next we’ll review the Terraform Databricks Provider. We should note that we will be discussing using Terraform, but not what it is and how it works. For that you can find other presentations and other guides elsewhere. Next, we will be discussing how this new Databricks Provider has helped us solve a lot of our problems and issues at Scribd. And then finally, we’ll talk about how the Databricks Provider can help in a much more general sense. So again, I am an SRE at Scribd, focused primarily on our data platform. I’m responsible for deploying our Databricks infrastructure to our data scientists, business users, and everything in between. And because of that, I kind of really enjoy working with the big data technology, the collection of data, the processing, and all the analytical information that can be gleaned from it. And of course, outside of the office, I am a big fan of playing video games. Well, most notably trying to get them to work in my Linux computer.

Serge Smertin: And I’m Serge. I am a lead maintainer of Databricks Terraform Provider. So I worked in all stages of data life cycle for the past 14 years: pre-sales, post-sales, implementation, solutioning, support, production. I built a couple of data science platforms literally from scratch. I used to track cybercriminals through massively scaled forensics and I used to build anti-PII analysis measures for payment industry. And yeah, here I help strategic customers of Databricks, like Scribd, to become more successful and bring them to the next level.

Hamilton Hord: So what does a complex data infrastructure look like? So for example, you might have an array of different buckets that have all sorts of different individual use cases, types of data all over the company. Next, you might have a few other not-bucket data stores like Kafka or some relational data store. Next, you’ll probably have a few Databricks clusters that you use to access that data and process it to actually glean value. Of course, you’ll have a number of roles and acts with those special access that those entail, all of the policy documents and rules and definitions that these roles need in order to have the access that they’re supposed to have. And of course on top of that, most of these are probably going to be in different accounts across the various different teams of an organization.
Now, to focus in a little bit of the complexities of running Databricks in this. With making Databricks changes in the web console, when people update various config items, such as worker counts, spark context or other various things, you can never really know where the changes were previously, all you know is the current values. You can’t see what it was like maybe a month or a week ago really. Additionally, people can just go in and create [net???] clusters that they have access to on a whim. There’s no oversight, there’s no checks or balances or anything. There’s no real checks to make sure they’re using the right tags that they need to, or the correct spark context to get things working. And in large companies, it’s kind of hard to make sure that everyone is adhering to a sort of written policy, but not enforced policy that things are being created correctly.
So in order to kind of help maintain that you have to reduce the number of people who can actually create clusters. So only a few people from a team of 50 can actually make the clusters. Which means that if someone needs changes to a cluster or needs a new cluster with some specific configuration items, an admin has to come in and actually build that for them with a ticket or some message in Slack or some other way. And in doing so, they have to build it and often have to duplicate a lot of the config items manually. And then on top of that, with all those manual creation, as admins, we need to regularly go in and check to see if everyone is adhering to our written policies or standards of hey, are people creating clusters with the appropriate tags, with the appropriate departments, et cetera, et cetera. Now Serge is going to tell us about what does solve a lot of these issues for us.

Serge Smertin: Yes, indeed Hamilton. So we’ve created Databricks Labs Terraform Provider to solve all the before-mentioned problems that large enterprises have, to help people to unleash the power of Databricks automated jobs, streaming pipelines, workspace security, data security and machine learning, connected seamlessly with cloud storage, cloud access controls, and networking through a tooling that’s de facto is the industry standard for cloud native infrastructure. Because if you’re running the cloud, if you’re running Databricks, then most likely you are using Terraform. Of course, there are a couple of other alternatives to Terraform like Cloud Formation or from Amazon Web Services or Azure Resource Manager Templates, but still they’re not Omni clouds. They are not multi clouds. And Terraform is the single tool that can have the similar syntax and the similar tool and the same tool for all of the clouds that you have, maintaining all of the things and breaching the gap.
Basically Terraform provider is working with almost any public API that Databricks exposes and there is a source for it in Terraform. Even though Terraform provider is currently not supported, it’s still pretty actively used. And it has approximately half a million of downloads and more than 5,000 installs per date. I’ve been working in it for more than a year now, and it has 15 public releases and more than 50,000 lines of code with 83% code coverage, and more than 600 closed issues. That’s pretty impressive type of results. And so many companies are using it for their production deployments. So what cool things does Scribd do with it?

Hamilton Hord: So we’ve actually basically built out what amounts to an automated deployable distributed data mesh. So with, even before the date of purchase provider, we had already been building a lot of our various AWS infrastructure inside of Terraform. So with the new Databricks provider, we can actually start building this infrastructure out and then actually pushing it into Databricks automatically. Then with those new resources, such as like IAM profiles and other things living inside of Databricks now, we can actually start building clusters and other resources on top of these AWS resources all automatically just with a few configuration lines, like whenever we need a new cluster, for example. And additionally, with using Terraform, we’re able to modularize our building out of workspaces, our creation of inter workspace objects and keep them pretty identical between them and only change what needs to be changed to make them unique.
So I’m going to take you through a little bit of code here, just to give you an idea of what kind of looks like. So here we have a sample from a large map that we use to basically define all of our IAM profiles that we use inside of Databricks. So here we say, “Hey, we’re going to create the profile core platform depth.” It’s going to need access to assume these roles in other accounts or within the same account, but it needs to also assume these: it should have access to this set of secrets and be placed in this workspace with this group ownership. So once a Terraform is run, our code will go through, create the role, create all the policies that allow for that role to get access to those accounts, attach those roles, grant all the policies to get access to the secrets it needs and import it into the new Databricks workspace that it’s required.
Following that we actually have modules for creating the workspace themselves. So in this, we actually have, not pictured here, another map, similar to the roles that define, “Hey, I want this workspace to exist.” And then it comes down here, calls the module for all of the different workspaces and the dev workspaces, and will iterate over that and just create us individual, nearly copycat, but uniquely addressable, workspaces, just from taking any of those resources that are created outside of this. This way, if for some reason down the road, we have a new team join that say needs super secret access to special data that we don’t want anyone else in the company to have access to, we can just add a quick line and boom, we have a unique workspace already built out for them that we can just have them log it. So with that workspace that we have created now, now we have to fill it out with the various Databricks resources.
So once we have the URL, we can go in and build out this. So what’s not pictured here is the module that does this. It actually has a bunch of files in it that have lists of clusters and other things. But all it needs to get started is what environment we’re in, what kind of like service admin accounts need to exist prior to integration with SSO and what IAM profiles it needs. So in this module, we trimmed down all this extra code that we need to do kind of for downstream PR users, which we’ll get to a little bit more. So rather than having them worry about the entire code base, like, “Oh, check this specific file in this massive Terraform stack,” We can just say, “Hey, just go into this folder and that’s all the configuration for the workspace that you want.”
And I want to point out that we’re going to have a lot more details about this sort of code and this deployment process and all of what we do with this over at tech.scribd.com. Links will be provided. And as I mentioned with the PRS, well, we’ll get into the PRS in just a moment. All of this Terraform code allows us to have a single source of truth of what our infrastructure is, which being placed in GitHub gets all the benefits of any sort of Git management system for your code base. We can provide an easy portal for teams that want to say, build a new cluster, make changes to a cluster, rather than the old way where they have to submit a ticket to me, or go into our Slack channel and be like, “Help, help I need assistance. I need to change this from a five to a 10.”
They can actually just go into to the code base and just submit a PR. And then we can pick it up in our normal PR review cycles, daily, weekly, whatever, and just get it approved. Additionally, with the Git set up, what we’ve done is we’ve segmented some of the teams’ configuration items like say a team like our applied research team has a dedicated cluster. That’s only for them, has their special sauce on it and we want to lock down that a little bit more than just access. So we can actually apply, make them code owners of the files, the configuration files that configure their clusters and in doing so when any sort of PR or other thing gets submitted for those files, they’re automatically involved in the PR process as a reviewer and approver on that. And that gets, reduces the burden of all the approvals on us to the teams themselves a little bit.
And further benefits of being in Git, we can leverage the sheer amount of Git automation that’s out there such as GitHub actions to kind of trigger various different checks and balances, linting, spell checks here and there, checking to see if it would actually run or apply. This way, as the end users submit their PRS, they can get instant feedback on whether or not it actually works before we even have a moment to review it.
And finally, one of the biggest and most beneficial reasons to put all of this infrastructure to Git, we get history of what’s been happening. So we can look through the commit log to see exactly when a cluster got created, when a cluster got edited, who edited it, why they edited it, because hopefully they left a description or a ticket or some reasoning for their changing, the cluster creation sizes or whatnot. But if there’s an issue we can easily look and see, “Hey, let’s go talk to John Doe about this commit and see what, what he was thinking at the moment, or like what was the reason behind this.” I’m happy to say that since rolling this out, we have actually been able to expand out the number of authors for issues and stuff from the two to three internal team members that have been working on this code base to a total of 13 across Scribd.
While not many PRS with individuals, but the sheer number of individuals out there has been fantastic. This shows that there’s not really as much of a bottleneck anymore on who has to do the code changes. So it really relieves the amount of work I need to be hand holding on this.
Now with anything, there are a few risks and challenges with this. Now with any sort of change management system or any sort of PR or other thing, you’re going to introduce a little bit of a slowdown with how fast you can make changes and edits to things.
Whereas it’ll take 10 seconds to up the worker count inside of the console and call it a day and just be like, cool, we’re good, you have to go through the full code review process with this to really get that pushed out. In the long run, we feel that that is an acceptable loss of time because you get to catch all the cowboy-like changes before they actually start running rampant, and you can get a little bit of communication involved in those changes. And it’s not too hard to get the attention of someone who has the approval process to, “okay, we need to get this done quickly. Let’s hop on this and review and check very fast.”
Secondly, because we use a lot of private preview features with Databricks or a lot of brand new features that just came out last week or last month we are trying to like do everything in Terraform, which means we tend to be on the absolute bleeding edge of the Terraform provider, which if anyone knows being on the bleeding edge of any sort of software, you tend to run into like the edge cases first.
So as Serge, you mentioned a couple of times, like we we’ve definitely had some bumps here and there, but nothing that was ever game-breaking or taking any of our systems down, more of just, oh, we have to account for this here and there. Speaking of, Serge, what’s the current version that people can get right now with the Databricks provider?

Serge Smertin: Right now it’s 0.3.3, it’s fifth in release, and we’ve introduced actually a feature of SQL permissions that helps to manage table access control across thousands of table across all of the workspaces you have from a single code base, which is coming for free if you use Terraform. You don’t have to buy any other products. You just use Databricks with Terraform provider and you have a fine-grained access control to your data objects. Isn’t that awesome?

Hamilton Hord: That’s awesome. And that means I need to really hurry up and get on the sprint to upgrade that because I can’t wait to play through that. And then finally one of the, a big hurdle, but not one we foresee being much of one going forward, we’ve seen a little slow of an adoption of committing to Terraform changes, mostly compared to when we first released Databricks to our company, people went wild with creating clusters and like getting super excited about this and making all sorts of specialty single-use clusters. Whereas with this extra hurdle of the change management and needing to learn Terraform, people are a little bit more hesitant to just run rampant with it, but we don’t expect this to be a huge barrier going forward as we’ve invested a lot of time and effort into internal documentation with like a very detailed notes of, “Hey, this is what you want. This is kind of what you need to do,” as well as with many code examples in the code itself and extensive code review processes on request.
So we’ll actually pretty much pair with them on like, “Hey, this is exactly what you need to do here, here and here, oh, this is some funky thing with Terraform. You might want to check with this and make sure this works right.” But with all these going forward, we don’t really perceive this being a huge problem. Now, Serge, what do all these cool features that I’ve been happily playing with recently kind of mean a little bit more generally?

Serge Smertin: Well, Scribd was playing with the features of Terraform for more than nine months now, from what I see. And you’re one of the oldest users of Terraform and you’re one of the most versatile ones. So you’re using almost it all, but basically Scribd is doing what Martin Fowler blog is calling a distributed data mesh. Distributed data mesh is inevitable for companies that have more than 50 people in the data team and they want to let each team to build their own datasets and have their own journey without relying too much on other teams. Of course, they rely on the central team that’s called data platform engineers, and data platform engineers are actually maintaining the whole infrastructure, usually with Databricks and Terraform, because if there is infrastructure in a cloud, there is Terraform.
And this is not something new, so be prepared to hear this upcoming part of the distributed data manage more and more often. Just because MartinFowler.com has written about it and once something appears there more and more people will start talking about it. So let’s talk over a couple of real use cases, Terraform at a bit larger scale. So companies usually start with a job, a second job, a third job, they make like 20 different jobs that work with storage mounts. Well, they work with it for a couple of months and they slowly bring that to production. And production, they have to start adding permissions to their on-call teams and to their billing teams and to their business users to share auto-scaling clusters. And they have to control permissions. Otherwise they’re not compliant with security policies of their organization. And it becomes more fun when they are starting to add things like monitoring, audit logs, IP access control lists, to this whole thing.
Then they think, oh yeah, “we actually should have outlaid this from the beginning.” And they start automating and they dug a hole with Terraform because they need to scale because suddenly they have to start thinking about the data locality rules. In cases like their amount of data is so huge that they have to process that data within the region and transfer away only a processed DeltaTable. So the cost of managing bigger infrastructure across 30 different workspaces is justified by network saving costs and regulatory costs. So this picture is not actually something that is quite unusual. We see some customers, both internal and external that have more than 10 different workspaces. And so sometimes their workspaces are all the same across 30 of the same kinds, or they have a different, very thin workspace. So Terraform is actually enabling rolling these things out into production in a bit easier way.
And this picture was just a simplification of the real thing that is actually used within Databricks internally, for things like billing. And you have a call to action, please explore Databricks Terraform provider as the way to automate your entire infrastructure code review process, and being able to recover from disasters and control the costs of your cloud spend and control security of your cloud spend in a reviewable way. Because if you grow more than 50 people in your tech team, you definitely need to adopt some production change advisory board type of methodology. And this is the best way to go with this.

Hamilton Hord: Well, thank you very much for that Serge. And again, my name is Hamilton. I work at Scribd and I want to call out that we are hiring. So if you check us out, please do; links will be in the chat. And Serge, I believe isn’t the Databricks provider completely open-source?

Serge Smertin: Oh, yes. It’s open source from the very beginning, we are looking for more contributors, always.

Hamilton Hord: Well, definitely go out and check them out. And I believe Databricks is hiring as well. If I do recall. Thank you very much for coming to our talk and have a great day.

Serge Smertin: Thank you everyone.

Hamilton Hord

Hamilton is a Site Reliability Engineer at Scribd where he has helped deploy Databricks to hundreds of happy internal customers. He enjoys helping data scientists, analysts, and machine learning devel...
Read more

Serge Smertin

Serge Smertin is a Resident Solutions Architect at Databricks. In his over 14 years of career, he’s been dealing with data solutions, cybersecurity, and heterogeneous system integration. His track r...
Read more