Architecting Agile Data Applications for Scale

May 27, 2021 05:00 PM (PT)

Download Slides

Data analytics and reporting platforms historically have been rigid, monolithic, hard to change and have limited ability to scale up or scale down. I can’t tell you how many times I have heard a business user ask for something as simple as an additional column in a report and IT says it will take 6 months to add that column because it doesn’t exist in the datawarehouse. As a former DBA, I can tell you the countless hours I have spent “tuning” SQL queries to hit pre-established SLAs. This talk will talk about how to architect modern data and analytics platforms in the cloud to support agility and scalability. We will include topics like end to end data pipeline flow, data mesh and data catalogs, live data and streaming, performing advanced analytics, applying agile software development practices like CI/CD and testability to data applications and finally taking advantage of the cloud for infinite scalability both up and down.

In this session watch:
Richard Garris, Executive (VP, GM), Databricks

 

Transcript

Speaker 1: Good afternoon everyone, here on the West Coast in the United States, good evening to the East Coast, and good morning to APJ. Welcome to Data + AI Summit, where the theme this year is “the future is open.” My name is Richard Garris. I’m an AVP at Field Engineering and today I’m going to talk about architecting agile applications on scale. So just a brief agenda, a little bit about myself, a little bit about the world’s most valuable companies and how those companies became the most valuable. We are going to talk a little about software development and the history of different methodologies from Waterfall to Agile, as well as talk about the difference between traditional data platforms and modern data platforms, and finally, I’m going to do a summary. Also, at the very end, there’s going to be a survey for this talk and I’m looking forward to your feedback.
A little about me, I’m a proud Buckeye. I graduated from Ohio State. I also got my master’s degree at Carnegie Mellon. My undergraduate degree was in MIS and Accounting, and then I got a software manager degree part-time at Carnegie Mellon University. I spent the better part of two decades in the data space, starting all the way back when we had IMS databases, information management system databases, on the IBM mainframe old [inaudible] systems. I worked through our phase with RBMs with Oracle and SQL servers as our main data platforms. And most recently, I’ve been working in big data. I was working on Hadoop and Spark. I’ve worked for three years as independent consultant, five years in data management within PricewaterhouseCoopers, which is one of the premier consulting companies today, as well as three and a half years at Google on their data team, and the last six years as a AVP at Databricks, where I’m working with our customers to really transform their businesses around data.
I’m also a certified scrum master. I got certified pretty early, which will be an interesting… Back when we get later on into the talk, I’ve done several other talks at both data and AI summit, as well as other popular areas in terms of Agile, Spark and all the development with data science central. I did a talk on ETL 2.0 for MSDN Channel Nine with Microsoft. These links will be available to you. And when I pass on these slides after the talk, if you’d like to take a look at my other talks, it’s also on my LinkedIn page. Just look for Richard Garris.
So if you look at the last 28 years and doesn’t even feel that long, at least to me, the top companies in the world has changed. Back in 1993, the top companies of the world were iconic brands like General Electric, ExxonMobil, Walmart, Coca-Cola, Merck, and they really dominated the Fortune 10 companies in the world. But, in the last 28 years, you can see the shift. We’ve gone from these very iconic brand name companies, to mostly being technology companies. So the top five lists here are Apple, Microsoft, Amazon, Alphabet, or Google, as well as Facebook. You can see that the shift over that period of time. And so the question is, though, these top companies, what are they doing differently? If you look at Facebook, Apple, Amazon, Google, Microsoft, often called Fang, Fang minus Netflix. What are they doing differently that really allow them to be as competitive as they are?
A lot of people have talked about this, and there’s been many discussions on Medium as well as other publications. Is the reason why these companies are very successful because, is because they have lots and lots of data. I wouldn’t say from my view, doing this for two decades, not really. At one point in time, the Fang companies may have had some varying datasets, whether you’re Facebook and having 2.7 billion user profiles, you’re Google having search results or information from YouTube, but never other Fortune 500 companies, commercial, market companies, as well as other digital natives and… Or public sector, state, local governments have a lot of data too.
Is it because they have better artificial intelligence, deep learning ML algorithms? Again, not really. At one point in time, they did. They had a lot of proprietary IP for how they could do more advanced analytics, but today a lot of that is available in open source. Google published TensorFlow not too long ago. Facebook’s kind of famous Pie Torch. Amazon has Mxnet. Microsoft is LightGBM. A lot of that is available in open source. And a number of the other algorithms also available in research papers today. So that’s also not quite the case.
Or, is it because they have better data processing? Again, not really. I mean, at one point Google, Amazon, Microsoft, Facebook, Apple, all had the best infrastructure in the world. Custom built servers, custom built hardware. But today, because of the advantages that you get from public cloud, a lot of those technologies are available to anyone that wants access to that and available on demand. And there’s also a number of open source and commercial softwares available to anyone who wants to process big data at scale. So again, this is all available to most companies. So why are these companies doing as well as they are?
I’m going to do a quick poll. Should have popped up on your screen. Any guess what these numbers are? 20,000, 36,000, 60,000, 90,000. I’ll wait for a few seconds for people to answer the poll question.
These are the number of engineers that each of these companies employ. I did not use any public information for this. It already [inaudible]. I only use public information, no proprietary, no confidential data. I derived this from looking at public job postings, Glassdoor, looking at R and D spent as a percentage of total salary spent on FTEs. These are all best guess estimates because these companies keep the number of engineers actually hired pretty close to the vest, but as a percentage, they hire more engineers than your average organization out there. And that really what distinguishes them from the rest of the Fortune 500 [inaudible] that are out there and [inaudible] something today.
What do all these engineers that these companies have hired, what do they bring to the modern enterprise? They bring Agile. One of the things that we’ve seen with software development is, it is a very… It professionally requires a lot of adaption to change, hence why the birth of Agile development, over 20 years ago now, it’s really the premier way that people manage the life cycle of software. You can see here, this quick diagram, he starts with design, develop, test, deploy, operate, support, and it’s a continuously improving cycle for how you develop and maintain a software. And I’ll get you a little more about Agile in the future… In the future slides. But, a lot of this came out of actually Japan, around Kanban, and specifically Toyota to improve the manufacturing process. They could actually make more competitive cars, and based on learning and retrospectives, can be used to improve that process to make the best cars they could, the highest quality, also reducing the overall cost.
I follow a number of different venture partners within Silicon Valley. I myself live in… I live in San Jose. One of those is Jacques Benkoski. He’s part of the US Venture Partners. So just to… More of a thought probing question, it’s rhetorical, but the question is, do start-ups actually change the world?
I’d argue that they do, but they also, they don’t because start-ups don’t change the world. They actually adapt to the world faster than everyone else. What you will find is that, everyone has some great ideas, some great theorems, and we have hypotheses on what could be a great way to solve a new problem that comes out. But really what distinguishes these top five iconic, really software companies or others, is they’re able to adapt to the world faster than everyone else.
There’s a couple of examples of Amazon. In the mid 2000’s, they were still relatively small, going from a bookseller to being a general product seller. What really allowed them to take off was the idea of Amazon Prime and eliminating shipping fees. History could have been very different. They learned a lot based on looking at the data, looking at how they did their e-commerce experience for their customers, or the experiments they wanted to run is that we get rid of shipping fees, can we really increase the overall adoption of e-commerce? And Amazon, when they did… If they didn’t do that, they didn’t adapt to what the market wanted, history could have been very different. I mean, E-bay, Overstock or Walmart, [inaudible] traditional players could have dominated e-commerce. But with that one change, Amazon was really transformed their business.
I see the same thing with Google. I used to work there, but this is public information. Google is very relentless at rewriting backend services to adapt to the needs of their other users. One of the examples is, they’re a search engine, and the version of search used today is very different than the version of [inaudible]. It’s probably somewhere between seven and eighth version of that experience. And as a company, you want to continuously adapt that user experience. You’re needing to be competitive and also adapting to what your customers need.
But what does that do with the applications, right? Really what these companies have done is really brought Agile to data applications. I’m going to cover that in this talk. Some of these are based on the technologies these companies have championed including things like Mac Creation, Google Paper, Google File System, GFS, Amazon, and then before that, Yahoo and Facebook have championed a lot of it too. And the reason why they championed these types of technologies is because they do allow for more of an Agile data application and it follows the standard software built in principles, things like being testable, open source, not using proprietary software languages and allows these companies to actually scale their data pipelines, to meet the needs of their businesses in a very agile way.
Rewind back to when I was in my undergraduate and look at the nineties. What was common in the nineties, in terms of software development, was really the idea of Waterfall. A Waterfall is very much one stage following on the other. It was very hard to accommodate change, and change is very expensive because as… Since everything is very single-threaded and you go from conflict to requirements, analysis, design, development and implementation, test and QA, and finally, deploy and maintain. At any point along this, the step function, there is a change or change the requirements. You have to go all the way back to the beginning and start over again. It’s very hard to come to accommodate change in a Waterfall methodology. And this is how software’s really built. It was built with this type of framework.
At the same time, [inaudible], our architectures came out with built-in [inaudible] and the idea of having the enterprise Data Warehouse. It was… Worked really well in the nineties, early 2000’s because it worked for the the time and the era.
Those operational systems had very little data, mostly structured, small in volumes. They’re very… They gave us GUI Tools to do ETL, like Informatica, [Abeneccio], Oracle Data Integrator, or you can use the native capabilities in the Data Warehouse or the database, things like PL SQL sort of procedures, T-SQL, et cetera. You process data, we’re using those types of ETLs. You store all your data in the Data Warehouse itself using what’s called [inaudible] cables, which are often unmanaged and sometimes would be deleted because… To save cost. And the single source of the truth was the monolithic Data Warehouse, where everything was predefined schemas. It’s all very predictable. And the most of the EDW’s are sold as expensive appliances, so the data was locked into a primary format that combined both compute and storage.
And the only way to scale this out was to either buy more appliances, buy more Tarion, buy more Oracle, buy more IBM DB 2, or to buy more expensive versions like Oracle Rack and other ways to do better performance. Well, these Data Warehouses also had minimal support for arbitrary files, as well as semi-structured, unstructured, or streaming sources. It was also very inflexible, because to add a single column to… Or, let’s say a report or if you’re a data scientist building a model, you take up to six months because it was built for Waterfall. You’d have to go back to the original source system, pull it in, do ETL in the staging area, put it in the Data Warehouse. And then it would take six months to add it to report, which doesn’t really work when you’re thinking about adapting to team business needs. And, it didn’t work well for human intelligence, things like accounting and finance and static reports, as well as dashboards, but it didn’t really work well for AI or ML because these have the… Most of those AI, ML tools are really querying the set of data sets available to your warehouse, and you couldn’t get to more data that could actually make your model actually better.
In general, in this timeframe, [inaudible] I was a formal Oracle DBA. Most of the users of this type of tool were DBA’s, ETL designers, BI analysts, and there’s no real data engineers and data science at this time. I can’t tell you how many hours I spent tuning an Oracle database, going back and looking at all the different queries and the query planners. It was all a lot of effort for a very hard-coded, hard to manage system.
Fast forward a little bit, let’s go back to our software development discussion. So around the mid 2000’s, I got my, excuse me, certified scrum master certification, and I heard about the Agile manifesto, which was published in 2001. The Agile manifesto was really a rejection of the Waterfall methodology and it was a pushback against all the paperwork and bureaucratic processes required to build software. It’s a very early version of Agile where, what’s called Scrum, which is depicted on the screen here as well as XP, Extreme Programming. And it worked really well for small self-managed teams, and it kind of took away a lot of the heavy approval and heavy… The heaviness of the Waterfall methodology, but it didn’t scale well to larger teams and the needs of larger enterprises because it lacked some of the discipline of Waterfall.
The good parts about this, though… If you look at the diagram here is, we created the idea of iterations or sprints where every two weeks we’re looking at: what are the user requirements? What are the things that customers are doing with the stories? Being able to do some implementation work due to the Scrum meeting, to see how you did, do retrospectives to evaluate how well you did and then transfer, repeat.
So with this type of methodology, you can basically incorporate… Or, you can incorporate change as part of the process. It also incorporates the idea that everything can be an experiment. You can do hypotheses, and as you learn more through your process of building whatever project or software you’re doing, you can actually accommodate those changes.
So around the same time, we also did have the Open Data Lake, or Hadoop. The Open Data Lake was a little bit like pure Agile in that, in that it gave a lot more flexibility to help people manage data as well as data applications. This type of environment, Hadoop, supported media sources like web scale data, SaaS sources like Salesforce, as well as the old operational systems of structured data, but also supported semi-structured. So you could load logs, JSON, unstructured data like images and audio, because everything is just a file, a file that lands in HTFS. Get a handle.
Get a handle the high volume, high velocity data. Also could handle some level of streaming data. And again, HDFS was great because it was built on commodity servers. So, it wasn’t locked in proprietary biosystems like Sands, or like Flash and Storage, et cetera. And because of that, it was beefy servers, but relatively less expensive than the expenses [inaudible] from Data Warehouses.
Applicators can be written and deployed inside of Hadoop using the resource negotiator or YARN. And, he wrote it in open-source languages like Java, Scala, or it could be written in Hive, which is the SQL engine for Hadoop, or in more customs between languages like Pig and Mahot for machine learning.
The commodity servers allow you to scale this out. So you got better performance, because you could run things at scale across a number of different computers, over different cores and memory and servers. Initially it was cheaper to get a better compute, because you are using commodity servers and because you can scale it out, but because computer storage were paired together, you often had to provision enough capacity to handle the max amount of storage, the max amount of compute. And you couldn’t really pay for what you used. You had to… Again, if you needed more storage, you had to buy more servers or if you just need more compute ,because you have some pretty complicated processing to run, you’d also, again, have to buy more compute.
The forums actually ended up being a little more of a mixed bag. So it allowed scale-out, but it was very hard to tune Hadoop and YARN. So with YARN, you had to… It was very much a 1.0 type experience with [inaudible] compute. You had to specify the amount of memory, the number, of course, for each job, which is always really hard to estimate. And then, you had the various different sequences on Hadoop, each of those, trying to improve incrementally the performance of the query engine. So things like Impala, Hive, Hive variants like Hive LLAP. And every time you had a new engine, yes, it might have been faster, but it was also hard to tune and hard to scale that out.
Really, the hallmark of Hadoop was the idea of scheme on read, versus schema on write, which created a ton of agility, but because the lack of schema enforcement and reliability became an issue, hence the Data Lake often became a Data Swamp because you didn’t have a source of truth in terms of the schema. And you’d have this, everything is a file. It’s hard to reason out what is actually in your Data Lake.
I would also [inaudible] monolithic attributes, because we’ve actually deployed the applications into Hadoop, and you compare your compute and storage, everything had to be upgraded at the same time. If you’re using a certain version of Python, like Python two going to Python three, you have to upgrade every application that uses Python. Or, if you’re upgrading from Hadoop 2.1 to 2.2, again, you have to upgrade all the applications at the same time. So it’s very hard to create kind of agile teams that are self-managed if everything has to be done in more of a Waterfall method with all of these different teams working together.
And lastly, I think this is also one of the hardest parts of Hadoop, but it required very specialized people to manage and develop Hadoop. You needed special admins, trained developers. They had to use different frameworks: MapReduce, Tez, Hive. Different sequences on Hadoop, Spark, Flink, Storm. It was very hard to find general software developers or general data practitioners that could actually use Hadoop without missing any of the special training.
And lastly, analysts have misused… Did use Hadoop, but didn’t concern themselves with infrastructure, because they were shielded away from a lot of this complexity, but ultimately would fall back to the EDW, whenever they had [inaudible] challenges or data wasn’t available, or they couldn’t find the data. And so ultimately, the goal and promise of Hadoop was to offload or even replace EDW, but that didn’t actually happen over the course of time.
So again, going back to our software development, the… What happened in the mid 2000’s, to just about today, is the idea of modern Agile. Modern Agile is pretty much a hybrid. It took advantage of some of the information we learned from Waterfall, as well as what we do in Agile, to put together more of a modern way to think about software development. It incorporated a lot of different things from Waterfall, things like how we measure value, how do we think about governance, risk management, security? How are we doing executive steering committees, making sure our stakeholders are involved, but still keeping the spirit of the idea that you have sprints, iterations, and changes as part of the process. So you’re trying to combine the best of both, but both… The best of both worlds, both Waterfall as well as Agile together in creating a single hybrid that can do both of those things. My favorite is D-A-D or discipline to Agile delivery. There’s also another one called SAFE or scaled agile framework. But these are both now well-defined frameworks for doing kind of modern Agile.
So the next hybrid that is coming just as we speak is the idea of a Lakehouse Platform. And again, it is a hybrid, a Lakehouse is a part of a Data Lake and Data Warehouse. Seeing this in the late 2018, 2019 timeframe to now, 2020’s and beyond. And the idea here is that you really combine the best of the Data Lake, what we learned from the 1.0 Hadoop, with the Data Warehouse. Again, so this Lakehouse supports all the new data sources, supports the scalability of the cloud as well as multicloud. So any cloud you want, or as well as private cloud, if you choose to do something more on premise. It supports a structured table, so it supports structured data, semi-structured data, log, JSON audio, images, live data through streaming. And all you’re doing is storing in an open storage format.
And one of the key tenants of this platform, is it separates out compute from storage. So you have a place where you can store data in an open format. It’s reliable, also infinitely scalable. If you take advantage of obvious stores like S3 ADLS, Google Compute. But, most importantly, what distinguishes from your general Hadoop 1.0 Data Lake, it has a data management layer for reliability, as well as schema enforcement. The most common of those on the market, open sources, Delta Lake, Iceberg, HUDI, some degree Hive ACID, because it gives you transactional guarantees, and also a way to make sure that the data you put into your Lakehouse is good quality data so your downstream users can actually take advantage of that for reporting, analytics, data engineering, as well as data science.
It also is more [inaudible] of having multiple layers. So this is common, similar to the staging table architecture you saw, and the Data Warehouse, you have a landing zone or bronze. You have a silver, for your refined data where you cleaned up and moved duplicates and got rid of your bad data. And lastly, you have your gold layer. I’m jokingly calling it dogecoin because right now dogecoin was actually more expensive than gold, for all your business-level aggregates, for your final reporting layer.
Now this architecture isn’t something where [Databricks] invented this, or a vendor put this together. This ability, what we see from other tech-forward companies like a Netflix, that how they think about opportunity, they’re a Data Lake.
This is where you actually build additive applications into a platform that separates out the compute and code from the actual storage layer. You can have different data science slope [inaudible]. You can have different data engineering projects, use M-O flow, for example, to orchestrate and manage the lifecycle of your models. But, it sits outside of the storage layer. It sits outside of the platforms, you can actually have agility and have each team have their own data properties that don’t collide with each other.
And then you can do all kinds of analytics with this Lakehouse platform, everything from doing internal applications, doing dashboards or reports for internal usage, using whatever [inaudible] you use.
You can do external, customer-facing applications, end to end model life cycle, recommendation systems, customer-facing applications. And this is a difference from the Hadoop, where you’re really doing internal analytics here. None of our customers would, at Databricks as well as other customers that use these Lakehouse platforms, can build the end to end life cycle for customers, actually deliver value directly directly to them.
And lastly, you can move your data from your Lakehouse to your downstream systems as needed for specialized data storage. Whether you’re looking at graph databases, like NEO 4J, NoSQL databases for real-time access, or looking at EWD’s, if you need an EDW for doing other reporting and using EDW for just your final reporting layer.
We’re running a little short on time, but I’ll go through a couple of the other slides. So one, this is great for modern data personas. Data scientists want an environment where they can be very experimentative, cause data science is a science. It’s going to take constant evolution, experiments, hypothesis to learn what is the best model to use. It gets data scientists off their laptops and into an environment where they can access data in a secure way.
It’s also great for the engineers. Engineers are our developers. They want to write code and standard programming languages like Java, Scala, and Python, and not in proprietary stored procedures or ETL. It also gives them the ability to write high quality production code that is testable, reusable, and modular, and can be continuously integrated, deployed into production, so they can actually really evaluate how well the pipeline works in a [inaudible 00:25:21] environment.
And lastly, for data analysts, it gives them more access to data quicker. They can actually get quicker access to data as it’s producing the transactional systems. And they can even use a little bit of Python or R, a lightweight for advanced analytics, and get some way from download it from their warehouse into Excel or down [inaudible] using a cheaper notebook. And they can do the analysis in an environment that gives them that flexibility.
So a little bit about Data Mesh, and I’m going to go a little bit faster here. Data Mesh is an architectural pattern introduced by Zhamak of Thoughtworks. It’s talking about how you can think about moving beyond a monolithic Data Lake to a series of self-managed teams that actually manage data as a product. Again, data is a business asset, but if data just sits in a lake, or sits in a Data Warehouse and doesn’t get used, it’s not being monetized. Actually, by dividing up data access and data control to different self-managed teams, you can actually be much more agile about how you use data as an asset and produce value out of it. But also, you want to have governance and standards that are actually centralized. This allows for scalable interoperability and data sharing. And this whole notion of Data Mesh sounds a lot like the hybrid that Agile and the Lakehouse approach provides.
Again, I work at Databricks, but I also am a firm believer in best technology should win. I have a slide here talking a little bit about the differences between… When you think about Lakehouses, you will want to look at Databricks. We provide out of the box. I won’t read every bullet point, but gives you some sense, of what Databricks provides for Lakehouse. You can look at what the cloud providers provide, out of the box, so whether it’s a commercial, EMR, HGI, Dataproc, and then different query engines, or you can do it yourself, and there’s advantages and disadvantages to each of these. And so you… I would look at these different options and think about, what is the best technology? Do you support your Lakehouse framework on your paradigm?
And so in summary, if you’re thinking about [inaudible] to applications, you really don’t want to put your data into the Data Warehouse or a first-generation Data Lake, because you often end up paying more for storage, or [your] compute. Rework and change is very expensive. It’s not really built for agility. A [inaudible] is monolithic and hard to support, the notion of Data Mesh, as well as self-managed data domains. But with an Agile data application and a Lakehouse you only pay for what you use. It should lower your overall total cost of ownership. Agility, change is part of the data application life cycle, so it can easily adapt to new use cases, new projects, new products that you’re trying to launch within your business. Agility supports the idea of one data application per project, team, or domain, and Agility supports the idea of the Data Mesh paradigm.
Thank you to everyone who attending a DNI Summit virtual this year. You feel like it’s important to us. Please review the session and let us know how… Any questions you have as well as any feedback you have. Thank you much.

Richard Garris

I am a seasoned data and analytics professional with 18 years of experience both as an architect and team leader. Experienced in data architecture, data management, business transformation, data strat...
Read more