data brew logo
EPISODE 6

Journey of Big Data

Jules Damji and Tathagata Das guide us through their journey in big data and the evolution of data architecture in the past 30 years. They discuss some of the biggest changes in industry they’ve seen, as well as trends to look forward to in the coming years. This is a fun episode connecting all four authors of the Learning Spark, 2nd Edition book.

Jules S. Damji
Jules S. Damji is a Senior Developer Advocate at Databricks and an MLflow contributor. He is a hands-on developer with over 20 years of experience and has worked as a software engineer at leading companies such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, VeriSign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a BSc and MSc in computer science and an MA in political advocacy and communication from Oregon State University, Cal State, and Johns Hopkins University, respectively.

Tathagata Das
Tathagata Das is a Staff Software Engineer at Databricks, an Apache Spark committer and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams) and is currently one of the core developers of Structured Streaming and Delta Lake. Tathagata holds a MS in computer science from UC Berkeley.

Video Transcript

Journey of Big Data

Denny Lee 00:06
Welcome to Data Brew by Databricks with Denny and Brooke. The series allows us to explore various data topics in the data and AI community. Whether we’re talking about data engineering or data science, we’re going to interview subject matter experts to dive deeper into these topics. And while we’re at it, we’re going to enjoy a morning brew. My name is Denny Lee, I’m a developer advocate here at Databricks with a background in data engineering and data science.

Brooke Wenig 00:32
And hi, everyone. My name is Brooke Wenig, machine learning practice lead at Databricks. Today, I have the pleasure of introducing Tathagata Das and Jules Damji to our episode today. Both of them are actually co-authors of the Learning Spark book with Denny and I, we’ve all presented on stage together. We’ve been comrades for many years. Tathagata Das also known as TD is a staff software engineer at Databricks and also a PMC member of the Apache Spark project, committed to Delta Lake and countless other projects. And Jules Damji is a senior developer advocate at Databricks, he’s our ML flow advocate and presents on various topics in the open source and AI community. So, that was just a quick intro of the two of them, but I would love for each of them to explain in their own words, how you got into the field of big data. Jules, how about we start with you? I know you have decades of experience in this field.

Jules Damji 01:19
Well, thank you. All my white hair speaks for itself, but pleasure to be here with all my esteemed colleagues. How did I actually get started with big data? I think it was actually quite set in debate. I was working at a startup company where we were publishing books and we were digitizing books. And when I got hired, we had a legacy system that we actually had to somehow scale it. And the way we were doing it is that the publishers would give us this particular books and we would digitize it, and then we would provide library as a service to all this academic first and second tier institutions. But the process in which we were doing that was very singular, it was really serial and we would get the PDF from the FTP site, from our publisher.

Jules Damji 02:10
We would actually have a way internal indexer that takes the PDF and converts them into pages and words. And then we would feed it to Lucene to the index of pages. And then we would give it to our back-end machine to publish it. Now, if you get 50 books, it would take us about five days. It would take us maybe sometimes a couple of weeks. And the whole idea behind was that as soon as the books are released, we should be able to give them to the university as soon as possible.

Jules Damji 02:39
So, at the time there was this notion of big data and Hadoop coming into prominence. And the whole idea was that you actually use MapReduce programs to take the list of things you actually want to do, or paralyze them and bring them down into a list and then give it to the second task. And so, we had to actually deal with. So, we actually revolutionized the entire process by taking all these PDFs, creating MapReduce program, giving it to Lucene which then would actually do that. That was my sort of first introduction to big data, Hadoop and MapReduce. And then I started working for Hortonworks and then a light bulb went into my head about Spark and the rest is history.

Brooke Wenig 03:25
I actually had no idea, you started off in the publishing industry. No wonder you’re such a fantastic first author on our book.

Jules Damji 03:30
Oh, thank you very much. That’s a compliment coming from you.

Brooke Wenig 03:34
All right, TD, how about you start off with a quick introduction of how you got into the field of big data.

Tathagata Das 03:38
So, unlike Jules who has decades of experience, I have only exactly one decade of experience. So, my journey into this big data starts with grad school where I joined this fantastic research group called AMPLab in UC Berkeley, where Spark started. And as soon as I joined in back in 2010, I got involved with the Spark project, worked on the core Spark research project back then, extended it over the next few years, extending to build Spark Streaming for presenting streaming data on top of it. And then from there transitioned into Databricks as one of the early engineers, continued building Spark Streaming. Then after Spark SQL was built to transition into building the next gen of streaming engine over Spark Streaming, constructor streaming, and since then Delta Lake and the journey continues. So, it’s been a fun ride seeing our bet grad school project Spark Streaming grow into such a phenomenon. It’s been a really fun ride.

Brooke Wenig 04:52
And actually the AMPLab is where our CEO, who we interviewed earlier in the series, who was also part of the Spark project. So, lots of great minds came out of that.

Denny Lee 05:00
Absolutely. So, actually before we dive into the next set of questions, Brooke, why don’t you tell a little bit about yourself in terms of how you got into big data as well, since, this is less of a guest-host set up here today, but this is more of a panel of four really fun people.

Brooke Wenig 05:17
Oh, oh, you just put me on the spot here, Denny.

Denny Lee 05:19
That’s right.

Brooke Wenig 05:21
So, I started off getting into big data actually in grad school, actually, technically undergrad. I was interning at Splunk and they asked me as their intern to look into this project called Spark. I had no idea what it was, I was going way over my head. Effectively what they wanted to do is what spark-sklearn turned out to be. So, we were building this machine learning toolkit and what they wanted to do is build lots of different anomaly models in parallel. So, that’s why I had to investigate Spark. And back then Databricks had launched their first MOOC on distributed machine learning and on Apache Spark. So, I attended the MOOC, absolutely fell in love with it. And then I reached out to the professor who had run the MOOC because I realized he was a professor at UCLA where I was currently attending.

Brooke Wenig 06:02
So, I reached out and said, “Hey, I would love to follow up, take more classes on Spark and distributed machine learning. Can I continue to do that as an undergrad?” He said, “Yeah, sure.” So, I started taking his grad classes as an undergrad. He’s a fantastic advisor, his name’s Ameet Talwalkar. Then he ended up being my advisor for grad school as well. And so, my first introduction to big data was actually through a MOOC that Databricks used to run. The following year, I was then the TA for that MOOC. And then I started contracting for Databricks on the side throughout grad school, because as a starving college student, Databricks paid very well in comparison to the TA salaries. So, I was a little bit more incentivized to do work for Databricks than I was to do research. But yeah, that was my introduction to big data. Now, I get the pleasure of asking you this question in return Denny, what was your introduction like to the field of big data?

Denny Lee 06:50
Well, I set myself up for that for starters. So, my bad, in terms of the answer to the question, big data I guess was introduced when I was doing web analytics. I joined a startup at the time called Digit Mine after a small foray at Microsoft. We had these awesome one core, four gigabyte machines. And so, we had to figure out how to process, I know, a hundred million rows a day. And at that time that was a lot. And so, on these four gigabyte one core machine, once processor machines. And so, in other words, I’m implying the decades apart.

Denny Lee 07:30
And so, based off of that, then we’ve realized we had to distribute. And so, that’s how I got into it. I try to… I danced around by being part of SQL Server for a while, but then got re-introduced it when Hadoop took off. And then I was either fortunate or unfortunate depending on how you want to phrase it for helping to create what is now currently known as HDInsight at Microsoft. So, the Hadoop on Windows and originally, and then Hadoop in the cloud for on the Azure platform. And fortunately Databricks was silly enough to hire me, to let me join them when it came to Spark and here we’re at. So, fun times.

Jules Damji 08:09
Incidentally, the first interview at Databricks was with Denny.

Denny Lee 08:18
That’s right. I definitely forgot about that, but I don’t think I asked you any coffee questions at the time, didn’t I?

Jules Damji 08:24
No, your JVM questions are fine.

Denny Lee 08:26
Oh, that’s right. Of course, we had to do with the JVM. Okay. Well, actually this is a good segue into… Especially because of your background Jules, with Java that is specifically. The next question really here is more like data architecture evolution. How have you seen it change over the last 30 years in this case?

Jules Damji 08:46
I think that’s a very interesting question. And I can’t think about nothing but Alan Kay’s quote that he actually gave in Dr. Deb’s interview where he said, “The computer industry trends today is like a pop culture.” And what I mean by pop culture is that the pop culture has a disdainful history and pop culture requires you to sort of participate and be part of things. And so, today who are data engineers and data scientists are coming in, don’t realize the history of how data has actually evolved, how data architecture has evolved over the past 30 years. And I think it’s important to have that particular perspective to build the tree of knowledge, right? You go from the roots to the trunk, to the branches, to the leaves. You don’t go straight to the leaves. And that’s the whole evolution that you have to understand.

Jules Damji 09:40
And one of the great evolution of the history that other CEO and founder gave in his interview with Mark and Kowtows, was how the data architecture has evolved over the period of years. And it started with the 80’s, right? If you look at 80’s, that’s where the data warehouse concept comes in. People had all these kinds of data and they wanted to put it in one central place, but they had this operational data distributed across all these different kinds of machines. And they wanted to put it in a central place where they can actually attach BI and they can attach analytics and I can do that. And that was the training in the eighties. But what happened over a period of time as somebody, if some great Sage said that, “The only constant in the universe is change.” And data changes.

Jules Damji 10:25
And one of the weird things in data is a variety of data. And so, I think what happened over the period of two decades is that the data changes, the type of data that we used to store in data warehouses had changed. We had texts, we had unstructured data, we had video and we had audio. And those couldn’t be stored in the way that we wanted to, and we couldn’t do machine learning, or we couldn’t do advanced analytics on that. So, that’s kind of trend changes. So, in the 2000, this notion of data lakes was introduced and where they said, “Where are we going to put all this tech data and put it in data lake.” But that also had its own problems because data is a swamp and you try to read on schema and it’s hard to attach schema to it.

Jules Damji 11:09
So, what people actually did was they had this two types of architecture. You had a data warehouses, where they took the data from the data lakes and then put it in data structure. So, now you have two places of data. And that somehow caused a lot of problems. And I think history has told us that innovation evolves over time, right? The paradigm shifts. And I think the new shift today is with the advent of new technologies is to, “Hey, why can’t I do my analytics on the data lake? Why can I do my BI on the data lakes? Why can’t I use both batch and streaming on the data lake?”

Jules Damji 11:46
And I think the new technologies that have emerged such as Apache Hudi and Delta Lake and ACID 3.0, I mean, Hive on ACID 3.0 gives you the ability to create those transactions on top of the data lake. So, today we actually have this new paradigm called lakehouse that we have been pushing that allows you to do your BI, allows you to do your ML and allow you to do your reporting all on one place in structured data. And that has so been the evolution of the data architecture of the period of 30 years. And I’m just paraphrasing what Ali is saying. I think he was quite succinct and the technical arguments of the evolution over quite valid.

Denny Lee 12:25
Two things, one is that we’ll make sure to include a link to all these, Data is not Enough podcast, actually that’s what Jules was referring to. The second thing is story time. So, the story time in terms of exactly what Jules was referring to, I still remember I’m working with an old friend of mine, Dave Mariani. He’s the founder of AtScale. The reason I bring that story up is because at Yahoo, in order to do exactly what Jules was talking about to try to get fast BI or OLAP style queries on top of our data lake, what we did was we created, well more like him and I was helping. So, let’s be clear here, 24 terabyte analysis services cube on top of a two petabyte Hadoop cluster. And that was about 10 years ago.

Denny Lee 13:09
Now, so at that time that was pretty huge, right? And it was the best of both worlds from the standpoint that Hadoop was actually able to do what it needed to do. BI was able to do what it needed to do, but it’s also the worst of both worlds because we had to maintain a Hadoop cluster, maintain the Oracle staging server and also maintain analysis services, three very different technologies and exactly to your point Jules. That evolution it goes in spurts, it goes in waves. This was some of the pain that we’re going through right now. And as we progress and exactly as you specifically called out about Ali’s podcast, we’re hoping to see a new future that is a lot simpler than what we just built there. So…

Jules Damji 13:52
And talking about the evolution and the paradigm shift, and that’s where I actually want to bring into TD’s perspective because he was the person who actually brought us DStreams, and we saw that DStreams evolve over time from DStream to structured streamings and the underlying technology that allowed you to do structure in the stream. And that is now playing a significant part in Delta Lake and Delta Warehouse. So, TD, can you somehow walk us through what were the motivations for having DStream, which was the first version in your purchase batch, how did it actually evolve to structured streaming and how structured streaming now actually plays a very pivotal role in being able to do both batch and streaming in this new paradigm of data lake?

Tathagata Das 14:37
Yeah. So, I think the central theme that we see, the pattern of evolution that we see over and over again in technology, not just in big data but across different kinds of technology is convergence. Basically, this is I’m paraphrasing a very awesome example that my advisor, John Stryker, also the former CEO of the company, that the current chairman of the board, he used to give earlier is that even back in the 90’s, we used to have different devices for doing every different kind of things. For GPS, you had a GPS device, for video recording you used to have a camera, for taking a phone call you used have mobile phones, for maintaining your emails used to have a PalmPilot. So, every different application used to have a dedicated device for itself, but then transition next 10 years with the introduction, Apple iPhone, then Android and this thing, 20 years later, we have converse to a single device for doing every possible thing.

Tathagata Das 15:51
And that’s exactly the kind of transition that we saw in big data. So, we started with databases of course, for building data warehouses for absolutely perfectly structured data, then came Hadoop and Delta Lake, et cetera, for unstructured and maybe semi-structured data. But those are two different silos as you’ve heard already a couple of times. And then slowly things started converging. People started running, “Why can’t I get the best of both worlds?” And there was another silo in parallel which was distributed stream processing. They earlier in the early two thousands, stream processing was a thing of its own, but it wasn’t distributed. Distributed processing was only a research concept. Then it transitioned to late 2010, came out Apache Storm and start where distribution processing became our reality. But in the open source world, in the big data world, much like Hadoop, MapReduce, revolutionized batch processing, distributed batch processing, Apache Storm, revolutionized distribution processing, but it was two different worlds, two different engines.

Tathagata Das 17:01
That’s why when we were building Spark, we were really making things fast and we realized that we don’t have to have two different worlds. We can build engine and make it fast enough that can process both batch and streaming data in the same engine. And that’s where Spark Streaming came to the picture. We took the Spark batch engine, made it 10X faster. So, low latencies that you could get second scale latencies when you’re running stream processing on the distributed Spark engine. And second scale latencies were good enough for, and still is good enough for 99% of the stream processing real time use cases. And so, that was the next level of evolution in the stream processing world side of the world from Storm being the first evolution one where distributed stream processing became a commodity reality for general practitioners, then a single engine to do batch and streaming with a single consistent API destreaming.

Tathagata Das 18:10
RDD not the same API, but consistent API with each other with similar semantics. So, you can write your code once and with very little restructuring in RDD for batch and with restructuring converted to DStream and run it on the same engine. So, only one engine to manage, and then once we virtualized it and Spark Streaming started becoming the de facto standard of stream processing in the big data community, we started realizing that even the next barrier we need to break the wall, we need to break is we shouldn’t… The users should not have to do this transition from writing the code once in RDD for batch and then rewrite it, even if it’s slightly re-writed for Dstreams, that’s where the idea of structured streaming came into the picture that why not have… We already have one engine, why not have one API?

Tathagata Das 19:04
So, the user does not have to think about streaming at all in the first place. Is it just right? I want to split everyday code, parts everyday code, aggregate across them and get this result. The user thinks in a purely batch-like fashion, and then it’s the engine’s job, the engine is smart enough to take that batch liquidity and run it over and over again, continuously as new data comes in through the input streams. The engine is smarter enough to take care of that. And that was a core stage of evolution, where we converge even more, both engine, API and everything. And that is the standard right now with structured streaming. And we are processing currently, even within database, within our customers more than eight trillion records every day. So, and that’s evolution as a standard like convergence.

Jules Damji 20:01
Right. And I think besides the convergence, I think what you really understated was the simplicity of how the structured API hid all the things that had to reason about in DStream, about all the operationalization, all the Dstreamifying, all the state management, all the ability to actually write very lucid readable code through the simple APIs brought developers a lot closer because before, when they were writing DStreams and the managing of the state, it was very intimidating for them to write. But once they had the DataFrame API that they were familiar with, and they say, “I can do the same thing, the operations that I perform in the batch as I can do with the stream,” and quoting you in one of the example is that stringify them very easily by just changing the read stream to a read. And that was essentially the pinnacle of simplicity in my opinion.

Denny Lee 20:58
It’s a good job, kudos to you.

Tathagata Das 21:00
Yeah. It takes a whole village. It was a lot of engineers putting a lot of work in making this technological advancement over the years.

Denny Lee 21:11
One thing I did actually want to add and it’s actually an important call that also in the podcast that Ali had, was actually called out. So, that structured streaming isn’t just for real time, right? That’s an important aspect, right? Even for us old fogies who do lots of batch queries, right? It’s about changing the ability for you to, for example, reprocess your entire pipeline all over again if you need to. The ability to go ahead and reduce the number of batch jobs like you’ve got some customers that are literally would go from 10s to 100 of batch jobs down to literally two or three streaming jobs. And so, it’s actually easier to maintain, easier to operate and actually even less use resources to do such a thing. So, it’s not just about streaming, don’t forget. It’s actually very much about even your standard lakehouse paradigm types of processing as well.

Jules Damji 22:07
If you look at the data that the AI using and the analytics is using underlying it’s the same thing, right? It is the same structured data that you’re going to do put SQL queries. It’s going to be the same structured data that your Python data frames are going to put queries on it and are going to feed to the AI or convert them into tensors, it’s the same thing. So, the convergence of having one place to put all the data, we can do analytics and we can do AI is pretty much the same.

Denny Lee 22:39
Perfect. Well, this actually does a good segue into when we’re talking about the large amount of data here, what’s keeping you all in big data? I mean, you’ve alluded to that, but let’s be very specific here now. So, Jules, since you had talked about the convergence, let’s start with you. What is keeping you in big data? What’s keeping you excited? What’s keeping you interested and awake at night or at least awake in the day?

Jules Damji 23:04
I’ll go back to Alan Kay, where he said, “Computing is like a pop culture.” And a pop culture is all about identity and all about being participating in what’s actually happening around the world. Now, today we are at the zeitgeist of data. Data is the center of everything, right? Data is like… I wouldn’t use a cliche new oil, but we are surrounded by data everyday. And we have to make sense of it. And the new tools and the new technologies that have actually evolved that allows us to process and analyze and dice and look, and view that is probably the best thing that has actually happened.

Jules Damji 23:47
So, what really keeps me awake at night is things like these things, right? I read the chapter every day before I go to sleep and I get up in the morning and I read that book. And then, I go and start reading this bible of every data engineer should actually use by Martin Clapman. And then, when I go to the bathroom, I take this with me, right? So, I’m surrounded by this thing. And this is what keeps me excited. It never puts me to sleep. And when Brooke asked me, “Why are you sending me Slack messages at 5:00 AM in the morning?” And I say, “I’m awake.” So, that’s what keeps me awake at night. It is the newness of the data. It is the recency of the data. It is the fast pace of the data and it’s everything that allows us to do something with data. And that’s what keeps me going.

Denny Lee 24:45
Perfect. I don’t think there’s much we can add to your call up, but you know what, TD, I’m going to ask you that question in terms of what’s keeping you up at night or waking you up early in the morning outside of copious amounts of coffee, of course?

Tathagata Das 25:00
Of course. By the way, we probably got a little bit extra information about Jules’ daily habits than what we bargained for.

Denny Lee 25:09
Yes. We might need to edit that out, actually. Yes, that’s true.

Jules Damji 25:12
Did you know the showers in the bathrooms are the deign of inspiration and imagination? Did you ever think about that?

Denny Lee 25:21
I did. And now I’ve stopped thinking about it.

Tathagata Das 25:25
I would concur to that. They are. I do have breakthroughs in the shower. So, it does. I concur to that. Anyways, to get back to your actual [inaudible 00:25:37]. Yes, to get back to your question, so from… Let me add another perspective from an engineer’s point of view, who is building these technologies on day-to-day basis. Think what still keeps me excited about building these technologies is that there is a continuous evolution of challenges in this big data world of what new capabilities are needed as the technology and the tooling evolves, we are creating better and better tools to make past problems easier to solve. We are also hand-in-hand evolving the new set of problems that we need to solve. For example, 40 years ago, there was no AI, there was no machine learning. So, the challenges were just limited to structured data. And that’s why databases were built just to solve the structure, data problems in different aspects of it.

Tathagata Das 26:49
But as we have transitioned into building new tools, machine learning tools, et cetera, to make new things possible, that has led to this new set of challenges that why not have… Why do we need separate tools to do two different things? Why not have a single tool that can do everything? And so, the benefits of the awesome tooling we are building… The benefits of technology advancement is that we’re making new complex things simple, but new groundbreaking things possible, which are still complex, which still needs to be solved as the next stage of evolution, and that’s fundamentally why even after 10 years of building such systems, I’m still excited about building these because we have the next generation of problems to solve. We are always have to continuously keep pushing the boundaries of technology to make life simple for data workers, data scientists, data engineers, et cetera, to make life simpler for them.

Brooke Wenig 28:01
So, TD, following up on what you just said, what do you think are some of the next generations of problems that we need to solve for?

Tathagata Das 28:07
Very good question. I think the next generation of problems that we need to solve for is essentially a higher level of data management. It ain’t going back to the stewards, database worlds and data lake world. If you think about just the database world, there is more than half a century of experience sitting there and things have been built and that technology within the database world, within the section of the use cases where databases are still good, the technology and the management of technology has matured. CIS admins know how to manage databases, to get the best performance out of it, how people know how to build policies around auditing databases, et cetera, and stuff like that. Whereas that level of maturity hasn’t been reached in the data lake side of things yet. Why is the underlying technologies like SQL and distributed processing, et cetera, has matured quite a bit?

Tathagata Das 29:17
I think at a higher level management of those technologies is yet to be perfected. So, we still have a lot of challenges there in trying to make these newly designed tools that can do everything simpler than what it is now, so that it becomes a lot more commoditized and things are not more magical underneath without the users having to know exactly how to configure these tools, to get the best performance out of it. The tools have to be smarter to do the work by themselves with as little user intervention as possible, be a lot more magical than what they are, and that’s the next generation of challenges.

Jules Damji 30:05
I think there’s another lens that you can actually look at through that aspect that Brooke asked about, what are some of the challenges that are coming up that we have to worry about? So, the corollary of good data management policies leads to the ability to have good policies, right? And the good policies is something that’s going to prevent how we can actually use data in a nefarious way, how we can actually use misinformation, how we can manipulate it, or we can actually control societies.

Jules Damji 30:37
And I think that are the ethical issues that have to be regulated. Today we are grappling with it, right? We’re in the midst of election. And we keep on hearing all these big data companies who are now screaming about to say, “How can I bring this down? How can I change my algorithm to reduce the bias? How can I do this to make sure that I can identify in a way so that distinguish between what’s true and what’s real.” So, I think part of regulation of data and part of managing the data infrastructure, you have the ability to actually make sure that we don’t use data in a nefarious way. And that’s sort of more of an ideological stance, but I think it’s important part today.

Denny Lee 31:17
Yeah, absolutely right Jules.

Brooke Wenig 31:19
Timnit Gebru has a fantastic paper called Datasheets for Datasets, which is effectively saying that we need to document everything about our data, how it was collected, any limitations of the data. There’s this great newspaper article, who was it? It was Roosevelt versus Dewey. I think it was who was running, but effectively the newspapers had polled homeowners asking, who are you planning to vote for? And turns out most people that they’ve polled had voted for Dewey. And so, the newspaper ran this paper saying, “Dewey defeats…” Was it Roosevelt or Truman? But we’ve never heard of president Dewey. What went wrong? Turns out with the polling, they were only polling homeowners. And at that time it tended to be wealthy white Americans.

Brooke Wenig 32:04
And so, the wealthy white Americans tend to prefer Dewey, whereas the other folks preferred to vote for Roosevelt or Truman. And hence, that’s why we never had president Dewey. And so, with a lot of these things, we need to document how we collect the data, how it’s planning to be used, any assumptions, any limitations, how the data in particular should not be used. So, I think these are all of the questions that we need to be solving in the next few years. But right now we just need to even document our processes. I think technology is moving very quickly, but the documentation and processes are still lagging behind.

Denny Lee 32:37
Absolutely. You’re reminding me of the fact we often talk about data lineage and data lineage is not just about understanding how the data was processed within the pipelines that we create ourselves. But it’s also about how it was impacted from the sources before it even came into our systems. And also how what’s the impact post, right? And exactly to your point Brooke, the policies in themselves need to catch up to what we’re actually doing with it. And it definitely does remind me of the past where even when we were very big into data warehousing, we still had the same problems then.

Denny Lee 33:13
We were able to go ahead and produce super fast BIQs and we’re super fast “querying”, right? But we ended up getting into this idea of metadata management, master data management, specifically in attempt. I want to be very clear, attempt to understand that process. And again, and again, same problem, right? The policies were lagging behind the data itself. And so, just as what TD was calling out, as we simplify some processes, but then we find a whole new set of problems that we’re trying to tackle. The reality is often our policies are catching up with what we’re doing with it.

Brooke Wenig 33:52
Exactly. And I did look it up. It was “Truman defeats Dewey”. That was the name of the newspaper article back in 1948.

Denny Lee 33:58
Yes, actually. Now, I remember it. Now, yes, good call out.

Jules Damji 34:02
And wasn’t that poll conducted by telephone and those are the people only who had phones. And so, you had a very biased sample of people who were wealthy and if you conduct something by phone, you’re actually leaving out a large segment of the society.

Brooke Wenig 34:16
Exactly. And if we do any landline telephone polls now, no one’s going to contact me. I just have a cell phone. So, you’re going to get a very different demographic split as well. So, just things to keep in mind. So Jules, I want to go back to something you mentioned earlier, you had presented us the three different books that you read. We know you’re an avid reader, you’re an author of actually multiple books, not just the Learning Spark book, second edition. And so, I wanted to ask you, how did your interest in writing Spark? No pun intended on that one, actually. And then what keeps you motivated to keep writing? Because I know a lot of us often experience writer’s block where we have a really hard time focusing on writing at hand, what keeps you motivated?

Jules Damji 34:54
One quote that I remember quite vividly, which is tattooed under my eyelids, and I wake up with this and I go to night with this is that a sentence is a form of thought takes. A sentence is a form of thought takes. And that has been with me from the very early years of my life. And that sort of inspired me to write. And so, whenever I say something, I repeat that or when I write something, I repeat that. But to come to your question, what inspired to write Learning Spark, I think that book was calling to be written because of the fact that since when 1.0 has released, so many had changed. So many things had changed. We had brought in a set of developers who were at a higher level and they wanted to come and learn Spark and do big data.

Jules Damji 35:44
And the approach that we actually took with that particular book was that, what about if we start thinking about structure in life and what about if we start, Michael Armbrust talk about adding structure to Spark and what TD talked about putting structure to the structured streaming and all that idea about structure, and somehow compelled me to approach the books. And we took this, we all four discussed about the how are we going to write the book, right? We’re going to build the foundation in the first three chapters about advocating and explaining and arguing why structure is so important and why DataFrames are sort of important. And that’s somehow led to the chapters on Denny’s about ecosystem, how actually works at Spark.

Jules Damji 36:31
It led to machine learning chapter by you, how you actually use data frames to use machine learning using the structured API. It led to structure streaming and how structure evolves this whole lakehouse and how the new paradigm that we actually using in Spark 3.0 led to chapter 12. So, I think it was a thing that was begging us. Please write something and approach in a structured manner, at a very high level and introduce the concepts of Spark operations from a high level, that way the API has actually evolved that allowed me to write very simple code, very readable code, and very fast efficient code and let the underlying engine as TD said takes care of everything.

Brooke Wenig 37:19
I wish I had that tattooed underneath my eyelids maybe it would have helped with my writer’s block.

Jules Damji 37:25
So, that keeps me going. Yeah.

Brooke Wenig 37:27
Thank you, Jules. Now, I’d like to ask a follow-up question to TD, since I believe this was the first time that you’ve written a book or that you’ve authored a book. What advice do you have to newcomers, for folks who want to author their very first book? What are some things that you wish you had known before you had started writing?

Tathagata Das 37:43
Very good question. I think one of the things that I wish someone had told me before I started writing is, what is the right process of writing it? I think before you start… So, I think my attempt to write started with the thought that I need to put some words to the paper, which is how I started it, but it very soon hit the writer’s block. And then looking back, I realized that fundamental mistake that I did is that I rushed to put words to paper. I think what I should have done before that is spend the time to think and structure my thought process.

Tathagata Das 38:35
Exactly what Denny was mentioning, putting the time upfront to structure your thought, to really understand what is it that you want to write and only then will it make sense whether it is… You will be able to understand what are the things that are worth really putting words to, what other things are not essential and therefore in the structure that you want to put it there, or they can be ignored because what is the highest level thing you want to convey to your readers? I think that is something I wish someone had told me. And it would have helped me ease into the process much easier with less continuous writer’s block after every paragraph.

Denny Lee 39:27
So, thanks TD. I think the theme of today’s session unequivocally is about structuring your thoughts and that means you also need to structure your data. So, I think that’s the theme for today. Brooke off to you now.

Brooke Wenig 39:44
Well, I was going to say this was a very enjoyable session. Thank you, Jules and TD, for joining Denny and I, even after we finished publishing the book, we no longer have to have a weekly sync anymore. I really enjoyed getting to catch up with all of you and getting to ask a bit more questions about your history with big data, what are your thoughts going forward and the changes and trends you see in technology. So, thank you both for taking all the time out of your very busy schedules to join us today on Data Brew.

Jules Damji 40:06
Pleasure and honor.

Tathagata Das 40:09
Pleasure and honor.

Brooke Wenig 40:10
And cut. Season finale.