Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with Delta Lake

Download Slides

Columbia is a data-driven enterprise, integrating data from all line-of-business-systems to manage its wholesale and retail businesses. This includes integrating real-time and batch data to better manage purchase orders and generate accurate consumer demand forecasts. It also includes analyzing product reviews to increase customer satisfaction. In this presentation, we’ll walk through how we achieved a 70% reduction in pipeline creation time and reduced ETL workload times from four hours with previous data warehouses to minutes using Azure Databricks, hence enabling near real-time analytics. We migrated from multiple legacy data warehouses, run by individual lines of business, to a single scalable, reliable, performant data lake on top of Azure and Delta Lake.


 

Databricks 無料トライアル

Video Transcript

– Hi, my name is Lara Minor, and I work for Columbia Sportswear. Today I’m gonna be talking to you about how Columbia moved its legacy reporting and analytics platform to Azure, including the use of Databricks, Delta Lake. EIM is Enterprise Information Management and it’s the Division I work for at Columbia Sportswear. Columbia sells multiple brands that include lifestyle, outerwear, footwear, across the globe. EIM, gathers the data from these multiple brands and regions and builds data assets that allows the business to make data driven decisions on such things as sales, purchasing, supply chain, and product optimization. I work for the data delivery section of EIM. And there’s also a data governance team, which I’ll talk a little bit about as we move forward. In data delivery, we’re responsible for building those data assets. And our technology stack today consists of Data Factory, Data Lake, Databricks, and the Synapse Data Warehouse. And we also have SAP BW/Hanar where we create a lot of operational reporting off of our SAP platform.

In our development, we do some integrations at Columbia, we do have an integration team. But in data delivery, we do a lot of more batch oriented, high volume, data integrations with third parties or within Colombia. And then the most work that we get comes through with building our data assets. So we build relational and dimensional models that support that reporting. And we also have just started with a data science team. And we support data analytics teams that are now coming on to the platform. We started here pretty traditional BI type platform where we had specialty ETL tool that we used to bring data in from across the company and put it into our data warehouse. And then we had a special reporting tool that required a skill set in order to develop on there. The problem with this was it was very slow to gather data, it was very slow to build the assets. And then it required special skills across the platform. And we could not keep up with the business needs, we kept running into just they want more than we could do. And about the time that we switched to the Azure platform, a couple of things kind of came together at Columbia. One, for enterprise architecture, we decided to go with the cloud, we wanted to push for more cloud. We have an integration team that was switching from point to point integrations to event architecture. And at the same time, our Hanar database was kind of getting more and more data, and it competing with our data warehouse, we wanted to figure out one platform. With this data warehouse that we had, we were going to have a very pricey, multi million hardware upgrade that was due. And so we were looking for something that was scalable and elastic and a lower cost, would support a self serve reporting model. We did some studies and we settled on Azure.

So where we’ve moved to now is this architecture here, where we have, we used data factory to bring in data from across the company from those brands and regions enterprise wide. And we get that into the Lake as well as our streaming that comes in, and we get that into Delta Lake right away. Once it’s in Delta Lake, we can take all those sources and put them together into our dimensional, our relational and dimensional models that support the business. And we do all of that compute on the Lake using Databricks.

Once we have all that figured out, we take the changed datasets and we push that up into the data warehouse. So our data warehouse is a copy of the assets that we’ve created in the data lake. We don’t put all the raw data in the lake just the assets that are needed for business reporting down the line.

We use Azure Analytics Services which hooks up to as your Synapse, and Power BI. And we facilitate the business to do that self serve reporting across the projects that we have coming through.

So the tech was one thing, when you’re doing a major project like , at least for us, we found that there was a lot of things outside of tech that we needed to consider. And that really became major issues across the company. So previously, our BI platform was pretty self, it was contained. There wasn’t other people coming in, there wasn’t a lot of data accessed. And so security and who was gonna get on the platform wasn’t talked about much. But once we opened it up to the Azure platform, it’s all that data on the data lake and accessible throughout the enterprise, and business being able to do self serve, brought up a lot of conversations about security of data, about who owned the data, who could make the decisions about the data. And that’s where our data governance team really came into play. So our data governance team has identified data owners and data stewards that gets to make decisions about the security classification of data. In my group, we have two security classifications that we mostly deal with. One is Internal, which means it’s open to all of Columbia’s reporting community, and the other is restricted. So an example of restricted data is consumer, right, ’cause of GDPR in California.

So also other data that comes in can be restricted as well. And then what we found is, it’s not just raw data that’s restricted, but it’s also those data assets that can be restricted. So while sales and product and supply chain data coming in on the raw source, might be classified as internal. When we create a data asset that has sales across all regions and brands, including forecasts, gross margin, and cost, suddenly that asset is restricted. And they wanna control who has access to that information. And so a lot of time was spent understanding the executives especially their concerns around why they wanted to restrict data, we would like to keep it as a open as possible. But there are concerns around misuse. And so they want to keep track of that. So we identified, we did was identify different profiles of people who might access our platform, for example, a data analyst, which would be a business analytic team that comes on, or an analytic team that comes on to support a business unit. It can include data scientists, we don’t have very many of those at Columbia, that’s a small group. And they basically have access to all internal and restricted data, we don’t try to lock them down. We have info consumers, in this is a group that we work with all the time at Columbia, because they’re the ones that help us decide how to build those data assets, what they wanna see in there, what metrics, what KPIs, how things should be calculated, how that is the raw data should be transformed. And they also have access directly to the data warehouse because you can’t do everything in an AAS and in a Power BI Report. Those folks need to get in and do some heavy lifting, and they do that directly on the data warehouse. And then the biggest group that we have, and this is where most of the conversation happens is for the info consumer, which is all of Columbia’s reporting community and how they get access to the platform. And how we supply access to the platform in Databricks and data lake, and the data warehouse, is much different to how it’s done in Azure Analytic Services. So our goal was to make it so that when we get to that analysis services and Power BI in place, the access to the data is controlled by the business. It’s determined by the data stewards and the data owners and they also get to figure out how to set that, how to, who gets added. And we do have methods of them, for them to do that.

So our data lake layout insecurity, needed to really understand what the business wanted to do as far as restricting data in order to set things up correctly. We definitely did not want a data swamp, we spent a lot of time here on our data lake layout in our security model. So across the board on all of those, on all of our platform, there’s not one person that ever has access to any of it, it’s all controlled via security groups. And so these restricted domains and internal domains have security groups that are associated with them. And you can add other security groups into that to get folks access as they may be. We basically have these three containers at the high level of our data lake, we have our raw sources, which is all the raw source data that we bring in from all the brands and regions across the company. And then we have curated which is where we create those data assets. So raw data comes in, and we organize that my source system, because my team, when we were bringing in data from a source, that’s how we relate to it, we don’t relate to it as this is product, this is sales, it can be all different kinds of data coming in from one source system. So that’s how things are organized there.

This internal and restricted domain. When we started, we had our access layer, or our security layer, about five layers deep. And that did not work. We had to bring that up, because it’s very complex to control that whether or not somebody has access five layers deep. So we do it at the second level. Once you have access to internal or restricted, you’re on your way, you have access to everything underneath that. Then we use Databricks to move, to create those assets and we move them into this curated area. And over here it’s done by Data Domain. Data Domain is determined by our data governance team. Data Domain is assigned to a data owner or a data steward. Most of our assets cross multiple data domains, for example, in that report I said that we have sales across the company, also includes product and supply chain data. But we do have that in sales. And then the multiple data owners can participate in access requests for that. So we organize it by the domain, which is a schema and a table name. So it might have product, data sales, data, those are the different types of domains that we have. And again with restricted so our consumer domain, some of our forecasting is in restricted, which means it has limited access.

And then we also have computed folders. This is for when we haven’t, we have a couple of analytics teams that are operating in Databricks themselves. And they can read all of the data in the raw and the curated data sets, and they can create their own assets and write them out to their own directories that only they have access to.

I’m going to pass on to Bilal now, who’s going to talk more about the platform? – Hi everyone. My name is Bilal Obeidat, I’m a long term data geek with more than 15 years of experience. I have been working for Databricks as a solution architect for more than three years. As part of our strategic account team, where I get to work with our enterprise, a client like Colombia, and using Databricks to solve data and AI problems. I wanna take a few minutes to talk about the Unified Data Analytics Platform. It’s a common architecture paradigm that’s been adopted by many of our, and actually most of our Databricks customers. As you saw in the last few slides that Lara presented.

Columbia architecture is a customized version of Unified Analytics architecture. In general, this architecture and Delta Lake is common and widely adopted.

統合データ分析プラットフォーム

In this architecture we have real time data and batch data.

The real time data is being streamed through Kafka or Event Hub or any other streaming technology, batch data usually you write pipeline, to ETL this data into the Delta Lake. This is where real time data and batch data come together and land into the Delta Lake in a row data format. It can be cleaned, aggregated to become silver data, it can also be enriched to become a curated data, from there it can be uploaded to a data warehousing technology, if we have a low latency and high concurrency requirements to report on this data.

Just as a quick idea on Delta, Databricks is a Compute Engine. It’s not really a storage engine, so you get to leverage your cloud storage for that purpose.

Delta provide acid capabilities that simplify operating the pipelines to increase the reliability of the data job. At the same time, it enable efficient and performant access to the data because of other feature like caching and auto indexes.

Also, within the data lake, we have common processes like data extract, where ETL pipelines extract the data to downstream system. I noticed with many of our enterprise clients, a lot of this data and extraction and processing, is really moved to the Delta Lake, because they wanted to leverage that capabilities. So overall you have really less movement of the data which gives you a great data governance overall.

A few things I wanna highlight here,

and when you use Delta Lake or Unified Data Analysis format, you are using an open data format it’s parquet, so you are not really locked in. Delta Lake provide you historical insights on the business, but at the same time, provide you real time insight. And that’s really unique. It also enabled you to, to get, to build advanced analytic solutions, so you can predict the future. This is like to know what’s going on and predict the future and build solution like demand forecasting, that distinguished the Delta Lake architecture from traditional data warehouse architecture.

Advanced Analytics solutions are always supported through Databricks.

You can use notebooks, or data scientists use notebooks to collaborate, to build and create models, to test them. And they can also bridge data breaks runtime to the science libraries like scikit-learn it’s G boost. They can also use deep learning libraries like PyTorch, or TensorFlow. We also, they can also leverage a multi flow to manage their experiment and deploy to production for inference.

This is a quick overview of the Unified Analytics Platform, back to you Lara.

– So as a result of switching to our Azure platform, With Databricks, we’ve had so many positive outcomes. One of them is, is speed. What used to take weeks in our ETL tool to bring source data over and now takes a day or two, it’s just really fast. And we’re able to get to work on those assets using Databricks very quickly.

One of the things that we did from the start was when we moved, we knew that it would be really quick if we just took all the source data and moved it into the warehouse into Azure Data Warehouse and went from there. But that also would have put us in the same spot that we were before. So we really stuck to having everything computed on the lake, and available for use for data analytic teams, and data science teams. And that is really playing out now. We’ve got a couple of different data analytic teams that are on Databricks in the lake and doing their own analytics and putting their own assets together, and doing their own research. And we also have a data science team that is just getting started, and they too have access to all of the computed data assets that we’ve created. And that really gets a good start for them not having to go back to the raw data sources. Our data integration team is really getting going, and they’re dropping more and more data real time via the event architecture. And that’s allowing us to get started on real time reports. So we have a couple of those, where we aggregate sales data from the stores every five minutes using Databricks streaming, and all the store managers are accessing that to get up to date sales throughout the day.

We just implemented a thing with this Databricks external meta store, which has been really great. What that does is it allows us to have multiple Databricks workspaces, but declare a master. So the EIM data workspace is the master, and when we spin up one of those analytics, we call that an internal, it hooks up to our external metastore, and what that means is every change that we make they get. So that workspace spins up, it sees everything that EIM has created and any changes that we make. So they can get going without having to go and find out everything that we’ve done and mounted themselves. They just get started right away with the data. They also can create their own add to it their own content, but that external metastore won’t see them.

Previous to this, we were, ETL and data warehouse and we had nothing in CICD. Columbia has gone CICD across the board. So every Azure resource that we have, like the data warehouse, and Databricks, and data lake is all created through a pipeline, as well as all of our code deployment is through CICD.

Can’t say enough about the elastic, the speed, and scalability has enabled growth at a lot cost, this has just really exploded for Columbia, the amount of data that we have coming in now and the things that we’re able to produce for the business. The self serve reporting. The business is just very pleased, they’re able to get to the data that they need with multiple entry points. So those that are on the, on data breaks or data warehouse, or through AAS or Power BI, are able to get them access to the data that we have, and they’re able to use that for their business.

And then all that data, really glad that we stuck with Databricks using it for all of our compute, ’cause everything is available on the lake. So just come in and access that there, and then the data warehouses is used for that reporting layer. In the last year and a half, we grew our team so much the business really threw a lot of money at this, having people come on to gather data from everything from labor management, to our warehouse, so all that data is coming in, and business units are using it across the enterprise for a lot of supply chain forecasting, big data sets that we just were not able to handle bringing all that in previously.

One of the benefits of this platform is, is how fast people can come up to speed on it. Because there’s a lot of people in the data space most people in the data space that know Sequel, so you’re a BA, a business analyst, and you know Sequel, and you need to verify some things and you used to do it in our data warehouse, you can log on to Databricks and do those same verifications, because as far as they’re concerned, it just looks like queries that they’re running. They don’t, they’re not aware that it’s all file based underneath the cover so, just a lot of power positive outcomes that we’ve had, and being able to enable the business with their data.

We have some lessons learned. I’d like to talk about from this project. These top ones, the general and the data lake. That’s a lot about the organization and understanding that making this change for us meant a lot of enterprise involvement, a lot of executive involvement in what we were doing. The auditing, the monitoring, the backups were things that we needed in order to, the data lake wasn’t being backed up, we had to supply that ourselves. We were working through this a lot of the things were on, Gen one, on generation one, so our data lake is still on Gen one, it doesn’t have a backup, we were also working with data factory Gen one. A lot of these things have improved over time. But the organization for the enterprise, understanding how that’s gonna work access was a big learning for us. And then data lake security was another big learning and organization, we spent a lot of time there. Being able to audit it and monitor your costs because things can get out of control very quickly if you’re not paying attention. And Columbia has a big effort going on that right now with COVID, really trying to bring down the costs. The other two, vendor engagement and team, were for me personally. Things that I’ve learned, cannot engage with vendors at the level that I did with this project. Learning how to get in there and really get the people that we needed from Microsoft and Databricks to help us out.

Coming up with agreements where they would fund things and have data architects to help. And then also recognizing when it was time to lay down some money and bring some professional services in, to unblock my team and get them to where they needed to be. For our team, this was a lot of lessons learned here. So I had a team of ETL developers and data analysts that were really good at queries. And transferring those skill sets over into file based, all the data lake with Databricks came with challenges. Of course it did. And so there was a lot of chaos with that. The good thing is, is the way that, that’s enabled is because all those people know sequel, so they can come on to Databricks and use spark sequel to get going right away. Python and everything else that comes that we can do can come later. And you do have to have a good solid leader or two that can figure things out, because there’s a lot to figure out. And so we had people that were doing a lot of investigation. Everybody was learning something new, how to make something work every day, that was a challenge. Allocating sufficient time for discovery. So even though it is quick to come up on these platforms, when you talk about taking everything from the beginning to the end and operationalizing it and making it, so that it runs three times a day or whatever it is, or streaming, that takes time. And you need sufficient time for people to build things correctly so that you can, support things as they come on, and more and more data comes on. And so that comes with managing expectations from senior leadership.

Just to finalize, I can’t say enough about the positive impact that this project has had on Columbia, how many people we have out in Databricks now, doing all kinds of data discovery and analytics, how the business is so pleased to have access to their data and do self serve reporting, and how pleased we are and how fast we can get that to them and how open the platform is.


 
Databricks 無料トライアル
« back
About Bilal Obeidat

Databricks

Bilal Obeidat is a solution architect working for Databricks.

About Lara Minor

Columbia Sportswear

Lara Minor is a Senior Enterprise Data Manager for Columbia Sportswear. She is an inspiring IT leader with 10+ years of experience influencing corporate growth and profitability through innovative technology strategies, dynamic leadership, and an ability to shape high-performing, multicultural teams.

A hands-on coach and mentor, I most enjoy providing software developers and cross-functional teams clear vision, meaningful feedback and motivating them to make big things happen. I'm particularly strong when working in environments that require a strong mix of technical aptitude, business acumen and communication skills in order to achieve major milestones. A collaborative problem solver by nature, I'm comfortable navigating the most challenging projects and enjoy taking on the most complex or problematic initiatives. I'm also skilled in managing within matrixed global organizations and committed to keeping communication and culture integral to the business as I lead.