# Daili Zhang

I am a senior tech adviser in Halliburton with the focus on predictive maintenance and process improvement. Before joining Halliburton, I had worked on power plant predictive maintenance, gas turbine simulation and modeling in GE and Siemens for over 8 years. I graduated from Georgia Institute of Technology with a PhD degree in Aerospace Engineering and a master degree in Statistics.

### Efficiently Building Machine Learning Models for Predictive Maintenance in the Oil & Gas Industry with DatabricksSummit 2020

For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime. Scheduled maintenance treats each equipment similarly with simple metrics, such as calendar time or operating time. Using machine learning models to predict equipment failure time accurately can help the business schedule the predictive maintenance accordingly to reduce the downtime and maintenance cost. We have huge sets of time series data and maintenance records in the system, but they are inconsistent with low quality. One particular challenge we have is that the data is not continuous and we need to go through the whole data set to find where the data are continuous over some specified window. Transforming the data for different time windows also presents a challenge: how can we quickly pick the optimized window size among the various choices available and perform transformation in parallel? Data transformations such as the Fourier transforms or wavelet transforms are time consuming and we have to parallelize the operation. We adopted Spark dataframes on Databricks for our computation.

Here are the two major steps we took to carry out the efficient distributed computing for our data transformations:

1. Identify which segments have continuous data by scanning through a sub-sampled data set.
2. Pick different windows and transform data within the window.
3. Transform each window column into one cell as a list.
4. Preserve the order of the data in each cell by collecting the timestamp and the corresponding parameter as a list of dictionaries, and then reorder the list based on the timestamp element in the dictionaries.