Varun Tyagi

Tech Advisor , Halliburton

Varun is a tech adviser in Halliburton. Before joining Halliburton, Varun worked at TGS-Nopec Geophysical Company for four years and CGG for six years in Houston, as a seismic data processing and imaging Geophysicist. His last role at TGS was as an advising Geophysicist and team leader for cleaning, processing, and analyzing large 3D seismic datasets in the Gulf of Mexico, Canada, West Africa, and Brazil. He holds a B.S. degree in Electrical Engineering and a M.S. degree in Engineering Science, both from Penn State University.



Efficiently Building Machine Learning Models for Predictive Maintenance in the Oil & Gas Industry with DatabricksSummit 2020

For each drilling site, there are thousands of different equipment operating simultaneously 24/7. For the oil & gas industry, the downtime can cost millions of dollars daily. As current standard practice, the majority of the equipment are on scheduled maintenance with standby units to reduce the downtime. Scheduled maintenance treats each equipment similarly with simple metrics, such as calendar time or operating time. Using machine learning models to predict equipment failure time accurately can help the business schedule the predictive maintenance accordingly to reduce the downtime and maintenance cost. We have huge sets of time series data and maintenance records in the system, but they are inconsistent with low quality. One particular challenge we have is that the data is not continuous and we need to go through the whole data set to find where the data are continuous over some specified window. Transforming the data for different time windows also presents a challenge: how can we quickly pick the optimized window size among the various choices available and perform transformation in parallel? Data transformations such as the Fourier transforms or wavelet transforms are time consuming and we have to parallelize the operation. We adopted Spark dataframes on Databricks for our computation.

Here are the two major steps we took to carry out the efficient distributed computing for our data transformations:

  1. Identify which segments have continuous data by scanning through a sub-sampled data set.
  2. Pick different windows and transform data within the window.
  3. Transform each window column into one cell as a list.
  4. Preserve the order of the data in each cell by collecting the timestamp and the corresponding parameter as a list of dictionaries, and then reorder the list based on the timestamp element in the dictionaries.