Machine learning models are only as good as the quality of data and the size of datasets used to train the models. Data has shown that data scientists spend around 80% of their time on preparing and managing data for analysis and 57% of the data scientists regard cleaning and organizing data as the least enjoyable part of their work. This further validates the idea of MLOps and the need for collaboration between data scientists and data engineers.
During the crucial phase of data acquisition and preparation, data scientists identify what types of (trusted) datasets are needed to train models and work closely with data engineers to acquire data from viable data sources.
Another important aspect of the ML lifecycle is experimentation–where data scientists take sufficient subsets of (trusted) datasets and create several models in a rapid, iterative manner. And without proper industry standards, data scientists have to rely on manual tracking of models, inputs, hyperparameters, outputs and any other such artifacts throughout the model experimentation and development process.
In this talk, you learn how to automate these crucial tasks using StreamSets and MLflow on Databricks.« back
Dash Desai, Director of Platform and Technical Evangelism at StreamSets, has 18+ years of hands-on software and data engineering background. With recent experience in Big Data, Data Science, and Machine Learning, Dash applies his technical skills to help build solutions that solve business problems and surface trends that shape markets in new ways.
Dash has worked for global enterprises and tech startups in agile environments as an engineer and a solutions architect. As a Platform and Technical Evangelist, he is passionate about evaluating new ideas to help articulate how technology can address a given business problem. He also enjoys writing technical blog posts, hands-on tutorials, and conducting technical workshops.