Dina Suehiro is a Senior Software Engineer at Intel Corporation. She works in the Analytics and Artificial Intelligence Solutions Group as a developer on an open source data analytics library that leverages Apache Spark. Dina has a bachelors degree from Pacific University, where she majored in Computer Science and Integrated Media and minored in Mathematics.
The majority of a data scientist’s time is spent cleaning and organizing data before insights can be derived. Frequently, with large datasets, a lack of integration with visualization tools makes it hard to know what’s most interesting in the data and also creates challenges for validating numerical insights from models. Given the vast number of tools available in the ecosystem, it is hard to experiment with different tools to pick the most suitable one, especially given the complexity involved in integrating them with one’s solution. The speakers will present an easy to use workflow that solves this integration challenge by combining various open source libraries, databases (e.g. Hive, Postgres, MySQL, HBase etc.) and visualization with distributed analytics. Intel developed a highly scalable library built over Apache Spark with novel graph, statistical and machine learning algorithms that also enhances the user experience of Apache Spark via easier to use APIs. This session will showcase how to address the above mentioned issues for a drug similarity use case. We'll go from ETL operations on raw drug data to deriving relevant features from the drug’s chemical structure using statistical and graph algorithms, using techniques to identify best model and parameters for this data to derive insights, and then demonstrating the ease of connectivity to different databases and visualization tools. Session hashtag: #SFeco18