Anahita Bhiwandiwalla is a Senior Software Engineer at Intel for their Analytics & Artificial Intelligence Solutions Group. She is currently working on creating analytics solutions for large scale distributed data and solving challenges that arise as the data scales. She holds an M.S. in Computer Science from Columbia University with an emphasis on Machine Learning. Anahita’s main interests are in Machine Learning, Natural Language Processing, Speech Recognition and Data Mining. She has presented her work at various meet-ups, webinars and conferences.
The majority of a data scientist’s time is spent cleaning and organizing data before insights can be derived. Frequently, with large datasets, a lack of integration with visualization tools makes it hard to know what’s most interesting in the data and also creates challenges for validating numerical insights from models. Given the vast number of tools available in the ecosystem, it is hard to experiment with different tools to pick the most suitable one, especially given the complexity involved in integrating them with one’s solution. The speakers will present an easy to use workflow that solves this integration challenge by combining various open source libraries, databases (e.g. Hive, Postgres, MySQL, HBase etc.) and visualization with distributed analytics. Intel developed a highly scalable library built over Apache Spark with novel graph, statistical and machine learning algorithms that also enhances the user experience of Apache Spark via easier to use APIs. This session will showcase how to address the above mentioned issues for a drug similarity use case. We'll go from ETL operations on raw drug data to deriving relevant features from the drug’s chemical structure using statistical and graph algorithms, using techniques to identify best model and parameters for this data to derive insights, and then demonstrating the ease of connectivity to different databases and visualization tools. Session hashtag: #SFeco18
In healthcare, DICOM is an international standard format for storing medical images (MRI/CT representations). Each image has associated with it embedded metadata and pixel data. There is currently a tremendous amount of effort in healthcare to incorporate image analytics within clinical data analysis. Apache Spark is a natural framework to integrate these efforts. This session presents an analytics workflow using Apache Spark to perform ETL on DICOM images, and then to perform Eigen decomposition to derive meaningful insights on the pixel data. The workflow integrates a Java based framework DCM4CHE with Apache Spark to parallelize the big data workload for fast processing. Users can extract features based on the metadata and run efficient clean/filter/drill-down for preprocessing. See a demonstration of predictive analytics with visualization using the metadata to derive insights, such as likelihood of a condition or efficacy of medication administered. The speakers will also present performance benchmarks of this workflow on various datasets and cluster configurations to demonstrate the benefits of running this kind of analysis workflow on Apache Spark. Session hashtag: #SFds20