Data Mining and Prediction Modelling in the Dairy Industry Using Time Series and Sliding Windows with Apache Spark 2

WHY – As a major livestock producer, the European Union is directly affected by the global need for more sustainable food production. Climate change will undoubtedly impact on farm animal production but the health and welfare of livestock is also of increasing public concern. Due to rapid development of precision livestock farming technologies and availability of high-throughput from milk sensors, large-scale massive data has become available on research farms. The preferred matrix to measure the biomarkers is milk, as it is more accessible than blood and allows low-cost, automated repeat sampling using ‘in-line’ sampling and analytical technologies. WHAT – Certain biomarkers in milk such as N-glycan structures (BM-1), metabolites (BM-2) or mid-infra-red spectra (BM-3) can serve as biomarkers to predict production efficiency and disease. Data mining and machine learning can unlock insights around such biomarkers. As more of the aforementioned types of datasets become available over the near future, scalable data mining and prediction pipelines applied to animals science are needed.
TAKEAWAYS -In this session you will learn:
The methodology for ranking multiple biomarkers according to their predictive power;
Data processing and statistical modelling performed using Spark v2.1.1 with scala API;
Infrastructure, configuration, and implementation of the data pipeline using sliding windows with Apache Spark’s MLlib
Visualization of of datasets via ElasticSearch-Kibana.

Session hashtag: #EUds14

About Miel Hostens

Doctor in Veterinary Medicine at Ghent University