Skip to main content

With a tsunami of data, scale of computing resources available, and rapid development of easy-to-learn open source Machine Learning frameworks, data science and machine learning concepts are much easier to learn and implement today than they were a decade ago.

As a result, across all industries, practitioners are using cutting-edge ML algorithms to solve tough data problems and eager to learn new techniques.

Learning is a life-long journey, but what are the top skills that data scientists need to stay abreast? According to Inside Big Data and KDnuggets, Python and R Programming, Graphs, NLP, Apache Spark and Hadoop, and unbiased modeling are some of the key areas to explore.

Below is a selection of some of the Data + AI Summit sessions across the data science as well as python and advanced analytics tracks that will help you sharpen some of these skills.

Data Science

Relationships are one of the most predictive indicators of behavior and preferences. If you're looking at understanding algorithms available to help identify groups dynamics to better predict communities behaviors, Predicting Influence and Communities Using Graph Algorithms is for you. In this session, Any Hodler and Mark Needham of Neo4j will explore how to run community detection and centrality algorithms in Apache Spark, including best practices, and examples of running graph algorithms in Neo4j Graph Platform.

Related to predicting behaviors is Sentiment Analysis, a direct application of Natural language processing (NLP). NLP is also used for question answering, paraphrasing or summarizing, natural language BI, etc. and is a key component in many data science systems that must understand or reason about text. In Apache Spark NLP: Extending Spark ML to Deliver Fast, Scalable, and Unified Natural Language Processing, David Talby of Pacific AI, and Alexander Thomas of Indeed, will demonstrate the NLP library for Apache Spark using PySpark.

Expanding on our data science toolkit, Running R at Scale with Apache Arrow on Spark with Javier Luraschi of RStudio, will introduce the Apache Arrow project and recent developments that enable running R with Apache Arrow on Apache Spark to significantly improve performance and efficiency.

Last but not least, as we continue to implement Machine Learning systems in virtually all fields and domains of applications, it is absolutely crucial to prevent discrimination, privacy, and even accuracy issues in our systems. In Interpretable AI: Not Just For Regulators, Patrick Hall and Mark Chan of H2O.ai will talk about how to train explainable, fair, trustworthy, and accurate predictive modeling systems.

Python & Advanced Analytics

One of the biggest challenges faced by Python users is to move huge amounts of data around. In Make your PySpark Data Fly with Arrow!, Bryan Cutler of IBM will give an overview of new framework Arrow Flight and demonstrate how to build high-performance connections with Arrow.

Managing the data science workflow - from experimentation to production - is another of the biggest challenges faced by data scientists and practitioners every day. In Data Agility—A Journey to Advanced Analytics and Machine Learning at Scale, Hari Subramanian of Uber, will demonstrate how Uber's Big data platforms and Data science workbench put the power of Spark in the hands of Data scientists and data analysts for advanced analytics and ML/DL use cases, at scale. Related to this topic, Databricks unveiled MLflow in June 2018, a new open-source framework to manage the complete Machine Learning lifecycle. Don't miss Matei Zaharia's keynote for the latest updates on this initiative.

Finally, in Automating Predictive Modeling at Zynga with PySpark and Pandas UDFs, Ben Weber will explain how Zynga overcame scale challenges when working on large data sets, leveraging Pandas UDFs to help scale and automate the feature engineering process. As a result, teams can now use hundreds of propensity models in production to help personalize game experiences, and data scientists are now spending more of their time engaging with game teams to help build new features.

Data Science Classes

If you are someone who learns best by doing, don't miss our tutorial on Managing the Complete Machine Learning Lifecycle with MLflow, a 80-minutes session with an expert-led talk designed to introduce you to MLflow, a new open source framework for managing the ML lifecycle, followed by hands-on exercises allowing attendees to learn by doing.

Next, to deepen further your Apache SparkTM and MLflow knowledge, check out the Data Science with Apache Spark™ as well as the Machine Learning in Production: MLflow and Model Deployment training courses.

And finally, if you’re new to TensorFlow or Keras and want to learn how to use Horovod for distributed deep learning training, you can enroll in a training course: Hands-on Deep Learning with Keras, Tensorflow, and Apache Spark.

What’s Next

You can also browse through our sessions from the schedule, too.

If you have not registered yet, use discount code JulesPicks to get a 15% discount.

Read More

Try Databricks for free

Related posts

How to Use MLflow, TensorFlow, and Keras with PyCharm

July 10, 2018 by Jules Damji in
At Data + AI Summit in June, we announced MLflow , an open-source platform for the complete machine learning cycle. The platform’s philosophy...

Introduction to Analyzing Crypto Data Using Databricks

The market capitalization of cryptocurrencies increased from $17 billion in 2017 to $2.25 trillion in 2021 . That's over a 13,000% ROI in...

Stadium Analytics: Increasing Sports Fan Engagement With Data and AI

March 31, 2022 by Max Wittenberg and Declan Meaney in
It only took a single slide. In 2021, Bobby Gallo, Senior Vice president of Club Business Development at the National Football League (NFL)...
See all Company Blog posts