Skip to main content
CUSTOMER STORY

Improving learning outcomes and enhancing student safety with ML

110%

Faster querying, at 10% of the cost to ingest, than a data warehouse

2

Years of continually decreasing total cost of ownership (TCO) with Databricks

100%

Reduction in data lag from batch processes (24-48 hours) to streaming

INDUSTRY: Education
PARTNERS: dbt
CLOUD: AWS

GoGuardian powers K-12 digital learning environments in which every student can thrive. The company delivers award-winning educational tools for learning engagement, formative assessments, virtual on-demand tutoring and student safety. While their homegrown bespoke data infrastructure stack met the basic needs of a 300-person company, incredible growth led GoGuardian to need a data platform that simplified and standardized development and deployment for their rapidly expanding data science and engineering teams. GoGuardian now sends all the data they ingest from their business systems to Databricks Data Intelligence Platform and uses dbt to organize data in Delta Lake. This new data stack has eliminated a 24-hour data lag and powers machine learning models that serve a wide range of key tasks — including suicide and self-harm prevention.

Bespoke data stack becomes costly to maintain

As GoGuardian aims to improve the educational experience for all students, intense but careful use of data is critical to this mission. To help the company execute on this broad objective, the GoGuardian data and AI team must efficiently and securely bring together data from their products as well as from a wide range of internal systems in Salesforce and HubSpot.

“Improving outcomes encompasses many activities for our different departments,” explained Manoj Rawat, Director of Data Engineering at GoGuardian. “In Marketing, they’re looking for the best ways to engage customers and increase the ROI on our marketing campaigns. Our product team analyzes product event data to identify the need for new features. Our finance team prepares reports for our board, and our data science team needs to run several customer-facing machine learning (ML) models. Each of these groups needs data to be available in an analytics environment in offline storage so they can do their part to positively impact the future of education.”

Because GoGuardian products process student data, the company places student data privacy as a top priority. This means ensuring that personally identifiable student information (PII) is only visible when needed to support customers and that all analysts, scientists and researchers must work with anonymized data. To deliver the data their teams needed, GoGuardian Data Engineering initially used an Amazon Web Services (AWS) Redshift data warehouse, AWS batch computing capabilities, AWS Glue to process big data and Airflow for orchestration.

Meanwhile, GoGuardian’s 12 ML models serving over 4 billion inferences per day mostly ran on a bespoke solution based around AWS elastic container services. Data scientists developed ML models in Python, wrapped them in an API using various frameworks like Flask, then containerized the APIs before passing them to the data infrastructure team, who would deploy the models on ECS.

“Because we had scaled up AWS Redshift tremendously to support all our stakeholders, it was getting expensive to use the service, and AWS Glue was also stretching our budget,” recalled Greg Johnson, Data Engineering Manager at GoGuardian. “On the ML side, our existing solution wasn’t providing the data science team with the model testing, monitoring, tracking or deployment tools they needed. We needed to simplify our infrastructure and make it easier to rapidly iterate on our ML models so we could deliver better experiences to our users.”

Simpler data architecture facilitates enhanced data privacy

Today, GoGuardian runs a simpler, more cost-efficient data stack on Databricks Data Intelligence Platform. GoGuardian uses Auto Loader and Databricks Notebook jobs to extract data from their various business systems. The data engineering team extracts the data into Amazon S3 and then uses dbt jobs to transform the data in Delta tables. These Delta tables are arranged into a medallion architecture with Bronze (raw), Silver (analytics-ready) and Gold (business-ready) layers that deliver increasing refinement and additional transformations for stakeholders.

“GoGuardian hugely values student privacy,” said Ryan Johnson, Senior Director of Data and AI at GoGuardian. “Therefore, our data and AI organization partnered with our legal and privacy teams to design an overall analytics infrastructure that supports this value. A critical aspect of this architecture is that between the point of ingestion and the medallion layers where our scientists, researchers and analysts work, we transform the data through removal, hashing and/or other anonymization techniques to remove all student PII. Databricks made this design feasible and much easier to execute.”

“Most of the data in our Databricks environment comes from our application databases,” said Rawat. “But Databricks has made it easy to add streaming ingestion patterns too. Thanks to Databricks, the data in our analytics environment is near-real time wherever possible. We’ve eliminated the 24-hour data lag we used to experience with batch processes.”

Databricks Data Intelligence Platform is now GoGuardian’s single source of truth for increasing operational efficiency and empowering multiple teams. The company built an ETL process to move data from various source systems into the lakehouse. From there, the business intelligence team can easily build Tableau dashboards off the data in Delta Lake. When application engineers want to interact with application data without the risk of hitting production databases, they use the lakehouse. Reverse ETL also lets the company flow non-student data back into systems such as HubSpot, which has been mission-critical for their marketing team.

Rather than wasting precious time and resources maintaining a homegrown infrastructure to support ML activities, GoGuardian’s nine data engineers and seven data scientists now collaborate on one unified platform to do everything from ETL and engineering to analytics and ML. Databricks feature store lets the company’s data scientists find and share useful ML features. Databricks Model Serving lets GoGuardian expose machine learning models as scalable REST API endpoints and provides a service for easy deployment.

GoGuardian has also deployed dbt to increase collaboration among their data engineers and data scientists. Stakeholders across the company are also using the solution to create models.

“dbt has democratized analytics throughout GoGuardian,” Ryan Johnson noted. “People can create dbt data models very easily using a simple SQL file and test them on the infrastructure we’ve created with Delta Lake. From there, they can push them to production with minimal assistance from us. With dbt and Databricks, we’ve empowered many of our stakeholder groups to build analytics data models that support their use cases.”

Powerful lakehouse architecture saves money — and helps schools

With Databricks and dbt, GoGuardian is doing more than improving learning outcomes — the company is also ensuring students get help in times of need. Databricks Data Intelligence Platform, MLflow, feature store and Model Serving power part of GoGuardian’s Beacon product, which generates alerts for schools if a student’s online activity suggests signs of at-risk behavior.

“We have helped numerous districts detect and assist students in need of mental health services through the ML models which power Beacon,” explained Rawat. “It’s impossible to put a monetary value on that.”

ML also powers several other GoGuardian products. GoGuardian Admin allows school administrators to filter violent and sexually inappropriate content on school devices. Their Giant Steps software leverages machine learning to provide personalized content recommendations for students and teachers — saving teachers’ time and accelerating students’ learning at their own pace.

Meanwhile, GoGuardian has dramatically accelerated their data analysis. By adopting lakehouse architecture and decoupling data storage from data computing, the company gained the ability to run massive data clusters on demand.

“One of our biggest tables is about 300 terabytes,” said Rawat. “Analyzing it was always a challenge on Redshift. We could scale up and down for large clusters, but it took hours. When we moved to the lakehouse, we experienced 110% faster querying in many situations. Overall, it has resulted in a significantly lower cost of ownership. We can now analyze our not-frequently analyzed data very quickly, whenever there’s a need. That gives us a major strategic advantage.”

GoGuardian has also slashed development time for new tables and models by at least 90% with Databricks and dbt. Previously, stakeholders had to ask the data team for a new data model and wait at least two weeks. Today, many can create their own models in just hours.

As GoGuardian finishes replacing their entire legacy architecture with Databricks and dbt, the company continues to identify potential new use cases. GoGuardian hopes to provide tools that help their revenue enablement team set up premium accounts more quickly for customers. The company also plans to use their enhanced data analytics to identify areas in which they can increase operational efficiency.

“With data analytics, we may be able to make significant reductions to our time to hire,” concluded Rawat. “Or we can keep a closer eye on trends in product feature usage to see how many users are still using old features, and then reach out to these customers to train them in new features. With the speed at which we’re ingesting and analyzing data through Databricks and dbt, we suddenly have a whole new view of our business.”