Building an ML Tool to predict Article Quality Scores using Delta & MLFlow

Download Slides

For Roularta, a news & media publishing company, it is of a great importance to understand reader behavior and what content attract, engage and convert readers. At Roularta, we have built an AI-driven article quality scoring solution on using Spark for parallelized compute, Delta for efficient data lake use, BERT for NLP and MLflow for model management. The article quality score solution is an NLP-based ML model which gives for every article published – a calculated and forecasted article quality score based on 3 dimensions (conversion, traffic and engagement).

The score helps editorial and data teams to make data-driven article-decisions such as launching another social post, posting an article behind the paywall and/or top-listing the article on the homepage.

The article quality score gives editorial a quantitative base for writing more impactful articles and running a better news desk. In this talk, we will cover how this article quality score tool works incl.
– The role of Delta to accelerate the data ingestion and feature engineering pipelines
– The use of the NLP BERT language model (Dutch based) for extracting features from the articles text in a Spark environment
– The use of MLflow for experiments tracking and model management
– The use of MLflow to serve model as REST endpoint within Databricks in order to score newly published articles

Speaker: Ivana Pejeva


– Hi, everyone, I’m Ivana. And I’m a Data Scientist at element61. In today’s talk, I’m gonna talk how we build a machine learning model to predict article quality scores using Delta and MLflow. So let’s deep dive into it. First, I want to talk about the challenge that we have, and what is the goal of this solution that we built. The solution was developed for a media company in Belgium. Roularta is a leading publisher in Belgium. They have more than 40 strong brands, which is they’re focusing on different target groups, and delivers content to a lot of Flemish households. They capture digital touch points of the listeners of the readers and viewers of their website. And they have more than 3 million visitors and 35 million views per month. Using this data that they collect, they would like to calculate the article quality. But they want to calculate many different KPIs, and the amount of KPIs possible is really infinite that one needs to track. And the goal is also to simplify, standardize, and automate an objective measurement. So the goal of this solution is twofold. The first one is to calculate the quality score of an article. So for past articles Roularta wants to know and analyze which articles did well in bringing traffic conversion and engagement to their websites and to their articles. And the second part of the goal is to predict quality of an article. So for new articles that are published, they want to predict if the article will bring good conversion, traffic, and engagement. So let’s see how we actually do it. So what we offer to editorial team is a data tool, which is calculating article scores on historical articles, and predicts article scores on new articles. The data that we’re ingesting is real time data. So we have streaming data sources, coming from blueconic, which is done the reading behavior of the users and data coming from the content management system, which is done more detailed information on the articles. So because this data is coming in a streaming mode, we’re using Spark structured streaming in a databricks platform that we use to ingest this data, and then put it into the different layers of our data lake. So we go through the flow of having a bronze, silver, and a gold layer, where the silver and the gold are done of a Delta Lake format. Once the data is ingested, cleaned, and transformed in the gold zone. We’re done having a launching a new data pipeline, again in databricks, which is first calculating the scores for the different engagement, traffic, and conversions, and then to the end calculate the overall quality score of an article. Once we have that, we are doing then a feature extraction for the second part of our solution to predict the scores. Both the results of the calculation of the scores and all the metrics that we calculate there, also the features that we extract for the predictive models are saved in intermediate Delta tables. From there, from these Delta tables, we then take the data for the scores, and the feature that we extracted, and we can build then the prediction model. The prediction model is created in data bricks, also using MLflow, and is then serving to the editorial teams as an API. So they can call these to see for new articles, which article will have which score on the different engagement, traffic, and conversion areas. Next, we want to see and to show you what are the things you exactly want to predict. So we have to predict three different scores, we want to predict the traffic score, which is then gonna tell us how much traffic is an article gonna bring to the site. Secondly, we want to predict also the conversion score, which is then to show us is this article going to bring us good conversions. And lastly, we want to predict the engagement score, which tells us which article keep people engaged on the site. So these are the three scores that we want to predict. How the flow actually looks like for calculating the scores and predicting the article scores. So first, we start with simply gathering all the data. So we get the pageviews, we get the content data, and so basically, we get the number of pageviews. We get the pageviews, which are coming from social media, like did people then access this article from Facebook page or from Instagram? Who is the author of this article? When was this article published? Et cetera. So once we gather all this pageview and content data, and we have extracted all the measures that we need to calculate the different scores, we’re actually now on the steps to calculate this, the series of traffic, engagement, and conversion score. We calculate this using a weighted sum. So each of the metrics that we calculate will have a different weight for the traffic score, for the engagement score, and for the convergence score. Once we get the scores, we will actually calculate the quality stars for each article. So article can get from one to five stars on each of these scores. We calculate this using the distribution of the scores. Once we have the different scores for the three different areas, we want to calculate an overall quality score. So we calculate an overall quality score based on these three scores that we have previously calculated. And lastly, once we have all the scores for the historical articles, we will use this data to predict the scores for new articles, so will predict the traffic score and engagement and the conversion score on new articles. But before we go any further of how we build a solution, it’s important to know what the data is. So what’s the data that we used? We have two types of data. So we have the data, which is the reading behavior of the users. So this is coming from a CDP tracker, which is embedded in all the websites. And it’s actually giving us a 360 overview of the customer. So we’re actually getting all the touch points for each pageview for each user. Secondly, we have the content behavior data, which is coming from the content management system. And this contains all the details of an article, all the details of the all the content that they have. Once we gather the data, we have to prepare the data, and we have to prepare it separately for the two different goals that we have for the calculation of the article scores and for the prediction of the article scores. For the calculation of the article scores, we use both the pageviews and the content data. So here we calculate metrics like number of pageviews that an article has, what was the average time that people spent on this article? Was this article shared on social media, like Facebook or Instagram or Twitter? Did people register after reading this article, or did they may be subscribed to the website? And then we also want to see the bounce rate, did they stay and read some extra articles after reading a certain article? Et cetera. So there are many, many different measurements that we are calculating, before we actually do the calculation of the three different scores. For the prediction of the article scores, we are not using the pageview data, so we don’t use any historical data on other articles. But here, we only use the data from the content management system. So here we use data, such as the article text itself, what was the time of publication for a certain article? What is the author of this article? Are there multiple authors? Are there any tags? If there are, what are the tags in this article? We want to know how long is the article? What is the topic of the article? Et cetera. And then these things are used to extract features later for calculating the score or actually not calculated, but just predicting the score for new articles. But calculation of the articles is a really computationally and intensive job. So a lot of the metrics that we are calculating require really looking at a specific window of data from millions of rows. So we have around 2 million rows per day, and if you want to calculate it over a bigger period of time, this becomes a very intensive job computationally. Let’s have for example, the calculation of the number of visitors to an article where the user was not seen on the site for the past 30 days. To calculate this type of measure would require to always look at a specific window of data for each article. A very expensive operation because we are gonna be looking over millions and millions of rows. Therefore, we are keeping intermediate Delta tables for specific measurements. For example, in this case, for calculating this type of measure, we are keeping the visitors from the last 30 days in an intermediate Delta table. And from there, we can just find if the user was seen in the last 30 days or not. And this way, we can avoid using windowing function over a huge amount of data. So this actually helped us improve the performance tenfold. So having an intermediate Delta tables really helped us in some of the measures that require us to look over a big window of data. Therefore, I want to mention the role of Delta in this solution. So Delta really accelerates the data ingestion, and the feature extraction pipelines. So the role of Delta is really important with this solution. First, because we are having around 2 million pageviews per day that we need to process in a streaming mode. The data coming from the pageviews is data that needs to also be GDPR compliant. So that would mean that in order to anonymize this type of data, we’ll have to perform a lot of delete and update operations, which needs to happen in real time. So the Delta really is used a lot in the ingestion part, and it’s used mostly in the silver and the gold layer of the Delta Lake. We have around 250k numbers of articles, and we have 50 brands. So if we combine all the articles and the millions of pageviews to extract features, or to calculate different metrics, we need to create intermediate Delta tables to improve the performance that we have. And lastly, Delta was also used for time travel. So because we want in some cases to be able to recreate machine learning experiments, Delta was really a key in this part. Next would be then the feature extraction process, I would like to share with you. So as I mentioned before, we have the reading behavior data and the content behavior data. We’re using this data together to calculate the different metrics needed for the different scores. And at the end, we calculate the traffic, the engagement, and the conversion score. Next to this, we want to create features for each article that we are gonna use in building the predictive model. But for the predictive model I mentioned, we’re only using the content behavior data. So in this case, we are calculating features like, is it a short, or is it a long article? Does the teaser have an images? Are there any tags in the article? Who’s the author of the article? Are there multiple authors of this article? When was it published? We use the teaser text to extract some text features as well. And then, once you have the features for each article, we are gonna join the scores for the traffic, engagement, and conversion, and those will be actually the labels to your articles. So once we have this, the data is ready, then to be added to the predictive model. While creating the features, firstly, for the article text and for the teaser text, we tried using TF-IDF, to create the feature representation of our text. But we were convinced that there are better ways to create features from text. And since BERT and transformers really became popular in the NLP community because of the really good results that they were giving to many different tasks, we wanted to have a look of how BERT can help us to create a feature representation of the text. First, we look at BERTs, which is originally trained on English text. But our articles were mostly in Dutch and French. So we wanted to first try to solve the articles that were written in Dutch. So for this case, the original BERT model that was trained on English data or English text was not really an option for us. There is the multilingual BERT, which then is trained for different languages. But it’s only trained on Wikipedia articles, which was done not really enough for us because we want to have a model which is trained on diverse data set and really big enough of a data set. So therefore, we chose to use the BERTje model, which is then a model that was published. It was published in a paper in 2019. And we use this, but this was not out of the box available for Spark. So it was not possible to use it from the Spark NLP library. So therefore, we used or leveraged pandas UDF, pandas user defined function to achieve better performance or to use this model directly from the Transformers Python library. And using the NLP BERT language model to extract representations from the article text really improved our machine learning performance a lot. So it was really a good choice that we used it. Next, we’ll see how we build the predictive model. So we see here on this slide, that once we created a data set, which contains all the features together with the labels for each article, we’re ready to create a pipeline to enter train a predictive model. So in the first step, we created the machine learning pipeline to transform our features. So in this step, we binarized the features, we encoded strings columns into label indices, we used TF-IDF to actually reflect the importance of words in tags, topics, and collections of articles. We used the mean and standard deviation to scale the data, and then at the end, we created a feature vector. Once we had that, we’re ready to train the machine learning model. In this case, we’re dealing with a multi-class classification problem, so we have to predict the number of stars from one to five for each article. And for this case, we chose the random forest classifier. But we already tried many different algorithms, and we were logging the different metrics and checking the performance of each of those algorithms using MLflow. So we used cross validation and then all the metrics were locked, and we were able to see and compare performance between the different models. And the models, we evaluated using the multi-class classification evaluator. The best model that was trained was registered in MLflow. So we also leverage MLflow in this solution. The role of MLflow for us is also as important as Delta. So we have to train machine learning models per brand, and we have around 50 different brands. And for all those brands, we have to create a model per conversion score per traffic score and per engagement score. So this is a lot of models that we have to train. So it was really a must, that we have a really detailed and clear overview of all the models together so that we can see which model is performing good, which model is then already in production, we wanted to be able to monitor these models. Also, it was really important to promote new model versions fast. So we would train a model, so we have new articles coming every single day. And we want to quickly retrain our model, and then if we have a model, which has a better performance than the previous one, we want to promote this new model version very fast to production. And for all of these things, we were using MLflow, which really helped us to track the models, to register the models, and lastly, to serve the models. So this is the newest addition to MLflow, which was released not too long ago. So serving the models in MLflow is really easy to use, we saw. So we could just easily and fast create API’s for all the different models that can later be called to predict the scores for new articles. And then what are the results that we got? So this in this slide here, we’ll see the predictions only for one brand, which has a total number of articles of 2368. And we see the confusion matrix for the predictions, yeah, and the true label. Here for this brand, we’ve seen that in 93% of the cases, a quality article, which has four or five stars is given four or five stars by the model, which is pretty good for us. And we also see that a low quality article, which has one or two stars is given four or five stars, so we really miss it here, it’s only in 2% of the cases, which was still okay for the business. And if we look in the feature reporters, we could see that the teaser text and the topics are the most important features. So really using the BERT to extract feature representations from the text helped us really achieving a better performance. And to summarize everything in our solution to build a machine learning tool, which could predict for new articles, the quality score, we use Delta to accelerate really the data ingestion and feature extraction, Use NLP BERT for text representation from articles which also appear to be a very good choice, which really improve the performance that we have. And we use MLflow for tracking, registering, and serving machine learning models. Thank you all for your attention. I hope everything was clear, and you could see how Delta, NLP BERTs models, and MLflow helped us in this journey. If you have any questions, please don’t hesitate to ask them. Thank you.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Ivana Pejeva


Ivana is a data scientist, passionate about machine learning and artificial intelligence. As part of the Data Science and Strategy competence center at element61, she helps organizations build and grow business with data. Ivana is an engineering professional with a Master's degree in Artificial Intelligence from KU Leuven.