One of the biggest challenges I encountered in my career as a data scientist was migrating the core algorithms in a mobile AdTech platform from classic machine learning models to deep learning. I worked on a Demand Side Platform (DSP) for user acquisition, where the role of the ML models is to predict if showing ad impressions to a device will result in the user clicking on the ad and installing a mobile app. For a quick hands-on overview of the click prediction problem, please check out my past post.
While we were able to quickly get to a state where the offline metrics for the deep learning models were competitive with logistic regression models, it took awhile to get the deep learning models working smoothly in production, and we encountered many incidents along the way. We were able to start with small-scale tests using Keras for model training and Vertex AI for managed TensorFlow serving, and ran experiments to compare iterations of our deep learning models with our champion logistic regression models. We were eventually able to get the deep learning models to outperform the classic ML models in production, and modernize our ML platform for user acquisition.
When working with machine learning models at the core of a complex system, there are going to be situations where things are going to go off the rails and it’s important to be able to quickly recover and learn from these incidents. During my time at Twitch, we used the Five W’s approach to writing postmortems for incidents. The idea is to identify “what” went wrong, “when” and “where” it occurred, “who” was involved, and “why” a problem resulted. The follow up is to then establish how to avoid this type of incident in the future and to set up guardrails to prevent similar issues. The goal is to build a more and more robust system over time.
In one of my past roles in AdTech, we ran into several issues when migrating from classic ML models to deep learning. We eventually got to a state where we had a robust pipeline for training, validating, and deploying models that improved upon our classic models, but we ran into incidents during this process. In this post we’ll cover 8 of the incidents that occurred and describe the following steps we took for incident management:
- What was the issue?
- How was it found?
- How was it fixed?
- What did we learn?
We identified a variety of root causes, but often aligned on similar solutions when making our model pipelines more robust. I hope sharing details about these incidents provides some guidance on what can go wrong when using deep learning in production.
Incident 1: Untrained Embeddings
What was the issue?
We found that many of the models that we deployed, such as predicting click and install conversion, were poorly calibrated. This meant that the predicted value of conversion by the model was much higher than the actual conversion that we saw for impressions that we served. After drilling down further, we found that the miscalibration was worse on categorical features where we had sparse training data. Eventually we discovered that we had embedding layers in our install model where we had no training data available for some of the vocabulary entries. What this meant is that when fitting the model, we weren’t making any updates to these entries, and the coefficients remained set to their randomized initialized weights. We called this incident “Untrained Embeddings”, because we had embedding layers where some of the layer weights never changed during model training.
How was it found?
We mostly discovered this issue through intuition after reviewing our models and data sets. We used the same vocabularies for categorical features across two models, and the install model data set was smaller than the click model data set. This meant that some of the vocabulary entries that were fine to use for the click model were problematic for the install model, because some of the vocab entries did not have training examples in the smaller data set. We confirmed that this was the issue by comparing the weights in the embedding layers before and after training, and finding that a subset of the weights were unchanged after fitting the model. Because we randomly initialized the weights in our Keras models, this led to issues with the model calibration on live data.
How was it fixed?
We first limited the size of our vocabularies used for categorical features to reduce the likelihood of this issue occurring. The second change we made was setting the weights to 0 for any embedding layers entries where the weights were unchanged during training. Longer term, we moved away from reusing vocabularies across different prediction tasks.
What did we learn?
We discovered that this was one of the issues that was leading to model instability, where models with similar performance on offline metrics would have noticeably different performance when deployed to production. We ended up building more tooling to compare model weights across training runs as part of our model validation pipeline.
Incident 2: Padding Issue with Batching for TensorFlow Serving
What was the issue?
We migrated from Vertex AI for model serving to an in-house deployment of TensorFlow serving, to deal with some of the tail-end latency issues that we were encountering with Vertex at the time. When making this change, we ran into an issue with how to deal with sparse tensors when enabling batching for TensorFlow serving. Our models contained sparse tensors for features, such as the list of known apps installed on a device, that could be empty. When we enabled batching when serving on Vertex AI, we were able to use empty arrays without issue, but for our in-house model serving we got error responses when using batching and passing empty arrays. We ended up passing “[0]” values instead of “[ ]” tensor values to avoid this issue, but this again resulted in poorly calibrated models. The core issue is that “0” referred to a specific app rather than being used for out-of-vocab (OOV). We were introducing a feature parity issue to our models, because we only made this change for model serving and not for model training.
How was it found?
Once we identified the change that had been made, it was straightforward to demonstrate that this padding approach was problematic. We took records with an empty tensor and changed the value from “[]” to “[0]” while keeping all of the other tensor values constant, and showed that this change resulted in different prediction values. This made sense, because we were changing the tensor data to claim that an app was installed on the device where that was not actually the case.
How was it fixed?
Our initial fix was to change the model training pipeline to perform the same logic that we performed for model serving, where we replace empty arrays with “[0]”, but this didn’t completely address this issue. We later changed the vocab range from [0, n-1] to [0, n], where 0 had no meaning and was added to every tensor. This meant that every sparse tensor had at least 1 value and we were able to use batching with our sparse tensor setup.
What did we learn?
This issue mostly came up due to different threads of work on the model training and model serving pipelines, and lack of coordination. Once we identified the differences between the training and serving pipelines, it was obvious that this discrepancy could cause issues. We worked to improve on this incident by including data scientists as reviewers on pull requests on the production pipeline to help identify these types of issues.
Incident 3: Untrained Model Deployment
What was the issue?
Early on in our migration to deep learning models, we didn’t have many guardrails in place for model deployments. For each model variant we were testing we would retrain and automatically redeploy the model daily, to make sure that the models were trained on recent data. During one of the training runs, the model training resulted in a model that always predicted a 25% click rate regardless of the input data and the ROC AUC metric on the validation data set was 0.5. We had essentially deployed a model to production that always predicted a 25% click rate regardless of any of the feature inputs.
How was it found?
We first identified the issue using our system monitoring metrics in Datadog. We logged our click predictions (p_ctr) as a histogram metric, and Datadog provides p50 and p99 aggregations. When the model was deployed, we saw the p50 and p99 values for the model converge to the same value of ~25%, indicating that something had gone wrong with the click prediction model. We also reviewed the model training logs and saw that the metrics from the validation data set indicated a training error.
How was it fixed?
In this case, we were able to rollback to the click model from yesterday to resolve the issue, but it did take some time for the incident to be discovered and our rollback approach at the time was somewhat manual.
What did we learn?
We found that this issue with bad model training occurred around 2% of the time and needed to set up guardrails against deploying these models. We added a model validation module to our training pipeline that checked for thresholds on the validation metrics, and also compared the outputs of the new and prior runs on the model on the same data set. We also set up alerts on Datadog to flag large changes in the p50 p_ctr metric and worked on automating our model rollback process.
Incident 4: Bad Warmup Data for TensorFlow Serving
What was the issue?
We used warmup files for TensorFlow serving to improve the rollout time of new model deployments and to help with serving latency. We ran into an issue where the tensors defined in the warmup file did not correspond to the tensors defined in the TensorFlow model, resulting in failed model deployments.
How was it found?
In an early version of our in-house serving, this mismatch between warmup files and model tensor definitions would cause all model serving to come to a halt and require a model rollback to stabilize the system. This is another incident that was initially captured by our operational metrics on Datadog, since we saw a large spike in model serving error requests. We confirmed that there was an issue with the newly deployed model by deploying it to Vertex AI and confirming that the warmup files were the root cause of the issue.
How was it fixed?
We updated our model deployment module to confirm that the model tensors and warmup files were compatible by launching a local instance of TensorFlow serving in the model training pipeline and sending sample requests using the warmup file data. We also did additional manual testing with Vertex AI when launching new types of models with noticeably different tensor shapes.
What did we learn?
We learned that we needed to have different environments for testing TensorFlow model deployments before pushing them to production. We were able to do some testing with Vertex AI, but eventually set up a staging environment for our in-house version of TensorFlow serving to provide a proper CI/CD environment for model deployment.
Incident 5: Problematic Time-Based Features
What was the issue?
We explored some time-based features in our models, such as weeks_ago, to capture changes in behavior over time. For the training pipeline, this feature was calculated as floor(date_diff(today, day_of_impression)/7). It was a highly ranked feature in some of our models, but it also added unintended bias to our models. During model serving, this value is always set to 0, since we are making model predictions in real time, and today is the same as day_of_impression. The key issue is that the model training pipeline was finding patterns in the training data that may create bias issues when applying the model on live data.
How was it found?
This was another incident that we found mostly through intuition and later confirmed to be a problem by comparing the implementation logic across the training and model serving pipelines. We found that the model serving pipeline always set the value to 0 while the training pipeline used a wide range of values given that we often use months old examples for training.
How was it fixed?
We created a variant of the model with all of the relative time-based features removed and did an A/B test to compare the performance of the variants. The model that included the time based features performed better on the holdout metrics during offline testing, but the model with the features removed worked better in the A/B test and we ended up removing the features from all of the models.
What did we learn?
We found that we had introduced bias into our models in an unintended way. The features were compelling to explore, because user behavior does change over time, and introducing these features did result in better offline metrics for our models. Eventually we decided to categorize these as problematic under the feature parity category, where we see differences in values between the model training and serving pipelines.
Incident 6: Feedback Features
What was the issue?
We had a feature called clearing_price that logged how high we were willing to bid on an impression for a device during the last time that we served an ad impression for the device. This was a useful feature, because it helped us to bid on devices with a high bid floor, where the model needs high confidence that a conversion event will occur. This feature on its own generally wasn’t problematic, but it did become a problem during an incident where we introduced bad labels into our training data set. We ran an experiment that resulted in false positives in our training data set, and we started to see a feedback issue where the model bias became an issue.
How was it found?
This was a very challenging incident to identify the root cause of, because the experiment that generated the false positive labels was run on a small cohort of traffic, so we did not see a sudden change in operational metrics like we did with some of the other incidents in Datadog. Once we identified which devices and impressions were impacted by this test, we looked at the feature drift of our data set and found that the average value of the clearning_price feature was increasing steadily since the rollout of the experiment. The false positives in the label data were the root cause of the incident, and the drift in this feature was a secondary issue that was causing the models to make bad predictions.
How was it fixed?
The first step was to rollback to a best-known model before the problematic experiment was launched. We then cleaned up the data set and removed the false positives that we could identify from the training data set. We continued to see issues and also made the call to remove the problematic feature from our models, similar to the time-based features, to prevent this feature from creating future feedback loops in the future.
What did we learn?
We learned that some features are helpful for making the model more confident in predicting user conversions, but are not worth the risk because they can introduce a tail-spin effect where the models quickly deteriorate in performance and create incidents. To replace the clearing price feature, we introduced new features using the minimum bid to win values from auction callbacks.
Incident 7: Bad Feature Encoding
What was the issue?
We explored a few features that were numeric and computed as ratios, such as the average click rate of a device, computed as the number of clicks over the number of impressions served to the device. We ran into a feature parity issue where we handled divide by zero in different ways between the training and serving model pipelines.
How was it found?
We have a feature parity check where we log the tensors created during model inference for a subset of impressions and run the training pipeline on these impressions and compare the values generated in the training pipeline with the logged value at serving time. We noticed a large discrepancy for the ratio based features and found that we encoded divide by zero as -1 in the training pipeline and 0 in the serving pipeline.
How was it fixed?
We updated the serving pipeline to match the logic in the training pipeline, where we set the value to -1 when a divide by zero occurs for the ratio based features.
What did we learn?
Our pipeline for detecting feature parity issues allowed us to quickly identify the root cause of this issue once the model was deployed to production, but it’s also a situation we want to avoid before a model is deployed. We applied the same learning from incident 2, where we included data scientists on pull request reviews to help identify potential issues between our training and serving model pipelines.
Incident 8: String Parsing
What was the issue?
We used a 1-hot encoding approach where we choose the top k values, which are assigned indices from 1 to k, and use 0 as an out-of-vocab (OOV) value. We ran into a problem with the encoding from strings to integers when dealing with categorical features such as app bundle, which often has additional characters. For example, the vocabulary may map the bundle com.dreamgames.royalmatch
to index 3, but in the training pipeline the bundle is set to com.dreamgames.royalmatch$hl=en_US
and the value gets encoded to 0, because it is considered OOV. The core issue we ran into was different logic for sanitizing string values between the training and serving pipelines before applying vocabularies.
How was it found?
This was another incident that we discovered with our feature parity checker. We found several examples where one pipeline encoded the values as OOV while the other pipeline assigned non-zero values. We then compared the feature values prior to encoding and noticed discrepancies between how we did string parsing in the training and serving pipelines.
How was it fixed?
Our short term fix was to update the training pipeline to perform the same string parsing logic as the serving pipeline. Longer term we focused on truncating the app bundle names at the data ingestion step, to reduce the need for manual parsing steps in the different pipelines.
What did we learn?
We learned that dealing with problematic strings at data ingestion provided the most consistent results when dealing with string values. We also ran into issues with unicode characters showing up in app bundle names and worked to correctly parse these during ingestion. We also found it necessary to occasionally inspect the vocabulary entries that are generated by the system to make sure specific characters were not showing up in entries.
Takeaways
While it may be tempting to use deep learning in production for model serving, there’s a lot of potential issues that you can encounter with live model serving. It’s important to have robust plans in place for incident management when working with machine learning models, so that you can quickly recover when model performance becomes problematic and learn from these missteps. In this post we covered 8 different incidents I encountered when using deep learning to predict click and install conversion in a mobile AdTech platform. Here’s are the key takeaways I learned from these machine learning incidents:
- It’s important to log feature values, encoded values, tensor values, and model predictions during model serving, to ensure that you do not have feature parity or model parity issues in your model pipelines.
- Model validation is a necessary step in model deployment and test environments can help reduce incidents.
- Beware of the features that you include in your model, they may be introducing bias or causing unintended feedback.
- If you have different pipelines for model training and model serving, the team members working on the pipelines should be reviewing each other’s pull requests for ML feature implementations.
Machine learning is a discipline that can learn a lot from DevOps to reduce the occurrence of incidents, and MLOps should include processes for efficiently responding to issues with ML models in production.