Machine Learning models are developed, trained and deployed within more and more intelligent applications. Most of the modern applications are developed as cloud-native applications and deployed on Kubernetes. There are several ways to serve your ML model so that application components can call a prediction REST web service. Seldon Core is a framework that makes it easy to deploy ML models. The Seldon Core metric functionality is an important feature for operational model performance monitoring and can support you observing potential model drift.
In this blog post we are going to explore how to deploy and monitor ML models on OpenShift Kubernetes using Seldon Core, Prometheus and Grafana.
Topics
- Install Seldon Core, Prometheus and Grafana
- Deploy ML models, scrape and graph operational metrics
- Scrape and graph custom metrics
- Troubleshooting
Approach
After installing Seldon Core, Prometheus and Grafana on your OpenShift Kubernetes cluster, we will walk through several basic examples. The machine learning models used here are very simple and just examples for working with metrics.
A basic understanding of OpenShift Kubernetes, Operators, machine learning and git is most likely needed to follow along. Please clone the git repo openshift-tutorials on your computer and switch to the directory ml-monitoring.
Additionally, please ensure that you have privileges to deploy and configure operators on your OpenShift cluster. The examples should work fine on Red Hat CodeReady Containers as well.
Install Seldon Core, Prometheus and Grafana
Below are step-by-step instructions for installing and configuring Seldon Core, Prometheus and Grafana. Alternatively, you can install these components with Open Data Hub.
Login into OpenShift and create a new project:
oc new-project ml-mon
Code language: JavaScript (javascript)
Install the Operators
Install the Operators for Seldon Core, Promentheus and Grafana via the OperatorHub in OpenShift Console or using oc
CLI and the operator subscriptions in the operator
directory. E.g.
oc apply -k operator/
Sample output:
operatorgroup.operators.coreos.com/ml-mon created
subscription.operators.coreos.com/grafana-operator created
subscription.operators.coreos.com/prometheus created
subscription.operators.coreos.com/seldon-operator created
In OpenShift Console under Installed Operators
you should see the following:
Create a Prometheus instance and route
Now, create a Prometheus instance and route with by applying the following manifests:
oc apply -f prometheus-instance.yaml
oc apply -f prometheus-route.yaml
Code language: CSS (css)
Check if your Prometheus instance is running by navigating in the OpenShift Console to `Networking` -> `Routes` and clicking on the Prometheus URL.
Configure Grafana
Create a Grafana instance:
oc apply -f grafana-instance.yaml
Code language: CSS (css)
Your Grafana instance should be connected to you Prometheus instance as datascourde
Therefore, create a Grafana datasource for Prometheus:
oc apply -f grafana-prometheus-datasource.yaml
Code language: CSS (css)
Check the route and get the URL for the Grafana dashboard:
oc get routes -o name
Code language: JavaScript (javascript)
Sample output:
route.route.openshift.io/grafana-route
route.route.openshift.io/prometheus
Get the URL:
echo http://$(oc get route grafana-route -n ml-mon -o jsonpath='{.spec.host}')
Code language: PHP (php)
Sample output:
http://grafana-route-ml-mon.apps-crc.testing
Code language: JavaScript (javascript)
In case everything went fine, you should see the “Welcome to Grafana” page.
Deploy ML models, scrape and graph operational metrics
This this section we will use examples from SeldonIO to deploy ML models, scrape and graph operational metrics.
Ensure you are in the right namespace:
oc project ml-mon
Deploy the Seldon Core Grafana Dashboard
The Grafana Operator exposes an API for Dashboards. Let’s apply the Prediction Analytics dashboard:
oc apply -f prediction-analytics-seldon-core-1.2.2.yaml
Code language: CSS (css)
Open Grafana and have a look the Prediction Analytics dashboard. No data is available yet:
Explore Seldon operational metrics
Few steps are needed to see operational metrics:
- Deploy a ML model using Seldon
- Expose the prediction service
- Deploy a Prometheus Service Monitor
- Generate load for the prediction service and view the dashboard
Deploy a ML model using Seldon
We will use an example from Seldon Core:
oc apply -f https://raw.githubusercontent.com/SeldonIO/seldon-core/release-1.2.2/notebooks/resources/model_seldon_rest.yaml
Code language: JavaScript (javascript)
Wait until the pod is deployed:
oc get pods
Code language: JavaScript (javascript)
Sample output:
NAME READY STATUS RESTARTS AGE
...
rest-seldon-model-0-classifier-5594bd9d49-pld7s 2/2 Running 0 91m
...
Expose and test the prediction service
The deployment of the model created a service too.
Expose the created service so that we can test the prediction:
oc expose service rest-seldon-model
Note, Seldon created two services: rest-seldon-model
and rest-seldon-model-classifier
. We will use here the service rest-seldon-model
, because it points to the seldon engine and we have to use the seldon engine to make metrics available for Prometheus.
Test the prediction service:
curl -H "Content-Type: application/json" -d '{"data": {"ndarray":[[1.0, 2.0, 5.0]]}}' -X POST http://$(oc get route rest-seldon-model -o jsonpath='{.spec.host}')/api/v1.0/predictions
Code language: JavaScript (javascript)
Sample output:
{"data":{"names":["proba"],"ndarray":[[0.43782349911420193]]},"meta":{}}
Code language: JSON / JSON with Comments (json)
The prediction works and the result is 0.437.
Deploy a Prometheus Service Monitor
Next we will instruct Prometheus to gather Seldon-core metrics for the model. This is done with a Prometheus Service Monitor:
oc apply -f rest-seldon-model-servicemonitor.yaml
Code language: CSS (css)
The Service Monitor is going to find the service with the label seldon-app=rest-seldon-model
and scrape metrics from /prometheus
at port http
.
Here a snippet of the servicemonitor:
...
spec:
endpoints:
- interval: 30s
path: /prometheus
port: http
selector:
matchLabels:
seldon-app: rest-seldon-model
Generate load for the prediction service and view the dashboard
Generate some load for the prediction service to have metric data on the dashboard:
while true
do
curl -H "Content-Type: application/json" -d '{"data": {"ndarray":[[1.0, 2.0, 5.0]]}}' -X POST http://$(oc get route rest-seldon-model -o jsonpath='{.spec.host}')/api/v1.0/predictions
sleep 2
done
Code language: JavaScript (javascript)
The Grafana Prediction Analytics dashboard will start showing some data:
Deploy and monitor a Tensorflow model
Let us repeat the lab with a Tensorflow model from Seldon Core:
Deploy a ML model using Seldon
oc apply -f https://raw.githubusercontent.com/SeldonIO/seldon-core/release-1.2.2/notebooks/resources/model_tfserving_rest.yaml
Code language: JavaScript (javascript)
Wait until the pod is deployed:
oc get pods
Code language: JavaScript (javascript)
Sample output:
rest-tfserving-model-0-halfplustwo-7c6c67fcbc-q6rrk 2/2 Running 0 107s
Expose and test the prediction service
Expose the created service so that we can test the prediction.
oc expose service rest-tfserving-model
Note, Seldon created two services: rest-tfserving-model
and rest-tfserving-model-halfplustwo
. We will use here the service rest-tfserving-model
, because it point to the seldon engine so that we see metrics.
Test the prediction service:
curl -H "Content-Type: application/json" -d '{"instances": [1.0, 2.0, 5.0]}' -X POST http://$(oc get route rest-tfserving-model -o jsonpath='{.spec.host}')/v1/models/halfplustwo/:predict
Code language: JavaScript (javascript)
Sample output:
{
"predictions": [2.5, 3.0, 4.5
]
}
Code language: JSON / JSON with Comments (json)
The prediction works fine!
Deploy a Prometheus Service Monitor
Now we will instruct Prometheus to gather Seldon-core metrics for the model. This is done with a Prometheus Service Monitor:
oc apply -f rest-tfserving-model-servicemonitor.yaml
Code language: CSS (css)
The Service Monitor is going to find the service with the label rest-tfserving-model
and scrape metrics from /prometheus
at port http
. Here a snippet of the servicemonitor:
spec:
endpoints:
- interval: 30s
path: /prometheus
port: http
selector:
matchLabels:
seldon-app: rest-tfserving-model
Generate load on the service and view the dashboard
Next, generate some load to see data on the dashboard:
while true
do
curl -H "Content-Type: application/json" -d '{"instances": [1.0, 2.0, 5.0]}' -X POST http://$(oc get route rest-tfserving-model -o jsonpath='{.spec.host}')/v1/models/halfplustwo/:predict
sleep 2
done
Code language: JavaScript (javascript)
The Grafana Prediction Analytics dashboard will start showing the Tensorflow data. You might have to reload the dashboard.
Scrape and graph custom metrics
With custom metrics you can expose any metrics from your model. For example you cloud expose features and predictions for model drift monitoring.
Again, let’s repeat the lab with a model from Seldon Core.
Deploy the model and dashboard
Deploy the example model with custom metrics:
oc apply -f https://raw.githubusercontent.com/SeldonIO/seldon-core/v1.2.2/examples/models/custom_metrics/model_rest.yaml
Code language: JavaScript (javascript)
Deploy a custom dashboard:
oc apply -f custom-metrics-dashboard.yaml
Code language: CSS (css)
Open Grafana and have a look the Custom Metrics dashboard. No data is available yet:
Expose the service and test
oc expose service seldon-model-example
Test the prediction service:
curl -s -d '{"data": {"ndarray":[[1.0, 2.0, 5.0]]}}' -X POST http://$(oc get route seldon-model-example -o jsonpath='{.spec.host}')/api/v1.0/predictions -H "Content-Type: application/json"
Code language: JavaScript (javascript)
Sample output with custom metrics in “meta”:
{"data":{"names":["t:0","t:1","t:2"],"ndarray":[[1.0,2.0,5.0]]},"meta":{"metrics":[{"key":"mycounter","type":"COUNTER","value":1},{"key":"mygauge","type":"GAUGE","value":100},{"key":"mytimer","type":"TIMER","value":20.2}]}}
Code language: JSON / JSON with Comments (json)
Have a look at the meta data above.
Scrape custom metrics
Note, custom metrics are expose by the predictor (not the engine) at port 6000.
Therefore, add a service for port 6000:
oc apply -f seldon-model-example-classifier-metrics-service.yaml
Code language: CSS (css)
And add a service monitor for the custom metrics:
oc apply -f seldon-model-example-classifier-servicemonitor.yaml
Code language: CSS (css)
Create a bit of load:
for i in 1 2 3 4 5
do
curl -s -d '{"data": {"ndarray":[[1.0, 2.0, 5.0]]}}' -X POST http://$(oc get route seldon-model-example -o jsonpath='{.spec.host}')/api/v1.0/predictions -H "Content-Type: application/json"
sleep 1
done
Code language: JavaScript (javascript)
The Custom Metrics dashboard will (hopefully) show the data. If not, set the time range of the Grafana dashboard to Last 15 minutes
.
So, we saw operational and custom metrics in the Grafana Dashboards. You can now apply these concepts to your ML model serving.
Troubleshooting
Missing data?
In case your data is not showing up in Grafana, please check first in Prometheus if the metrics and data exist.
Internal error occurred: failed calling webhook “v1.mseldondeployment.kb.io”
Deinstalling the Seldon Operator leaves webhookconfiguration behind, which cause trouble when you deploy a seldon deployment for a newly deployed operator.
For example:
oc apply -f https://raw.githubusercontent.com/SeldonIO/seldon-core/release-1.2.2/notebooks/resources/model_seldon_rest.yaml
Code language: JavaScript (javascript)
Sample output:
Error from server (InternalError): error when creating "https://raw.githubusercontent.com/SeldonIO/seldon-core/release-1.2.2/notebooks/resources/model_seldon_rest.yaml": Internal error occurred: failed calling webhook "v1.mseldondeployment.kb.io": Post https://seldon-webhook-service.manuela-ml-workspace.svc:443/mutate-machinelearning-seldon-io-v1-seldondeployment?timeout=30s: service "seldon-webhook-service" not found
Code language: JavaScript (javascript)
A previous deployment of the Seldon Operator in manuela-ml-workspace
causes the trouble.
Let’s find and delete the WebhookConfiguration. E.g.,
oc get -o name MutatingWebhookConfiguration,ValidatingWebhookConfiguration -A | grep manuela
Code language: JavaScript (javascript)
Sample output:
mutatingwebhookconfiguration.admissionregistration.k8s.io/seldon-mutating-webhook-configuration-manuela-ml-workspace
validatingwebhookconfiguration.admissionregistration.k8s.io/seldon-validating-webhook-configuration-manuela-ml-workspace
Now delete …
oc delete mutatingwebhookconfiguration.admissionregistration.k8s.io/seldon-mutating-webhook-configuration-manuela-ml-workspace
oc delete validatingwebhookconfiguration.admissionregistration.k8s.io/seldon-validating-webhook-configuration-manuela-ml-workspace
Code language: JavaScript (javascript)