Introduction
Since my last exploration of the topic in mid-2021, there have been significant advancements in cold disaster recovery for applications on Kubernetes. Since then, there have been remarkable improvements in both the technical aspects and the user experience of cold disaster recovery. Notably, the Kubernetes operator that manages the technical infrastructure for cold disaster recovery on Kubernetes has made noticeable progress, transitioning from version 0.2.1 to 1.2.0. In addition to other benefits, this has streamlined the initial setup process and minimized the steps that are necessary for initiating backup and recovery. Additionally, Red Hat customers now have a supported option available, which should give users the confidence to be able to effectively meet their data protection requirements for their critical cloud-native applications.
Furthermore, it may be useful to note that the Red Hat portfolio related to the whole spectrum of disaster recovery scenarios has evolved as well. While this particular blog post primarily concentrates on cold disaster recovery, there are options to realize approaches such as warm and hot disaster recovery. Acknowledging an abundance of definitions online, Table 1 tries to encapsulate Kubernetes-specific definitions of these terms and how they relate to Red Hat solutions.
Type | Description | Subscriptions required | Further reading |
---|---|---|---|
Cold Disaster Recovery | This typically refers to scenarios where the infrastructure components are in place, but manual steps are required to restore the service. This includes moving data and deploying applications. | Red Hat OpenShift Kubernetes Engine or Red Hat OpenShift Container Platform or Red Hat OpenShift Platform Plus and Some sort of compatible storage, for example delivered via Red Hat OpenShift Platform Plus, which includes OpenShift Data Foundation essentials | https://docs.openshift.com/container-platform/4.12/backup_and_restore/index.html |
Warm Disaster Recovery | Sometimes referred to as Regional Disaster Recovery. Implies that a redundant set of infrastructure components is deployed and that application data is replicated asynchronously. Manual steps are required to trigger a failover. | Red Hat OpenShift Kubernetes Engine or Red Hat OpenShift Container Platform or Red Hat OpenShift Platform Plus and Red Hat OpenShift Data Foundation advanced or Red Hat OpenShift Container Platform and Red Hat Advanced Cluster Management for Kubernetes and Red Hat OpenShift Data Foundation advanced | https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-multisite-ramen.html https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/rdr-solution |
Hot Disaster Recovery | Sometimes referred to as Metro Disaster Recovery. Effectively means a fully functional, redundant copy of the production environment elsewhere with synchronous application date replication. | Red Hat OpenShift Platform Plus and OpenShift Data Foundation advanced or Red Hat OpenShift Container Platform and Red Hat Advanced Cluster Management for Kubernetes and Red Hat OpenShift Data Foundation advanced | https://red-hat-storage.github.io/ocs-training/training/ocs4/odf411-metro-ramen.html https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.12/html/configuring_openshift_data_foundation_disaster_recovery_for_openshift_workloads/metro-dr-solution |
The objective of this blog post is to provide a simple and easy to understand starting point for individuals seeking guidance on how to go about the backup process for their Kubernetes applications while acknowledging that it is not a complete guide for all things data protection.
Scenario
In this blog post, I will reference various Red Hat specific technologies such as OpenShift and the OpenShift API for Data Protection (OADP). Since these technologies are not proprietary, their applicability extends to other Kubernetes distributions where similar APIs, specifically Velero, are readily accessible.
The steps provided later on assume the presence of certain components. Rather than providing detailed setup instructions, the focus will be on utilizing them:
- Two OpenShift Container Platform 4.12 clusters with some form of local storage solution
- A central and highly available container registry, in this case, Red Hat Quay, which is connected to both OpenShift clusters
- A central MinIO object storage cluster with an endpoint accessible to both OpenShift clusters
- One bucket along with the necessary credentials to read from and write data to the storage
In this scenario, we will utilize Velero to backup OpenShift and Kubernetes resources. These backups will be stored as archives in the Minio object storage bucket. Additionally, since we are leveraging NFS for storage in this example, backups of Persistent Volumes (PVs) will also be stored in the object storage using Restic.
Side note: Another approach for backing up PVs is through snapshots, which can be managed using the native snapshot API provided by cloud providers (if available) or via Container Storage Interface (CSI) snapshots. With this consideration, the high-level architecture depicted in Figure 1 emerges.
To conveniently install and configure various components in two Kubernetes environments from a client computer, we will utilize kubectl contexts. The primary site will be referred to as “primary” and the failover site as “failover,” although these names are arbitrary.
$ oc login --token=sha256~d2VyIGRhcyBsaWVzdCBpc3QgZG9vZgo --server=https://api.primary.example.com:6443
$ oc config rename-context $(oc config current-context) primary
$ oc --context primary get nodes
NAME STATUS ROLES AGE VERSION
compute-0 Ready worker 2y333d v1.25.8+37a9a08
compute-1 Ready worker 2y333d v1.25.8+37a9a08
compute-2 Ready worker 2y333d v1.25.8+37a9a08
compute-3 Ready worker 2y333d v1.25.8+37a9a08
master-0 Ready master,worker 2y333d v1.25.8+37a9a08
master-1 Ready master,worker 2y333d v1.25.8+37a9a08
master-2 Ready master,worker 2y333d v1.25.8+37a9a08
$ oc login --token=sha256~Zm9sbG93IHRoZSB3aGl0ZSByYWJiaXQK --server=https://api.failover.example.com:6443
$ oc config rename-context $(oc config current-context) failover
$ oc --context failover get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-166-125.eu-west-1.compute.internal Ready control-plane,master,worker 30m v1.25.7+eab9cc9
Code language: Shell Session (shell)
Installing the OADP Operator
To backup and restore applications running on the OpenShift Container Platform, one option is to use the OADP operator. OADP handles the deployment and management of the necessary components for implementing disaster recovery. To deploy the operator, please refer to the documentation for detailed instructions and remember to perform these steps on both the primary and the failover site. Finally, obtain the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY that are used to access the MinIO object storage bucket and store them in a Kubernetes secret object:
$ cat << EOF > ./credentials-velero
[default]
aws_access_key_id=<AWS_ACCESS_KEY_ID>
aws_secret_access_key=<AWS_SECRET_ACCESS_KEY>
EOF
Code language: Shell Session (shell)
For simplicity’s sake, we will create the same secret in both clusters. However, in a real-world scenario, it would be optimal to use different credentials to access the object storage.
$ for i in primary failover; \
do oc --context ${i} create secret generic cloud-credentials \
-n openshift-adp --from-file cloud=credentials-velero; \
done
secret/cloud-credentials created
secret/cloud-credentials created
Code language: Shell Session (shell)
Configuring the Data Protection Application
The next step is to instruct the OADP operator to deploy the necessary components for backup and restore operations in both sites. This can be achieved by using the DataProtectionApplication API resource. In this specific example, a single Velero instance and a DaemonSet for deploying the Velero Node Agent will be deployed. The Velero Node Agent hosts the Restic library, which is responsible for conducting file system backups. Please note that the following configuration example is designed to use MinIO as the backup location. Please refer to the documentation for instructions on how to adjust the DataProtectionApplication CRD to work with alternative S3-compatible backup storage providers.
$ cat DataProtectionApplication.yaml
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: velero-sample
namespace: openshift-adp
spec:
backupLocations:
- velero:
config:
insecureSkipTLSVerify: "true"
profile: default
region: minio
s3ForcePathStyle: "true"
s3Url: http://minio.example.com
credential:
key: cloud
name: cloud-credentials
default: true
objectStorage:
bucket: oadp-backup
prefix: velero
provider: aws
configuration:
restic:
enable: true
velero:
defaultPlugins:
- openshift
- aws
podConfig:
resourceAllocations:
requests:
cpu: 500m
memory: 256Mi
Code language: Shell Session (shell)
As mentioned earlier, OADP is responsible for both backup and restore operations. Therefore, we will deploy the DataProtectionApplication in both environments.
$ for i in primary failover; \
do oc --context ${i} create -f DataProtectionApplication.yaml; \
done
dataprotectionapplication.oadp.openshift.io/velero-sample created
dataprotectionapplication.oadp.openshift.io/velero-sample created
Code language: Shell Session (shell)
If everything goes smoothly, we should expect to see something similar to the following.
$ for i in primary failover; \
do oc --context ${i} -n openshift-adp get pods; \
echo; done
NAME READY STATUS RESTARTS AGE
node-agent-6kbp7 1/1 Running 0 2m54s
node-agent-bptcc 1/1 Running 0 2m55s
node-agent-cwvgc 1/1 Running 0 2m54s
node-agent-jglq9 1/1 Running 0 2m54s
node-agent-lsr9g 1/1 Running 0 2m54s
node-agent-th4jk 1/1 Running 0 2m54s
node-agent-zcvbc 1/1 Running 0 2m55s
openshift-adp-controller-manager-66cf6958d5-l5s68 1/1 Running 0 169m
velero-647b46bb9b-2c7lb 1/1 Running 0 2m55s
NAME READY STATUS RESTARTS AGE
node-agent-4tj5d 1/1 Running 0 2m55s
openshift-adp-controller-manager-5d6d56f89f-bvqn4 1/1 Running 0 3m38s
velero-647b46bb9b-gkww6 1/1 Running 0 2m55s
Code language: Shell Session (shell)
To ensure that Velero has access to the S3 repository, we will check the BackupStorageLocation API resource.
$ for i in primary failover; \
do oc --context ${i} -n openshift-adp get backupstoragelocation; \
echo; done
NAME PHASE LAST VALIDATED AGE DEFAULT
velero-sample-1 Available 14s 4m47s true
NAME PHASE LAST VALIDATED AGE DEFAULT
velero-sample-1 Available 8s 4m47s true
Code language: Shell Session (shell)
In real-world scenarios, it is common to use multiple backup locations, granular credentials, and different Velero plugins depending on the environment. Fortunately, all of these requirements can be accommodated and configured in a straightforward way via the previously mentioned APIs.
Application deployment
To demonstrate the disaster recovery capabilities of OADP, we need two essential elements: a stateful application and a catastrophic event that renders the service unrecoverable without a backup.
Here is the most basic example of a stateful application that I can think of: A single pod that writes data to a file residing on a Persistent Volume (PV). In the event of a disaster or data loss, having a backup of the PV becomes crucial for successful recovery and restoration of the service.
$ oc --context primary new-project persistent-database-prod
Code language: Shell Session (shell)
$ oc --context primary -n persistent-database-prod create deployment \
persistent-database-prod --replicas 1 \
--image=registry.access.redhat.com/ubi8/ubi-minimal:8.8-1014 \
-- /bin/bash -c "sleep infinity"
deployment.apps/persistent-database-prod created
Code language: Shell Session (shell)
$ oc --context primary -n persistent-database-prod set volumes \
deploy/persistent-database-prod --add -t pvc --claim-size 1G --mount-path=/data
info: Generated volume name: volume-524mq
deployment.apps/persistent-database-prod volume updated
Code language: Shell Session (shell)
$ oc --context primary -n persistent-database-prod get po,pvc
NAME READY STATUS RESTARTS AGE
pod/persistent-database-prod-6d58c68dbc-pjtn9 1/1 Running 0 36s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/pvc-ctcdk Bound pvc-a8b3c1e3-4bfd-466d-8bca-b2061b49eb52 1G RWO managed-nfs-storage 36s
Code language: Shell Session (shell)
To enable easy verification of the successful application restore later, writing a timestamp to the file on the Persistent Volume (PV) should suffice.
$ oc --context primary -n persistent-database-prod exec -it persistent-database-prod-6d58c68dbc-pjtn9 -- /bin/bash -c "date > /data/criticaldata.txt"
Code language: Shell Session (shell)
$ oc --context primary -n persistent-database-prod exec -it persistent-database-prod-6d58c68dbc-pjtn9 -- /bin/bash -c "cat /data/criticaldata.txt"
Thu Jul 6 14:41:04 UTC 2023
Code language: Shell Session (shell)
Application backup
To initiate an application backup using OADP, we utilize the Backup API resource. As the application is currently running on the primary site, the Backup resource will be deployed there.
$ cat backup.yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
name: backup-persistent-database-prod
namespace: openshift-adp
spec:
includedNamespaces:
- persistent-database-prod
DefaultVolumesToFsBackup: true
ttl: 720h0m0s
Code language: Shell Session (shell)
$ oc --context primary create -f backup.yaml
backup.velero.io/backup-persistent-database-prod created
Code language: Shell Session (shell)
After a certain period, the API should indicate a status of “Completed” after initially being in an “InProgress” state for some time.
$ oc --context primary -n openshift-adp get backup \
backup-persistent-database-prod -o jsonpath='{.status.phase}'; echo
Completed
Code language: Shell Session (shell)
Alternatively, one can check the logs of the backup controller to monitor the progress.
$ oc --context primary -n openshift-adp logs \
$(oc -n openshift-adp get pods -l app.kubernetes.io/component=server -o name) \
| grep "Backup completed"
time="2023-07-06T14:41:40Z" level=info msg="Backup completed" backup=openshift-adp/backup-persistent-database-prod logSource="/remote-source/velero/app/pkg/controller/backup_controller.go:780"
Code language: Shell Session (shell)
Yet another place to verify the successful backup is by checking the object storage itself. The object storage should now contain the static files such as API resources and logs, as well as the content of the Persistent Volume.
$ aws --endpoint-url=http://minio.example.com s3 ls \
s3://oadp-backup/velero/backups/backup-persistent-database-prod/
2023-07-06 16:41:40 29 backup-persistent-database-prod-csi-volumesnapshotclasses.json.gz
2023-07-06 16:41:40 29 backup-persistent-database-prod-csi-volumesnapshotcontents.json.gz
2023-07-06 16:41:40 29 backup-persistent-database-prod-csi-volumesnapshots.json.gz
2023-07-06 16:41:40 27 backup-persistent-database-prod-itemoperations.json.gz
2023-07-06 16:41:40 11365 backup-persistent-database-prod-logs.gz
2023-07-06 16:41:40 940 backup-persistent-database-prod-podvolumebackups.json.gz
2023-07-06 16:41:40 604 backup-persistent-database-prod-resource-list.json.gz
2023-07-06 16:41:40 49 backup-persistent-database-prod-results.gz
2023-07-06 16:41:40 29 backup-persistent-database-prod-volumesnapshots.json.gz
2023-07-06 16:41:40 83097 backup-persistent-database-prod.tar.gz
2023-07-06 16:41:40 2707 velero-backup.json
Code language: Shell Session (shell)
$ aws --endpoint-url=http://minio.example.com s3 ls \
s3://oadp-backup/velero/restic/persistent-database-prod/
PRE data/
PRE index/
PRE keys/
PRE snapshots/
2023-07-06 16:41:37 155 config
Code language: Shell Session (shell)
Disaster simulation
After confirming that the application is still functioning in the primary site, deliberately deleting the namespace will serve as a simulated disaster. It is worth noting that since the backup is stored externally from the cluster, recovering from a complete cluster outage would also be a feasible scenario.
$ oc --context primary -n persistent-database-prod get po,pvc
NAME READY STATUS RESTARTS AGE
pod/persistent-database-prod-6d58c68dbc-pjtn9 1/1 Running 0 4m55s
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
persistentvolumeclaim/pvc-ctcdk Bound pvc-a8b3c1e3-4bfd-466d-8bca-b2061b49eb52 1G RWO managed-nfs-storage 4m55s
Code language: Shell Session (shell)
$ oc --context primary delete project persistent-database-prod
project.project.openshift.io "persistent-database-prod" deleted
Code language: Shell Session (shell)
$ oc --context primary -n persistent-database-prod get all
No resources found in persistent-database-prod namespace.
Code language: Shell Session (shell)
Application restore
After confirming the outage and deciding to initiate the restore process on the failover site, the Restore API resource is used to instruct OADP to trigger the restoration process.
$ cat restore.yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
name: restore-persistent-database-prod
namespace: openshift-adp
spec:
backupName: backup-persistent-database-prod
restorePVs: true
Code language: Shell Session (shell)
$ oc --context failover create -f restore.yaml
restore.velero.io/restore-persistent-database-prod created
Code language: Shell Session (shell)
Once again, querying the API can help retrieve the status of the restore process.
$ oc --context failover -n openshift-adp get restore \
restore-persistent-database-prod -o jsonpath='{.status.phase}'; echo
Completed
Code language: Shell Session (shell)
Similar to the backup procedure, checking the logs of the controller manager pods can likewise provide valuable insights into the status of the restore operation.
$ oc --context failover -n openshift-adp logs \
$(oc --context failover -n openshift-adp get pods \
-l app.kubernetes.io/component=server -o name) | grep "restore completed"
time="2023-07-06T14:45:54Z" level=info msg="restore completed" logSource="/remote-source/velero/app/pkg/controller/restore_controller.go:513" restore=openshift-adp/restore-persistent-database-prod
Code language: Shell Session (shell)
In addition, the log files associated with restore operations are persistently stored in S3, allowing them to be accessed and consumed even after the lifecycle of the OADP controller manager pod.
$ aws --endpoint-url=http://minio.example.com s3 ls \
s3://oadp-backup/velero/restores/restore-persistent-database-prod/
2023-07-06 16:45:55 27 restore-restore-persistent-database-prod-itemoperations.json.gz
2023-07-06 16:45:54 11369 restore-restore-persistent-database-prod-logs.gz
2023-07-06 16:45:54 449 restore-restore-persistent-database-prod-resource-list.json.gz
2023-07-06 16:45:54 255 restore-restore-persistent-database-prod-results.gz
Code language: Shell Session (shell)
Upon reviewing the output, one would expect the application to be operational, and fortunately, that is indeed the case.
$ oc --context failover -n persistent-database-prod get po,pv
NAME READY STATUS RESTARTS AGE
pod/persistent-database-prod-6d58c68dbc-pjtn9 1/1 Running 0 2m26s
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-a8b32d75-6c18-4c57-aa34-7f10b64c4fa7 1Gi RWO Delete Bound persistent-database-prod/pvc-ctcdk managed-nfs-storage 2m23s
Code language: Shell Session (shell)
The same applies to the data that was written to the application at the time of the backup.
$ oc --context failover -n persistent-database-prod exec -it persistent-database-prod-6d58c68dbc-pjtn9 -- /bin/bash -c "cat /data/criticaldata.txt"
Thu Jul 6 14:41:04 UTC 2023
Code language: Shell Session (shell)
Regular application backups
Manually scheduling application backups is susceptible to human error, time-consuming, and prone to inconsistencies, thereby increasing the risk of data loss. This problem becomes more pronounced as the number of applications grows, worsening the challenges associated with managing and ensuring reliable backups. To tackle this issue, one can leverage the Scheduling API resource, which ensures regular and automated creation of backups.
$ cat backup-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: backup-persistent-database-prod-schedule
namespace: openshift-adp
spec:
schedule: '*/10 * * * *'
template:
includedNamespaces:
- persistent-database-prod
DefaultVolumesToFsBackup: true
ttl: 1h0m0s
Code language: Shell Session (shell)
$ oc --context primary create -f backup-schedule.yaml
schedule.velero.io/backup-persistent-database-prod-schedule created
Code language: Shell Session (shell)
$ oc --context primary get schedule -n openshift-adp backup-persistent-database-prod-schedule
NAME STATUS SCHEDULE LASTBACKUP AGE PAUSED
backup-persistent-database-prod-schedule Enabled */10 * * * * 12s
Code language: Shell Session (shell)
When using schedules, each backup is assigned a unique name, which must be referenced in the Restore API resource when needed. The Time-to-Live (TTL) specified in the Schedule API resource determines the duration for which backups will be retained. Velero automatically handles the cleanup of older backups, ensuring efficient management of backup storage.
$ oc --context primary get backup -n openshift-adp
NAME AGE
backup-persistent-database-prod 13m
backup-persistent-database-prod-schedule-20230706145009 4m37s
Code language: Shell Session (shell)
Summary and Outlook
This blog post hopefully delivered on its promise by offering a straightforward starting point for cold disaster recovery on Kubernetes. It showcased a backup procedure, a simulated disaster scenario, and a successful recovery.
I acknowledge that this post may not cover all possible scenarios or address every question that could arise. There are some unanswered questions, including:
- Restoring applications deployed and managed by a Kubernetes Operator which may not be included in the backup itself.
- Integration of the above concepts within the principles of GitOps.
- Handling large-scale application restoration and testing in a real disaster scenario.
- What would other aspects of disaster recovery, such as processes, documentation, RTO objective setting, and capacity planning look like?
Furthermore, it may be useful to note that, while we focused on backing up and restoring applications, OADP can also be utilized to backup and restore entire clusters. For a comprehensive understanding of backing up and restoring hub clusters, you can refer to the following blog post.
Finally, OADP is not the only option for Kubernetes backups. One of Red Hat’s strengths lies in engaging with partners, and there are multiple alternatives available for disaster recovery on OpenShift. Kasten K10 by Veeam, for example, is a leading provider in this field and well-integrated with OpenShift. Additionally, vendors like Portworx and Trilio offer their own solutions, but it’s worth exploring all options available through the Red Hat software catalog, as there are numerous alternatives to choose from.