Sneak peek into Red Hat OpenShift Service on AWS (ROSA)

Introduction

Back in May of 2020, Red Hat and Amazon Web Services announced a jointly supported, fully managed Red Hat OpenShift offering that is natively integrated into AWS. Since the announcement in November of 2020, customers had the opportunity to get their hands on the preview version of Red Hat OpenShift Service on AWS (ROSA). And since March 24, the service is Generally Available! If you don’t want to get your hands dirty yourself, keep reading, as I’ll report a few of my findings that I gathered when working with ROSA.

First, let’s get the most obvious question out of the way: Why should I care? We already have Red Hat OpenShift Dedicated, which is a fully managed Red Hat OpenShift service on AWS, isn’t it? That is true, however, in the context of ROSA:

You benefit from a native AWS experience, meaning that you can consume OpenShift directly from the AWS console.
In terms of billing, you do not need to have a business relationship with Red Hat, as you receive a single invoice from AWS for the OpenShift and AWS consumption.
Both Red Hat and AWS are responsible for managing and evolving the service, which will lead to an exciting road map for the offering.

You can find a bunch of other highlights of the offering in the official announcements. [1, 2] Enough said, let’s get our hands dirty and look into the current state of the product. In particular, we will:

provision a multi zone ROSA cluster
test the limits of the managed service
try the cluster autoscaler and explore different MachineSets
understand how updates work
consume an AWS service directly from within ROSA

In this context, it is important to familiarize yourself with the ROSA service definition as well as the responsibility assignment matrix. These documents will help you understand the service and the shared responsibility model, which is absolutely key before thinking about running production workloads on ROSA.

Installing a ROSA cluster

In order to start the installation process, you are (quite obviously) required to have an AWS account. If you don’t have one, learn how to create one here. Furthermore, an AWS user with programmatic access is required in order to get a working AWS CLI. Additionally, it is required to download and install the ROSA command line tool, which is used to install, configure and, of course, uninstall ROSA clusters.

One thing to keep in mind is the pricing model. There is a basic fee for every cluster (roughly $0.03 per cluster per hour) as well an individual fee that is based on the number of virtual CPUs available to worker nodes. An additional cost factor is the AWS infrastructure required to run the actual OpenShift bits (EC2 instances, load balancers, network transfer charges, etc.). The minimum configuration that is currently supported consists of the following:

Three control plane nodes
Two infrastructure nodes
Two compute nodes

The approximate cost of running this configuration is $1.30 US per hour (for on-demand EC2 instances). That being said, let’s get started with some hands-on work!

Verify the prerequisites

Before proceeding with the installation, it is recommended to double check that the prerequisites that were mentioned at the beginning of this section have been met. First, let’s go ahead and see if our AWS CLI is configured correctly.

$ aws ec2 describe-regions
---------------------------------------------------------------------------------
|                                DescribeRegions                                |
+-------------------------------------------------------------------------------+
||                                   Regions                                   ||
|+-----------------------------------+-----------------------+-----------------+|
||             Endpoint              |      OptInStatus      |   RegionName    ||
|+-----------------------------------+-----------------------+-----------------+|
||  ec2.eu-north-1.amazonaws.com     |  opt-in-not-required  |  eu-north-1     ||
||  ec2.ap-south-1.amazonaws.com     |  opt-in-not-required  |  ap-south-1     ||
||  ec2.eu-west-3.amazonaws.com      |  opt-in-not-required  |  eu-west-3      ||
||  ec2.eu-west-2.amazonaws.com      |  opt-in-not-required  |  eu-west-2      ||
||  ec2.eu-west-1.amazonaws.com      |  opt-in-not-required  |  eu-west-1      ||
||  ec2.ap-northeast-2.amazonaws.com |  opt-in-not-required  |  ap-northeast-2 ||
||  ec2.ap-northeast-1.amazonaws.com |  opt-in-not-required  |  ap-northeast-1 ||
||  ec2.sa-east-1.amazonaws.com      |  opt-in-not-required  |  sa-east-1      ||
||  ec2.ca-central-1.amazonaws.com   |  opt-in-not-required  |  ca-central-1   ||
||  ec2.ap-southeast-1.amazonaws.com |  opt-in-not-required  |  ap-southeast-1 ||
||  ec2.ap-southeast-2.amazonaws.com |  opt-in-not-required  |  ap-southeast-2 ||
||  ec2.eu-central-1.amazonaws.com   |  opt-in-not-required  |  eu-central-1   ||
||  ec2.us-east-1.amazonaws.com      |  opt-in-not-required  |  us-east-1      ||
||  ec2.us-east-2.amazonaws.com      |  opt-in-not-required  |  us-east-2      ||
||  ec2.us-west-1.amazonaws.com      |  opt-in-not-required  |  us-west-1      ||
||  ec2.us-west-2.amazonaws.com      |  opt-in-not-required  |  us-west-2      ||
|+-----------------------------------+-----------------------+-----------------+|
Code language: plaintext (plaintext)

Because of the above output, we can safely assume that our AWS CLI is configured correctly and we are able to query the AWS API. If this would not be the case, the command would return one or more error messages indicating that there are missing permissions.

The next step is to do the same with the ROSA CLI. To save some time, let’s verify that the command line tool is installed and that the required AWS permissions and quota settings are sufficient to deploy ROSA:

$ rosa verify permissions
INFO: Validating SCP policies...
INFO: AWS SCP policies ok

$ rosa verify quota
INFO: Validating AWS quota...
INFO: AWS quota okCode language: plaintext (plaintext)

If you want to learn more about the required AWS services quotas, please refer to the Red Hat documentation.

Next, we are going to have to acquire an offline access token with our Red Hat account. This token will be used to associate the ROSA cluster with your Red Hat account – and thus allowing the joint support model between Red Hat and AWS to work.

$ rosa login
To login to your Red Hat account, get an offline access token at https://cloud.redhat.com/openshift/token/rosa
? Copy the token and paste it here: ****Code language: plaintext (plaintext)

Lastly, we can use the ROSA CLI to double check that the correct AWS and Red Hat accounts are being used to provision the cluster.

$ rosa whoami
AWS Account ID:               9876543210
AWS Default Region:           us-east-1
AWS ARN:                      arn:aws:iam::9876543210:user/redhat-rosa-poc
OCM API:                      https://api.openshift.com
OCM Account ID:               03749d0f128a77e8b451490a4a1
OCM Account Name:             Christian Koep
OCM Account Username:         my-redhat-account
OCM Account Email:            [email protected]
OCM Organization ID:          beae9abd57d4b4aaf27d9ce093d
OCM Organization Name:        Red Hat
OCM Organization External ID: 1234567
Code language: plaintext (plaintext)

Starting a cluster installation

In order to conduct the actual installation of a ROSA cluster, the ROSA CLI will make sure that the AWS environment is initialized. What this means concretely is that an AWS CloudFormation stack is being created. At the core of this stack is the creation of an IAM user called ‘osdCcsAdmin’:

$ rosa init
INFO: Logged in as 'my-redhat-account' on 'https://api.openshift.com'
INFO: Validating AWS credentials...
INFO: AWS credentials are valid!
INFO: Validating SCP policies...
INFO: AWS SCP policies ok
INFO: Validating AWS quota...
INFO: AWS quota ok
INFO: Ensuring cluster administrator user 'osdCcsAdmin'...
INFO: Admin user 'osdCcsAdmin' created successfully!
INFO: Validating SCP policies for 'osdCcsAdmin'...
INFO: AWS SCP policies ok
INFO: Validating cluster creation...
INFO: Cluster creation valid
INFO: Verifying whether OpenShift command-line tool is available...
INFO: Current OpenShift Client Version: openshift-clients-4.6.0-202006250705.p0-164-gffd683609
Code language: plaintext (plaintext)

This user will be used to provision all the infrastructure bits in AWS that make up the OpenShift cluster. As you can see from the above output, a few additional checks are conducted (e. g. whether or not the OpenShift command-line tools are installed).

With that being out of the way, we can go ahead and spin up our cluster. The goal is to get access to a fully managed, production ready and highly scalable OpenShift Cluster with a pay-as-you-go approach for pricing. The following command will do just that:

$ rosa create cluster \
--cluster-name ckoep-prod \
--multi-az \
--region us-east-1 \
--version 4.6.17 \
--channel-group stable \
--enable-autoscaling \
--min-replicas 3 \
--max-replicas 6 \
--watch
Code language: plaintext (plaintext)

Note that the default installation does neither use multiple availability zones, nor cluster autoscaling. Those features have to be explicitly enabled via the respective command line flags (`–multi-az` and `–enable-autoscaling`).

After about 25 minutes, the command line interface reports that the installation has been completed successfully. It also shares some information on how to actually access the cluster:

time="2021-02-23T14:01:01Z" level=info msg="Install complete!"
time="2021-02-23T14:01:01Z" level=info msg="To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/output/auth/kubeconfig'"
time="2021-02-23T14:01:01Z" level=info msg="Access the OpenShift web-console here: https://console-openshift-console.apps.ckoep-prod.mxd9.p1.openshiftapps.com"
time="2021-02-23T14:01:01Z" level=info msg="Time elapsed: 26m41s"
INFO: Cluster 'ckoep-prod' is now ready
time="2021-02-23T14:01:02Z" level=info msg="install completed successfully" installID=qb2dgff4
Name:                       ckoep-prod
OpenShift Version:          4.6.17
DNS:                        ckoep-prod.mxd9.p1.openshiftapps.com
ID:                         1j1bbj9csic0c67utdjmdhrm54kp4655
External ID:                a79267cc-59eb-4ddc-8a70-9bc6cb091efb
AWS Account:                9876543210
API URL:                    https://api.ckoep-prod.mxd9.p1.openshiftapps.com:6443
Console URL:                https://console-openshift-console.apps.ckoep-prod.mxd9.p1.openshiftapps.com
Nodes:                      Master: 3, Infra: 3, Compute (Autoscaled): 3-6
Region:                     us-east-1
Multi-AZ:                   true
State:                      ready
Channel Group:              stable
Private:                    No
Created:                    Feb 23 2021 13:31:58 UTC
Details Page:               https://cloud.redhat.com/openshift/details/1j1bbj9csic0c67utdjmdhrm54kp4655
Code language: plaintext (plaintext)

By the way, the same information, including installation logs, links to the OpenShift console, configuration options, etc. are available in the Red Hat OpenShift Cluster Manager.

As mentioned before, the goal is to have a production grade cluster available. In this context, we need to look into connecting a reliable and secure Identity Provider to OpenShift in order to grant users access to the environment. In this case, we will be using Google’s OpenID Connect integration.

$ rosa create idp 
--cluster=ckoep-prod \
--type=google \
--name=redhatsso \
--mapping-method=claim \
--client-id='foo-bar.apps.googleusercontent.com' \
--client-secret='nono' \
--hosted-domain='redhat.com'
INFO: Configuring IDP for cluster 'ckoep-prod'
INFO: Identity Provider 'redhatsso' has been created.
   It will take up to 1 minute for this configuration to be enabled.
   To add cluster administrators, see 'rosa create user --help'.
   To login into the console, open https://console-openshift-console.apps.ckoep-prod.mxd9.p1.openshiftapps.com  and click on redhatsso.
Code language: plaintext (plaintext)

Again, the above (and much more) could have been achieved via the cluster manager as well. And that’s it! We can now start deploying actual work loads.

Testing the limits of the managed service

One of the first things you might notice is the fact that you may assign cluster users to two administrative groups:

dedicated-admins: Grants standard administrative privileges for OpenShift Dedicated. Users can perform administrative actions listed in the documentation.
cluster-admins: Gives users full administrative access to the cluster. This is the highest level of privilege available to users. It should be granted with extreme care, because it is possible with this level of access to get the cluster into an unsupportable state.

Wait, so you’re telling me that I can get administrative access, even to the control plane and still get to operate in the realm of a fully managed service? Let’s try, and see, what’s going to happen.

$ oc get nodes -o wide
NAME                           STATUS   ROLES          AGE   VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-130-97.ec2.internal    Ready    master         40m   v1.19.0+e405995   10.0.130.97    <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-149-115.ec2.internal   Ready    worker         32m   v1.19.0+e405995   10.0.149.115   <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-155-231.ec2.internal   Ready    infra,worker   18m   v1.19.0+e405995   10.0.155.231   <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-162-163.ec2.internal   Ready    infra,worker   18m   v1.19.0+e405995   10.0.162.163   <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-170-93.ec2.internal    Ready    worker         31m   v1.19.0+e405995   10.0.170.93    <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-182-6.ec2.internal     Ready    master         38m   v1.19.0+e405995   10.0.182.6     <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-198-101.ec2.internal   Ready    worker         32m   v1.19.0+e405995   10.0.198.101   <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-211-176.ec2.internal   Ready    master         38m   v1.19.0+e405995   10.0.211.176   <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
ip-10-0-217-158.ec2.internal   Ready    infra,worker   18m   v1.19.0+e405995   10.0.217.158   <none>        Red Hat Enterprise Linux CoreOS 46.82.202102051640-0 (Ootpa)   4.18.0-193.41.1.el8_2.x86_64   cri-o://1.19.1-7.rhaos4.6.git6377f68.el8
Code language: plaintext (plaintext)

Now, you might ask, what happens if someone was to “accidentally” remove, say, a control plane node?

$ oc delete node ip-10-0-182-6.ec2.internal
Error from server (Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support): admission webhook "regular-user-validation.managed.openshift.io" denied the request: Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support
Code language: plaintext (plaintext)

Thankfully, it fails, regardless of the user’s OpenShift permissions. The reason for that is that the Red Hat Site Reliability Engineering (SRE) team built a bunch of admission webhooks that are designed to prevent harmful actions to OpenShift clusters.

That being said, please don’t forget that you also have full access to every single AWS resource that makes up the cluster (it is your AWS account after all). This means that you can still get in trouble by, for example, manually removing every EC2 instance (and, as a result, the SLA of 99.95% uptime no longer applies).

Autoscaling

One of the benefits of ROSA is it’s cost model. In short, you are only going to pay for what you are actually using. In order to maximize the value of this concept, it makes sense to leverage the concept of autoscaling. In ROSA, you can use auto scaling both on the Pod level via the horizontal pod autoscaler (HPA) as well as the cluster, or node level, via the cluster autoscaler.

To verify that this works as expected, we are going to deploy a sample application that consumes a fixed amount of CPU and memory resources. For this, we are going to use a project that a colleague of mine maintains, `docker-stress`.

$ oc new-project ckoep-autoscaling

$ oc new-app https://github.com/iboernig/docker-stress.git
--> Found container image 33c4a62 (5 days old) from Docker Hub for "fedora:latest"

    * An image stream tag will be created as "fedora:latest" that will track the source image
    * A Docker build using source code from https://github.com/iboernig/docker-stress.git will be created
      * The resulting image will be pushed to image stream tag "docker-stress:latest"
      * Every time "fedora:latest" changes a new build will be triggered

--> Creating resources ...
    imagestream.image.openshift.io "fedora" created
    imagestream.image.openshift.io "docker-stress" created
    buildconfig.build.openshift.io "docker-stress" created
    deployment.apps "docker-stress" created
--> Success
    Build scheduled, use 'oc logs -f buildconfig/docker-stress' to track its progress.
    Run 'oc status' to view your app.

$ oc get pods
NAME                             READY   STATUS      RESTARTS   AGE
docker-stress-1-build            0/1     Completed   0          81s
docker-stress-7fc545cd48-mds7m   1/1     Running     0          12s
Code language: plaintext (plaintext)

With the help of the OpenShift Build concept, we got that up and running in just a few seconds. In order to enable the Kubernetes scheduler as well as the HPA to make decisions in terms of Pod placement and autoscaling, we are going to specify CPU resource requests and limits:

$ oc set resources deployment/docker-stress --requests cpu=1 --limits cpu=1
deployment.apps/docker-stress resource requirements updated

$ oc get deployment/docker-stress -o yaml
```
    spec:
        name: docker-stress
        resources:
          requests:
            cpu: "1"
          limits:
            cpu: "1"Code language: plaintext (plaintext)

The question to ask now is: How many pods of the above specification do fit into my cluster? To answer this question, we can either do some manual calculations, or we can use a neat tool called cluster-capacity. In order to make use of this, we’ll create a pod specification and let the tool do the calculations:

$ cat podspec.yaml
apiVersion: v1
kind: Pod
metadata:
  name: server-hostname
spec:
  containers:
  - name: server-hostname
    image: k8s.gcr.io/serve_hostname
    imagePullPolicy: Always
    resources:
      requests:
        cpu: 1Code language: plaintext (plaintext)

Thankfully, Red Hat ships the cluster-capacity tool in a container image, so we can use Podman (or any other container engine for that matter) to run it:

$ sudo podman run \
--rm \
-it \
-v /home/ckoep/.kube/config:/root/kubeconfig:Z \
-v /home/ckoep/pod.yaml:/root/podspec.yaml \
registry.redhat.io/openshift4/ose-cluster-capacity:v4.6 cluster-capacity --kubeconfig /root/kubeconfig --podspec /root/podspec.yaml --verbose
Code language: plaintext (plaintext)

Note that we mount both the previously created pod definition, as well as our local kubeconfig file into the pod. The latter is required to tell the cluster-capacity tool which cluster to connect to. Looking at the output, we know that our cluster can currently serve five copies of my application.

I0224 10:53:14.972477       1 registry.go:173] Registering SelectorSpread plugin
I0224 10:53:14.972586       1 registry.go:173] Registering SelectorSpread plugin
server-hostname pod requirements:
        - CPU: 1
        - Memory: 0

The cluster can schedule 5 instance(s) of the pod server-hostname.

Termination reason: Unschedulable: 0/9 nodes are available: 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 6 Insufficient cpu.

Pod distribution among nodes:
server-hostname
        - ip-10-0-170-93.ec2.internal: 2 instance(s)
        - ip-10-0-198-101.ec2.internal: 2 instance(s)
        - ip-10-0-149-115.ec2.internal: 1 instance(s)
Code language: plaintext (plaintext)

Now, what happens when we configure our application so that it scales up to seven replicas under load? Remember, we enabled the cluster autoscaling mechanism when we installed our ROSA cluster. So I expect that nodes are added automatically to serve the growing compute demand of my application. Let’s go ahead and verify that:

$ oc autoscale deployment/docker-stress \
--max 7 \
--cpu-percent=50
horizontalpodautoscaler.autoscaling/docker-stress autoscaled

$ oc set env deployment/docker-stress CPU_LOAD=1
deployment.apps/docker-stress updated
Code language: plaintext (plaintext)

After a few minutes, we can observe that the load increases. As a result, additional copies of our application are being spawned.

$ oc get hpa
NAME            REFERENCE                  TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
docker-stress   Deployment/docker-stress   99%/50%   1         7         1          2m3s
Code language: plaintext (plaintext)

$ oc describe hpa
Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Normal   SuccessfulRescale             22s                 horizontal-pod-autoscaler  New size: 2; reason: cpu resource utilization (percentage of request) above target
Code language: plaintext (plaintext)

Running the cluster capacity tool again, we can see that the cluster resources are starting to become more and more scarce:

The cluster can schedule 2 instance(s) of the pod server-hostname.Code language: plaintext (plaintext)

Eventually, new pods can no longer be scheduled because the resource requirements can not be met. As a result, the Pods remain in the `Pending` state until new capacity is available.

$ oc get pods
NAME                             READY   STATUS      RESTARTS   AGE
docker-stress-1-build            0/1     Completed   0          20m
docker-stress-7c4cf9fdc4-5l42g   1/1     Running     0          5m33s
docker-stress-7c4cf9fdc4-8tfqp   1/1     Running     0          10m
docker-stress-7c4cf9fdc4-cswf6   1/1     Running     0          15m
docker-stress-7c4cf9fdc4-jt9kh   1/1     Running     0          5m33s
docker-stress-7c4cf9fdc4-wjjrd   0/1     Pending     0          32s
docker-stress-7c4cf9fdc4-wvcm2   1/1     Running     0          32s
docker-stress-7c4cf9fdc4-z46wx   1/1     Running     0          32s

Code language: plaintext (plaintext)

And, as expected, a scale-up, this time on the machine (and eventually node) level, is triggered.

$ oc describe pod docker-stress-7c4cf9fdc4-wjjrd
Events:
  Type     Reason             Age                  From                Message
  ----     ------             ----                 ----                -------
  Warning  FailedScheduling   3m15s                default-scheduler   0/9 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 Insufficient cpu.
  Warning  FailedScheduling   3m15s                default-scheduler   0/9 nodes are available: 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 3 node(s) had taint {node-role.kubernetes.io/infra: }, that the pod didn't tolerate, 4 Insufficient cpu.
  Normal   NotTriggerScaleUp  46s (x15 over 3m6s)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added):

<strong>  </strong>Normal   TriggeredScaleUp   25s                  cluster-autoscaler  pod triggered scale-up: [{openshift-machine-api/ckoep-prod-xz987-worker-us-east-1a 1->2 (max: 2)}]
Code language: plaintext (plaintext)

This results in the provisioning of a Machine …

$ oc get machine -n openshift-machine-api
NAME                                       PHASE          TYPE        REGION      ZONE         AGE
...
ckoep-prod-xz987-worker-us-east-1a-j9h4v   Provisioning   m5.xlarge   us-east-1   us-east-1a   9s
…
Code language: plaintext (plaintext)

… which eventually turns into a fully provisioned node …

$ oc get nodes
NAME                           STATUS   ROLES          AGE    VERSION
ip-10-0-128-130.ec2.internal   Ready    worker         47s    v1.19.0+e405995
ip-10-0-130-97.ec2.internal    Ready    master         117m   v1.19.0+e405995
ip-10-0-149-115.ec2.internal   Ready    worker         109m   v1.19.0+e405995
ip-10-0-155-231.ec2.internal   Ready    infra,worker   95m    v1.19.0+e405995
ip-10-0-162-163.ec2.internal   Ready    infra,worker   95m    v1.19.0+e405995
ip-10-0-170-93.ec2.internal    Ready    worker         108m   v1.19.0+e405995
ip-10-0-182-6.ec2.internal     Ready    master         115m   v1.19.0+e405995
ip-10-0-198-101.ec2.internal   Ready    worker         109m   v1.19.0+e405995
ip-10-0-211-176.ec2.internal   Ready    master         115m   v1.19.0+e405995
ip-10-0-217-158.ec2.internal   Ready    infra,worker   95m    v1.19.0+e405995

Code language: plaintext (plaintext)

… that is able to serve our workload’s demands!

$ oc get pods -o wide
NAME                             READY   STATUS      RESTARTS   AGE     IP            NODE                           NOMINATED NODE   READINESS GATES
docker-stress-1-build            0/1     Completed   0          27m     10.128.2.44   ip-10-0-149-115.ec2.internal   <none>           <none>
docker-stress-7c4cf9fdc4-5l42g   1/1     Running     0          12m     10.131.0.31   ip-10-0-198-101.ec2.internal   <none>           <none>
docker-stress-7c4cf9fdc4-8tfqp   1/1     Running     0          17m     10.129.2.33   ip-10-0-170-93.ec2.internal    <none>           <none>
docker-stress-7c4cf9fdc4-cswf6   1/1     Running     0          22m     10.128.2.48   ip-10-0-149-115.ec2.internal   <none>           <none>
docker-stress-7c4cf9fdc4-jt9kh   1/1     Running     0          12m     10.129.2.34   ip-10-0-170-93.ec2.internal    <none>           <none>
docker-stress-7c4cf9fdc4-wjjrd   1/1     Running     0          7m25s   10.129.4.6    ip-10-0-128-130.ec2.internal   <none>           <none>
docker-stress-7c4cf9fdc4-wvcm2   1/1     Running     0          7m25s   10.128.2.53   ip-10-0-149-115.ec2.internal   <none>           <none>
docker-stress-7c4cf9fdc4-z46wx   1/1     Running     0          7m25s   10.131.0.32   ip-10-0-198-101.ec2.internal   <none>           <none>

Code language: plaintext (plaintext)

Lastly, let’s make sure that – for our wallet’s sakes – the other way around works as well, shall we?

$ oc set env deployment/docker-stress CPU_LOAD=100m
deployment.apps/docker-stress updatedCode language: plaintext (plaintext)

After a few minutes, the cluster is indeed scaled back to its original capacity.

$ oc get nodes
NAME                           STATUS   ROLES          AGE    VERSION
ip-10-0-130-97.ec2.internal    Ready    master         158m   v1.19.0+e405995
ip-10-0-149-115.ec2.internal   Ready    worker         151m   v1.19.0+e405995
ip-10-0-155-231.ec2.internal   Ready    infra,worker   137m   v1.19.0+e405995
ip-10-0-162-163.ec2.internal   Ready    infra,worker   137m   v1.19.0+e405995
ip-10-0-170-93.ec2.internal    Ready    worker         150m   v1.19.0+e405995
ip-10-0-182-6.ec2.internal     Ready    master         157m   v1.19.0+e405995
ip-10-0-198-101.ec2.internal   Ready    worker         151m   v1.19.0+e405995
ip-10-0-211-176.ec2.internal   Ready    master         157m   v1.19.0+e405995
ip-10-0-217-158.ec2.internal   Ready    infra,worker   137m   v1.19.0+e405995
Code language: plaintext (plaintext)

To be fair, it takes some time to detect autoscaling events, provision machines, install nodes, and scale out workloads. In a few cases, an argument could be made that it takes too much time. As a result, a few colleagues of mine, who are members of the Red Hat Community of Practice, are actively maintaining a community operator that aims to enable a more proactive workflow: the Proactive Node Scaling Operator.

On another note, I’d like to highlight that ROSA supports a multitude of different AWS instance types. At the time of this writing, the following are supported:

Figure 1: Available instance types in ROSA

Update strategy

Roughly every three months, Red Hat OpenShift customers have the option to upgrade their environments to the latest minor version in order to leverage a bunch of new features and functionalities. On top of that, additional security, bug fix, and enhancement updates are shipped even more frequently.

Luckily, even with ROSA you have a certain level of control over when upgrades are being done. To be specific, there are two update strategies to choose from in the Red Hat OpenShift Cluster Manager:

Automatic: This option allows you to automatically upgrade to the latest available version as soon as it is available. In addition to that, the exact day and start time for said operation can be specified.
Manual: In short, you trigger the (still fully automated) upgrade procedure manually.

Note that if your cluster version falls too far behind, it will be automatically updated. See the version support information. Please note that High and Critical security concerns (CVEs) will be patched automatically within 48 hours, regardless of the chosen update strategy.

Please note that similar functionality is available in the AWS Management Console and will soon be in the Red Hat Advanced Cluster Management for Kubernetes product.

Integration with AWS services

One of the great things about AWS is the vast majority of cloud services that cover a broad area – for example storage, databases and analytics, to name a few. Wouldn’t it be great to be able to consume those services directly within OpenShift, perhaps even via the Kubernetes API?

As a matter of fact, this is possible today via the use of a project called AWS Controllers for Kubernetes (ACK), which is currently available as a developer preview. With ACK, you have the ability to deploy and manage AWS services directly from OpenShift.

Now, assuming we’d like to deploy an S3 bucket, we first need to deploy the ACK S3 controller. Because Helm charts are available, the installation is straightforward:

$ export HELM_EXPERIMENTAL_OCI=1
$ export SERVICE=s3
$ export RELEASE_VERSION=v0.0.1
$ export CHART_EXPORT_PATH=/tmp/chart
$ export CHART_REPO=public.ecr.aws/aws-controllers-k8s/chart
$ export CHART_REF=$CHART_REPO:$SERVICE-$RELEASE_VERSION
$ mkdir -p $CHART_EXPORT_PATH
$ helm chart pull $CHART_REF
$ helm chart export $CHART_REF --destination $CHART_EXPORT_PATH
$ export ACK_K8S_NAMESPACE=ack-system

$ oc adm new-project $ACK_K8S_NAMESPACE
$ helm install -n $ACK_K8S_NAMESPACE \
ack-$SERVICE-controller \
$CHART_EXPORT_PATH/ack-$SERVICE-controller

$ oc set env deployment/ack-s3-controller \
AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text) \
AWS_REGION=us-east-2 \
AWS_ACCESS_KEY_ID=foo \
AWS_SECRET_ACCESS_KEY=bar
Code language: plaintext (plaintext)

And that’s it! The previously installed controller will now watch for instances of the “Bucket” custom resource and take care of the heavy lifting when it comes to managing S3 buckets. A simple example would be the following:

$ cat <<EOF | oc apply -f -
apiVersion: s3.services.k8s.aws/v1alpha1
kind: Bucket
metadata:
  name: my-amazing-bucket
  namespace: my-awesome-project
spec:
  name: my-amazing-bucket
EOFCode language: plaintext (plaintext)

Which will result in the creation of a basic bucket:

$ oc -n ack-system logs -l k8s-app=ack-s3-controller
2021-03-02T14:35:44.630Z        INFO    ackrt   created new resource    {"kind": "Bucket", "namespace": "my-awesome-project", "name": "my-amazing-bucket", "generation": 2, "is_adopted": false, "arn": null}
2021-03-02T14:36:15.279Z        INFO    ackrt   deleted resource        {"kind": "Bucket", "namespace": "my-awesome-project", "name": "my-amazing-bucket", "generation": 3}
Code language: plaintext (plaintext)

$ aws s3 ls
2021-03-02 16:01:06 my-amazing-bucketCode language: plaintext (plaintext)

In a real world scenario the last step would be to actually make a given application use the recently provisioned bucket. It is worth mentioning that the workflow for other services looks similar. If you are interested, a list of released and planned AWS services can be found on the ACK community website.

Summary

This blog post discussed the basics of installing and configuring ROSA. However, there are many more interesting nuances that we didn’t cover. Just to name two:

ROSA comes with a built-in logging add-on service that allows you to forward your application, infrastructure and audit logs to AWS CloudWatch.
All managed components of ROSA clusters are backed up regularly. It is important to think about backing up your workloads as those are not backed up.

With that said, I encourage you to try the service out yourself.