Why test automation won't work without infrastructure automation

Background

The last months I had the pleasure to work with some fine engineers from the Accenture Cloud First testing practice together on a bigger picture of automated testing. Furthermore, the idea was to create a complete picture of an automated testing journey.

The result is a powerful LinkedIn article and I want to use the opportunity to highlight some statements (or best practice you may call it) here and put them into the Red Hat context.

I see four central pillars to support your automated testing journey out of this article:

Infrastructure automation is a prerequisite for test automation
Scalability is key, your developers need to receive critical feedback in max 2 hours
Security testing is not optional and needs to be in your pipeline
Shift left⁽¹⁾ of test cases requires to resolve service dependencies

Automated testing saves 500k$ a month!!!

How relevant automated testing is for every transformation heading for efficiency gains shows the article “Taking a new approach to reducing software testing costs” from itproportal.com.⁽²⁾
It states that still today 80% of testing is done manually and highlights a case from Worldpay (global payment processor) who significantly increased automated testing and through that is saving 500k$ a month!!

So, this topic is still hot and it is only at the beginning. AI driven automated testing is even more or less still seen as a myth. On the other hand there is enough material out there to design software testing systems which are capable of self-testing and self-healing.

But one by one, let’s have a look together at the four pillars mentioned above and later we will show you some code examples around them.

So, let’s get started …

Infrastructure automation is a prerequisite for test automation

Why is infrastructure automation required? Let’s start with a view on requirements and business case aspects coming from the users:

Main business case driver is automated testing

Take a random business case on an efficiency gain transformation and you may end up with having test automation as the biggest driver. But not only the required effort / man power is the problem. The duration of manual testing has a high impact on the time to market, too.

Different use cases require different infrastructures

There are

different types of test runs, for example: incremental, nightly, load or destruction.
different types of test data, for example: synthetic, anonymized or productive.

So, it is quite logical that there can be different types of infrastructures where you want to run the tests.

A load test consumes a lot of compute and you want to run it on the cheapest available infrastructure with the use of synthetic data. Whereas tests with near productive data you want to have on-prem. Imagine you have to spend a significant effort in deploying these different infrastructures within every baseline delivery. That won’t fly.

Scalability, also for your test cluster

Another thing is the test cluster itself. If you want to avoid having the test cluster defined in a static manner you need it to be defined in as code as well. In other words you are not only deploying your application and the corresponding test infrastructure (middleware, etc) but also the automated testing infrastructure with every baseline.⁽³⁾

Loose coupling within your application infrastructure

Let’s not forget about the communication between the different applications. All relevant API configurations necessary to integrate with external systems which are not included in the automated baseline delivery process could get lost, delivered with a delay or in a wrong configuration.

Another key asset is scalability, where again a lack of automation is a showstopper, but this we are going to tackle in the next chapter.

Scalability is key, your developers need to receive critical feedback in max 2 hours

Why do I want to go multi-cloud? Since everybody writes about it? Maybe 😉 … but actually from a CFO point of view the one and only goal is the maximum of cost savings. And we are not talking about cutting IT budgets. We are talking about having the same book of work, the same delivery but more efficient processes and therefore lower costs.

In those days you calculated how many tests you have for a regression on your CI. You wanted to have a completion time of one incremental cycle done within an hour. But the underlying compute for this you had to keep and pay 24×7.
If you are already using a container platform, like OpenShift, to fix this cost driver is a low hanging fruit.

You are holding your application, relevant supporting applications and your test nodes in containers. The whole setup is fully scalable. If a developer changes a piece of code then the test cluster gets the signal to run and the test nodes can scale up using the compute power of the OpenShift platform. So, with this approach you are not only scaling your test application, but middleware and test infrastructure as well.

The list of possible test frameworks is long and not really important. Important is:

I don’t hold compute for my test infrastructure
I can perform even intense realistic end to end tests in minutes instead of hours
I need to have a feedback about critical regression and the underlying test cases in max 2 hours

So, the key is to keep as many as possible artefacts in my continuous delivery pipeline. This starts with the code, includes all kinds of relevant middle ware items and doesn’t stop when it comes to areas such as the security testing, which is the next chapter.

Security testing is not optional and needs to be in your pipeline

What happens if a business user finds a critical defect on production which requires a fix within the same day?
Half the regression tests and for sure all the stability, destruction and security tests will be bypassed. Does this sound familiar to you?
In generell many test areas still seem to be no candidate for a pipeline integration. This can work for a couple of years. But there needs to be only one mistake in the wrong area and you are facing a not fixable reputational damage.

Also the hackers are not sleeping. The number of attacks is exponentially increasing. Check out the latest SSSC report.⁽⁴⁾ There you can find the following graphic showing the next generation software supply chain attacks.

Next generation software supply chain attacks

A 650% year over year increase makes it very clear, either you act now or you have to react to attacks in the future.

There are tons of possibilities out there to have automated security tests of all kinds running in your pipeline. A good example for sure is Red Hats Advanced Container Security (ACS).⁽⁵⁾ Among other reasons it was for sure a main driver why Red Hat was just named as leader in the container security space by Kuppinger Cole Analysts.⁽⁶⁾

It comes with a lot of benefits:

The analysis of container images and Kubernetes resources for vulnerabilities and miss-configuration issues.
The analysis of the Kubernetes cluster for configuration issues and permission exposures to ensure the platform has an acceptable security posture.
The identification and prioritization of risk to enable teams to focus on the most important issues first.
The management of policies based on industry standard compliance requirements including CIS Benchmarks, PCI, HIPAA, and NIST SP 800-190 and SP 800-53.
The management of network communication within and outside the project or namespace.
The identification of baseline application behaviour which enables anomalous behaviours to be identified.

At the end an automated testing approach has to be checked on a holistic level. Everything that leaves a significant footprint in the cost analysis of your software delivery lifecycle (SDLC) has to be investigated. Since it could be a potential automation candidate.

Just one example if it comes to SDLC cost analysis:

To fix a defect on stage DEV or CI will maybe cost you 30-60min. But this looks different with a defect with impact to the application performance or memory management. If this issue is detected on stage User Acceptance you can easily end up with 3-6 weeks of one Full Time Equivalent.

This is the reason why a shift left of critical test cases makes absolutely sense. So, we will check this a bit closer in the next chapter.

Shift left of test cases requires to resolve service dependencies

Just imagine you have all of the above in place:

Some of your functional tests run after every code change. So they are working as a kind of advanced smoke test of the new code version
You have advanced tests as .e.g security tests in your pipeline
Everything is nicely scalable and a developer gets immediate feedback after every code change

What could possibly still be an issue?

Exactly, if you require external services to give a meaningful statement of your code quality. Which means you have a strong dependency to this service which needs to be managed. Otherwise every time the service is down your test case will be failing.

Just to give one example how to tackle this:

Ever heard of Istio Service Mesh?⁽⁷⁾

It works like a kind of a glue between your services. It is capable of doing A/B testing. Can do stuff like routing a certain amount of traffic to a newer version of the service.

But it can also do auto failover.

But how does this help in this case here?

In a perfect situation you want to call the external service directly. But if this one is not responding you want to reroute the call to a mock-up service. With this you keep your CI pipeline up and running. There are tons of examples⁽⁸⁾ on the web how to do this.

At the end this is not only a hint, it is a prerequisite to achieve a shift left of functional test cases. Since it is illogical to assume by shifting tests you are also shifting the availability and quality of the dependent surrounding services of your application to these early test stages (e.g. stage CI / Dev).

So, enough theory, let’s have a look at some test infrastructure architectures in the next chapter.

Some technical examples

According to a ranking done by Katalon⁽⁹⁾ good old friend Selenium is still in 2nd place. So, let’s have a look at how to run it on OpenShift, how a pipeline can look with and around it and of course how to scale it.

So, let’s get it started:

How to run an automated test framework on OpenShift

As described already above, we want to achieve that not only our application infrastructure is fully scalable but also the middleware and test infrastructure.

On top the whole setup should be very transparent, so not only bottlenecks in regards to capacity but also profiling of our code should be done. With this we can see performance issues in the code coming up or already existing. But also if our infrastructure will face resource shortage in a certain point, we will see this before it happens.

Example test infrastructure

^{Extracted out of the article “Container-native integration testing”(10)}

Test Process Diagram

The process the diagram shows is the following:

When the Cucumber tests are started, Cucumber reads the test definitions from the Gherkin file.
Then it starts calling the test case implementation code.
The test case implementation code uses NightwatchJS to perform actions on the web page.
When that happens, NightwatchJS connects to the Selenium server and issues commands via the Selenium API.
Selenium executes these commands in a browser instance.
The browser connects to the web server(s) as needed. In our case, because we are using an SPA, the application is loaded as the first page load from the web server and then no more communication is needed.
Setting up this stack in a non-container based infrastructure is not simple, not only because of the number of processes and frameworks needed, but also because starting browsers in headless servers has been historically difficult. Fortunately for us, in a container-native world, we can easily automate all of this.

Integration test farm

Enterprises need to test their web applications with different combinations of browsers and operating systems. Usually, application owners will prioritize testing those combinations that are prevalent in the application user population. Normally, at least about half a dozen combinations are needed for each application.

Setting up different stacks and executing each of the test suite(s) sequentially on each stack is expensive in terms of resources and time.

To achieve the required feedback time of our regression testing we need to run the tests in parallel.

To help solve this problem, we can use Selenium-Grid. Selenium-Grid is a solution comprising Selenium Hub, which is a request broker, and one or more nodes that can be used to execute requests.

Selenium Hub

Each Selenium node, which is usually running on a different server, can be set up with different combinations of browsers and OSs (these and other characteristics are called capabilities in Selenium). The Hub is smart enough to send requests that require certain capabilities to the node, which can meet them.

If it comes to installing and managing Selenium-Grid clusters you may want to take the help of either SauceLabs and BrowserStack to save some time.

Container-native integration tests

Ideally, we would like to be able to create a Selenium-Grid cluster with nodes that offer the right capabilities for our tests and run the tests with a high degree of parallelism. Then, once the tests are done, we’d destroy all of this infrastructure. This basically means re-creating on premises some of the services that are offered by integration test farm service providers.
Definitely a solid open source project in this area is Zalenium.⁽¹¹⁾

Zalenium runs a modified Hub that is able to create nodes on demand and destroy them when they are not needed anymore. With the advent of Windows nodes for Kubernetes, it is possible to enhance it to also support Explorer and Edge on Windows.

If we put all together, it would look as follows:

Container-native integration testing diagram

Each of the ovals in this diagram is going to be a different pod in OpenShift. The test player pods and the emulator pods are ephemeral and will be destroyed at the end of the test.

Observability

^{Extracted out of the article: “Leveraging Kubernetes and OpenShift for automated performance tests (part 1)”(12)}

Running automated tests is a great thing, but for them to bring their full value we need to understand the details of the application behaviour. This is absolutely crucial if it comes to performance tests, but also security tests or code quality tests rely on this information.

Leveraging the observability features built for production readiness is a straightforward way of getting this insight: identifying bottlenecks, error states, resource consumption under load, etc. Three pillars can be used for that:

Application metrics, which can be collected through JMX/Jolokia or Prometheus endpoints, for instance
Application traces/performance, which can be captured thanks to OpenTracing and Jaeger
Logs, which with OpenShift automatically are aggregated into Elasticsearch and made available for querying and reporting in Kibana when the application writes them to the standard output

We will put ACS for this chapter out of scope which would do another analysis aiming for the security relevant topics.

Observability and tools in a diagram

Here it definitely makes sense to use open source and drive this without any license constraints.

It gives you the freedom to literally hang everything into your logging strategy without having a cost constraint. Only with full transparency and adequate reporting within your CI/CD pipeline can you unleash the full power of your automated testing setup.

Outlook

As said before there is tons of material out there which doesn’t simply fit into one blog.

Dominique and me planning a follow up blog. There we will show examples of how infrastructure automation around test automation works. With concrete coding examples of for example an Ansible module for Selenium⁽¹⁴⁾ or Molecule, a nice way to test the Ansible scripts.⁽¹⁵⁾

So stay tuned! 😉

Key-takeaways

Automated testing still has this touch of being one of the boring side topics but never was more important than today.

Still many people getting the power of automating testing wrong, when it comes to

Different types of automated tests
Cost savings
Time to market

There is no way around automated testing and if you want to drive an efficient automated testing path, there is no way around infrastructure automation.

To unleash the full power of your test automation your automation test infrastructure needs to be fully scalable, fail safe and in best-case even self-healing with a fully transparent reporting in place.

What a wonderful world this would be, isn’t it 😉

Let us close this blog with a famous quote from Bruce Lee:⁽¹³⁾

Sources

(1) https://smartbear.com/learn/automated-testing/shifting-left-in-testing/

(2) https://www.itproportal.com/features/taking-a-new-approach-to-reducing-software-testing-costs/

(3) Can also recommend this article: https://www.ansible.com/blog/on-demand-execution-with-red-hat-openshift

(4) https://www.sonatype.com/hubfs/Q3%202021-State%20of%20the%20Software%20Supply%20Chain-Report/SSSC-Report-2021_0913_PM_2.pdf

(5) https://cloud.redhat.com/blog/using-openshift-pipelines-to-automate-red-hat-advanced-cluster-security-for-kubernetes

(6) https://www.kuppingercole.com/research/lc80207/container-security#heading1.1

(7) https://www.solo.io/blog/istio-multi-cluster-on-red-hat-openshift-with-gloo-mesh/

(8) https://istio.io/latest/blog/2021/external-locality-failover/

(9) https://katalon.com/resources-center/blog/continuous-testing-tools

(10) https://developers.redhat.com/blog/2018/08/02/container-native-integration-testing check also this video: https://www.youtube.com/watch?v=KbNBTDO2XSM&ab_channel=DevConf

(11) https://opensource.zalando.com/zalenium/

(12) https://developers.redhat.com/blog/2018/11/22/automated-performance-testing-kubernetes-openshift

(13) https://www.reddit.com/r/GetMotivated/comments/9ykgrs/image_a_goal_is_not_always_meant_to_be_reached/

(14) https://github.com/SeleniumHQ/ansible-selenium

(15) https://redhatnordicssa.github.io/test-ansible-role-molecule-podman

Authors

Main author

Michael Siebert

Co-author

Dominique Hofstetter, Senior Enterprise Account Solution Architect – Pharma Switzerland

I want to use the opportunity to mention the authors behind the Accenture LinkedIn article as well and say thank you for the nice teamwork, it was a pleasure!!