How to ensure that your bucket access is given on multiple locations via replication?

January 31, 2022

Introduction

Data is the most significant asset in today’s businesses and data services focus on infrastructure and application needs.

For this reason, I would like to talk about the current possibilities regarding data distribution from an object storage bucket in OpenShift Data Foundation (ODF).

ODF is not a new product from Red Hat, but in mid-2021 we rebranded OpenShift Container Storage (OCS) to OpenShift Data Foundation (ODF). The integration of a Persistent Storage into OpenShift is multi-layered. ODF provides a data layer for applications where interaction is implemented in a simplified, consistent and scalable way. The associated operator allows deployment using step-by-step support, and in just a few steps. It also ensures that the lifecycle management for the complete backend is supported for long-term deployment. The backend includes components such as Ceph, Rook, Multi Cloud Object Gateway (based on NooBaa.io). ODF Multi Cloud Object Gateway provides a consistent S3 endpoint across different multi-cloud infrastructures: AWS, AZURE, GCP, BareMetal, VMware and OpenStack.

Use Case: Data Sharing in a hybrid and multi cloud deployment

Data is generated in the cloud and on-prem and must in part be mirrored, replicated or distributed through replication mechanisms. This can serve for further processing or as redundancy, but by no means as a long-term backup. Long-term backups are usually defined by fixed retention times, auditing, WORM (i.e. object lock) and kept at remote locations.

A suitable protocol and probably also file system should be considered for data storage. Since S3 is the most widely used protocol for object storages and applications, I also use it in this example.

To provide these data services, I use the ODF Multicloud Object Gateway (MCG), which manages objects and can distribute objects in buckets across different cloud providers / endpoints.

First, let me explain what a bucket is. A bucket is a logical data space that can scale endlessly. As soon as data is uploaded (PUT) to this bucket, it is usually reduced in the storage footprint by deduplication and compression, unless previously encrypted or compressed in ODF. (Take a look at the picture below.) S3 also provides APIs for access control lists and other functions which can be defined to a bucket, i.e. object lock for data retention and backup solutions.

There are two types of buckets. Originally, data buckets were there first and later namespace buckets were added as a new feature. The data itself can be mirrored, replicated or outsourced.

What exactly is the difference between a data bucket and namespace bucket?

Data buckets

  • Data stored on local persistent volumes, S3 compatible storage or Cloud
  • Management of the data itself using metadata stored locally
  • Deduplicated, compressed and encrypted chunks
  • Only accessible with the correct encryption key to decrypt the data
    • Only the internal database includes the keys with an option to keep it in a Key Management System separately
  • Highly secured

Namespace bucket

  • Highly flexible configuration sets
    • Mix of different buckets from possibly different locations
    • I.e. served from multiple MCG’s and different ODF Cluster
  • Contains only pointers to the buckets
    • New or existing buckets
  • Targets can be defined as Read and Read-Write
    • Only one writer can be defined
  • Data will be stored as is
    • Not deduplicated, compressed or encrypted
  • Hyperscaler / Cloud targets can be for now: Azure, AWS and S3 compatible (i.e. IBM Cloud)
  • Unidirectional or bidirectional asynchronous replication

Because it’s a concept to act with data via an external compatible object storage and cloud provider, I want to be more clear about multi access possibilities in the picture below. A namespace bucket abstracts one or more namespace resources in the NooBaa namespace. When you create a namespace bucket, you can specify read or write policies to the namespace resources that you have configured in your Multicloud Object Gateway. For example, you can read from two buckets across different cloud providers and write to a third bucket in another, separate cloud environment.

To work with multiple cloud providers, local caching is a configuration option (Technical Preview) which helps to reduce network bandwidth and egress costs.

Demonstration: How it works for one use case via replication and namespace buckets

As you can see in the following picture, I configured a solution where data/objects will be automatically replicated between a local running data space/data bucket and logical data space/namespace bucket which stores the data in a S3 compatible object storage. I put the data with a small tool s3md as one of many S3 compatible applications.

Prepare the lab environment

I’m using a small lab which is fully automated by Ansible playbooks and deployed in about 45 minutes with SSO integration.

Project Stormshift: https://github.com/stormshift/automation

Based on that I deployed an OpenShift Cluster with 3x control and 3x compute (up to 8cores and 32GB memory per node) nodes and installed ODF via Operator Hub in a few steps as minimal setup.

Multicloud Object Gateway (MCG) command-line interface

RHEL:

# subscription-manager repos --enable=rh-odf-4-for-rhel-8-x86_64-rpms

# yum install mcg

macOS:

Download: https://access.redhat.com/downloads/content/547/ver=4/rhel—8/4.9.0/x86_64/product-software

Upstream versions can be installed via brew.

Now login with “oc login” and type “noobaa status” to get an overview about the current configuration.

S3cmd

S3cmd can be used for uploading/put, downloading/get or show/list of files in the buckets.

Configuration example of S3cmd:

➜  ~/.s3cfg-OCP2rgw   

[default]

access_key = C0XYLACTI8EF3SU32XL3

secret_key = FKYYARBsK56NOTQFfeIG7mEGPB1OXieBbxH9MyXl

host_base = s3-rgw-openshift-storage.apps.ocp2.stormshift.coe.muc.redhat.com

host_bucket = s3-rgw-openshift-storage.apps.ocp2.stormshift.coe.muc.redhat.com

check_ssl_certificate = False

check_ssl_hostname = False

use_http_expect = False

use_https = False

signature_v2 = True

signurl_use_https = False

… additional lines default

Configure namespace and bucket replication

  1. Create buckets to serve data for the namespace store

I used the RGW from Ceph in ODF as S3 compatible storage which in my example is a S3 compatible cloud provider.

  1. Create a namespace store to define the target via UI

The configuration mask is simple and you can switch between selecting existing secrets or manually adding access and secret keys. The target bucket should be empty and reachable.

The first namespace is created and shows the state Ready, which is great, because the connection to the bucket of the target object storage was successful. If failed, then please check routing/ports, certificates, endpoint and dns resolution.

If you want create it via noobaa cli then use the cli wizard:

  ~ noobaa namespacestore create s3-compatible namespace-1

INFO[0000] ✅ Exists: NooBaa “noobaa”                    

Enter endpoint: http://s3-rgw-openshift-storage.apps.ocp2.stormshift.coe.muc.redhat.com

Enter target-bucket: s3-rgw-1-84d6c330-d269-4773-8d58-08545b740bfa

Enter access-key: [got 20 characters]

Enter secret-key: [got 40 characters]

INFO[0042] ✅ Created: NamespaceStore “namespace-1”          

INFO[0042] ✅ Created: Secret “namespace-store-s3-compatible-namespace-1” 

Output while creating namespacestore via noobaa cli
  1. Create a namespace bucket class

Create a namespace bucket class defines a namespace policy for the namespace buckets. The namespace policy requires a type of either single, multi or cache.

Note:

Single NamespaceStore

The namespace bucket will read and write its data to a selected namespace store

Multi NamespaceStores

The namespace bucket will serve reads from several selected backing stores, creating a virtual namespace on top of them and will write to one of those as its chosen write target

Cache NamespaceStore

The caching bucket will serve data from a large raw data out of a local caching tiering.

  1. Create  bucket with namespace related bucketclass

OBC (Object Bucket Claim) is a unique feature in ODF, because it creates a new bucket and provides all the information to connect the bucket with an application seamlessly and fully automated. For my demonstration I need only the access and secret key to proceed.

Configure of the second bucket with new BucketClass from the namespace
Configure of the second bucket with new BucketClass from the namespace
  1. Create bucket with replication policy

This task will create a bucket with an attached replication policy which defines filter/prefix of object name to replicate all new content in the bucket or only objects where the prefix fits.

Create bucket with replication policy
Create bucket with replication policy

After creation, I changed the current YAML of the Bucket Claim object to adopt the replication policy.

Otherwise also possible with noobaa cli and predefined json file.

Content of the pol.json file:

[{ “rule_id”: “rule-2”, “destination_bucket”: “s3-mcg-2-mirror-to-rgw-987d4f7e-de9b-4d0c-ad47-e26feb82db58”, “filter”: {“prefix”: “”}}]

Content of the pol.json file

Note: It’s also possible to create a prefix which predefines that not all data will be replicated, i.e. file name = repl_data1, repl_data2 and filter/prefix set to “repl” means other object names with abc_data1 will not be replicated by this prefix.

➜  ~ noobaa obc create obc.name –replication-policy=<path>/pol.json

INFO[0001] ✅ Exists: StorageClass “openshift-storage.noobaa.io” 

INFO[0001] loading bucket replication /Users/mschindl/pol.json 

INFO[0001] ✅ Successfully loaded bucket replication [{ “rule_id”: “rule-2”, “destination_bucket”: “s3-mcg-2-mirror-to-rgw-987d4f7e-de9b-4d0c-ad47-e26feb82db58”, “filter”: {“prefix”: “”}}] 

INFO[0001] ✅ Created: ObjectBucketClaim “obc.name”   

Output snip from the command above
  1. Fill the object storage / data lake with data

Now we put a few data into the object storage and check if the replication and namespace mechanism are working.

  1. Put objects into the data bucket s3-mcg-1 with replication policy to s3-mcg-2-mirror-to-rgw

s3cmd -c .s3cfg-OCP2mcg-1 put -r /Users/mschindl/Documents/Backgrounds/Library.jpg  s3://s3-mcg-1-f2f4d3a4-9138-4022-856b-86d9892d7da8

s3cmd -c .s3cfg-OCP2X ls s3://<bucket name>

Put an object to the bucket and list all three existing buckets

Result 1:

Data was automatically replicated to the namespace and also visibly and accessible at the target / S3 compatible cloud storage.

  1. Put objects into the namespace bucket s3-mcg-2-mirror-to-rgw with namespace store namespace-1

s3cmd -c .s3cfg-OCP2mcg-2 put -r /Users/mschindl/Documents/Backgrounds/Birdie.jpg s3://s3-mcg-2-mirror-to-rgw-987d4f7e-de9b-4d0c-ad47-e26feb82db58/

(as you can see, the namespace is consistent)

Result 2:

Namespace bucket works transparently without any replication policy and holds the pointing information to the S3 compatible cloud storage. Data is located in the target which is the S3 compatible cloud storage and called “s3-rgw-1”.

Additional information

…about configuration parameters can be found in the documentation area:

https://access.redhat.com/documentation/en-us/red_hat_openshift_data_foundation/4.9/html/managing_hybrid_and_multicloud_resources/multicloud_object_gateway_bucket_replication

https://access.redhat.com/documentation/zh-cn/red_hat_openshift_data_foundation/4.9/pdf/managing_hybrid_and_multicloud_resources/red_hat_openshift_data_foundation-4.9-managing_hybrid_and_multicloud_resources-en-us.pdf

https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.8/html/managing_hybrid_and_multicloud_resources/managing-namespace-buckets_rhocs

Conclusion

As you could see, bucket replication enables new possibilities in terms of a data lake and ways of working automatically and event-related data distribution (i.e. triggered by object prefixes). In addition, namespace buckets allow scaling for hybrid cloud and cross-departmental use cases. At the end check out the following picture about the Jupyter use case which is a great idea for collaboration in a simple way of contribution from a collaborated hybrid cloud point of view with a single endpoint as namespace for the end user and application.

Use Case: Jupyter Notebook read & writes to a Namespace Bucket with multiple sources

Red Hat, the world’s leading provider of open source software solutions with enterprise support, helps to implement solutions stacks from middleware to infrastructure by providing a variety of services, consulting, and training.

Just leave a comment if you wish more information about this topic or the products mentioned in this blog. Any feedback or collaboration is highly appreciated! Our code is open …