Best Practices for Cloud Resource Management
Introduction
NOTE: This article was written by David Barranco, member of technical staff at VMware and SRE Engineer at Bitnami.
It’s a well-known fact that cloud computing can bring numerous benefits to businesses, such as scalability, more efficient collaboration among teams, reduced costs, flexibility, etc. Given that most companies have already started their journey to the cloud - more now than ever before, as remote work has increased due to COVID-19 - it is important to stress the need for effective cloud governance to manage:
- Budget overruns
- Cloud resource underutilization
- Compliance requirements
- Security remediation
Bitnami runs a tremendous number of automated workloads across multiple cloud environments. Through our experience, we have learned that enforcing a well-defined set of policies to keep those environments - and our budget - under control is critical.
Let’s take a look at some tools and internal processes we think every organization should have in order to ensure they are able to make the most of their cloud computing platform(s).
Key considerations
If you are thinking about adding a set of policies to keep workloads manageable across multiple cloud environments, you should be sure to:
- Create processes that fit your company’s culture - Changes in infrastructure management may affect different development teams in your organization. If you don’t implement an internal process to back up these changes, you might end up with process integration issues.
- Involve the affected teams, and let them have a real say - Implementing new cloud management policies will affect all of the development teams that work with cloud environments. As such, it is very important to have periodic meetings with each of them so they can provide feedback and input about the policies that will be enforced in different cloud environments to ensure they don’t negatively impact productivity.
- Review and update your policies frequently - Cloud environments are fast-paced environments where changes occur quickly and continuously, so your policies should follow a similar process.
- Monitor the results - Successful policy enforcement requires ongoing monitoring of any subsequent results of implementing those policies. This empowers you to continuously improve them based on real numbers.
Failing to consider those points could slow down the progress of your teams' progress.
Identify your cloud use cases
The first step towards better cloud management is to identify your current use cases so you can choose the best tools and processes for enforcing your cloud environment policies. In order to assess your specific requirements, try answering the following list of questions:
- What does the infrastructure of the cloud-based development environment look like?
- Which cloud resources will the development team use on a daily basis?
- Which accounts/projects/subscriptions are used to do manual or automated tests? How many do you have?
- Are the production workloads correctly configured and secured?
- Which cloud resources in your budget produce maximum impact?
- Are the production workloads correctly tagged?
Your answers should give you a raw idea of the number of resources that may need managing. This process will also impact legacy infrastructure, where the ownership is not clear, in that it will kickstart the discussion around why those infrastructure pieces are there and help make clear their stakeholders.
Reviewing these questions with the various development teams in your organization will give everyone a better understanding of their cloud usage and requirements, and will highlight areas where better management policies can help them.
Define your cloud resource management policies
Once you understand how different cloud resources are used and how they affect your budget, you can begin defining which sets of policies you will enforce in different cloud environments.
Cloud management policies can be classified into three different categories:
Cost control policies
Policies that belong to this group are meant to reduce your spend on cloud environments as much as possible. These policies cover:
- Off-hours work - Instances that are used for development, testing, or demo purposes should be automatically stopped during non-working hours.
- Deletion of testing resources - Testing resources that are not used within a certain amount of time should be terminated. This applies not only to instances, but to external IP allocations, volumes, DNS records, and any other resources left behind by the CI/CD pipelines.
- Control of any other resource that impacts your budget - This includes stale backups of your instances or volumes.
A good starting point for identifying the resources that need to be controlled is to review your cloud spend. Follow Pareto’s principle by taking a deeper look at the resource groups that are consuming 80 percent of your budget.
The remainder of the resource groups are typically considered tail spends. They are usually not that easy to control, so enforcing policies in such groups can easily affect developer productivity.
Hygiene policies
Apart from controlling cloud spend, cleaning up stale resources in order to maintain a certain level of hygiene in the environment is also important.
Hygiene policies should cover the wiping of old, unused resources that may contribute to quota issues. In some environments, controlling the list of active users could also be helpful.
Compliance policies
To effectively govern different cloud environments, compliance and security controls should be added as part of the infrastructure.
These compliance policies are usually related to the configuration - or misconfiguration - of the cloud resources. Examples include:
- Security groups that allow SSH connections from the Internet in production services
- Storage bucket configurations that may lead to public information leaks
- Insecure network configurations
- Lack of cloud tags (e.g., cost center, production, or development tags)
- CIS benchmarks
Enforcing cloud policies
After reaching an agreement with the teams about the kind of policies that will be enforced in different environments, it’s time to find a tool that will help you enforce them.
Requirements
A tool that will help you enforce policies in cloud environments should:
- Have multi-cloud support
- Have a declarative approach that aims to improve policy readability
- Produce cloud provider-native metrics on resources that match particular policies
- Support multi-account setups if being used in a cloud service provider
- Be easy to maintain and written in declarative language
- Be easy to integrate with GitOps workflows
In light of those requirements, and after spending some time researching various options, the Bitnami SRE team decided to use Cloud Custodian, an open source rule engine used by several public cloud providers to enforce their cloud management policies.
What is Cloud Custodian?
As a rule engine for managing public cloud accounts and resources, Cloud Custodian allows users to define policies for secure and cost-optimized cloud infrastructure management. Many organizations use it to turn their ad hoc scripts into lightweight, flexible tools, with unified metrics and reporting.
Cloud Custodian can be used to manage AWS, Azure, and GCP environments by ensuring real-time compliance to security policies (like encryption and access requirements), tag policies, and cost management via garbage collection of unused resources and off-hours resource management.
Cloud Custodian policies are written in simple YAML configuration files that enable users to specify policies on a resource type (EC2, ASG, Redshift, CosmosDB, PubSub Topic) and are constructed from a vocabulary of filters and actions. The tool also integrates with the cloud native, serverless capabilities of each provider to enable the real-time enforcement of policies with built-in provisioning. Alternatively, it can be run as a simple cron job on a server to execute against large existing fleets.
What does a policy look like?
The following example illustrates what a Cloud Custodian policy looks like:
- name: ec2-require-non-public-and-encrypted-volumes
resource: aws.ec2
description: |
Provision a lambda and cloud watch event target
that looks at all new instances and terminates those with
unencrypted volumes.
mode:
type: cloudtrail
role: CloudCustodian-QuickStart
events:
- RunInstances
filters:
- type: ebs
key: Encrypted
value: false
actions:
- terminate
Look deeper at the policy and you will be able to identify four different sections:
- The cloud resource type that the policy will run on (e.g., EC2 instance, IAM role)
- A filter which controls resources that will be affected by the policy (e.g., untagged EC2 instances, IAM roles with admin privileges)
- The actions that the policy will take on the matched resources (e.g., tag the instance, send a notification to the security team about the role, etc.)
- A mode in which to run the policy (e.g., deployed as a Lambda, or from the command line on your machine or an EC2 instance)
You can find all the information available related to filters, actions, and cloud specifics in the Cloud Custodian documentation. The policy schema, which defines the actions and filters available for a particular cloud resource, can be found there as well.
Continuously enforcing policies in public cloud accounts
When operating Cloud Custodian (or any other, similar tools), it is highly recommended to treat the policy files as code, much as you would with Terraform or CloudFormation files. Cloud Custodian has a built-in dry-run mode and policy syntax validation, both of which can help when you’re crafting new policies for your infrastructure.
When enforcing policies, running Cloud Custodian in a continuous deployment tool, such as Jenkins, is highly recommended. Given that the Cloud Custodian team provides a Dockerfile to run this tool, you can configure a pipeline to:
- Build a local copy of this container (or store it in a private image registry)
- Clone your internal repository with your policies as code
- Execute those policies on your various public cloud providers
At Bitnami, we have several Jenkins jobs with different periodic configurations. Depending on the policy and the environment it targets, these jobs run on an hourly or daily basis.
Monitor cloud policies
Every policy that you execute should be traceable, especially if your policies are going to be applied over other team resources, like instances.
We wanted to configure an “audit” trail for the execution of our cloud policies and make it easy for any of our engineering teams to check. So we configured a simple integration with an SNS topic and a Lambda function that logs the execution of our Cloud Custodian policies to a specific Slack channel.
With this approach, any team member can quickly and easily check which resources were deleted by Cloud Custodian, understand the reason for the deletion, and learn more through the link to our internal engineering handbook, which has a detailed explanation of each policy. Meanwhile, because this channel is meant to act as an audit trail for our Cloud Custodian policies, it doesn’t generate alert fatigue on the part of SRE team members.
Consider your options
Compliance as code should be a shared responsibility among the engineering teams inside an organization, not just the infrastructure teams. Internal tools, docs, and processes that enable efficient cooperation among those teams play a key role.
While we’ve discussed Cloud Custodian and the requirements we had in mind when looking for a flexible rules engine, you may have different environments and requirements. Indeed, it is worth spending some time evaluating the various options out there in order to find the tool that best suits your needs.