Cloud computing provides a huge number of benefits, from rapidly exploiting a global network of data centres, to a vast range of technological solutions – cloud computing has been designed for rapid scale. But personally, I find the biggest game changing feature is the ability to reduce manual and error-prone human actions.
Don’t get me wrong, humans in the system are critically important. We have the knowledge and understanding of what needs to be solved, but I recognise that the systems are much better at executing the plan.
All modern solutions are a complexity of parts that should be moving, and parts that shouldn’t. These solutions regularly consist of hundreds, if not thousands, of individual configuration items that must all be aligned and working for the system to do its job optimally. Systems are likely only going to increase in complexity, and so we need to invest in approaches that allow us to support this growing complexity and ensure the configuration is applied correctly.
Automation for the win
If you’ve been working with automation to manage your configuration, you’ll recognise the relief it gives you when you can have confidence that what you have specified has been correctly deployed in a consistent and repeatable manner. Right up until it isn’t any more.
We’ve all had that moment, where something wasn’t working and we go and just make that one change by hand. One change can’t hurt right, and that’s probably true – but it does open the gates to more, and at some point in the future, more things are being managed by hand, and the confidence in our automated configuration management is eroding.
This is something I have experienced multiple times, especially in shared cloud computing environments. While I and others are doing it through our agreed GitOps workflow – reviewing changes and allowing the automation to handle deployment – it’s possible that not everyone is operating the same way. After all, it’s only one little change – that can’t be bad? Can it?
There are two ways we can work to avoid this environment from forming – restrict people from making manual changes, and detect when they have.
While it might be enticing to restrict access until people have passed some course, or agreed to some standards before allowing them access, and then after that, give them limited access user so they can’t change “important” services, this can be a long and complex path. It can cause a huge amount of friction for both the consumers of these policies, constantly banging into things they can’t do, and the people developing the policies, ensuring they keep abreast of the work being done and designing appropriate controls.
Detection over prevention
I’m not saying prevention doesn’t have a place. For key critical infrastructure, like network configuration, or security tooling, it is good practice to restrict people from making changes where they have no need. But beyond the clear shared services, it becomes more complex. We WANT people to make changes to these other services, but we don’t want those changes done by hand.
This is the problem we’ve started exploring.
If everything in AWS is audited, we SHOULD be able to review all of the configuration changes and determine which changes were done manually, via the console, and create alerts and reports to review why these changes were needed.
We need this solution to be near-real time, as a report from last week has already lost the context of why this action was performed. An alert within five minutes of the activity, clearly identifying who made the change, gives us a great entry point into understanding why. Is it that the change can’t be automated? Or is it just a lack of understanding and awareness on behalf of the operator? We are not stopping the activity, but using the collaboration culture of the team to encourage and reduction of undesired activity.
Let’s start with the detection. Most API calls in AWS are audited through CloudTrail, and while we classically think about accessing CloudTrail logs via the service directly, or via the trail exports stored in S3. While there is a pattern for reading these files via Athena, we are going to take a more Cloud Native and event-driven approach.
Detecting changes via the Console
All CloudTrail API events are also logged across EventBridge – we simply need to insert a rule to filter the events we are interested in.
Each service has a subtly different payload for their events, as their parameters and return values differ – but there looks to be a standard payload that allows us to start to narrow down actions that we want to further review.
The two fields we are interested in are
This event rule filters only CloudTrail events, where the function caused a change – AND – the credentials were from the console. While there are still some AWS services that don’t have CloudTrail – all of the core services, like EC2, S3, RDS, DynamoDB, Lambda are audited, and theses are the services that most solutions are going to be composed of.
Storing and presenting the data
Now that we have the events, what do we do with them? Advanced storage and reporting solutions are a little beyond this post, so we’ll stick with something quick and simple to gain insight, and we can build upon this in a future iteration.
A pattern for collecting and presenting events such as those raised by CloudTrail I have been using for a while is a combination of CloudWatch logs, insights and dashboards.
By developing a simple Lambda function which simply logs in the incoming packet to CloudWatch logs – we can then write a CloudWatch log insights query to extract the fields from the JSON formatted events – and display the resultant information on a CloudWatch Dashboard.
The creation of the dashboard and insight query can be combined as follows:
Now anytime anyone undertakes a create, update or delete action manually via the console, it will be detected by our EventBridge rule, logged by Lambda and visible via a CloudWatch dashboard.
From this simple pattern, you could easily add in email alerting via SNS – add further filtering to the messages in either the Lambda function or even using EventBridge pipes, you could even manage dynamic configuration based on service or action with a simple DynamoDB – the sky’s the limit.
One extension that is coming to this pattern is a multi-account deployment. Utilising CloudFormation stack sets to deploy the EventBridge rule to the entire organisation, and then relaying the events across accounts to a central audit location where the events are logged, alerted and visualised.
But all of that is not worth anything if you don’t get started. Attached here is a public GitHub repo containing a full CloudFormation template that you should be able to deploy to get started. With this starting point, you don’t need to spend thousands of dollars on third party solutions to detect actions quickly and provide near real-time feedback to elevate the understanding and use of cloud automaton.