To explore this, I recently started my AI journey by earning the AWS Certified AI Practitioner certification. This blog marks the beginning of a multi-part series where I’ll share my understanding and insights about AI in the context of DevOps.
In this first post, I’ll focus on the fundamentals of AI and Generative AI, explain what large language models (LLMs) are, and share some early examples of how these technologies can enhance our work as DevOps engineers.
What is Machine Learning?
AI isn’t a brand-new concept. In fact, it’s been around since the 1900s. Machine Learning (ML) systems are designed to analyse historical data and make predictions based on patterns. A simple example is weather forecasting: apps use past temperature trends and atmospheric data to predict tomorrow’s weather.
Now let’s look at some of the real-world use cases where ML is already making an impact in the DevSecOps space. While the AI hype has recently surged due to advancements in Generative AI, Machine Learning has quietly been powering several core DevSecOps functions for years. These use cases revolve around identifying trends, predicting issues and optimising systems – all based on historical and real-time data.
In the world of DevSecOps, Machine Learning has long been used for:
- Event prediction and management
Machine Learning models can analyse past incidents and performance metrics to forecast upcoming issues before they impact users.
- Example:
Using Amazon DevOps Guru, an e-commerce company might receive proactive alerts during seasonal traffic surges (e.g., Black Friday), warning of increased latency risks in the checkout microservice. Teams can scale infrastructure in advance or optimise DB queries to prevent downtime.
- Example:
- Anomaly detection
Detecting irregular spikes in traffic, memory leaks, or CPU usage that deviate from normal patterns – often flagging potential problems before humans even notice.
- Example:
CloudWatch Anomaly Detection can detect unusual spikes in memory usage on an EC2 instance running a Python application. Instead of manually setting thresholds, it learns the pattern and triggers alerts only when real anomalies occur, reducing noise.
- Example:
- Log analysis
Automating the parsing of thousands of logs to highlight critical errors, patterns, or trends, saving engineers hours of manual effort.
- Example:
With OpenSearch Service and ML plugins, AI can analyse application logs to detect recurring patterns like ConnectionTimeout errors following specific API calls, helping engineers narrow down the problem in seconds.
- Example:
- Incident pattern recognition
Identifying recurring incidents across environments and suggesting automated responses or remediation plans.
- Example:
PagerDuty Event Intelligence may identify that a spike in database IOPS always precedes a memory bottleneck in a dependent service. It can then automatically escalate or trigger a Lambda function to scale out the affected service.
- Example:
- Resource optimisation
Services like AWS Predictive Auto Scaling use AI to analyse workload patterns and proactively scale infrastructure to meet demand.
- Example:
AWS Predictive Auto Scaling reviews CloudWatch metrics from the past 14 days to forecast future usage and proactively increases EC2 capacity before the Monday morning traffic spike hits a marketing website.
- Example:
This type of traditional AI usage is often referred to as AIOps, a powerful approach to automating and enhancing IT operations through data-driven insights.
What is Generative AI?
While traditional AI focuses on predictions, classification and pattern detection, Generative AI takes things a step further – it creates entirely new content such as text, images, code, audio or even music.
To illustrate this, a few days ago I asked ChatGPT to generate an image of a Labrador dog drinking coffee, essentially combining two of my favourite things into one visual. Within seconds, it produced a unique and realistic image of exactly that.
What’s remarkable is that this image wasn’t retrieved from a database or copied from somewhere online. Instead, it was generated entirely from scratch by the model, drawing on patterns and knowledge it has been exposed to across vast datasets. This is the power of Generative AI, the ability to create brand new, contextually relevant outputs based on natural language prompts.
In the DevOps world, this can be a game-changer.
Imagine you’re starting a new infrastructure project and need boilerplate IaC templates, CloudFormation scripts, or Terraform modules. Instead of searching Stack Overflow or official docs, you can simply ask an LLM (like ChatGPT or Claude) to generate secure, best-practice-based code in seconds.
Examples include:
- Creating a CloudFormation template for a private, encrypted S3 bucket with lifecycle policies
- Generating a Kubernetes deployment YAML for a containerised Node.js application
- Writing a Lambda function stub with logging and error handling
- Drafting a GitHub Actions pipeline to build and deploy a microservice
With the right prompt, Generative AI tools can save time and beyond code generation, Generative AI can assist with:
- Creating runbooks or documentation
- Writing status page updates in human-friendly language
- Generating test cases or mock data
- Summarising incident reports from logs and metrics
In essence, it acts like a virtual DevOps assistant, available 24/7 to help you design, troubleshoot, automate and document.
Enter Large Language Models
At the heart of Generative AI are Large Language Models (LLMs) like GPT-4, LLaMA 3, and DeepSeek. These models are trained on billions of parameters and use neural network architectures generate human-like content.
How do they work?
- LLMs are trained on massive datasets using supercomputers that contain thousands of GPUs or TPUs. This training process consumes a significant amount of energy – often compared to the electricity usage of hundreds or even thousands of homes. However, it’s important to note that this is a one-time cost per model. Once trained, the model can be used repeatedly with relatively low energy consumption per query.
- Once trained, the models don’t go online to search for answers – they generate responses based on what they’ve learned, much like how our brains recall knowledge.
For example, if someone asks you, “What is AWS?”, your brain doesn’t search Google – it pulls from your memory and experience. LLMs work the same way, using their trained parameters to respond to prompts.
Beyond static answers: Agentic AI and RAG
While base LLMs rely solely on their training, modern systems are evolving to become more dynamic and capable. Agentic AI refers to models that can autonomously plan, make decisions and execute tasks in a goal-driven way – for instance, conducting multi-step reasoning, calling APIs, or interacting with tools to complete complex workflows.
Meanwhile, Retrieval-Augmented Generation (RAG) enhances LLMs by connecting them to external knowledge sources, such as private documents or up-to-date databases. Instead of relying solely on internal memory, the model retrieves relevant information in real time and integrates it into its responses, providing more accurate, current and context-specific outputs.
Together, these innovations are making Generative AI not just smarter, but more useful in real-world applications.
What is Prompt Engineering?
If LLMs are the engine, then prompt engineering is the steering wheel.
It’s the art of communicating effectively with LLMs to get high-quality, relevant results. Much like how you frame a question in a team meeting can affect the outcome, the way you ask LLMs for help determines the usefulness of the response.
For example, when I gave following prompt to ChatGPT:
“Create an AWS CloudFormation template to create an S3 bucket.”
I got the following response:
</p>
<pre>AWSTemplateFormatVersion: '2010-09-09'
Description: Create an S3 Bucket
Resources:
MyBasicS3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: my-basic-s3-bucket-example
Outputs:
BucketName:
Description: Name of the S3 bucket
Value: !Ref MyBasicS3Bucket
</pre>
<p>
But when I tried:
“Create a CloudFormation template to create an S3 bucket that is private, encrypted with SSE-S3, has a lifecycle policy to move objects to Glacier after 30 days, and blocks all public access.”
Then I got following result:
</p>
<pre>AWSTemplateFormatVersion: '2010-09-09'
Description: Create an S3 Bucket with encryption, versioning, and secure access
Resources:
MyS3Bucket:
Type: AWS::S3::Bucket
Properties:
BucketName: my-secure-s3-bucket-example
VersioningConfiguration:
Status: Enabled
BucketEncryption:
ServerSideEncryptionConfiguration:
- ServerSideEncryptionByDefault:
SSEAlgorithm: AES256
PublicAccessBlockConfiguration:
BlockPublicAcls: true
BlockPublicPolicy: true
IgnorePublicAcls: true
RestrictPublicBuckets: true
BucketPolicy:
Type: AWS::S3::BucketPolicy
Properties:
Bucket: !Ref MyS3Bucket
PolicyDocument:
Statement:
- Sid: AllowSSLRequestsOnly
Effect: Deny
Principal: "*"
Action: "s3:*"
Resource:
- !GetAtt MyS3Bucket.Arn
- !Sub "${MyS3Bucket.Arn}/*"
Condition:
Bool:
aws:SecureTransport: false
Outputs:
BucketName:
Description: Name of the created S3 bucket
Value: !Ref MyS3Bucket</pre>
<p>
In the second prompt, I was being specific, and the result was much more aligned to what I actually needed.
Prompt engineering techniques include:
- Zero-shot prompting: Asking a question directly
- Few-shot prompting: Providing examples first
- Chain-of-thought prompting: Asking the model to explain its steps
- Keyword emphasis: Highlighting important words to guide the model
We’ll dive deeper into this in Part 2 of this blog series.
A Word of Caution: AI Can Hallucinate
While AI, especially Generative AI, can significantly accelerate tasks like infrastructure scaffolding, documentation and incident analysis, it’s important to understand that these models can hallucinate. This means they may generate responses that sound correct but are factually inaccurate, incomplete or insecure.
For example, an LLM might:
- Suggest deprecated or non-existent AWS features
- Generate insecure IAM policies
- Omit required configuration details
- Present syntax that looks valid but won’t pass validation
As DevOps engineers, verification is critical. Always review and validate:
- Infrastructure-as-Code templates
- Generated scripts or pipelines
- Architectural advice or configurations
Use tools like linters, validators (e.g., cfn-lint, tflint) and manual review before promoting AI-generated code into CI/CD or production environments.
Remember: AI can assist, but you remain the final decision-maker.
Final Thoughts
AI is not here to replace DevSecOps engineers. It’s here to augment our capabilities – to help us think faster, build better, and reduce toil. In this blog, we’ve explored the foundational concepts of AI and Generative AI, and how they apply to the DevOps world. As we move forward, I’ll explore hands-on tools, real-world use cases, and how to practically implement AI to level up your DevOps game.
Stay tuned for Part 2: Prompt Engineering for DevOps Engineers.