Many businesses are moving to AWS for its flexibility, speed, and cost benefits. But with so many options, it’s easy to feel overwhelmed or miss important details.
That’s where the AWS Well-Architected Framework (WAFR) helps. It gives a clear set of best practices to build secure, reliable, and cost-effective systems. We use it to help our clients build smarter, not just bigger.
Before vs After: Why the Well-Architected Framework (WAFR) Mattered
Let’s imagine you run a fast-growing SaaS startup called FBasket, a grocery delivery platform operating across India.
Your dev team is pushing features fast, and your AWS bill just hit 35,000 $ last month. Suddenly, your CTO is being grilled in a board meeting:
"Why are we spending so much?"
"Are we secure enough to onboard enterprise clients?"
"What’s our backup plan if the Mumbai region goes down?"
You realize your cloud infrastructure is a bit like a messy kitchen: everything works, but no one knows where the ingredients are, some appliances are always on (even when unused), and you're winging it every night.
When the Reality Hits Hard, when it was we audited the problems with the Current Architecture
Problem | Real-World Analogy | What It Looked Like |
---|---|---|
No clear tagging policy | Sticky notes are missing on food jars | Developers couldn't tell who launched which EC2 instance, leading to orphaned resources |
IAM roles are overly permissive | Everyone has keys to the safe | Interns had access to production S3 buckets (Scary) |
NO cost governance | Grocery bills piling up | Idle RDS, underutilized EC2s, and high data transfer charges (backbone of humongous bills) |
NO resilience (HA) plan | One kitchen, no backup stove | If Mumbai went down, so did your whole app (Humpty Dumpty had a great fall) |
Still manually provisioning | Chopping veggies every night from scratch | Every infra change was made manually, leading to errors and no audit trail. (Worrysome) |
But we sure have not lost the hope & as they have put the faith in us, we would surely help them out to get them out of this facade
After around a month (4 weeks) of time, we helped them identify their mistakes, which were piling up the bill using the AWS Well-Architected Framework.
Solution | What We Did | Outcome |
---|---|---|
Implemented tagging enforcement | AWS Config + Terraform to enforce | Clean billing reports, ownership clarity |
IAM least privilege model | Scoped IAM roles, MFA, and audit logging | No more over-permissioned users, improved security posture |
Introduced auto-scheduling | Scheduled dev EC2 and RDS to stop after 10 PM | Possible saving of ₹35,000/month |
Set up cross-region backup | Deployed cross-region S3 replication + RDS snapshots | Reduced RTO to <15 mins |
IaC with Terraform | Replaced manual setup with Terraform modules and GitOps via Atlantis | Reliable, repeatable infra with audit trails |
Resizing Instances | Moved Intel-based processors to Gravitron based processors in EC2 | Slashed the EC2 billing by at least 18% less than X86-based processors |
That helped to calm the chaos and set up a cost-effective and maintainable setup to calm the nerves of the higher management.
The Pillars of the (WAFR), & Why deal with them
Operational Excellence
It's advised to organize small teams following the modular approach, focusing on each module of the business aspect, aggregating to the more sustained and efficient final business outcome.
To gain the top view of what's going on, leverage the observability to make an informed decision & take prompt actions when the business outcomes are at risk.
Reducing the operational burden using the AWS managed services wherever possible.
Security
It encompasses the ability to protect the data, systems, and assets to take advantage of AWS to improve security.
It's advisable to follow the Least Privilege Principle for all users, ensuring that the users don't have access to what they don't require. eg, Interns having access to prod S3 Buckets.
Should incorporate the proper tagging of all the resources to get proper tracking of the creation and maintainability point of view of the resources.
Reliability
It deals with the ability of the workload to perform its intended function correctly & consistently when it is expected to.
It's better to have multiple smaller resources than to have one larger resource to reduce the impact of Single Point Failure (SPF).
Testing out the Disaster Recovery Strategies by simulating a failover scenario that has happened before, or a chance to happen when the worst-case scenario hits. It helps to expose the dead roots of the DR plans and can be worked upon to make them more efficient.
Cost Optimization
It deals with the ability to run the systems to deliver a business value proposition at the lowest price point.
Adapting to the consumption model, if targeting to exist for a longer duration, then adapt the long-duration savings plan to save tons of cost.
Scheduling the resources to go offline during the weekends and the off-hours on the weekdays helps in saving the unnecessary consumption of the resources.
Sustainability
It focuses on the environmental impacts, especially dealing with the envionmental impacts, especially energy consumption and efficiency.
Right-sizing of the instances plays a critical role in adapting for sustainability goals, wherein we can minimize idle resources, processing, and storage to reduce the total energy required to power your workload.
Making use of the managed Services to minimize the impact.
eg> making use of the S3 Bucket policies to move the data to the infrequent access tier if that comes within the budget constraints.
Performance Efficiency
Ability to use the cloud resources efficiently to meet performance requirements
Can make use of the serverless architectures, removing the need to maintain a physical server for the short-lived processes.
Going Global for the workloads using the AWS regions for lower latency and a better experience for the customer at a minimal cost.
Terraform to the Rescue: Snippets We Used
Prerequisites:
Basic Understanding of Terraform
Already setted up an AWS Provisioner
Terraform to Configure AWS to Periodically Start an EC2 Instance
resource "aws_scheduler_schedule" "my_scheduler" {
name = "my-scheduler"
group_name = "default"
flexible_time_window {
mode = "OFF"
}
# Run it each Monday
schedule_expression = "cron(0 9 ? * MON *)"
target {
# This indicates that the event should be send to EC2 API and startInstances action should be triggered
arn = "arn:aws:scheduler:::aws-sdk:ec2:startInstances"
role_arn = aws_iam_role.my_role.arn
# And this block will be passed to startInstances API
input = jsonencode({
InstanceIds = [
aws_instance.my_ec2.id
]
})
}
Terraform to Configure and Enforce the Tags on the Created Resources
Create the SNS Topic:
Here we must confirm our Subscription on the given email!
resource "aws_sns_topic" "config_alerts" {
name = "config-noncompliance-alerts"
}
resource "aws_sns_topic_subscription" "email_alert" {
topic_arn = aws_sns_topic.config_alerts.arn
protocol = "email"
endpoint = "[email protected]" # Replace with your email
}
Add the AWS Config Rule
resource "aws_config_config_rule" "required_tags" {
name = "required-tags"
source {
owner = "AWS"
source_identifier = "REQUIRED_TAGS"
}
input_parameters = jsonencode({
Tag1Key = "Owner"
Tag2Key = "Project"
Tag3Key = "Environment"
})
scope {
compliance_resource_types = ["AWS::EC2::Instance"]
}
depends_on = [aws_sns_topic_subscription.email_alert]
}
CloudWatch Event Rule to Detect Non-Compliance
resource "aws_cloudwatch_event_rule" "config_noncompliant" {
name = "config-rule-noncompliance"
description = "Trigger when AWS Config rule becomes NON_COMPLIANT"
event_pattern = jsonencode({
source = ["aws.config"]
"detail-type": ["Config Rules Compliance Change"]
detail = {
newEvaluationResult = {
complianceType = ["NON_COMPLIANT"]
}
configRuleName = [aws_config_config_rule.required_tags.name]
}
})
}
Add the CloudWatch Event Target
resource "aws_cloudwatch_event_target" "sns_target" {
rule = aws_cloudwatch_event_rule.config_noncompliant.name
target_id = "send-to-sns"
arn = aws_sns_topic.config_alerts.arn
}
Terraform Script for AWS Monthly Budgets
# Setting up a cost budget in AWS using Terraform
# Define a cost budget using AWS Budgets
resource "aws_budgets_budget" "monthly_budget" {
name = "monthly-cost-budget"
budget_type = "COST"
limit_amount = "2000"
limit_unit = "INR"
time_unit = "MONTHLY"
# Optional: Apply budget only to a specific AWS linked account (in case of consolidated billing)
cost_types {
include_credit = true
include_discount = true
include_other_subscription = true
include_recurring = true
include_refund = true
include_subscription = true
include_support = true
include_tax = true
include_upfront = true
}
# Define the notification settings
notification {
comparison_operator = "GREATER_THAN"
threshold = 90
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = "[email protected]"
}
}
Conclusion
The AWS Well-Architected Framework gave us a clear path to build smarter, run safer, and spend better. It’s like having a blueprint that grows with your business.
If you're building on AWS, don’t just build; build it well.
EzyInfra.dev – Expert DevOps & Infrastructure consulting! We help you set up, optimize, and manage cloud (AWS, GCP) and Kubernetes infrastructure—efficiently and cost-effectively. Need a strategy? Get a free consultation now!
Share this post