visit
This blog post is a written transcript of the FOSDEM Talk: “Infrastructure drifts aren’t like Pokemon, you can’t catch ’em all”, by Stephane Jourdan – CTO and founder
This talk covers three major topics:
Infrastructure as Code: all the good intentions and the ideal world each of us expected when we started using it, and how it’s actually going in everyday’s Ops life. We will see that how it started is probably different from how it is going and from what we expected.We will then “drift” together, using Terraform and AWS and share some war stories that we heard from infrastructure teams, and how things sometimes went really wrong for them. We will finally present driftctl, our open source answer to infrastructure drift problems.How it all started
When it all started, we had this vision of an ideal world, and a bright future that Infrastructure as Code was to offer us, like Plato’s Republic was for society a while ago.How it is actually going
Marcus Aurelius, the roman emperor a few centuries after Plato wrote to himself: “Do not expect Plato’s ideal Republic”. As it happens we are not Roman emperors but our ideal Republic is not to be expected either. How it started is not exactly how we see it going.
In our team, many of us were quite experienced but we wanted more feedback on what we were building. So we just asked hundreds of infrastructure teams, SREs… etc around the world where they were standing in their Infrastructure as Code journey. Basically what they told us was that changes were still happening outside of their Infrastructure as Code: mainly manual changes and scripts that are changing tags, or lambdas updating things after deployment, etc… A lot of changes are still happening outside of their Infrastructure as Code scope, be it Terraform or Pulumi or whatever infrastructure tool they use, it doesn’t matter.Do not expect Plato's ideal Republic - Marcus Aurelius
A bunch of other things go wrong for them as well.
The scope of their changes is not very often fully committed because of broken processes. People are still changing things manually on their AWS web console and then backporting the changes in Terraform until Terraform starts “screaming”, which means that people are still running Terraform locally with the right access keys on their own laptop. Terrible processes… and everybody knows that once you click everywhere on your console, a lot of other things are deployed on the background and so your Terraform code doesn’t reflect exactly what’s actually deployed online.Many people we talked to were aware of the good practices about testing requirements. They knew frameworks like Terratest, etc… but most of them never had the time to either learn or write those tests. Sometimes they were just too costly to run for them. The lack of communication obviously makes things worse: even in these Covid19 times, “Async” communication is still a major issue between teams.Immutable infrastructure also is still a great struggle to achieve. A lot of the new projects are running more or less around this principle, but people also still have legacy to deal with, like thousands of VMs running every day.Lots of people are still gaining console access. And guess what ? People with AWS console access tend to change things unless you lock them down completely, which is sometimes just not possible because they are your boss or your customer, or because they are “this very special co-worker” who needs to do “that special thing” manually on the web console… That’s where problems happen.
Many people are also facing skill issues. For example, if a full team very skilled on GCP has a new project that for some reason needs to run on Azure, as all the specifics of this provider are not immediately mastered, they often end up with bad practices in their Infrastructure as Code deployments.And basically for lots of them, everything that has an API is still not under infrastructure code: we heard terrible stories about GitHub accesses, or DNS records or anything that can be used with Terraform that wasn’t yet by far deployed with it.To deal with it, most of the people that we met put something in place like
terraform plan
and apply
in a cron tab or in CI/CD somewhere before and after some commits, but were still missing things.$ terraform apply
Apply complete! Resources: 0 added, 0 changed, 0 destroyed
Many of us expected
terraform plan
in cron tabs or in CI/CD to report changes because some of the resources were managed by it, but unfortunately as we are going to see, it is not always the case.How an intern with read-only access ended up with rogue administrative access and keys (without anyone noticing)
The first story is about an intern with read-only access, ending up with access keys controlled by no one. Those keys were linked to an administrative policy, which nobody noticed. To gain a better understanding of this story, let’s look at this very simple Terraform configuration file:We have a user with a key and a policy attachment for read-only.
Let’s apply this configuration:
$ terraform apply
Apply complete! Resources: 9 added, 0 changed, 0 destroyed
What do you think happened? Let’s check with a
terraform apply
on this user. $ terraform apply
Apply complete! Resources: 0 added, 0 changed, 0 destroyed
What do you think Terraform found in this case with a
terraform apply
, as we have a policy attached to the IAM user and everything is managed by terraform? This team was sure that this was to be reported in
terraform plan
or terraform apply.
But it wasn’t the case because basically it’s out of the control of Terraform. It’s just an attached resource, it doesn’t work like that.$ terraform apply
Apply complete! Resources: 0 added, 0 changed, 0 destroyed
How an intern with rogue Administrative access opened everything to anyone on IPv4 and IPv6 (still without anyone noticing)
The next story starts with a process: as you know, you can have lambdas, scripts etcetera that have administrative access to some resources on a very limited scope. In this case someone with an administrative access just opened everything to anyone on IPv4 and IPv6 and no one noticed anything. So how did it go? Let’s go into the console > VPC > us-east-1 > Security groupsFor the sake of simplicity we are doing this demo manually here, but someone or something could change this specific super secure security group and could allow anything anywhere on any kind of networks on IPv4 etc… and name it very, very carefully “temp” (you know, like those temporary rules that we’ll remember soon to change…)
This team expected any kind of rules that was added or modified to be reported on terraform and while it is true for a rule it is not true for a new rule and basically the support team went weeks or months with such a rule live on their system without any notification.
$ terraform apply
Apply complete! Resources: 9 added, 0 changed, 0 destroyed
How a scripting issue created a billing nightmare
Our last story is about a team that was building a lot of stuff on S3. They were putting a lot of their binaries there and removing them. So obviously they removed the versioning.Let’s check with a
terraform apply
what was reported: $ terraform apply
Apply complete! Resources: 9 added, 0 changed, 0 destroyed
Let’s just use the same terminal and the same session that we had previously. drift control will basically report you all the changes that you could not see with
terraform plan
or terraform apply
. $ driftctl scan --from tfstate://./s3/terraform.tfstate --from tfstate://./vpc/terraform.tfstate --from tfstate://./iam/terraform.tfstate
Terraform provider initialized (name=aws, alias=us-east-1)
Terraform provider initialized (name=aws, alias=eu-west-3)
Terraform provider initialized (name=aws, alias=eu-north-1)
Found deleted resources:
- Table: rtb-0c5a9bb483ee6eaaf, Destination: 10.0.1.0/24
Found unmanaged resources:
aws_security_group_rule:
- Type: ingress, SecurityGroup: sg-0746ca93f95fcc338, Protocol: All, Ports: All, Source: ::/0
- Type: ingress, SecurityGroup: sg-0746ca93f95fcc338, Protocol: All, Ports: All, Source: 0.0.0.0/0
aws_iam_access_key:
- AKIASBXWQ3AYRSNQN5EY (User: INTERN-8mszmf)
Found drifted resources:
- AKIASBXWQ3AYWWYX72GX (User: INTERN-8mszmf) (aws_iam_access_key):
~ Status: "Active" => "Inactive" (computed)
Found 26 resource(s)
- 84% coverage
- 22 covered by IaC
- 3 not covered by IaC
- 1 deleted on cloud provider
- 1/22 drifted from IaC
Now, we can see the rules that allowed “anything to anywhere” on IPv4 and IPv6 and we know that there are rules that were not expected.
We also have the policy attachment of that intern we talked about.
$ driftctl scan --from tfstate://./s3/terraform.tfstate --from tfstate://./vpc/terraform.tfstate --from tfstate://./iam/terraform.tfstate
Terraform provider initialized (name=aws, alias=us-east-1)
Terraform provider initialized (name=aws, alias=eu-west-3)
Terraform provider initialized (name=aws, alias=eu-north-1)
Found deleted resources:
aws_route:
- Table: rtb-0c5a9bb483ee6eaaf, Destination: 10.0.1.0/24
Found unmanaged resources:
aws_s3_bucket:
- fosdem2031
aws_security_group_rule:
- Type: ingress, SecurityGroup: sg-0746ca93f95fcc338, Protocol: All, Ports: All, Source: ::/0
- Type: ingress, SecurityGroup: sg-0746ca93f95fcc338, Protocol: All, Ports: All, Source: 0.0.0.0/0
aws_iam_access_key:
- AKIASBXWQ3AYRSNQN5EY (User: INTERN-8mszmf)
Found drifted resources:
- AKIASBXWQ3AYWWYX72GX (User: INTERN-8mszmf) (aws_iam_access_key):
~ Status: "Active" => "Inactive" (computed)
Found 27 resource(s)
- 81% coverage
- 22 covered by IaC
- 4 not covered by IaC
- 1 deleted on cloud provider
- 1/22 drifted from IaC
$ driftctl scan --from tfstate://./s3/terraform.tfstate --from tfstate://./vpc/terraform.tfstate --from tfstate://./iam/terraform.tfstate
Terraform provider initialized (name=aws, alias=us-east-1)
Terraform provider initialized (name=aws, alias=eu-west-3)
Terraform provider initialized (name=aws, alias=eu-north-1)
Found deleted resources:
aws_route:
- Table: rtb-0c5a9bb483ee6eaaf, Destination: 10.0.1.0/24
Found unmanaged resources:
aws_s3_bucket:
- fosdem2031
aws_security_group_rule:
- Type: ingress, SecurityGroup: sg-0746ca93f95fcc338, Protocol: All, Ports: All, Source: ::/0
- Type: ingress, SecurityGroup: sg-0746ca93f95fcc338, Protocol: All, Ports: All, Source: 0.0.0.0/0
aws_iam_access_key:
- AKIASBXWQ3AYRSNQN5EY (User: INTERN-8mszmf)
Found drifted resources:
- AKIASBXWQ3AYWWYX72GX (User: INTERN-8mszmf) (aws_iam_access_key):
~ Status: "Active" => "Inactive" (computed)
aws_S3_bucket:
-fosdem2031
Found 14 resource(s)
- 50% coverage
- 7 covered by IaC
- 4 not covered by IaC
- 1 deleted on cloud provider
- 1/14 drifted from IaC
$ driftctl scan --filter "Type=='aws_s3_bucket'"
$ driftctl scan --from tfstate://./s3/terraform.tfstate --from tfstate://./vpc/terraform.tfstate --from tfstate://./iam/terraform.tfstate --filter "Type=='aws_s3_bucket'"
Terraform provider initialized (name=aws, alias=us-east-1)
Terraform provider initialized (name=aws, alias=eu-west-3)
Terraform provider initialized (name=aws, alias=eu-north-1)
Found unmanaged resources:
aws_s3_bucket:
– fosdem2031
Found 2 resource(s)
– 50% coverage
– 1 covered by IaC
– 1 not covered by IaC
– 0 deleted on cloud provider
– 0/1 drifted from IaC
Thank you very much for reading through all this talk transcript.
If you wish to gain a better understanding of driftctl, please visit our or our . We also have a series of short that you can watch to get started in minutes.