What does a Platform (DevOps) Engineer do? 💻🤔

The following is a technical review of a series of issues that has taken over my life for the last week or 2. Please feel free to ask questions.

Terraform Provider Lock

Seemingly out of nowhere, our Azure pipelines stopped working. They would make it all the way to the apply stage, but then a Terraform provider lock resulted in following error:

│ Error: Inconsistent dependency lock file

│ The given plan file was created with a different set of external dependency

│ selections than the current configuration. A saved plan can be applied only

│ to the same configuration it was created from.

│

│ Create a new plan from the updated configuration.

This took all day to figure out. No changes were made to the pipeline and I am still unsure how this started. The issue was a block of code in the azure-pipeline.yaml file that locks terraform provider (azurerm, azuread) versions in place. Somehow those versions were showing up different from the planning stage to the apply. We only figured this out because someone put the azure pipeline and the error message into Sage AI and it told us the issue was that specific block of code.

Version Upgrades

One of the potential fixes was a version upgrade to the Terraform version as well as the 2 main provider plugins, azurerm and azuread. In our system, this can be a lengthy process. First, you have to find every external module and create a feature. Then you have to set the new provider versions in that module and any sub-module to that module. Then, back in the main repo, set the source in the call to that module to the feature branch. Now in the main repo, update the version numbers and any sub-modules inside this repo. (Note: In terraform, a module can be housed inside another repo, or external to it). From here, you want to run a terraform plan. This checks your code against the current infrastructure and tells you if anything needs to be changed according to your state file and the code. If that all checks out, you run your code through the pipeline and all goes well.

(Another note: Through this process I also found out that when AzureRM hits 5.0, certain resource attributes will be fully deprecated. From experience, this is bad if you don't catch it early as this is the kind of this that can "corrupt" your state file. Meaning the newer version of terraform would not be able to read the old state file. You have to manually download the state file, update it and push it back.)

Password Updates

Things went well, but the version update seemed to change the way a resource block read the expiration date. In terraform, if you don't have certain things ignored or accounted for otherwise, terraform may decide the resource needs to be recreated. So in this scenario, because the expiration date updated, terraform decided to change password values. I spoke with someone on the application team that reconfirmed my suspicion that this should be ok, as the applications should be able to effectively re-read the password. Those password changes were approved and in the site I was testing in, everything went well. When these series of fixes got rolled out to the rest of prod, apparently k8s did not react in the same way. Pods went into CLBO, which should have caused them to restart and re-read the password from App Config (an application that runs and houses all the configs for the other applications).

Migration from ClusterSecretStore to External Secrets

There was a ticket to migrate our Dev environment k8s set up to utilize external secrets versus a clustersecretstore. I was not involved in whatever meeting led to this decision. The work to do that was performed, but in doing so 2 new issues were created:

More managed identities were created, which extended the runtime of the Dev pipeline past an hour. In all fairness, it was already close, this just pushed it over.
The developer environments that are set up up to read from Dev broke, because in this migration, developer environments were not accounted for

To get past 1, I had to manually update the azure pipeline yaml and add a timeout to allow the pipeline to run longer than an hour. This made the pipeline runnable for now, but I already have plans to extract part of the repo for Dev and have it run separately. I'll probably need to write a little Bash to remove resources from one state and at them to another.

As for 2, the decision was made to rollback the migration. I believe I had the argo update needed to make the development environment work, but the decision was made to just rollback for now. So to rollback, I had to track down each PR that was made in relation to the migration and combine them all into 1 git revert. I pushed that through as a PR and that solved the problems for the Dev environment not working.

RCAs

And all that lands us here, the dreaded RCA (Root Cause Analysis). RCA's are simply a documented analysis of exactly what happened that led to the series of events. This is used to develop better policies and procedures to help mitigate nightmares like whatever just happened in the future. On the surface, everyone likes to pretend the purpose has nothing to do with finger pointing.

Then the "post mortem" call happens. This is where the indirect lashing happens. This is where people ask WHY wrong decisions were made and highlight anything else that is obvious and not helpful in the moment. So now, for the next week or so, a lot of my day will be investigations, gathering PR links and messages together to form a timeline.

0 comments

Linux Infrastructure Academy

skool.com/linuxinfrastructureacademy

Master Linux infrastructure and DevOps skills with hands-on labs, tools, and community support. Build smarter, faster, and more secure systems.

Leaderboard (30-day)