What We Talk About When We Talk About Code

#devops#llms#programming#SRE#Go#Silicon Valley

DevOps programming with Go and LLMs.

Problem definition and backstory

Long ago, I joined what was the Android Apps (later Google Play) team as a software release engineer. The crux of this position was deploying new versions of the store and its billing infrastructure, along with automating those deployments and making them self-service features for software developers.

When I joined, the builds automated, but deployments still involved manual steps for communicating current status. There was a Google Sheets spreadsheet that stored a list of deployments, which, in perhaps step 12 out of 15 in a document, was to be updated manually. This would not be the last time I would see a process that involved manual deployment notifications or logs.

Manual, and particularly heroic manual engineering, can be a heady thing. If a system is complex and difficult, clearly the person running it is both critical and necessary. If not for them, wouldn’t the system crash? Wouldn’t the deployments fail? Wouldn’t the alerts fire and scream to the heavens?

For the last decade and a half I’ve tried to move away from heroic engineering and toward things that just work. If you want to gussy it up with motivational slogans, call it “The Silent Service.” Maybe a tattoo-style T-shirt logo? A bomber jacket?¹

One of the more effective things in early Facebook release engineering, where I had just come from, was ch1rpbot. This was a chatbot on the (pre-Slack) IRC channel that let people know if their changes were in the build, let developers claim ownership of their changes, and notified everyone of the current stage of deployment. It handled the routine communication and nagging of developers, a position which had previously required all of the sweat and many of the gray hairs of at least one release engineer. ch1rp was my second favorite bit of internal software at Facebook.²

A single production change log, along with automated email or chat messages, can both provide visibility and aid incident triage. The question, after “what’s broken,” in a software system is usually, “what just changed?” Deployments, flags, and changes to infrastructure configs are the top suspects, and a quick dashboard can help such diagnoses, at least ruling out contributing factors. This idea perhaps first appeared as “sparklines”³ in John Allspaw’s metrics at Flickr and Etsy circa 2010, inspired by Edward Tufte’s work, and it confuses me as to why it’s not standard in most observability products.

At Google Play, I solved this using an internal logging system called Herodotus, which, if memory serves, was a BigTable backed database with an ingestion API and a lightweight front end. Perfect for storing a relatively low amount of messages per day in a central location and just right for deprecating Google Sheets. It took about a week to locate this internal service so I didn’t have to build it from scratch, to get user access, negotiate for resources, and wire it up to the build automation.

In 2020, I was invited to write some chapters for Go for DevOps by John Doak and David Silver. Unfortunately, 2020-2021 were a bit rough, so I was unable to complete my chapters and could not participate. The book, however, remains a tome without which no DevOps bookshelf is complete.⁴

One of my chapters was, surprise!, how to create a production change log. So, I decided to revisit that using LLM code generation and Anthropic’s Claude CLI, with the Opus 4.6 and Sonnet 4.6 models. My circa-2021 estimate was that it was going to take me weeks to write and document the code. Two days ago I tried to build a more robust version from scratch.

Humility and why these tools?

Like many DevOps and site reliability engineers, I do not come from a standard computer science background. When I’m not working on the Internet’s plumbing, I’m figuring out how to fight fires or avoid problems. These kind of approaches seldom offer the kind of dedicated project focus time that product SWEs take for granted. Sometimes, the “pitch it over the wall” strategy of the pre-DevOps days was an attempt to preserve that time and pass the messy, interrupt-driven work to the ill-fated Morlocks in Ops.⁵

Thus, much of my software development experience has been … archeological in nature. Looking at hundreds of thousands of lines of existing (“brownfield”) code or infrastructure configs, and determining which two to twenty lines need to be modified or replaced. Alas, as Paul Graham quoted, “where there’s muck, there’s brass.”⁶

While my SWE skills are not as formidable as I would like, I have a lot of experience with operation and development cycles of large software systems. So I’m working with a toolset that can go from design discussions to code, particularly if guided by a skeptical hand. And, as mentioned above, I don’t value the myth of the heroic engineer.⁷

Build it, ship it

I’ll let interested parties take a look at the GitHub repo. I specified that all the code (100% generated by Claude) be in Go wherever possible, and used Go skills in Claude to enforce (more like “increase the chances” of) style guidelines, security audits, and idiomatic Go.

The concept of adversarial review was most useful. Claude can spin off agents which, when instructed, will skeptically evaluate your code and architecture.

Here is an example of the output of part of that analysis.


⏺  Both reviews are complete. Here's a consolidated summary:

  Code Review Findings

  ┌──────────┬───────────────────────────────────────────────────────────────────┬─────────────┐
  │ Severity │                               Issue                               │  Category   │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Critical │ Internal error messages leaked to API clients (SQL, paths)        │ Security    │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Critical │ Token in URL query strings — logged, cached, leaked via Referer   │ Security    │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ High     │ Update doesn't check RowsAffected — nil pointer on deleted event  │ Correctness │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ High     │ ToggleStar has read-modify-write race condition                   │ Correctness │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ High     │ Delete returns 500 instead of 404 (sentinel error not propagated) │ Correctness │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ High     │ No request body size limit — trivial DoS vector                   │ Security    │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Medium   │ Migration "duplicate column" string-matching hack                 │ Tech debt   │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Medium   │ Test schema diverges from production (missing indexes)            │ Tech debt   │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Medium   │ No EventType validation                                           │ Correctness │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Medium   │ Inconsistent error handling between dashboard handlers            │ Consistency │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Medium   │ GetDB() breaks store encapsulation                                │ Design      │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Low      │ EffectiveLimit() duplicated between model and service             │ Tech debt   │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Low      │ alerted filter not exposed in API handler (only dashboard)        │ Feature gap │
  ├──────────┼───────────────────────────────────────────────────────────────────┼─────────────┤
  │ Low      │ Duplicate mockStore across test files                             │ Maintenance │
  └──────────┴───────────────────────────────────────────────────────────────────┴─────────────┘

From here, I could work with the tool to refactor this code and avoid the so-called “AI slop” and tech debt that so concerns detractors of this type of software development. Of course, I chose Go for historical reasons, but the choice of a compiled language with mandatory types and robust error handling and tests is a no-brainer for these exercises.

I originally estimated a small version of this project would take me weeks of part-time work to plan and develop; Claude’s naive guess as to normal SWE time for the current project was 3-5 weeks. The actual time was about 5 interactive, token-consuming hours over 11 clock hours.⁸ This five hours included the time to greatly revise the architecture, fix critical bugs (actually, everything not low/minimum priority), create tests, perform manual testing, and sketch out a Docker and local Kubernetes setup.

This was a proof-of-concept, and of course can be improved, but my goal is not to create a new product. More can be done, and my goal was to produce a solid MVP⁹ / prototype that was not bonkers dumb or destructive. If you assume that software rolls out of large companies, complete and bug-free, you are either exceedingly optimistic, disingenuous, or deluded.¹⁰ Software development is iterative development. Code review is an iterative, adversarial (and hopefully, not antagonistic) process. And the more of that we can do through tools, the better, from books to linters to IDEs to Google searches of Stack Overflow to software reuse, and now LLMs.

From my informal industry discussions: teams are moving to 80-100% generated code, and everyone is trying to figure out how to handle code reviews, a flood of pull requests, non-traditional software development, and other new challenges. In a time when the work of weeks can be done in as many hours, it’s not going away.¹¹

Silicon Valley was built on a traveling show of magic tricks.¹² You can outlaw magic when it’s fraud. It’s a bit harder when it works.

No kidding, the Google SRE bomber jacket was one of the most prestigious pieces of swag at Google. As of this writing, one is available on eBay for $80: “Look on my Works, ye Mighty, and despair!” ↩
I will speak later of the change review and Karma system, which had the only thumbs-down button present in a Facebook UI. ↩
Colored lines on a time-based graph that show you when deployments were active. Examples of other versions ↩
Go for DevOps ↩
Or, at best, to Scotty or Geordi in engineering. Also, Eloi: it’s what’s for dinner. ↩
“Why Smart People Have Bad Ideas”, April 2005 ↩
Sorry, Andy Weir. ↩
LOCs don’t count, but around ~2000 lines of generated code not counting tests or docs. Bug repair also included creating and revising unit tests. ↩
Minimum viable product ↩
In the 2010 film The Social Network, Mark Zuckerberg (played by Jesse Eisenberg) shouts, “we don’t crash ever!” YT video The entire audience of Facebook employees laughed during our showing, then we went out drinking in Palo Alto. ↩
I’m afraid the code generator will be quite operational when your craft guild arrives. ↩
Folks may remember this building a blog with Ruby on Rails in 15 minutes roadshow! ↩