Friday, March 20, 2026

A couple of thoughts on using LLMs

Recently I started using Claude Opus 4.6 and burned through over $10 after just a couple of queries. In one of my queries, I asked Opus to generate me a wireguard config for NixOS, and saw that it generated this:

    networking.wireguard.interfaces.wg0 = ...

This immediately rang alarm bells because I've previously read the official NixOS Wiki page on wireguard, and I knew that there are at least 4 ways to configure wireguard in NixOS:

  • NetworkManager
  • wg-quick
  • networking.wireguard
  • systemd.network
Which one to use? Well, claude just picked networking.wireguard with no explanation. But according to the official NixOS wiki, systemd.network is recommended, while networking.wireguard is said to have "issues with routing". So at least in this instance Opus seems to have not followed best practices as documented in the official NixOS wiki. This was disappointing as Opus 4.6 is, at the time of this writing, considered the best model for code gen, and is also super expensive (cost me over $10 for a couple of queries), and I had some good experiences using it, so I had high expectations for it. After seeing this result, I realized that Opus is just the same as the other LLMs after all and that my trust had been misplaced.

In another query, I asked Opus to help me build a system, and it wrote the whole system in python, without any type hints, and of course no tests. I get that maybe it's trying to conserve tokens, but writing a python app with no type hints feels like something that should be illegal.

That inspired me to finally write this blog post. I feel like I've finally had enough experiences with LLMs to write down a couple of my thoughts on using LLMs:

  1. LLMs can be very useful for debugging and troubleshooting software and hardware problems. LLMs are very fast at reading code. An LLM can sometimes find a bug near instantly by itself, saving you potentially hours of debugging time. 
  2. LLM output is always and in all cases unreliable. I have found no use case where LLM output is reliable and trustworthy 100% of the time. My absolute rule is that all LLM output must be inspected and checked (e.g. tested) before it's trusted. For example, if I have an app, I would never accept code changes from a LLM without inspecting all of the code to make sure it looks okay first, and then testing it to make sure it works.
Bottom line, I think it's best to think of LLMs as an unreliable ideas generator. They generate ideas, some of which are good, and some of which are bad. When the human has run out of ideas, LLMs can provide additional ideas to try and test. In situations where ideas can be quickly tested at little cost, LLMs are awesome. In situations where doing the wrong thing can have irreversible consequences, one should naturally be more cautious about trusting LLM output. 

Additionally, I think there are a couple of tendencies to be wary of when using LLMs:
  1. The tendency to treat LLM output as authoritative. This is seen when two people are having a debate, and one side pulls out an LLM, and asks it the question, and treats the LLM answer as authoritative. I must admit that it also irked me when people used to do the same thing by only looking at the first result on Google, or treating wikipedia as authoritative, etc.
  2. The tendency of LLMs to act as a "confirmation bias" amplifier. This is when a human has an idea, and asks the LLM about it, and the LLM basically goes "yes, your idea is very valid", and the human goes from treating the idea as "possible" to being almost certain. This is related to point 1.

No comments:

Post a Comment