Old Blog 2: June 2026

So, I've been using Opus 4.7 (with effort set to max) lately because Opus 4.8 often gives me the "retry ..." error, not sure if I've hit some sort of rate limit. Probably, since I keep getting API error : Overloaded. I can send short messages like "hi" to 4.8 but long messages and it would often ask me to retry.

So overall Opus 4.7 doesn't feel noticeably different compared to 4.6. Here are some issues I encountered while using 4.7:

Opus 4.7 still hallucinates. One recent example was when I asked it to kick off a test run and check the results. I looked at the test run and saw there were 5 failed tests. I asked Claude what tests failed and it told me 3 tests failed which were completely different from the 5 failed ones. I asked it to go check the actual test run result and it checked it and told me it got the failed tests wrong and it just hallucinated the 3 failed tests.

Another hallucination I saw recently was when it told me that I can enable a feature using a config setting in the UI. It said: "go into Settings -> General -> ..." except there was no General setting in the Settings. I asked it to check so it went and read the code and then it admitted that it hallucinated it and that in fact the setting didn't even exist in the UI.

Opus 4.6 wrote a huge spaghetti mess of flags that was causing bugs all the time. Basically whenever I asked it to add a feature it would need to update existing flags, and quite often it would forget to update some flags which caused bugs. This was getting to the point where pretty much every time I asked it to add a new feature it would break an existing feature, and when I asked it to fix a bug it would quite often tell me that it fixed the bug only for me to find that it did not in fact fix the bug, and this would happen over and over. Eventually I asked it why the code was so buggy and it told me that it was because the code was a huge spaghetti mess of flags and so I told it to rework to a FSM design and it did, and that significantly reduced the number of bugs. But I had to spend a lot of time walking it through the FSM design, making sure the states made sense and were coherent etc.

One of the most glaring examples of this, there was a particular functionality that I call edit-vs-new, where I want it to edit a message if that message is the last in a conversation, or else post a new message if it's not. This feature got broken probably at least 6 times. This was despite there being HUNDREDS of tests. Claude writes TONS AND TONS of unit tests which I'm not sure do anything useful except for increasing my token spend.

The end-to-end tests that Claude writes are useless. They don't catch regressions (and there have been many, MANY regressions. I've seen the e2e tests pass while the most basic functionality, the happy path, is broken in the most obvious way possible) and they don't test end-to-end functionality. I told Claude many times that the tests it wrote are useless and don't catch regressions. Claude rewrote the tests multiple times and they still don't catch regressions. Like even bugs that are incredibly obvious and easily verifiable for Claude, like bugs that it would notice if it just made some API calls or just read some logs etc. Not even UI or visual bugs. I don't think Claude ever wrote a single regression test that actually worked.

When I asked Claude to fix the broken feature, it would break it in a different way. Like, when it was editing too much instead of posting new, I told Claude to fix that, and after Claude "fixed" it it would always post new and never edit. This feature continued to break all the time even after I told it to refactor it and I walked through a complete redesign with it, it still got broken after some new features were added, eventually I finally just gave up trying to get it to work.
Opus 4.7 is still not really much better at designing code, nor at writing code. Like with the plugin code it was generating the same bugs over and over, like it was hitting the exact same 3-4 bugs every time it added a new window to the plugin. So, the plugin consisted of multiple windows. And there were bugs with things like the scrollbar or just content not displaying in the window, or the borders missing etc. The same bugs every time. I saw the bugs in one window, I asked it to fix it, it fixes only the bugs in that one window, when I try another window I see the same bugs, I ask it to fix those bugs, it just fixes them without fixing the same bugs in the other windows. So I kept encountering the same bugs over and over again. Eventually I asked it why the same bugs kept happening and it said it just copied and pasted the same buggy code for all of the windows. All of the copy pasted code was doing the same thing, and I told it to just pull out all the common code and extract that into a common library or something, so it did that. But this is like, pretty common sense stuff right? Things like, rendering content, drawing borders, scrollbars etc are common tasks that every window needs to do, so it's obvious common sense to extract the code into a library instead of copy and pasting the same code everywhere. This isn't some super secret special senior developer wisdom that takes 20 years to obtain, it's literally basic common sense.

Opus 4.7 still forgets things. Quite often I would ask it to help me plan a major feature, and then it would go ahead and ask me questions clarifying the feature, then I would ask it to solve a bug urgently, so it would go and do that, then once the bug is fixed I ask it if there are any outstanding tasks left and it would say no. This is especially bad when there's a compaction, that's when it will just straight up deny that any feature had even been planned, and I would have to tell it to go read the past session logs. There is an easy way to work around this which is that every time you give a new task, you just write it down in an external file, then later on you can tell Claude to check against the external file.
Opus 4.6 (with effort set to max) was very lazy, even when I tell it to not be lazy and fully investigate things, it would just say "oh, I don't know why this happens, I can investigate it another time if the user wants", even when I explicitly tell it to don't be lazy and investigate all unexplained phenomena. Opus 4.7 (with effort set to max) is similarly lazy and will just give up a lot of the time. Like, I was asking it to build a plugin, and it built a plugin that doesn't even display, I asked it to fix it, and it tried a few times and just gave up and said it's going in circles and not going anywhere and honestly is it even worth it? I just told it "look, getting it to display should not be that hard" and it went and fixed it.
Opus 4.7 fails to read docs that it really should be a no-brainer to read. Like I was asking it to write some code that uses some plugin API, that would have been easily Googleable. Instead of reading the documentation it will just make guesses about what API calls do, often guessing wrongly. And even when I tell it there's a bug, and it goes and figures out the root cause of the bug which is that the API call it thinks does X in fact does not do X, it will then throw up its hands and say "oh well, looks like it's impossible to do X", even when X is a common task that it should be obvious that any API user would want to do, and in fact it's difficult to see how the program would even be able to function without that feature. By X I'm talking about a functionality like "get the currently focused window" - a functionality that should be extremely commonly used, and yet Claude told me that it's impossible to do it. And I said "surely there must be an easy way to do this" a few times and eventually it decided to FINALLY go read the API docs and discover that there is, in fact, an extremely easy way to do it (there was an API call that does exactly that).

This happened for multiple things. Even for VERY commonly used messaging APIs, Claude will just not read the docs. Like, for a messaging API, the docs will say what the rate limits are, what is the maximum size of message, and what happens if you send a message that exceeds the maximum size. This is all stuff that isn't easily guessable, and yet Claude decides to guess instead of going and reading the docs. Claude will write code that fails to respect the API, like, it would write code that has no safeguards against sending messages that exceed the maximum size.

Seriously, reading the docs is not some super secret senior developer wisdom that takes 20 years of development experience to obtain. This is literally. If you grabbed some random Joe off the street and asked him to write a program that uses a messaging API, or even a fresh new grad, he will know "oh, I don't know how to use the API, I should probably read the API docs". Like "RTFM" is literally the first thing any new developer gets told. This is literally the most obvious basic common sense that anyone with 2 brain cells will think of.

Opus 4.7 still goes around in circles when fixing bugs. There was a bug that was caused by the messaging service splitting messages that exceeded the maximum size. Claude spent something like 2 hours just going around in circles checking the same code over and over again. Like, it should be VERY OBVIOUS if you've read the same code repeatedly that you think that there's no way that it can cause a bug, that the cause of the bug is most likely something external. Yet it just went around in circles over and over and eventually just gave up. I walked it through the possible causes and it didn't figure it out after I literally told it what the source of the bug was. Yes, it was caused by the messaging service splitting the message due to the message exceeding the maximum size. Something that Claude would have learned if it had bothered reading the messaging API docs which aren't even that long.
Opus 4.7 makes some pretty elementary programming mistakes such as writing just a single bare recv(4096) call instead of calling recv in a loop. This was the cause of a major bug that took a while to track down. Honestly, learning that you should call recv in a loop is something you learn as a student. This is even advised in the docs. Again, this isn't some super secret senior developer wisdom that takes 20 years to obtain. It's just basic defensive programming. I don't think it's ever a good idea to write a bare recv. Like, even when you "know" that the message is a certain size when you wrote the code initially, what if you decided to change the message size later on? It just creates footguns for no good reason.
Opus 4.7 writes some REALLY CRAZY INSANE code. Oh, you thought the recv(4096) was bad? Well, this one was on another level. So I asked Opus 4.7 to write me a watchdog. I gave it some rough outlines, some requirements and let it code it itself. It wrote like 800 lines of code. One of the functions that it wrote was basically a process_messages function that has a loop in it which goes like this:

for msg in received_msgs:

assert self.flag is not None
... [ some code ] ...
self.flag = None

So basically this loop was GUARANTEED to crash if received_msgs ever had more than one item. This caused the watchdog (which is supposed to be a really small and simple and reliable piece of software) to go into a crash loop where it crashes and dies and systemd restarts it and it dies again. I can't fathom the level of confusion that created this code.
In fact the code was so unreadable and incomprehensible that I just decided to rewrite it. Why on Earth was it 800 lines for such a simple task of "send heartbeat, restart program". I ended up rewriting it to about 100 lines of code that does basically the same thing, except my version is probably 100x more readable and auditable.

I feel like in the future people will face a choice between "expensive premium hand-written code" vs "cheap AI generated bug-ridden garbage".

Old Blog 2

Tuesday, June 2, 2026

Opus 4.7 Leaves Much To Be Desired