Saturday, March 30, 2024

Firefox has lag issues when 6 windows are open with over 12,000 tabs open

Currently I have 6 Firefox windows open:

  1. Window 1 has 12,142 tabs open
  2. Window 2 has 129 tabs open
  3. Window 3 has 1 tab open
  4. Window 4 has 42 tabs open
  5. Window 5 has 64 tabs open
  6. Window 6 has 92 tabs open

The majority of the tabs are in the "unloaded" state so they should not be taking up any CPU time.  

RAM usage is less than 50% and CPU usage is on average less than 50% across all cores. 

I am running both Firefox and Chrome at the same time and Chrome is significantly more responsive.

Even something as simple as opening a new tab in Firefox takes a few seconds, whereas opening a new tab in Chrome is instant. To be fair, I only have 300 tabs open in Chrome right now, but this shows that the performance issues are not due to my hardware.

Creating a new tab in Firefox and then entering a URL then hitting enter, the tab takes a while to even begin loading the page. 

Even while writing this blog post, every few seconds while typing, the text just freezes for a couple of seconds before my keystrokes make their way onto the screen. I notice this when typing on monkeytype and typeracer too - sometimes the text just freezes for a few seconds. This happens almost every other sentence and it only happens on Firefox and not on Chrome.

When watching a YouTube video, every few seconds (say 5-20 seconds), the video just freezes for a couple of seconds while the audio continues to play. Again, this ONLY happens on Firefox and does not happen on Chrome.

I think this issue started to happen from a while ago but recently the lag just got bad enough to the point where it is actually impacting my monkeytype performance, so I decided that I had to close some tabs. 

But before closing my precious tabs, I have to save them somehow. Here is where I ran into problems. 

It turns out that Tab Session Manager no longer works when you have 12,000 tabs open.

Tab Stash also doesn't work when you have so many tabs.

Firefox's built-in "save all open tabs as bookmarks" works, but does not save tab state and history, which is desirable.

So I'm going to try to find a way to just save my Firefox "profile" somewhere so that I can restore it later along with all of the tab state and history.












Tuesday, March 26, 2024

How to give your Scaleway Stardust VPS a custom* IPv6 address

(*by "custom" I mean any address you want within the /64 block that Scaleway gives you.)

So I got myself a €0.43/month Scaleway Stardust IPv6-only instance, and I wanted to attach it to a permanent IPv6 address.

Scaleway generously gives you 40 free flexible IPv6 addresses. Each of these is a /64 block! So you actually can add any of the IP in the /64 block and attach to your VM. And you can have multiple IPv6 addresses from multiple of those blocks attached to your VPS at the same time, so you can access your VPS from all those IPs simultaneously! (I tried this, it works - pretty cool!)

Anyway, so my Stardust instance is running Debian 12, and I initially thought that to add my own custom IPv6 address I just had to edit /etc/network/interfaces, because that is what the Debian manual says: https://wiki.debian.org/NetworkConfiguration

# systemctl status networking
# systemctl restart networking

However, when I run that command, I get this result:

# systemctl status networking
Unit networking.service could not be found.

So then I listed all of the running systemd services and looking at the list, it looked like my VPS is using systemd-networkd for configuration. 

Doing systemctl status systemd-networkd gave this kind of message:

if1: Configuring with /run/systemd/network/if1.network.

So I thought I just needed to edit that file. So I went ahead and edited it but upon reboot the changes were not persisted. 

It turns out the /run/systemd/network files are volatile files as explained in the Arch wiki:

The global configuration file in /etc/systemd/networkd.conf may be used to override some defaults only. The main configuration is performed per network device. Configuration files are located in /usr/lib/systemd/network/, the volatile runtime network directory /run/systemd/network/ and the local administration network directory /etc/systemd/network/. Files in /etc/systemd/network/ have the highest priority.

So then I created the file with the same name in /etc/systemd/network/ and now the IP address is restored on reboot.


A list of S3-compatible providers

In the previous post I compared B2 and R2. Then I realized that there are a whole bunch of other S3-compatible providers so here is a list - I evaluated them based on my own use case, yours may vary:

NOTE: I haven't used any of the services listed below, so I cannot comment as to their quality or reliability.

  • BackBlaze B2 - PUT requests are free, no minimum spend.
  • Oracle Cloud - Data storage: $0.0255/GB/month. $0.34/million requests. 10TB free egress per month.
  • Microsoft Azure - $7.7/million write operations
  • IBM Cloud - $5.2/million class A (write) operations 
  • AWS S3 - $5/million PUT calls
  • Google Cloud - $5/million class A (write) operations
  • Fuga - €5/million PUT calls
  • CloudFlare R2 -$4.50/million class A (write) operations
  • Clever cloud - €0.09 / GB egress 
  • Terrahost - minimum spend is $11.5 per month
  • Wasabi - minimum spend is $6.99 / month
  • Vultr - minimum spend is $6 / month 
  • Upcloud - minimum spend is €5 / month
  • Digital Ocean spaces - minimum spend is $5 / month
  • Linode - minimum spend is $5 / month
  • iDrive e2 - minimum spend is $4 / month
  • Contabo - minimum spend is $3 / month
  • Bunny - minimum spend is $1 / month (which gives you 25GB with replication, or 50GB with no replication), no API fees, no API egress fees, S3 coming soon (TM)
  • Synology C2 - €11.99/ year for 100GB, no API fees, no egress fees (???), no upload fees, no deletion fees
  • Serverius - Data storage: €0.009/GB/month. Every month, your first million HTTP requests are free. Each GET and PUT request type has its own limit of 1 million free requests. For example, if you’ve had 0.8 million GET and 0.7 million PUT requests, you’re still within your free limit. In case you exceed 1 million requests, the extra requests will be charged at only 0.0003 Euro per 1000 (0.3/million) HTTP requests. The first 200GB of data egress per month is free.
  • Scaleway - Data storage: €0.012/GB/month for single-zone, €0.0146 for multi-zone. Ingress is free. Requests are free. Egress - 75GB free per month, after that charged at €0.01/GB.
  • OVH - Data storage: 0.012/GB/month. Ingress is free. API requests are free. Egress is charged at 0.012/GB month.
  • tebi.io - Data storage: PAYG plan includes a Free Tier which gives you 25GB of free storage replicated in two locations. Additional storage is charged at $0.02/GB/month. API calls are FREE. Unlimited uploads (free, I guess). 250GB of free egress per month, additional egress is charged at $0.01/GB.
  • Dreamhost - Data storage: $0.025/GB/month. Ingress is FREE. API calls are FREE. Egress is charged at $0.05/GB.
  • Exoscale - Data storage: €0.02/GB/month. Egress is charged at $0.02/GB. There is no other charge - ingress is free.
  • Ionos S3 - Data storage: €0.015/GB/month. Ingress is FREE. API requests are FREE. Outgoing data traffic: €0.03/GB.
  • Storj - Data storage: $0.004/GB/Month. Segments are billed at $0.00000001222 per Segment Hour. Every file smaller than 64MB takes up 1 segment (unless you split them). Egress is charged at $0.007/GB.
  • Telnyx - Data storage: $0.006/GB/month. State-change operations: $0.5 per million. State-read operations: $0.04 per million. Egress is free (???). But see the LET thread for more details: https://lowendtalk.com/discussion/187546/telnyx-s3-compatible-object-storage-4-tb-mo-and-free-egress

Please note that I do not know which of the above listed services have hidden charges or minimum spend limits or some crazy terms/conditions like "once you upload a file you must not delete it for at least 6 months otherwise we will suspend your account" etc.

Caveat emptor, I guess.

 

Btw, the cheapest Scaleway instance - IPv6-only Stardust - only costs around $0.5 per month if you use the 10GB local storage. Pretty cheap! And you get 1GB RAM and "unlimited" bandwidth too. You need to disable IPv4 in order to get that price, though. So pester your ISP until they give you IPv6!!

Monday, March 25, 2024

BackBlaze B2 vs Cloudflare R2 pricing

NOTE: I did not include Wasabi because their minimum price is $6.99 / month. I did not include Digital Ocean spaces because their minimum price is $5 / month. In contrast, it seems Backblaze does not have a minimum price (https://www.reddit.com/r/backblaze/comments/yv55eu/backblaze_b2_is_there_a_minimum_monthly_amount/) so if you store only a few GB then you only pay a few cents per month, which is perfect for my use case.

So I noticed some interesting differences between B2 and R2 pricing:

Backblaze B2

  • Ingress is free, egress is free up to 3x your monthly average storage, with any additional egress priced at $0.01/GB. You also get 1GB free egress per day.
  • Class A operations (PutObject, DeleteObject) are FREE
  • Class B operations (GetObject) - you get 2,500 free operations per day (== 75k/mth), then $0.004 per 10,000 ($0.4 / million)

Cloudflare R2

  • Ingress and egress are both free.
  • Class A Operations (PutObject) - you get 1 million free requests / month, then $4.50 / million 
  • Class B Operations (GetObject) - you get 10 million requests / month, then $0.36 / million requests
  • DeleteObject is free.

Summary

  • DeleteObject is FREE on both B2 and R2
  • GetObject is cheaper on R2: R2 gives you 10 million/month allowance and then charges you $0.36/million thereafter, whereas B2 gives you 2.5k/day allowance and then charges you $0.4/million thereafter.
  • PutObject is cheaper on B2: FREE on B2, whereas R2 gives you 1 million allowance and then charges you $4.50/million thereafter.

Based on the pricing info alone, it looks like if you are going to be doing millions of calls to PutObject per month and less than 2500 calls to GetObject per day, then B2 will be a lot cheaper for you. But if you are going to be doing millions of GetObject calls and less than 1 million PutObject calls per month, then R2 will be cheaper.

Of course we have to take the B2 egress costs into account too. If you are egressing less than 3x your storage, then egress is free, otherwise it costs $10 per TB, so I don't think B2 is suitable for file sharing - the B2 pricing structure makes it only really suitable for file backups. 

Having said that, apparently B2 egress is free through Cloudflare. Though I'm not sure exactly how to take advantage of it. Something to investigate if I end up actually using more than the free B2 egress, I guess.

If I do 4 million PutObject calls (e.g. 1-2 calls per second) that is going to cost me $13.50 per month on R2, whereas it would be free on B2. So I think, if I use R2, I would have to carefully think about how to reduce the number of PutObject calls.




No longer able to reproduce Cloudflare DNS flapping

UPDATE: I tried this with some of the $0.99/year 1.111B class .xyz domains that I registered using a different registrar (you can't register .xyz domains on Cloudflare for some reason). I simply set the nameservers for my 1.111B domain to Cloudflare (add it to Cloudflare first, of course) and it works just as well! The change takes effect instantaneously. As soon as the HTTP PUT request returns, if you run the host command again, you will immediately see the new, updated IP address for that domain. Very cool!!!!

I wonder why more people don't use those $0.99/year 1.111B domains. They're so cheap.

 

 

 

Last post I mentioned that I saw DNS flapping with Cloudflare. 

I wondered if it was because the TTLs on some nameservers had not yet expired. Since the updates presumably take a while to propagate across all nameservers, maybe the TTLs on some nameservers start counting down before others. So maybe the issue was that I was updating the DNS too quickly - if I waited a few minutes between updates, then maybe the updates would become instantaneous and reliable with no flapping.

So I tried what I did in the last post again, this time waiting a few minutes before updating the DNS to a new value, and this time I saw some more interesting behavior.

First, I set the IP to 1.0.0.1 at 10:14:52: Instantaneous and no flapping.

Then I set the IP to 8.8.8.8 at 10:21:02:

Request issued at 10:21:02

First change seen: 10:23:15

Wow! This time it took over 2 minutes to update and there was flapping too!

Then I changed it to 192.168.0.1 and the change was instantaneous once again, and no flapping.

This makes me wonder if either the 1.0.0.1 or the 8.8.8.8 IP address is special - maybe Cloudflare doesn't want to change from 1.0.0.1 or maybe it doesn't want to change to 8.8.8.8. I'll try some more tests to distinguish between the two hypotheses.

Or maybe there is another DNS cache timeout somewhere that is longer than 1 minute?

Then I waited a few minutes and updated the IP to 192.168.0.123, and this time again, the change was instantaneous and there was no flapping.

Then I waited a few minutes and updated the IP to 192.168.0.42, and this time again, the change was instantaneous and there was no flapping.

So it would seem that at least for the IP range 192.168.0.x, as long as you wait a few minutes between each change, the update is instantaneous and reliable with no flapping.

Then I waited a few minutes and updated the IP to 8.8.8.8, and this time again, the change was instantaneous and there was no flapping.

Then I waited a few minutes and updated the IP to 1.0.0.1, and this time again, the change was instantaneous and there was no flapping.

Then I waited 2 minutes and updated the IP to 192.168.0.1, and this time again, the change was instantaneous and there was no flapping.

So it seems that most of the time, if you wait a few minutes before changing the IP, the change is indeed instantaneous with no flapping.

This makes me feel more confident using Cloudflare for instantaneous DDNS updates.

Cloudflare DNS flapping

I saw something interesting with DNS today.

I updated my DNS record, then immediately queried Cloudflare DNS (1.1.1.1) and it would switch between the old and new IPs for a while before settling on the new IP. 

[linux 2024-Mar-25 09:56:43]$ host -v mydomain.com
Trying "mydomain.com"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 56089
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;mydomain.com.                    IN      A

;; ANSWER SECTION:
mydomain.com.             60      IN      A       127.0.0.1

Received 44 bytes from 1.1.1.1#53 in 5 ms

[linux 2024-Mar-25 09:56:46]$ host -v mydomain.com
Trying "mydomain.com"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63287
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;mydomain.com.                    IN      A

;; ANSWER SECTION:
mydomain.com.             60      IN      A       10.0.0.1

Received 44 bytes from 1.1.1.1#53 in 5 ms

Pretty interesting behavior.

My best guess is that different DNS servers are answering my query each time. I guess some nameservers get updated faster than other ones, and sometimes my queries are answered by one nameserver while at other times it's being answered by other nameservers. Some of the nameservers have the old IP while others have the new IP, hence the flapping behavior you see here. 

I don't know where exactly the flapping is taking place. Maybe Cloudflare internally uses some kind of load balancing mechanism that distributes DNS queries to different machines each time (or randomly)? Don't know.

In any case, this dashes my dreams of using 1.1.1.1 for instantaneous reliable DDNS, because it seems that sometimes the DNS change is not instantaneously reflected in the host/dig output and sometimes it flaps between the old and the new IP. Sadge.

 


Sunday, March 24, 2024

Hmmm...DNS cache expiry patterns ...

So I need a really fast DDNS because ... reasons ... so I tried out the cloudflare DDNS.

Basically what I did is I sent the update query and then I kept running the host command over and over again.

So I sent the update query then I saw it took around 44 seconds for the new value to show up in host.

I tried dynu which says it has 30 second TTL (WOW!) and saw that DNS update took around 48 seconds.

But then I tried dynu again and saw that this time the DNS update took only 4 seconds.

I investigated further and saw this pattern:

21:24:36 Request sent

21:25:24 DNS updated (48s)


21:26:21 Request sent

21:26:25 DNS updated (4s)

 

23:45:02 Request sent

23:46:01 DNS updated (59s)

 

23:46:22 Request sent

23:47:03 DNS updated (41s)

 

23:47:19 Request sent

23:48:03 DNS updated (44s)

 

23:48:57 Request sent

23:49:03 DNS updated (6s)

 
I find it interesting that in the last 4 cases, the DNS update happened near the start of the minute, but in the first two cases, the DNS update happened near second 24-25. 

It could just be a coincidence, or this could indicate that DNS cache timeouts are happening roughly in 1 minute intervals, but with some drift.

I tried again with Cloudflare:

23:58:11 Request sent

23:58:38 DNS updated (27s)

 

00:00:08 Request sent

00:00:52 DNS updated (44s)


00:01:17 Request sent

00:01:52 DNS updated (35s)

 

00:02:44 Request sent

00:02:53 DNS updated (9s)


Here again we see the familiar pattern of the DNS updating around the same second for multiple minutes consecutively, yet from 23:58 to 00:00 it changed from second 38 to second 52-53. 

It seems to me that there is some kind of pattern that occurs regardless of which DDNS service you use.

DNS updates happen via cache expiry, and it seems that the cache can expire around the same time every minute?

Also it seems that the expiry time also changes?

Not really sure what's going on.

 

 

 

In any case, my takeaway from all this is that you cannot count on a TTL of less than 60 seconds. The dynu TTL of 30 seconds does not seem to guarantee that you will see the DNS updated in less than 30 seconds after it changes - sometimes it is more than 30 seconds, sometimes it is less. It should be less than 60 seconds though.

I suppose if you want really fast DDNS, you could host your own special "DNS" server and send a packet there every second so that it will know immediately when your IP changes...

GOD DAMMIT my ISP doesn't support IPv6 😡😡😡

UPDATE: I contacted my ISP and managed to get them to give me IPv6. I can only pray that the IPv6 will continue to work in the future. It is actually kind of crazy that I had to contact them and go through the silly dance of "restart your router" "okay I did that, I still don't have IPv6" in order for them to actually fix their network so that IPv6 works for me. I think every ISP should provide working IPv6 out of the box.

 

 

I can't ping ipv6.google.com and my score on https://test-ipv6.com/ is 0/10

All the VPS vendors support IPV4 now...why doesn't my ISP support IPv6...it's unacceptable...There really ought to be some kind of government mandate that requires all ISPs to provide full IPv6 support.

I want to use one of those cheap IPv6-only VPSes, dammit! 😡😡😡

EDIT: Okay, I guess I'll use a he tunnel

EDIT: My router blocks ping so I can't even create a tunnel, god dammit. I can't use my VPS as the endpoint either, it says "This network is restricted"

EDIT: Okay, so I set up the he tunnel on one of my VPSes, now I can finally ping ipv6.google.com from that VPS but I get 2% packet loss when doing so. Note that I get 0% packet loss when I ping -4 google.com from that VPS so I'm pretty sure it's caused by using the he tunnel...😭😭😭😭😭😭

Ping stats:

--- ipv6.google.com ping statistics ---
5600 packets transmitted, 5506 received, 1.67857% packet loss, time 5608718ms
rtt min/avg/max/mdev = 216.176/256.840/278.688/11.551 ms

---  ping statistics ---
5613 packets transmitted, 5611 received, 0.0356316% packet loss, time 5623708ms
rtt min/avg/max/mdev = 10.223/10.337/25.901/0.226 ms

Maybe I am using the wrong he tunnel.

What would be a really nice way to charge for bandwidth?

According to the Cloudflare blog post AWS's Egregious Egress, it costs AWS around $1.20 per TB of traffic transferred: https://blog.cloudflare.com/aws-egregious-egress

Basically, Cloudflare says: If you run a 3mbps link at 100% utilization for a month, you'll have transferred around 1TB of data. If it costs you $1.20 to use the 3mbps link for a month, then you're effectively paying $1.20 per TB.

I think this would be a really nice way to charge for bandwidth. 

Instead of charging a monthly price for bandwidth, it would be nicer if customers could simply just buy a certain amount of bandwidth that never expires.

It would be nice if I could just pay $50 up front for 50TB of bandwidth, and that bandwidth would never expire, so that I can use it whenever I want.

I think that would be a really nice pricing model. I wonder why VPS providers don't use it.

Actually this is how a lot of prepaid SIM cards work - you can get IoT SIM cards with data that won't expire for 10 years. Pretty neat concept.

How AWS Lightsail bandwidth pricing works

So on the AWS Lightsail page there is a part which says only outbound transfer in excess of allowance is charged.

But in another part of the Lightsail pricing page it says that both inbound AS WELL AS outbound use up your transfer allowance.

Putting the two pieces of information together, it means that both inbound as well as outbound will use up your allowance. But once you have used up your allowance, you will only get charged for the outbound traffic.

So for example, if you pay $3.50 per month then you'll get the 1TB allowance.

So if you do 1TB of ingress followed by 1TB of egress, then the first 1TB of ingress will use up all of your bandwidth allowance, and then the 1TB of egress will be charged at the standard AWS rate of $0.09 per GB which comes out to $90.

Just thought I'd explain this for anyone else who was confused about the Lightsail pricing like I was.

By contrast, many big VPS providers such as Digital Ocean, Linode, Vultr, Contabo, and so on do not charge for ingress at all. Digital Ocean explicitly says "Any inbound transfers don't count against your bandwidth usage."

HOWEVER, I signed up for Vultr wanting to buy their $3.50 plan and later found out that it's only available in one location in the US. I thought their $3.50 plan was available in all locations. Keep this in mind because when you sign up for Vultr you have to buy some credit. I would suggest not putting in any credit until you're sure that the plan you want to buy is actually available.

AWS IPv4 pricing changes everything

UPDATE: AWS has now updated their pricing for the Lightsail to reflect their IPv4 charges. Now the cheapest IPv4 Lightsail plan will cost $5.

So my previous calculations regarding monthly AWS costs are now incorrect. 

Originally IP addresses were free as long as they were attached to your EC2 instance.

Now you will get charged around $3.60 per month for just the IPv4 address alone! And then you have to pay for the compute. 

Given that the compute itself only costs around $1.40 per month on a 3 year reserved plan, this means the IP address itself is more than double the price of the instance itself.

This is especially ridiculous given that a $3.50 Lightsail instance gives you both the compute as well as a static IP address.

So if you use EC2 you are paying $3.60 just for the IP address, not including the compute. When you could be paying $3.50 for Lightsail which includes free IP plus compute.

IMO the AWS IPv4 pricing is overpriced compared to some other places, e.g. Hetzner charges only $0.65 for an IPv4 address per month.

Saturday, March 23, 2024

More proof that I'm an idiot

Context: I wanted to block all non-Cloudflare IPs from accessing my server since I don't want people to be able to query my server and figure out what domains it hosts (yes, this is quite easy to do - a simple curl -k https://aaa.bbb.ccc.ddd:port -v will tell you).

So I wrote a bunch of rules into /etc/nftables.conf thinking that that's where nftables looks for the config file.

Nope, it actually turns out the real config is in /etc/sysconfig/nftables.conf

So I googled and even asked GPT4 and Gemini where do I find the real config, and couldn't find the answer. GPT4 and Gemini were totally useless.

In the end, I had to think for myself, so I thought "well, nftables is a service, so systemd will tell me what command it was started with and maybe that command will contain the location of the config file" and lo and behold:

$ systemctl status nftables
nftables.service - Netfilter Tables
    Loaded: loaded (/usr/lib/systemd/system/nftables.service; enabled; preset: disabled)
    Active: active (exited) since Sat 2024-03-23 14:29:13 PDT; 10min ago
      Docs: man:nft(8)
   Process: 3934471 ExecStart=/sbin/nft -f /etc/sysconfig/nftables.conf (code=exited, status=0/SUCCESS)
  Main PID: 3934471 (code=exited, status=0/SUCCESS)
       CPU: 21ms

Anyways, this just goes to prove what an idiot I am, that I had to Google for something so obvious and couldn't find it. I guess this was so trivial and obvious common sense that nobody bothered writing it down. 

GPT4 and Gemini were completely useless in this case.

Also, did you know that you can use named sets in nftables? Pretty useful feature: https://wiki.nftables.org/wiki-nftables/index.php/Sets


Golang gripes: net/http doesn't log certain errors

I just spent like 40 minutes trying to fix an issue where Cloudflare TLS proxying was working for all TLS ports (e.g. port 2087, 2083 and so on) EXCEPT for port 443. That was driving me nuts.

Context: So I had this Origin Rule which says that when request hostname is a certain value, change destination port to 12345.

Of course, since my server serves TLS on that port, this means normal HTTP traffic to that port won't work. So if you tried visiting that site on plain HTTP, you will get "Client sent an HTTP request to an HTTPS server." which makes sense and is fine.

But here's the problem: If you tried accessing https://mywebsite.com:2083 from a web browser, it would work just fine, but if you tried visiting https://mywebsite.com:443 from a web browser, then you would see error 400.

So, port 443 was special, somehow. But where was the special-case handling for port 443? Was it in Cloudflare or was it in my server? I had a separate process running on my server that received traffic on port 443, but in theory it shouldn't have mattered because the Origin Rule should have been rewriting the destination port to 12345, so none of the traffic would ever even hit port 443 on my server. 

Anyway, I killed that process (that was listening on port 443)  and it made no difference. 

I also killed my process that was listening on port 12345, and that DID make a difference - instead of returning 400, Cloudflare began returning the "server is down" error as soon as I killed the process listening on port 12345. Thus, I know the Origin Rule is working and that all traffic - including traffic to port 443 - was being redirected to port 12345.

So then I thought: Okay, maybe there was some kind of TLS handshake error on my server that only shows up when users connect to the Cloudflare proxy via port 443.

But I was literally not seeing any TLS handshake errors on my server process. But if I killed my server process then Cloudflare would return the "server is down" error message, which means that Cloudflare MUST HAVE BEEN GETTING SOME KIND OF RESPONSE FROM my server process, which resulted in a 400. Later on, when I restarted the server, the error message changed to some bad SSL encryption error - the fact that I couldn't get a useful or even consistent error message drove me crazy. I began Googling for this: I searched for Cloudflare origin rule fails error 400 but only on port 443 - no useful results.

But then for some reason, I thought of using curl instead of my web browser. And hey, whaddayaknow? Instead of returning error 400, curl actually returned an useful error message: "Client sent an HTTP request to an HTTPS server."

This error message shows up when I try to connect to https://mywebsite.com:443 but NOT when I try to connect to https://mywebsite.com:2083

This immediately gave me the hint that Cloudflare was decrypting the traffic. When TLS traffic goes to a Cloudflare proxy on port 443, Cloudflare decrypts it and forwards it to my server IN PLAINTEXT HTTP, BUT ONLY WHEN THE CLIENT SENT IT TO PORT 443 ON THE PROXY.

Anyway, so I simply switched my TLS setting on Cloudflare from Flexible to Full. And that made the error go away - now port 443 works just the same as port 2083.

Thinking about it, it kinda makes sense. Cloudflare does explicitly say that they decrypt TLS traffic and send it to your server via plain HTTP on the Flexible setting. But the fact that this DOESN'T happen for port 2083 is what threw me - Cloudflare didn't explicitly say that their TLS decryption ONLY happens for port 443 and not for the other TLS ports.

Anyway, I'm not sure what I learned from this, but I guess I understand how the Cloudflare Flexible vs Full encryption works a little bit better now.

Also, relevant meme:



EDIT: It now strikes me that the REAL problem was the lack of debugging error messages from the ListenAndServeTLS function.

It seems that by default, it only prints some TLS handshake errors. 

Not sure why it doesn't print anything when it responds with that "Client sent an HTTP request to an HTTPS server." error. 

I need to figure out how to make it log those errors.

I added logging in the handler function but the handler clearly wasn't getting called.

EDIT: It seems that there is no way to intercept those errors at present: https://stackoverflow.com/questions/45802492/how-can-i-customize-http-400-responses-for-parse-errors/45802962#45802962

See:

https://github.com/golang/go/blob/c2c4a32f9e57ac9f7102deeba8273bcd2b205d3c/src/net/http/server.go#L1927

 

I'm surprised that it still isn't possible to log such errors, even despite issues being raised about this from as far back as 2016:

https://github.com/golang/go/issues/12745 

 

I guess this is one of my gripes about Go's net/http - that it doesn't log some 400 errors and there is no way for the user to add logging for those errors.

 

EDIT: Actually, fuck it. I'll just make a PR for this and see what they say.

How to have multiple TLS certificates on the same IP ?

UPDATE: It turns out that Cloudflare actually allows you 10 Origin Rules which allow you to rewrite the destination port to whatever you want! So you can host a service on your web server on port (say) 8081. Now, if you tried to connect to Cloudflare proxy on port 8081, your traffic would just get dropped. But, if you created a custom rule that said that all traffic destined for a certain hostname should have the destination port redirected to port 8081, then you can connect to the Cloudflare proxy on any proxied port and it will rewrite the destination port to whatever you set it to! Pretty cool, right?

 

 

 

UPDATE: Apparently having x (repeated 3 times) dot com in your blog post automatically marks it as an adult blog post by blogger. Pretty interesting. I didn't know that. Changed it to aaa.com, now seems fine.

[This blog post is written for myself only]

So here is my problem:

  1. I want to host multiple domains (e.g. aaa.com and bbb.com)
  2. I want to host them on the same IP address. (IP addresses are very limited, so it's really really important for servers to be able to serve multiple domains from one IP address)
  3. I want to serve them over TLS.
  4. I want to use one TLS certificate for some domains, and another TLS certificate for other domains (yes, I do have one TLS certificate that is valid for some of my domains, but I want to use another TLS certificate for some of my other domains).
  5. I want to proxy my traffic through Cloudflare.

 Anyway, as far as I know there are only 2 solutions to this problem:

  1. Use SNI
  2. Use different ports

If you're proxying your traffic through Cloudflare (the cloud icon on the DNS page in Cloudflare) then ALL traffic will first go thru Cloudflare proxy server before ending up at your server.

This means that if you're hosting a service on a non-proxied port, like port 8081, then try to access that port through your domain, your traffic will simply get dropped by Cloudflare - the packets simply won't arrive at your server!

Unfortunately, the number of ports proxied by Cloudflare is quite small -- only a dozen or so -- and only like 2 or 3 are actually cached - port 80 and port 443 and I think 8080 (haven't tried).

So if you want Cloudflare proxying, you can only choose one out of a dozen or so ports. And if you want Cloudflare caching then your options are basically limited to port 80 or 443.

But let's take a step back. Why are we limited to these 2 options? Why can't we just build a reverse proxy like we can with plain old HTTP traffic?

The reason you can't reverse proxy TLS traffic the same way you reverse proxy plain old HTTP traffic is because during the initial TLS handshake (prior to SNI), the server has to send over the certificate before the client indicates which domain it's trying to connect to. When the server has multiple certificates, it doesn't know which certificate to send over. If it sends over the wrong certificate then the handshake simply fails.

But now there is this cool TLS extension called SNI - Server Name Indication (it's badly named - it should really be called DNI - Domain Name Indication, because the domain name is what is being indicated).

Without SNI, you couldn't have a TLS reverse proxy. Why? Because you want your TLS reverse proxy to direct packets to the service based on the domain name. But the initial TLS handshake packets don't contain the domain name, so you don't know which service to direct the packets to. All you can see is just the IP and port, which are the same regardless of which domain the client is requesting.

So without SNI, it would be impossible to do even something as simple as hosting multiple domains on the same IP over TLS on the same port - something that is trivial to do with HTTP, because HTTP is not encrypted so the reverse proxy can see which domain the client is requesting and just direct the traffic to the appropriate service. You can't do that with TLS. If SNI didn't exist, this blog post would be titled "Why TLS Is Annoying". 



Anyway, using different ports to serve different websites is clearly not a very scalable solution (since Cloudflare only proxies a dozen or so ports), but it also lacks caching, and just generally feels pretty hacky.

So I think SNI is the right way to go here.

EDIT: Found this link about writing a reverse proxy that does SNI in Go: https://www.agwa.name/blog/post/writing_an_sni_proxy_in_go

See also: https://www.gilesthomas.com/2013/07/sni-based-reverse-proxying-with-golang



I guess a further question to ask is whether or not the reverse proxy should decrypt the TLS traffic.

I think it should not, because it would be simpler to have each separate service managing its own TLS certificates.



Friday, March 22, 2024

How to do 2-way bidirectional communication between Raspberry Pi and Pico over USB serial

Original Source: https://forums.raspberrypi.com/viewtopic.php?t=300474

 

Spent some time looking for this really basic trivial thing that I thought would be easy to find online. 

So I want my Pico to constantly send sensor readings to my Pi, and then my Pi to react in real-time to changes in the sensor readings. So I wanted to be able to have a Python program running in the background on my Pi that constantly receives data from my Pico and reacts to it in real time.

Anyway, here is my fully tested and fully working code (yes I tested it, yes it works):

Code that runs on the Pico:

import select
from machine import Pin, Timer
import sys
import time

led = Pin(25, Pin.OUT)

count = 0
while True:
    count += 1
    time.sleep(0.5)
    led.toggle()
    if select.select([sys.stdin],[],[],0)[0]:
        line = sys.stdin.readline()        
        print("You said:", line, count)
    else:
        print("..", count)

The LED toggle is there to tell you that the program is running - if the LED is blinking, then it means the program is running.

Code that runs on the Raspberry Pi:

#!/usr/bin/env python3
import time
import os
import serial

if os.path.exists('/dev/ttyACM0') == True:
    # Set timeout=0 for nonblocking read
    # Set timeout=None for blocking read
    ser = serial.Serial('/dev/ttyACM0', 115200, timeout=None)
    time.sleep(1)
else:
    print("ttyACM0 not detected")
    exit()

last_time = time.time()
while True:
    # VERY IMPORTANT: Input MUST be newline-terminated!!!!!
    if time.time() - last_time > 1:
        last_time = time.time()
        ser.write(bytes("hello\n".encode('ascii')))
    print("Waiting for readline to return...")
    pico_data = ser.readline()
    pico_data = pico_data.decode("utf-8","ignore")
    print(pico_data)

Original Source: https://forums.raspberrypi.com/viewtopic.php?t=300474


Thursday, March 21, 2024

Why can't you hardcode NTP IP???

UPDATE (21 March 2024): Some relevant text from RFC8633:

https://www.rfc-editor.org/rfc/rfc8633.html#section-7

   Note well that using a single anycast address for NTP presents its
   own potential issues.  It means each client will likely use a single
   time server source.  A key element of a robust NTP deployment is each
   client using multiple sources of time.  With multiple time sources, a
   client will analyze the various time sources, select good ones, and
   disregard poor ones.  If a single anycast address is used, this
   analysis will not happen.  This can be mitigated by creating
   multiple, separate anycast pools so clients can have multiple sources
   of time while still gaining the configuration benefits of the anycast
   pools.

   If clients are connected to an NTP server via anycast, the client
   does not know which particular server they are connected to.  As
   anycast servers enter and leave the network or the network topology
   changes, the server to which a particular client is connected may
   change.  This may cause a small shift in time from the perspective of
   the client when the server to which it is connected changes.  Extreme
   cases where the network topology changes rapidly could cause the
   server seen by a client to rapidly change as well, which can lead to
   larger time inaccuracies.  It is RECOMMENDED that network operators
   only deploy anycast NTP in environments where operators know these
   small shifts can be tolerated by the applications running on the
   clients being synchronized in this manner.

 

UPDATE (21 March 2024): Some hacky workarounds: You can probably hardcode these IPs, though there is absolutely no guarantee that they will continue to work:

miyuru on Dec 30, 2022 | prev | next [–]

> It would be great to see Google or Cloudflare use their infrastructure to provide anycasted NTP IP addresses.

Google, Cloudflare and Facebook has vanity IPv6 address, pretty sure they are all static anycast IPs.

time.google.com - 2001:4860:4806::

time.cloudflare.com - 2606:4700:f1::123

time.facebook.com - 2a03:2880:ff0c::123 


jedisct1 on Dec 30, 2022 | parent | prev | next [–]

As for IPv4, time.google.com has been 216.239.35.0 since 2016, so it's unlikely to change anytime soon either.


I can confirm that time.google.com still resolves to that IP address. I also ran these commands today (21 March 2024) for recordkeeping purposes:

$ host time.facebook.com
time.facebook.com has address 129.134.29.123


$ host time.cloudflare.com
time.cloudflare.com has address 162.159.200.123
time.cloudflare.com has address 162.159.200.1


UPDATE (21 March 2024): Still no viable solutions, see below.

UPDATE: I see that there are already-existing solutions for the problem I described:

  • tlsdate - https://github.com/ioerror/tlsdate       (but see below)
  • roughtime proposal - https://datatracker.ietf.org/doc/html/draft-ietf-ntp-roughtime

UPDATE: Here's a relevant blog post by Hanno Bock: https://blog.hboeck.de/plugin/tag/tlsdate

tlsdate is a hack abusing the timestamp of the TLS protocol. The TLS timestamp of a server can be used to set the system time. This doesn't provide high accuracy, as the timestamp is only given in seconds, but it's good enough.

I've used and advocated tlsdate for a while, but it has some problems. The timestamp in the TLS handshake doesn't really have any meaning within the protocol, so several implementers decided to replace it with a random value. Unfortunately that is also true for the default server hardcoded into tlsdate.

Some Linux distributions still ship a package with a default server that will send random timestamps. The result is that your system time is set to a random value. I reported this to Ubuntu a while ago. It never got fixed, however the latest Ubuntu version Zesty Zapis (17.04) doesn't ship tlsdate any more.

Given that Google has shipped tlsdate for some in ChromeOS time it seems unlikely that Google will send randomized timestamps any time soon. Thus if you use tlsdate with www.google.com it should work for now. But it's no future-proof solution.

TLS 1.3 removes the TLS timestamp, so this whole concept isn't future-proof. Alternatively it supports using an HTTPS timestamp. The development of tlsdate has stalled, it hasn't seen any updates lately. It doesn't build with the latest version of OpenSSL (1.1) So it likely will become unusable soon.

Roughtime

Roughtime is a Google project. It fetches the time from multiple servers and uses some fancy cryptography to make sure that malicious servers get detected. If a roughtime server sends a bad time then the client gets a cryptographic proof of the malicious behavior, making it possible to blame and shame rogue servers. Roughtime doesn't provide the high accuracy that NTP provides.

From a security perspective it's the nicest of all solutions. However it fails the availability test. Google provides two reference implementations in C++ and in Go, but it's not packaged for any major Linux distribution. Google has an unfortunate tendency to use unusual dependencies and arcane build systems nobody else uses, so packaging it comes with some challenges.

But wait, it looks like roughtime also requires DNS? At least I haven't been able to find any roughtime IPs that I can hardcode. 





 

 

Original post:

People online say that you shouldn't hardcode NTP IPs, but I don't see why this has to be the case. 

You can hardcode 1.1.1.1 for DNS, so why can't you hardcode an IP for NTP? 

People online say that the NTP server might go down, but that shouldn't be an issue because IP anycast will automatically route the traffic to the nearest available server.

People online say that you might overload the server, but you can do load balancing internally within your datacenter in any number of ways, so that shouldn't be an issue either.

You can argue that IP anycast won't work because the packets might get redirected to another server, but this happens so rarely in practice that it shouldn't be a problem, and you can just try again if it fails.

I don't see what's so special about NTP that you can't have an anycast IP for it like 1.1.1.1

I am writing this blog post because TLS won't work if your clock is wrong. If you force your machine to only use DNS-over-HTTPS, then you can't resolve any domains if your cloick is wrong. 

So this leads to a catch-22 situation: Your DNS doesn't work because your clock is wrong, and you can't fix your clock because you can't resolve NTP domain names to IP addresses because your DNS doesn't work.

This problem would be solved if we could hardcode an IP address for NTP just like we can do with DNS (1.1.1.1)

EDIT: I see that someone has already made a blog post on this: https://news.ycombinator.com/item?id=34177331


> Alternatively it would be good to use an anycast IP for NTP. This is normally a bad idea because it makes calculating skew hard/unreliable, but that really should just mean a poorly sync'ed clock. So set the Anycast clock to be an intentionally high/poor Stratum score, list this along with a DNS based address so it's used until the encrypted DNS can be resolved with a better Stratum score. -- Bob H

Yes, so I suppose anycast might cause poor skew, though that isn't a problem for this use case because TLS will work even if your clock is a few minutes wrong. 

But I suppose we could create a simpler version of NTP whose purpose is to just set your clock to some good-enough-for-TLS time, and then switch to actual NTP once your DNS works.


Sunday, March 10, 2024

Protip: Write your email in a separate text editor then copy it into Gmail

Today I fucked up by writing an email and accidentally pressing ctrl+enter (meant to type shift+enter) and then I looked around for the Undo Send button and couldn't find it so I clicked on my Sent box and right-clicked on my email there and couldn't see an Undo Send option either. In the end I could not undo the accidental send.

So, 3 lessons learned:

  1. Write your emails in a separate text editor, then copy-paste it into your browser email editor once you're done.
  2. Disable the Gmail keyboard shortcuts in the Gmail settings.
  3. Remember that the Undo Send button is in the tiny little popup on the bottom left hand side of the screen. If you click on anything in Gmail then the popup goes away and you can't undo your send anymore.

Tbh I think the Undo Send should be in the right-click menu in the Sent box. It's really bugging me - I think this is a serious usability issue. Also the Undo Send time period should be customizable up to 1 minute so that I have time to go to my Sent box and manually Unsend the email.

But anyway, writing your email in an external text editor is foolproof and will work regardless of your email provider and completely mitigates all of the above mentioned problems, so as long as you do that you don't need to worry about any of what I just said.

Saturday, March 9, 2024

What is the cheapest VPS? AWS vs GCP vs Azure

UPDATE: I got the AWS pricing wrong. You actually need to pay an additional ~$3.60 per month for the IPv4 address, even if the IP is always attached to your instance. This is a recent pricing update and completely changes the cost calculations. This means that the minimum possible AWS EC2 instance cost is now something like $5.60 a month if you include the IPv4.

UPDATE: I got the Azure pricing wrong. If you select any Linux OS image Azure will force you to get a 30GB OS disk which costs $2.40 if you're using standard SSD (more if you're using premium). This brings the Azure pricing to be more than double the AWS price for the t4g.nano ($1.90 per month for t4g.nano including the mandatory 8GB EBS, compared to $3.82 per month for b1ls including the mandatory OS disk). See below for original blog post.

UPDATE: I tested the AWS t4g.nano disk performance and measured 131MB/s write speed for my 9GB disk which uses gp3 storage (which is the default). See below for more details.

 

So I wanted a very small VPS that I can run a lightweight Linux instance on. I will only be using it for personal uptime monitoring so very little egress (I know AWS, GCP, and Azure all give 100GB free monthly egress and that should be more than enough) which means I don't have to worry about bandwidth costs. One of the great attractions of these big cloud vendors is that they offer unlimited free ingress traffic, which few VPS vendors provide.

UPDATE: If you really just want uptime monitoring, fly.io gives you 3x free 256mb "VM" with 160GB monthly egress and free ingress, which is probably enough - but note that fly.io is not a VPS unlike the other services mentioned in this post.

So I looked at AWS, GCP and Azure and found that the cheapest instances are as follows:

  • AWS: t4g.nano (ARM64) - 0.5G RAM - 2x ARM vCPU - both CPU and disk are burstable
  • Azure: b1ls  - 0.5G RAM - 1x x64 vCPU - both CPU and disk are burstable
  • GCP: e2.micro  - 1G RAM

The t4g.nano and b1ls come out to around the same price for similar configurations. AWS requires you to add a certain amount of EBS to match the snapshot image. Azure only gives you 4GB ephemeral disk for free, so if you want persistence you need to pay more. UPDATE: When trying to create a Linux b1ls instance Azure will automatically add a 30GB OS disk which costs $2.40 if you're using standard SSD.

3 year reserved b1ls: $1.42 per month
E4 SSD 32 GiB: $2.40 per month
Total cost for cheapest possible b1ls: $3.82 per month

With 8GB persistent storage you are looking at around $1.90 per month for t4g.nano in the US (and only slightly more expensive outside of the US) vs $2.02 $3.82 per month for the b1ls in West US and $2.30 $4.13 in Central US.

GCP e2.micro comes out to be more expensive at $2.75 including the smallest possible boot disk even with the 3 year committed use discount, but that's only in the cheapest US regions. In other regions it is much much more expensive e.g. in Los Angeles (us-west2) it is $3.79 / month, and outside of the US it is even more expensive. The f1.micro would have been cheaper than the e2.micro except for the fact that the f1.micro is not eligible for the committed use discount, only for the sustained use discount which is only like 30%.

It should be noted that Azure offers price matching with AWS for equivalent services. Maybe this explains why the AWS and Azure prices were so similar for the instances that I looked at. It's not even close lol, Azure is WAY more expensive than AWS: AWS only costs $1.90 per month while Azure costs $3.82 per month, it's more than twice the cost and even more than GCP even in the US.

Of course, this says nothing about how the CPU/disk performance compares for the t4g.nano vs the b1ls vs the e2.micro.

Tbh I can see why the e2.micro is more expensive than the t4g.nano since e2 micro has 1G ram compared to the half gig RAM in the t4g.nano ... but I can't see how the Azure price is even remotely justifiable. Azure says they price match AWS but with the 30GB OS disk I don't see how they would do that unless they make the OS disk free (or just fucking downgrade it to 8GB why the fuck does a Linux image require 30GB??????? AWS only requires a 8GB boot disk for Debian and GCP only requires 10GB, so it really is outrageous that Azure requires a 30GB OS disk).


EDIT:

A few years ago (in 2019) Lerdorf wrote this blog post comparing different cheap VPS providers: https://toys.lerdorf.com/low-cost-vps-testing

He obtained the follow numbers for AWS Lightsail disk performance:

Disk IO 65 MB/s write, 65 MB/s read

However, that was back in 2019. In 2020 AWS introduced gp3 disks which are newer and more performant than the old SSDs:

In December 2020, AWS announced general availability of a new Amazon EBS General Purpose SSD volume type, gp3. AWS designed gp3 to provide predictable 3,000 IOPS baseline performance and 125 MiB/s, regardless of volume size. With gp3 volumes, you can provision IOPS and throughput independently, without increasing storage size, at costs up to 20% lower per GB compared to gp2 volumes. 

Unlike gp2 where performance is tied to disk size, with gp3 you always get the same performance regardless of disk size which is really good if you want to have a really small disk with decent performance (which is exactly I want). And also gp3 is 20% cheaper than gp2.

To test this, I spun up a t4g.nano with 9GB of gp3 and ran fio and got these results:

Run status group 0 (all jobs):
 WRITE: bw=131MiB/s (137MB/s), 131MiB/s-131MiB/s (137MB/s-137MB/s), io=7996MiB (8384MB), run=61240-61240msec

Disk stats (read/write):
 nvme0n1: ios=2347/32762, merge=28/179, ticks=11093/3568868, in_queue=3579960, util=99.43%

When I saw this I was shocked. I had misread the 125MiB/s as MEGABITs per second, but actually it's MEBIBYTEs per second, which is over 8 times larger! So 125 megabits per second is only around 15.6 megaBYTES per second (which is pretty slow, even for spinning rust) but actually AWS gp3 gives 125 MEBIBYTES per second which is around 131 megaBYTES per second, which is pretty good!


There is also Oracle Cloud which gives you 200GB of Always Free storage. If you select the highest performance disk then you can get around 100MB/s throughput at around 60GB of disk storage, which I think is within the Always Free tier usage limit but I'm not sure, will have to wait and see if Oracle charges me for it.


Anyway, I didn't measure Azure disk performance.



Thursday, March 7, 2024

PSA: VA monitors can get burn-in

So I got a VA monitor a few months ago and today I noticed that the task bar has "burned in" i.e. it doesn't go away when I turn the monitor off or play some full screen video for a while.

So just a PSA: VA monitors can get burn in. 

Some more details: The burn-in isn't noticeable when the full screen is white, but it becomes noticeable when the full screen is a grey-blue color - in that case the task bar area is noticeably darker than the rest of the screen.

Monday, March 4, 2024

😡😡😡 Incorrect Stack Exchange Answers Make Me Angry 😡😡😡

Came across this incorrect yet highly upvoted answer today: https://unix.stackexchange.com/questions/121654/convenient-way-to-check-if-system-is-using-systemd-or-sysvinit-in-bash

The answer says:

Systemd and init have pid = 1

pidof /sbin/init && echo "sysvinit" || echo "other"

Check for systemd

pidof systemd && echo "systemd" || echo "other"

But that's fucking wrong. On modern Debian /sbin/init is a symlink to /lib/systemd/systemd.

So if you tried running the commands in the answer on a modern Debian system it will tell you that you're using sysvinit when in fact you're using systemd.

This is fucking infuriating and it makes me angry that I can't post another answer on that question to debunk the existing incorrect answer because that question is closed as Duplicate. This is one of the many things that enrage me about StackExchange.

Notice how there is a comment on that answer saying that it's wrong but that comment has only 38 upvotes whereas the answer has 56 upvotes. This is an intrinsic design flaw in Stack Exchange: old (outdated and incorrect) answers will tend to have more upvotes simply because they've been around for longer and thus had more time to collect upvotes than newer answers and especially comments (because who reads the comments?)

It's even worse because the answer is very popular, for example it's referenced by this answer here: https://askubuntu.com/a/1246465



Saturday, March 2, 2024

Intel NUC is not compatible with SSK SD300

So I tried 3 USB drives, a SSK SD300, a Sandisk Extreme 32GB, and a Kanguru SS3 and I wrote the exact same ISO image (Debian 12.5) on each USB stick using the exact same method (dd). Then I plugged each USB into the Intel NUC and saw that the BIOS would only recognize the Sandisk Extreme and the Kanguru SS3 but not the SSK SD300.

This means that it's not the file system that's the issue here, since I formatted the Sandisk Extreme in the exact same way and it works.

I was going crazy thinking "where did I mess up??". I switched from cp to dd thinking that was the problem but no. 

Tried different USB ports, didn't make a difference.

I checked and verified that the SSK SD300 boots up perfectly fine on my desktop PC.

So it really seems like the Intel NUC just doesn't like the SSK SD300 for some reason.

UPDATE: The Kingston DataTraveler Max doesn't work with my NUC either. What these two drives have in common is that they're both USB 3.2. Maybe the NUC BIOS can't recognize USB 3.2 drives.

UPDATE: I tried connecting the SD300 via USB-A to USB-C adaptor, didn't work.

I also tried connecting it via a USB 3.0 hub, also didn't work. 

UPDATE: The Samsung BAR USB works (that one is a USB 3.1 which supports my hypothesis that the problem is with USB 3.2)

Thursday, February 22, 2024

Don't run RHEL on e2-micro

So I wanted to install nano on my e2-micro instance but every time I try to do so it just hangs the entire system.

Turns out dnf uses so much memory that it crashes the system.

I created a 2GB swapfile and then tried dnf install nano again and it seemed like it got stuck again...but this time I waited a few minutes and then nano was installed.

Running dnf install was fast for the first 20 or so items, then it got stuck on google cloud cli for like 20 minutes then I got impatient and just killed it.

Seems like running RHEL-based systems on a GCP e2-micro is just not practical.

EDIT: I used "balanced persistent" storage for this which cost me money. To get free tier you need "standard persistent" which might have worse performance.

Don't close resp.Body on error

So my server finally crashed for the first time today luckily I was running it through systemd which captured the stderr (yay!) so I saw what the problem was. Turns out it panicked in the function where I ping the health check URL. 

So, for context, my web service has a goroutine which continually pings a health check URL every few seconds.

And today, that http Get request failed, returning an error. 

So here's the code that failed:

        resp, err := http.Get(url)
        if err != nil {
            log.Println(err)
        }
        resp.Body.Close()

See if you can notice the bug here. Yes, the problem is that when err is not nil, the resp can be a nil pointer. So doing resp.Body.Close() will cause a panic. Which is what happened.

The correct thing is to return when err != nil. Do not call resp.Body.Close() when err is non-nil.


I asked GPT4 this question and it actually spotted the bug immediately:

This code appears incomplete, and there is a critical bug that could cause a program execution to panic. The issue is that http.Get(url) could fail and return a nil response. If that occurs, calling resp.Body.Close() will trigger a runtime panic because you're trying to access a property on a nil value.

You should always check whether err is nil before working with the resp object to avoid such issues. Here's a corrected version:

        resp, err := http.Get(url)
        if err != nil {
            log.Println(err)
            return
        }
        defer resp.Body.Close()

By returning from the function after logging the error, you're ensuring resp.Body.Close() isn't called when resp is nil.

Also, it's good practice to use the defer statement to close resources like this, which ensures resp.Body.Close() is called before the function returns, in any case. It helps prevent leaks of resources.


Maybe I should ask GPT4 to check over all of my code...

Monday, February 19, 2024

Wrote a tic tac toe AI

I've always wanted to write a tic tac toe AI for years and finally I decided to get off my ass and do it: https://github.com/1f604/tictactoe-ai

The actual core algorithm took me only a few minutes to code up but the scoring function was the hard part. The given perfect play every first move leads to a draw so there's no reason to prefer any particular move. But if the opponent is imperfect then some moves are better than others, since some moves will get you into a state where you can still win if the enemy makes a mistake, while other moves will get you into a state where you cannot win no matter what. So my scoring function (win_probability) calculates this by assuming the opponent picks moves randomly (with a heuristic).

Thursday, February 15, 2024

Wrote a Theseus solver

So I've been playing this iOS game called Theseus and there's a level I couldn't solve (level 80) and so I decided to write a Theseus solver to solve it for me lol. Anyway it took me less than a day to write the whole thing. Here it is:

https://github.com/1f604/theseus-solver

Tuesday, January 16, 2024

why gcc and clang sometimes emit an extra mov instruction for std::clamp on x86

UPDATE: This blog post has been updated based on feedback from Reddit and Hacker News.

HN discussion: https://news.ycombinator.com/item?id=39011850

Reddit discussion: https://www.reddit.com/r/cpp/comments/1980q8l/stdclamp_still_generates_less_efficient_assembly/

How do you correctly implement std::clamp?

[Credit for this section goes to Reddit user F54280]

To make sense of the rest of this blog post, we have to first discuss the correctness requirements of std::clamp in order to understand why std::clamp is implemented the way it is in libstdc++.

Here is an incorrect implementation of std::clamp:

#include <algorithm>
double clamp(double v, double min, double max){
    return std::min(max, std::max(min, v));
}

The above implementation will return the correct answer most of the time but will return an incorrect result when dealing with positive/negative zeros, because, according to cppreference, clamp should:

If v compares less than lo, returns lo; otherwise if hi compares less than v, returns hi; otherwise returns v.

So if I call std::clamp(-0, +0, +0) it should return -0. Why? Because according to the IEEE standard, positive and negative zero are equal. Positive zero is not greater than negative zero. Therefore, since v does not compare less than lo, and hi does not compare less than v, the call to std::clamp must return v, which is -0 in this case.

The incorrect implementation above does not return -0, instead it returns +0. Why? Because std::min and std::max returns the first parameter in the case the two parameters are equal. Since negative zero and positive zero are equal, it will return the first parameter.  The implementation therefore would return max when max is equal to v, or min when min is equal to v - so it is really doubly wrong.

A correct implementation of clamp must return v when v is equal to both min and max. So if v is equal to -0 and min and max are both +0 then clamp must return -0. With that in mind, let's look at some correct implementations.

Here is the (correct) implementation that libstdc++ uses:

double clamp(double v, double min, double max){
    return std::min(std::max(v, lo), hi);
}

This implementation is correct because std::max(v, lo) will return v when v is equal to lo, and it will return std::max(v, lo) when std::max(v, lo) is equal to hi. 

And here is an alternative correct implementation:

double clamp(double v, double min, double max){
    return std::max(std::min(v, max), min);
}

This implementation is correct because std::min(v, max) will return v when v is equal to max, and std::max(std::min(v, max), min) will return std::min(v, max) when it's equal to min. 

These two are probably the only correct implementations of std::clamp using std::min and std::max. You cannot change the order of the parameters in the calls to std::min or std::max, because that would cause v to not be returned when it's equal to min and max. The semantics of std::clamp requires that v be returned when it's equal to min and max.

Why does the standard library (libstdc++) implementation of std::clamp sometimes generate an extra mov instruction? 

[Credit for this section goes to Reddit user F54280 as well as HN commenters jeffbee and vitorsr]

This is the main focus of this blog post and it refers to the following observation:

https://godbolt.org/z/rq9dsGxh5

#include <algorithm>

double incorrect_clamp(double v, double lo, double hi){
return std::min(hi, std::max(lo, v));
}

double official_clamp(double v, double lo, double hi){
return std::clamp(v, lo, hi);
}

double official_clamp_reordered(double hi, double lo, double v){
return std::clamp(v, lo, hi);
}

double correct_clamp(double v, double lo, double hi){
return std::max(std::min(v, hi), lo);
}

double correct_clamp_reordered(double lo, double hi, double v){
return std::max(std::min(v, hi), lo);
}

Generated assembly

incorrect_clamp(double, double, double):
maxsd xmm0, xmm1
minsd xmm0, xmm2
ret
official_clamp(double, double, double):
maxsd xmm1, xmm0
minsd xmm2, xmm1
movapd xmm0, xmm2
ret
official_clamp_reordered(double, double, double):
maxsd xmm1, xmm2
minsd xmm0, xmm1
ret
correct_clamp(double, double, double):
minsd xmm2, xmm0
maxsd xmm1, xmm2
movapd xmm0, xmm1
ret
correct_clamp_reordered(double, double, double):
minsd xmm1, xmm2
maxsd xmm0, xmm1
ret

So the interesting observation here is that by reordering the function parameters with respect to the arguments passed to either std::clamp or std::min and std::max, we can achieve shorter assembly.

Now the question is why does reordering the function parameters with respect to the arguments to std::clamp or std::min and std::max affect the number of assembly instructions generated? The short answer is that it's because of 3 things:

  1. The requirement imposed by the C++ standard
  2. The limitations on what assembly instructions can be used due to architecture
  3. The requirements imposed by the ABI (the calling convention)

I have already explained the requirement imposed by the standard which is that v must be returned when it is equal to min and max. So now let's discuss the relevant assembly instructions.

First, let's make sure we understand what the minsd, maxsd, and movapd instructions do and why they are generated. Long story short, they are generated because you didn't specify a microarchitecture to the C++ compiler, which means that the compiler must assume x86-64-v1, which only supports SSE. If we had specified a microarch equal to or greater than x86-64-v3, then the compiler would generate AVX instructions instead. But since we didn't, the compiler can't generate AVX instructions because it has to assume that the target CPU only supports the x86-64-v1 instruction set which doesn't contain AVX.

The SSE minsd and maxsd functions each have two operands. The first operand is the destination operand and the second operand is the source operand. This is the norm in x86 Intel Syntax where the mov instruction can be thought of as an assignment, so when you see:

mov eax, ebx

You should think of that as:

eax = ebx

Because it simply copies the value held in the register ebx into eax. 

So when we see:

minsd xmm0 xmm1

We should think of that as:

xmm0 = std::min(xmm1, xmm0)

The most important thing to note here is this:

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned

This means that if both the first and second registers contain 0, then the 0 in the second register will overwrite the 0 in the first register. So in the above example:

minsd xmm0 xmm1

If prior to running the above instruction, xmm0 contained negative zero and xmm1 contained positive zero, then the instruction will overwrite xmm0 with positive zero.

Now that we know what the assembly instructions do, let's talk about the ABI limitation.

The ABI limitation in this case the calling convention is that the first parameter of a function is stored in xmm0, and the return value is also stored in xmm0.

This means that if you have the following function signature:

std::clamp(v, lo, hi)

Then v will be stored in xmm0, and also the return value will also be stored in xmm0.

What does this mean? It means that we can't start with this:

maxsd xmm0 xmm1

Why not? Because that would overwrite xmm0 with the value in xmm1 if both xmm0 and xmm1 contained zero. That means we lose the value stored in v (which is stored in xmm0), which automatically leads to the incorrect result in the case of std::clamp(-0.0, +0.0, +0.0).

It also means that we can't start with this:

minsd xmm0 xmm2

For the same reason as explained above - because it would cause us to lose the value stored in v, which we have to return in the case v and lo and hi are all zero.

But we can do this:

minsd xmm2 xmm0

So this will overwrite xmm2 with std::min(v, hi), and then we can do 

maxsd xmm1 xmm2

Which will overwrite xmm1 with std::max(std::min(v, hi), lo).

You could also start with

maxsd xmm1 xmm0

But no matter what you start with, the register that contains the result of the first instruction must be the second operand of the next minsd/maxsd instruction because the result might contain v, and we must not overwrite v. Therefore the register containing the result, which might be v, must be the second operand of the next instruction.

If you think about it, whatever assembly we generate has to contain at least two instructions: one minsd and one maxsd, because each instruction can only look at 2 variables and there are 3 variables that we have to look at. So there must be at least two instructions. And I just showed that regardless of whether you start with minsd or maxsd, xmm0 has to be the second operand of the first instruction. And the register that contains the result of the first operation has to be the second operand of the second instruction in order to avoid losing v. Which means the first operand of the second instruction HAS TO BE the register that wasn't involved in the first instruction (either xmm1 or xmm2).

So the result of the two comparison instructions is stored in that other register, which is not xmm0. But the ABI requires the result to be stored in xmm0. So an extra move instruction is generated in order to move the result into xmm0. 

And that's why the assembly generated for std::clamp is not the shortest. It's basically because of the combination of the quirks of the assembly instruction minsd and maxsd and because of the ABI which requires the first operand (v) to be stored in xmm0 and the result to be stored in xmm0 as well.

And the 3 reasons why the compiler won't generate the shortest assembly for std::clamp naturally suggests 3 ways to get the compiler to generate shorter assembly:

  1. Relax the C++ standard
  2. Don't make v the first parameter in std::clamp
  3. Allow the compiler to use different assembly instructions (AVX)

We've already covered option 1: the incorrect version of std::clamp that I showed you. If you don't care about getting C++ standard-mandated behavior for edge cases like positive/negative zeroes, NaN values etc, then you can just use my incorrect implementation.

Now let's try option 2: don't make v the first parameter in std::clamp. Does it matter whether we make v the second or third parameter? As it turns out, it doesn't matter (with some caveats: see below) - as long as v is not the first parameter, we can get the compiler to generate the shortest assembly. See https://godbolt.org/z/xYs47s8nx:

#include <algorithm>

double correct_clamp_reordered1(double lo, double hi, double v)
{
return std::max(std::min(v, hi), lo);
}
double correct_clamp_reordered2(double hi, double lo, double v)
{
return std::max(std::min(v, hi), lo);
}
double correct_clamp_reordered3(double hi, double v, double lo)
{
return std::max(std::min(v, hi), lo);
}
double correct_clamp_reordered4(double lo, double v, double hi)
{
return std::max(std::min(v, hi), lo);
}

Generated assembly

correct_clamp_reordered1(double, double, double): # @correct_clamp_reordered1(double, double, double)
minsd xmm1, xmm2
maxsd xmm0, xmm1
ret
correct_clamp_reordered2(double, double, double): # @correct_clamp_reordered2(double, double, double)
minsd xmm0, xmm2
maxsd xmm1, xmm0
movapd xmm0, xmm1
ret
correct_clamp_reordered3(double, double, double): # @correct_clamp_reordered3(double, double, double)
minsd xmm0, xmm1
maxsd xmm2, xmm0
movapd xmm0, xmm2
ret
correct_clamp_reordered4(double, double, double): # @correct_clamp_reordered4(double, double, double)
minsd xmm2, xmm1
maxsd xmm0, xmm2
ret

The above shows something very interesting. It shows that the compiler will only generate the shortest assembly when the parameters are in a very specific order with respect to the min and max instructions. 

I believe that this is due to the requirement for std::min and std::max to return the first parameter when the two parameters are equal. So the compiler is not allowed to convert between the max(min()) and min(max)) forms, because they actually are not identical. For example: https://godbolt.org/z/fhaP6dee5

#include <algorithm>
#include <math.h>
#include <iostream>
double minmax_clamp(double v, double lo, double hi)
{
return std::min(std::max(v, lo), hi);
}

double maxmin_clamp(double v, double lo, double hi)
{
return std::max(std::min(v, hi), lo);
}

int main() {
std::cout << "custom clamp1:\t\t" << minmax_clamp( 0.0f, +1.0f, 0.0f ) << "\n";
std::cout << "custom clamp2:\t\t" << maxmin_clamp( 0.0f, +1.0f, 0.0f ) << "\n";
std::cout << "official clamp: \t" << std::clamp( 0.0f, +1.0f, 0.0f ) << "\n\n\n";
}

Output

custom clamp1: 0
custom clamp2: 1
official clamp: 0

So you can see, the minmax clamp and the maxmin clamp are actually not semantically equivalent when the lo is greater than hi (which is undefined behavior btw) :

The behavior is undefined if the value of lo is greater than hi.

But the C++ compiler doesn't know that we're trying to implement std::clamp, so it cannot freely convert between the maxmin and the minmax clamp implementations, because they are not semantically equivalent.

What this means is that if you do max(min()) then the compiler has to actually generate assembly that is equivalent to that. It can't convert it to min(max()) because that's not semantically equivalent.

So that's why in the generated assembly for:

std::max(std::min(v, hi), lo); 

You always see minsd first. Because the compiler has to first calculate std::min(v,hi) first, only after that can it use the result of that as the argument to std::max.

This has serious ramifications for the generated assembly. Why? Because in order for the shortest assembly to be generated, we must use one of the minsd or maxsd instructions to store directly into xmm0 and that's the result that we return. This can only happen on the second line because we can't know the return value on the first line since we haven't looked at all the operands yet.

So, to generate the shortest assembly, we need xmm0 as the first operand on the second line. But we also need minsd of v and hi on the first line. 

This means that if hi is stored in xmm0 then we can't have the shortest assembly. Because that means xmm0 has to be the first operand on the first line. Which then means xmm0 must be the second operand on the second line. Which means it can't be the first operand on the second line. So the first operand must be some other register, and then we'll have to copy the value from that register into xmm0 before returning. And that's exactly what you see in the generated assembly shown above.

In conclusion, then, changing the ordering of parameters in a function can change the number of assembly instructions generated because of the ABI which mandates that the first parameter AS WELL AS the return value must both be held in the same register. 

When the first parameter in the function must be preserved as in the case of std::clamp, this can cause the compiler to emit an extra move instruction to store the result at the end into xmm0. If you reorder the parameters so that the first parameter doesn't need to be preserved, then the compiler can omit the extra move instruction by storing the result of a computation directly in xmm0 instead of moving it into xmm0 from some other register, thus resulting in shorter generated assembly code.

But as I mentioned, there is a third way to reduce the length of the generated assembly and that is to target a newer microarchitecture e.g. x86-64-v3 which supports AVX. See: https://godbolt.org/z/dM3f8v4nM

#include <algorithm>

double standard_clamp(double v, double lo, double hi)
{
return std::clamp(v, lo, hi);
}

Generated assembly for march=x86-64-v3 (same in Clang and GCC)

standard_clamp(double, double, double):
vmaxsd xmm1, xmm1, xmm0
vminsd xmm0, xmm2, xmm1
ret

Now we get the shortest possible assembly output - only two instructions emitted. This is because it's using the AVX instruction vminsd which has 3 operands. As usual the first operand is the destination operand into which the result will be stored. The other two operands are dealt with in the following way:

If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned

As before, v is stored in xmm0. So here xmm0 is the third operand i.e. the second source operand which means its zero value will be returned as the result and written into the destination operand which is xmm1. 

So now max(xmm0, xmm1) is stored in xmm1, so now xmm1 naturally should be the second source operand and indeed on line 2 we see that xmm1 is the second source operand. Of course we want the result to be stored in xmm0 and here we see that xmm0 is indeed used as the destination operand while the first source operand is the other variable that we haven't looked at yet which is stored in xmm2.

So from this we can see that by targeting a more recent microarchitecture we allow the compiler to emit more recent instructions (AVX in this case) which allow for fewer assembly instructions, in this case because the AVX instruction takes 3 operands and allows the user to specify which register to store the result into, as opposed to the SSE instruction which only takes 2 operands and simply stores the result into the first source operand.

Does any of this actually matter?

Some commenters have noted that std::clamp will be inlined in real code, and that real code is likely to target x86-64-v3. This is true, but even in "realistic code", the compiler still seems to generate shorter assembly for the incorrect (non-standards-compliant) clamp compared to the correct clamp:

https://godbolt.org/z/hd44KjMMn (see also https://godbolt.org/z/eq8q1jnss)

#include <algorithm>
#include <cstdio>

double incorrect_clamp(double v, double lo, double hi){
return std::min(hi, std::max(lo, v));
}

double official_clamp(double v, double lo, double hi){
return std::clamp(v, lo, hi);
}

double official_clamp_reordered(double hi, double lo, double v){
return std::clamp(v, lo, hi);
}

double correct_clamp(double v, double lo, double hi){
return std::max(std::min(v, hi), lo);
}

double correct_clamp_reordered(double lo, double hi, double v){
return std::max(std::min(v, hi), lo);
}

int main_official(int argc, char** argv){
double v, lo, hi;
sscanf(argv[1], "%lf", &v);
sscanf(argv[2], "%lf", &lo);
sscanf(argv[3], "%lf", &hi);

return std::clamp(v, lo, hi);
}

int main_official_reordered(int argc, char** argv){
double v, lo, hi;
sscanf(argv[1], "%lf", &v);
sscanf(argv[2], "%lf", &lo);
sscanf(argv[3], "%lf", &hi);

return official_clamp_reordered(hi, lo, v);
}

int main_incorrect(int argc, char** argv){
double v, lo, hi;
sscanf(argv[1], "%lf", &v);
sscanf(argv[2], "%lf", &lo);
sscanf(argv[3], "%lf", &hi);

return incorrect_clamp(v, lo, hi);
}

int main_correct_reordered(int argc, char** argv){
double v, lo, hi;
sscanf(argv[1], "%lf", &v);
sscanf(argv[2], "%lf", &lo);
sscanf(argv[3], "%lf", &hi);

return correct_clamp_reordered(lo, hi, v);
}

Generated assembly

main_official(int, char**):
...repeated instructions omitted... 
call __isoc99_sscanf
vmovsd xmm1, QWORD PTR [rsp+16]
vmaxsd xmm1, xmm1, QWORD PTR [rsp+8]
vmovsd xmm0, QWORD PTR [rsp+24]
add rsp, 32
pop rbx
vminsd xmm0, xmm0, xmm1
vcvttsd2si eax, xmm0
ret
main_official_reordered(int, char**):
...repeated instructions omitted...
call __isoc99_sscanf
vmovsd xmm1, QWORD PTR [rsp+16]
vmaxsd xmm1, xmm1, QWORD PTR [rsp+8]
vmovsd xmm0, QWORD PTR [rsp+24]
add rsp, 32
pop rbx
vminsd xmm0, xmm0, xmm1
vcvttsd2si eax, xmm0
ret
main_incorrect(int, char**):
...repeated instructions omitted...
call __isoc99_sscanf
vmovsd xmm0, QWORD PTR [rsp+8]
vmaxsd xmm0, xmm0, QWORD PTR [rsp+16]
vminsd xmm0, xmm0, QWORD PTR [rsp+24]
add rsp, 32
pop rbx
vcvttsd2si eax, xmm0
ret
main_correct_reordered(int, char**):
...repeated instructions omitted...
call __isoc99_sscanf
vmovsd xmm1, QWORD PTR [rsp+24]
vminsd xmm1, xmm1, QWORD PTR [rsp+8]
vmovsd xmm0, QWORD PTR [rsp+16]
add rsp, 32
pop rbx
vmaxsd xmm0, xmm0, xmm1
vcvttsd2si eax, xmm0
ret

As you can see in the lines highlighted in yellow, the compiler does seem to generate an extra assembly instruction for the correct clamp implementations regardless of parameter ordering.

So it looks like even in "realistic" code (I'm not sure how realistic my example is, but it does seem to resemble a real program in that it takes input from argv and returns an output based on the input), the compiler will still generate more assembly for the correct clamp compared to the incorrect clamp.

But why is this? Reddit user Sporule explained it:

After sscanf(), all three values are located in memory. The operand src1 in the instruction vminsd dst, src1, src2 can only be a register. You cannot specify the memory location there.
Also both lo and hi values need to be passed via src1 operand. It is incorrect to pass them via src2. Therefore, it is absolutely necessary to execute a load instruction to transfer the value from memory to a register so that the order of operands is correct. Thus, at least two vmovsd instructions are required. And this is exactly that both compilers emitted.

Let's unpack this. The vminsd instruction has the following very important characteristics:

  1. If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).
  2. The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.

The first point says that when the two source operands are equal, then the second source operand is what will be written into the destination operand. Since we want v to be returned when it's equal to whatever it's being compared to (as explained earlier, the C++ standard requires std::clamp to return v when v is equal to lo and hi), this means that v must be the second source operand. Which means that whatever v is being compared to must be the first source operand. 

The second point says that the first source operand and the destination operand must both be registers. Combining this with the above, we have that since v must be the second source operand, this means that the value that v is being compared to must be placed into a register first, before we can compare v to it.

Okay, so we have this:

vmovsd xmm_a MEMORY_LOCATION_OF_HI_OR_LO

Here, xmm_x can be any xmm register and MEMORY_LOCATION_OF_HI_OR_LO can be either the memory location of hi or lo - it doesn't matter. All that matters is that we load some non-v value into some register so that we can compare it to v.

Now that we've loaded it into a register, we can use it for the comparison:

vmin_or_max_sd xmm_b xmm_a MEMORY_LOCATION_OF_V

Here, vmin_or_max_sd could be vminsd or vmaxsd, doesn't matter. And xmm_b could be the same as or different from xmm_a, doesn't matter. Note here that we are using the memory location of v in order to avoid having an extra mov instruction. Now we have already used up 2 instructions so there is only one instruction left. 

The important thing is that the return value of this instruction could be v. This means that the return value must be the second source operand of the next comparison instruction. Because if it's the first source operand then v won't be returned when it's equal to hi and lo. So whatever xmm_b is, it has to be the second operand of the next comparison.

And lastly, we have this:

vmin_or_max_sd xmm_? xmm_? xmm_b

Here we run into a problem. We know that xmm_b must be the second source operand, but what should be the first source operand? We know that it has to be a register. But it also must contain the value that we haven't looked at yet. But that's still in memory - we've only loaded one value from memory into a register using the mov from before, and we already used that value in a comparison. So we need the other value now, but that's still in memory, and we need it in a register - which means we have to use an extra mov. 

And that's why the generated assembly for the correct clamp contains an additional mov instruction compared to the incorrect clamp implementation, which is not restricted by the rule that v must be returned when it's equal to both hi and lo.

So the reason why the correct clamp implementation generates more assembly instructions than the incorrect clamp implementation in this case is different to the previous case. In the previous case, it was because of the ABI requirement that both v as well as the return value must be stored in xmm0. But in this case it's because the AVX instruction requires the first source operand to be a register.




 

 

 

 

 







Here's the original blog post (with corrections):

I originally wrote this blog post in 2019 (or maybe 2018 - my timestamps say that it was written before 1 May 2019).

Here’s my old blog post:

Contents:

Let’s say you want to clamp a value v between 2 values, min and max. If v is greater than max, return max. If v is smaller than min, return min. Otherwise return v.

Ternary

Implementing it directly as per the description:

double clamp(double v, double min, double max){
    return v < min? min : v > max? max : v;
}

gcc 8.2:

clamp(double, double, double):
        comisd  xmm1, xmm0
        ja      .L2
        minsd   xmm2, xmm0
        movapd  xmm1, xmm2
.L2:
        movapd  xmm0, xmm1
        ret

One branch instruction.

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        minsd   xmm2, xmm0
        cmpltsd xmm0, xmm1
        movapd  xmm3, xmm0
        andnpd  xmm3, xmm2
        andpd   xmm0, xmm1
        orpd    xmm0, xmm3
        ret

Branchless.

Using intermediate values

From this stackoverflow answer: https://stackoverflow.com/questions/427477/fastest-way-to-clamp-a-real-fixed-floating-point-value

double clamp(double v, double min, double max){
    double out = v > max ? max : v;
    return out < min ? min : out;
}

gcc 8.2:

clamp(double, double, double):
        minsd   xmm2, xmm0
        maxsd   xmm1, xmm2
        movapd  xmm0, xmm1
        ret

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        minsd   xmm2, xmm0
        maxsd   xmm1, xmm2
        movapd  xmm0, xmm1
        ret

Identical output. Much better than before. Can we do better?

using std::min and std::max

Here's the incorrect implementation that I mentioned at the start of this post:
#include <algorithm>
double clamp(double v, double min, double max){
    return std::min(max, std::max(min, v));
}

gcc 8.2:

clamp(double, double, double):
        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2
        ret

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        maxsd   xmm0, xmm1
        minsd   xmm0, xmm2
        ret

Yes, it generates the fewest assembly instructions but is incorrect as explained at the beginning of this post.

Let's go through the assembly to show why it's wrong.

Firstly, xmm0 holds v, xmm1 holds min, and xmm2 holds max.

Now, what does maxsd do?

If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned.

So maxsd xmm0, xmm1 compares xmm1 to xmm0 and if xmm1 is greater than xmm0 then it overwrites v, but also if xmm1 and xmm0 are both zero, then the value in xmm0 is overwritten with the value in xmm1. Meaning in the case where xmm0 contains negative zero and xmm1 contains positive zero, the value in xmm0 will be overwritten with positive zero, thus destroying v. So we've already messed up in the first line!

The next line minsd xmm0, xmm2 has the same problem. If they are both 0, then xmm0 will be overwritten with the value in xmm2, which is not what we want since v is stored in xmm0.






Using std::clamp

#include <algorithm>
double clamp(double v, double min, double max){
    return std::clamp(v, min, max);
}

gcc 8.2:

clamp(double, double, double):
        comisd  xmm1, xmm0
        ja      .L2
        minsd   xmm2, xmm0
        movapd  xmm1, xmm2
.L2:
        movapd  xmm0, xmm1
        ret

clang 7.0:

clamp(double, double, double):                            # @clamp(double, double, double)
        minsd   xmm2, xmm0
        cmpltsd xmm0, xmm1
        movapd  xmm3, xmm0
        andnpd  xmm3, xmm2
        andpd   xmm0, xmm1
        orpd    xmm0, xmm3
        ret

Not very efficient.

EDIT: It’s been almost 5 years since I originally wrote this article, so I decided to try again using the latest versions of GCC and Clang:

gcc 13.2:

clamp(double, double, double):
        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2
        ret

clang 17.0.1:

clamp(double, double, double):                            # @clamp(double, double, double)
        maxsd   xmm1, xmm0
        minsd   xmm2, xmm1
        movapd  xmm0, xmm2
        ret

Still not the shortest - it uses one more instruction than the std::min(max, std::max(min, v)) implementation. But see the top of this post for an explanation of why it must necessarily use one more assembly instruction than the incorrect implementation in certain situations.