Old Blog 2: December 2023

Friday, December 22, 2023

Sleep doesn't work when booting Windows from USB, but works when booting Linux from USB

I tried booting Windows from an SSD I pulled from my desktop PC. I plugged it into my mini PC via USB and booted from it.
Surprisingly, it just worked. Even though the desktop had a different motherboard and CPU model to the mini PC, Windows didn't complain and just worked fine. I suppose this might be because I used a Microsoft account rather than OEM license.
Unfortunately, the Power menu only had 2 buttons: Restart and Shutdown.
I enabled Hibernate manually in the power settings then the Hibernate button appeared.
Unfortunately, pressing the Hibernate button caused the system to shut down.
I installed the graphics drivers manually and then the Sleep button appeared.
Unfortunately, pressing the Sleep button caused the system to shut down.
I wanted to know if the problem was with my motherboard or with the USB boot, so I installed Windows on the internal NVME.
After installing Windows on the internal NVME, again I see only 2 buttons: Restart and Shutdown.
I installed the graphics driver manually and the Sleep button appeared.
Now the Sleep button actually sleeps the computer. When I click the mouse button it wakes the computer back up.

I don't know why Sleep causes computer to shut down when booting from USB.

I also booted Fedora from USB and Suspend works fine. So I think this issue is specific to booting Windows from USB. I'm not sure why Windows decides to shut off the PC when doing Sleep, since it doesn't do that when booted from internal drive.

Anyway, this is the first time I've seen hardware issues with Windows where Linux is more compatible with the hardware than Windows.

Sunday, December 17, 2023

fucking Linux!!

After waking from suspend my mouse and keyboard stopped working. I tried plugging in another USB mouse and that didn't work either. I short-pressed the power button and there was a reaction on the screen, which means the computer isn't frozen, it's just that the USB peripherals have somehow stopped working.

So then I tried every USB port on my computer until I found two that actually work. Somehow, only 2 out of the 8 USB ports on my computer works now. Somehow, after waking from suspend, 6 out of 8 of my USB ports stopped working.

Fucking Linux.

Simple hello world Go web server memory usage

Once again I am benchmarking the hello world Go webserver:

//go:build ignore

package main

import (
    "fmt"
    "log"
    "net/http"
)

func handler(w http.ResponseWriter, r *http.Request) {
    fmt.Fprintf(w, "Hi there, I love %s!", r.URL.Path[1:])
}

func main() {
    http.HandleFunc("/", handler)
    log.Fatal(http.ListenAndServe(":8080", nil))
}

   o: VIRT  --  Virtual Image (kb)
      The  total  amount  of  virtual  memory  used  by the task.  It
      includes all code, data and shared libraries  plus  pages  that
      have  been  swapped out and pages that have been mapped but not
      used.

   q: RES  --  Resident size (kb)
      The non-swapped physical memory a task has used.

   t: SHR  --  Shared Mem size (kb)
      The amount of shared memory used by a task.  It simply reflects
      memory that could be potentially shared with other processes.

Startup:

VIRT: 1600620 RES: 5348 SHR: 2992

While handling heavy load:

VIRT: 3079356 RES: 44024 SHR: 5312

After handling heavy load (resting):

VIRT: 3079356 RES: 55372 SHR: 5312

A new heavy load:

VIRT: 3153088 RES: 51312 SHR: 5312

After handling heavy load (resting):

VIRT: 3300552 RES: 49660 SHR: 5312

A new heavy load:

VIRT: 3300552 RES: 48488 SHR: 5312

After handling heavy load (resting):

VIRT: 3300552 RES: 48560 SHR: 5312

A new heavy load:

VIRT: 3300552 RES: 53152 SHR: 5312

Overall seems like the virtual grows then peaks at a certain point while the RES fluctuates but also seems to peak.

https://go.dev/doc/faq#Why_does_my_Go_process_use_so_much_virtual_memory

Why does my Go process use so much virtual memory?

The Go memory allocator reserves a large region of virtual memory as an arena for allocations. This virtual memory is local to the specific Go process; the reservation does not deprive other processes of memory.

To find the amount of actual memory allocated to a Go process, use the Unix top command and consult the RES (Linux) or RSIZE (macOS) columns.

Saturday, December 16, 2023

Progress Update: Benchmarks for writes and reads

I ran wrk on my server on the submit endpoint (POST requests) submitting new temporary URL to be shortened and these are the results:

# wrk -t12 -c400 -d30s -spost.lua http://system/submit
Running 30s test @ http://system/submit
12 threads and 400 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    58.91ms   48.48ms 313.79ms   58.52%
    Req/Sec   610.99    112.33     0.98k    73.37%
220305 requests in 29.68s, 31.45MB read
Socket errors: connect 0, read 0, write 0, timeout 396
Requests/sec:   7422.65
Transfer/sec:      1.06MB

So I am getting over 7422 requests per second that are actually hitting disk. Memory usage went up around 400MB with around 300K short URLs.

By my calculations, each short URL should take around 40 bytes so 300k would only be like 12 MB, but I guess there's a lot of overhead so maybe it's actually more like 10x that much? Seems like a lot though. Maybe there is a memory leak somewhere.

Number of temporary URLs: 484937
Number of permanent URLs: 0
Memory usage (from systemctl status): 419.8M

Number of file descriptors went to 408 at one point then went back down to 12. So this suggests that there is no file descriptor leak.

I'm consistently getting about 6.6K requests per second for the submit POST requests.

Number of temporary URLs: 966384
Number of permanent URLs: 0
Memory usage (from systemctl status): 917.1M
Number of open file descriptors: 12

It used around 177M with no URLs, 420M with 485K, then 622M for 600K URLs. So it seems like the growth is somehow not linear. At 966K URLs it only used 917M, which suggests that it requires less than 1G of RAM to store 1M URLs.

But that is still very disappointing, as this suggests that it's taking somewhere between 700-1000 bytes of RAM to store each URL, even though the URL is really small (like less than 10 bytes). That means each URL is taking up like 50 times more space than it should.

Well, I have an idea for how to deal with this - store all the URLs as uint64s. But, there's probably some low hanging fruit somewhere.

# wrk -t12 -c400 -d30s http://system/ei
Running 30s test @ http://system/ei
12 threads and 400 connections
Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    19.57ms   18.56ms 205.39ms   61.47%
    Req/Sec     2.02k   701.97    15.04k    89.27%
715628 requests in 29.66s, 444.97MB read
Socket errors: connect 0, read 0, write 0, timeout 396
Requests/sec: 24129.53
Transfer/sec:     15.00MB

Unfortunately, the number of reads is less than expected. I think it's probably because of all the logging that I'm doing. I think if I turn off logging then it will be faster.

Turning off logging increased the reqs/sec to 31k, not as much as I had hoped.

I changed the handler so that all it does is return the string "hi" and that increased the reqs/sec to 34k/s.

So I think it's probably not the handler itself that is slow but something else.

Replacing the muxer's ServeHTTP function to just return "hi" increased the reqs/sec to 67k/s.

At this point I'm not sure how I'm going to reach 100k reqs/sec. Well, I'm running all this in a virtual machine so maybe that is affecting the performance somewhat. But a VPS is running in a virtual machine too so this is arguably realistic...

But anyway, I'm going to call it done - I think 24k reads per second is good enough...if it's not fast enough I'll spend more effort optimizing it.

Detected another race condition

This time around the PutEntry_Common - I always felt something was wrong about this function since it had no locks around it but I thought it was okay since the cem had lock.

But actually the bug was caused the RandomBag64 not being locked.

I think it could probably be solved by adding a lock to that RandomBag64.

But I decided to do it the stupid way and just add a lock to the concurrent persistent map. This will surely fix it...but might hurt performance.

So, I tested the code again by submitting lots of expiring URLs again and this time no race was detected.

Go race detector detected race condition

So I was using wrk to benchmark my server and decided to build my binary with -race to detect race conditions, and lo and behold there was a race condition.

It was in my stats function. I was updating the stats in the stats function (number of expired and permanent URLs) while simultaneously displaying the stats in the home page.

That makes sense since adding the stats to the front page was a relatively late addition and I didn't think about what could happen if the stats updated at the same time as they were displayed on the front page.

Friday, December 15, 2023

len() is NOT thread-safe in Go!!!!!

https://stackoverflow.com/questions/52710882/is-len-thread-safe-in-golang

So I have a map containing all the URLs and of course I guard the Put and Get functions with a mutex but I was wondering if I could get away with not guarding the len() function with a mutex. Turns out, it's actually a race!!!

It is a race condition. The results are undefined. For example,

racer.go:

package main

func main() {
    m := make(map[int]int)
    l := 0
    go func() {
        for {
            l = len(m)
        }
    }()
    for i := 0; i < 10000; i++ {
        m[i] = i
    }
}

Output:

$ go run -race racer.go
==================
WARNING: DATA RACE
Read at 0x00c00008e000 by goroutine 5:
  main.main.func1()
      /home/peter/gopath/src/racer.go:8 +0x5f

Previous write at 0x00c00008e000 by main goroutine:
  runtime.mapassign_fast64()
      /home/peter/go/src/runtime/map_fast64.go:92 +0x0
  main.main()
      /home/peter/gopath/src/racer.go:12 +0xba

Goroutine 5 (running) created at:
  main.main()
      /home/peter/gopath/src/racer.go:6 +0x92
==================
Found 1 data race(s)
exit status 66
$

Why you cannot set SMTP Timeout in Golang...

So, I was trying to send email using Go's smtp package when it just got stuck forever...then I looked online for a timeout option and found that there is none and it will never get added because Go's SMTP package is frozen...

https://github.com/golang/go/issues/16436

Sad times. I really want to use the smtp package since it's in the standard library, but this is such a bummer...

Tuesday, December 12, 2023

Benchmarks of cryptographic hashes in Golang

Ran some benchmarks:

BenchmarkChecksums/cryptographic/md5-1KB-4               574143          2116 ns/op         112 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha1-1KB-4              662881          1781 ns/op         136 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha256-1KB-4            328556          3601 ns/op         160 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-1KB-4            358324          3391 ns/op         288 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha3-256-1KB-4          257420          5343 ns/op         928 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha3-512-1KB-4          151782          9194 ns/op         960 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha512-224-1KB-4        293875          3604 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-256-1KB-4        386319          3151 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-256-1KB-4      727593          1629 ns/op         416 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-512-1KB-4      706335          1452 ns/op         448 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/md5-1MB-4                    662       1804930 ns/op         112 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha1-1MB-4                   924       1334859 ns/op         136 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha256-1MB-4                 374       3156370 ns/op         160 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-1MB-4                 450       2627154 ns/op         288 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha3-256-1MB-4               286       3646231 ns/op         928 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha3-512-1MB-4               175       6781699 ns/op         960 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha512-224-1MB-4             459       2730398 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-256-1MB-4             459       2693629 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-256-1MB-4           925       1267007 ns/op         416 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-512-1MB-4           949       1263266 ns/op         448 B/op           2 allocs/op

BenchmarkMD5_1k-4             608145          1949 ns/op          16 B/op           1 allocs/op
BenchmarkMD5_10k-4               67117         17903 ns/op          16 B/op           1 allocs/op
BenchmarkMD5_100k-4               6465        177057 ns/op          16 B/op           1 allocs/op
BenchmarkMD5_250k-4               2490        448508 ns/op          16 B/op           1 allocs/op
BenchmarkMD5_500k-4               1332        923916 ns/op          16 B/op           1 allocs/op
BenchmarkSHA11_1k-4           731042          1703 ns/op          24 B/op           1 allocs/op
BenchmarkSha1_10k-4              85749         13196 ns/op          24 B/op           1 allocs/op
BenchmarkSha1_100k-4              8848        132344 ns/op          24 B/op           1 allocs/op
BenchmarkSha1_250k-4              3760        330325 ns/op          24 B/op           1 allocs/op
BenchmarkSha1_500k-4              1802        660172 ns/op          24 B/op           1 allocs/op
BenchmarkSha256_1k-4          327596          3397 ns/op          32 B/op           1 allocs/op
BenchmarkSha256_10k-4            39212         31752 ns/op          32 B/op           1 allocs/op
BenchmarkSha256_100k-4            3829        315654 ns/op          32 B/op           1 allocs/op
BenchmarkSha256_250k-4            1580        756969 ns/op          32 B/op           1 allocs/op
BenchmarkSha256_500k-4             790       1525592 ns/op          32 B/op           1 allocs/op
BenchmarkSHA3_1k-4            251572          4436 ns/op         480 B/op           2 allocs/op
BenchmarkSHA3_10k-4              34284         38750 ns/op         480 B/op           2 allocs/op
BenchmarkSHA3_100k-4              3399        349182 ns/op         480 B/op           2 allocs/op
BenchmarkBLAKE2b_1k-4         711776          1485 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2b_10k-4           90138         12623 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2b_100k-4          9363        126833 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2s_1k-4         485341          2124 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2s_10k-4           62546         19029 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2s_100k-4          6295        200617 ns/op          32 B/op           1 allocs/op

So it seems BLAKE2b is the fastest but for large sizes no big advantage over SHA1.

I don't care about large sizes though, I only want to hash small strings (16-140 bytes), so let's try these benchmarks:

BenchmarkMD5_16b-4            4692264           239.8 ns/op          16 B/op           1 allocs/op
BenchmarkMD5_32b-4            4809571           257.1 ns/op          16 B/op           1 allocs/op
BenchmarkMD5_64b-4            4207168           306.2 ns/op          16 B/op           1 allocs/op
BenchmarkMD5_140b-4           2625067           455.8 ns/op          16 B/op           1 allocs/op
BenchmarkSHA1_16b-4           3867692           271.5 ns/op          24 B/op           1 allocs/op
BenchmarkSha1_32b-4           3473290           318.1 ns/op          24 B/op           1 allocs/op
BenchmarkSha1_64b-4           3473750           347.8 ns/op          24 B/op           1 allocs/op
BenchmarkSha1_140b-4          1894322           562.6 ns/op          24 B/op           1 allocs/op
BenchmarkSha256_16b-4         2944030           388.2 ns/op          32 B/op           1 allocs/op
BenchmarkSha256_32b-4         2613794           432.2 ns/op          32 B/op           1 allocs/op
BenchmarkSha256_64b-4         2158723           527.7 ns/op          32 B/op           1 allocs/op
BenchmarkSha256_140b-4        1482595           815.1 ns/op          32 B/op           1 allocs/op
BenchmarkSHA3_16b-4           1000000          1139 ns/op         480 B/op           2 allocs/op
BenchmarkSHA3_32b-4           1212406           923.6 ns/op         480 B/op           2 allocs/op
BenchmarkSHA3_64b-4           972903          1185 ns/op         480 B/op           2 allocs/op
BenchmarkSHA3_140b-4          699087          1446 ns/op         480 B/op           2 allocs/op
BenchmarkBLAKE2b_16b-4        3745459           281.5 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2b_32b-4        4042144           303.4 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2b_64b-4        3342735           365.6 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2b_140b-4      2417270           505.2 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2s_16b-4        4889670           244.9 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2s_32b-4        4168473           295.4 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2s_64b-4        2688381           413.2 ns/op          32 B/op           1 allocs/op
BenchmarkBLAKE2s_140b-4      2381414           532.7 ns/op          32 B/op           1 allocs/op

BenchmarkChecksums/cryptographic/md5-16B-4                  4678737           293.6 ns/op         112 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha1-16B-4             2631940           450.8 ns/op         136 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha256-16B-4           2967570           379.0 ns/op         160 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-16B-4           2227252           578.1 ns/op         288 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha3-256-16B-4         984817          1111 ns/op         928 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha3-512-16B-4         1164912          1096 ns/op         960 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha512-224-16B-4      2320543           526.7 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-256-16B-4      1949170           574.8 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-256-16B-4      2809782           388.8 ns/op         416 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-512-16B-4      2623162           507.2 ns/op         448 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/md5-32B-4              4980589           287.0 ns/op         112 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha1-32B-4             4412821           298.3 ns/op         136 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha256-32B-4           2146078           555.0 ns/op         160 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-32B-4           1801621           587.0 ns/op         288 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha3-256-32B-4         1000000          1384 ns/op         928 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha3-512-32B-4         1000000          1748 ns/op         960 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha512-224-32B-4      1970791           668.3 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-256-32B-4      2006991           638.5 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-256-32B-4      2571462           394.1 ns/op         416 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-512-32B-4      2711781           409.9 ns/op         448 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/md5-64B-4              5008778           296.6 ns/op         112 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha1-64B-4             4353504           322.3 ns/op         136 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha256-64B-4           3221083           375.5 ns/op         160 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-64B-4           1955646           675.7 ns/op         288 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha3-256-64B-4         967701          1242 ns/op         928 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha3-512-64B-4         906475          1242 ns/op         960 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha512-224-64B-4      1935334           705.1 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-256-64B-4      1834999           709.6 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-256-64B-4      2797434           443.7 ns/op         416 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-512-64B-4      1700326           673.0 ns/op         448 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/md5-140B-4             3182262           354.9 ns/op         112 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha1-140B-4            3237883           379.6 ns/op         136 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha256-140B-4          2638554           576.8 ns/op         160 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-140B-4          1915440           744.3 ns/op         288 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha3-256-140B-4        959647          1516 ns/op         928 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha3-512-140B-4        982670          1392 ns/op         960 B/op           3 allocs/op
BenchmarkChecksums/cryptographic/sha512-224-140B-4      1903096           728.1 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/sha512-256-140B-4      1789345           929.8 ns/op         256 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-256-140B-4     2770437           405.9 ns/op         416 B/op           2 allocs/op
BenchmarkChecksums/cryptographic/blake2b-512-140B-4     2895793           498.4 ns/op         448 B/op           2 allocs/op

/// using .Sum directly:

BenchmarkMD5_16b-4           8172746           152.8 ns/op           0 B/op           0 allocs/op
BenchmarkMD5_32b-4           7950465           152.1 ns/op           0 B/op           0 allocs/op
BenchmarkMD5_64b-4           4700047           250.5 ns/op           0 B/op           0 allocs/op
BenchmarkMD5_140b-4          3222742           366.1 ns/op           0 B/op           0 allocs/op
BenchmarkSHA1_16b-4           6652365           185.4 ns/op           0 B/op           0 allocs/op
BenchmarkSha1_32b-4           6371542           183.7 ns/op           0 B/op           0 allocs/op
BenchmarkSha1_64b-4           3977347           302.7 ns/op           0 B/op           0 allocs/op
BenchmarkSha1_140b-4          2527082           430.7 ns/op           0 B/op           0 allocs/op
BenchmarkSha256_16b-4         4675767           264.4 ns/op           0 B/op           0 allocs/op
BenchmarkSha256_32b-4         3929925           266.2 ns/op           0 B/op           0 allocs/op
BenchmarkSha256_64b-4         2627029           472.8 ns/op           0 B/op           0 allocs/op
BenchmarkSha256_140b-4        1861273           688.7 ns/op           0 B/op           0 allocs/op
BenchmarkBLAKE2b_16b-4        6273417           200.6 ns/op           0 B/op           0 allocs/op
BenchmarkBLAKE2b_32b-4        5075683           208.8 ns/op           0 B/op           0 allocs/op
BenchmarkBLAKE2b_64b-4        6119851           202.1 ns/op           0 B/op           0 allocs/op
BenchmarkBLAKE2b_140b-4      2974110           354.4 ns/op           0 B/op           0 allocs/op

So it looks like a tie between MD5 and BLAKE2b. SHA1 looks strictly worse than MD5 in every length, but BLAKE2b is faster for 64b and slightly faster on 140b.

Monday, December 11, 2023

Golang os.Stat performance

So I'm calling os.Stat on every Write to a log file so I was wondering how efficient that was. So I ran it 10,000 times in a row and timed it and the result was 10-20ms on average (though sometimes it was higher e.g. 45ms).

1000 times - 1-2ms on average

100 times - 100-200 microseconds on average

So it seems pretty consistent that a call to os.Stat() takes around 1-2 microseconds on average - but note here that I'm calling it in a loop.

EDIT: I ran strace -T on it and saw this:

fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000013>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000013>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000007>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000007>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000007>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000012>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000014>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000006>
fstat(3, {st_mode=S_IFREG|0644, st_size=3, ...}) = 0 <0.000012>

So the fstat calls were taking 6-12 microseconds, which is more than the benchmark suggested.

So if you have 1000 calls to stat per second, then that's 12,000 microseconds or 12 milliseconds out of each second spent doing stat. So that's around 1% of the total time budget - not a big deal.

So if you're calling stat on every write and you're only doing a few hundred writes per second, then the overhead of stat is really not that big.

Sunday, December 10, 2023

How long does it take to list all files in a directory in Linux using Golang?

I came up with a scheme for storing expiring URLs in log files. I will have a separate log file for each expiry date (or hour or whatever). So when I get an expiring URL that expires on a certain date, then I write that URL into the corresponding file, then when that date passes, I just delete that file - easy.

But, in order to do this, I need to list all of the files in a directory so that I can parse their names (which contains the expiry date/time) so that I know which ones to delete.

Alternatively, if I already know the naming scheme, I can just try to delete it directly without listing all of the files.

But anyway, here's the benchmarks:

100 files: 5ms, 2ms, 1.8ms

1000 files: 4ms, 6ms

10,000 files: 33ms, 32ms

100,000 files: 143ms, 109ms, 194ms, 456ms

Benchmarking code:

package main

import (
	"fmt"
	"os"
	"time"
)

func main() {
	start := time.Now()
	entries, err := os.ReadDir("/tmp/test/")
	if err != nil {
		panic(err)
	}

	for _, e := range entries {
		if e.IsDir() { // ignore directories
			continue
		}
		file_info, err1 := e.Info()
		if err1 != nil {
			panic(err1)
		}
		if file_info != nil {
		}
	}
	fmt.Println("Time elapsed:", time.Now().Sub(start))
}

So the os.ReadDir performance for 100k files is not that bad - only 200ms. Still, this provides some motivation to keep the number of files low. But I think it shows that this is the kind of function that can run every few seconds without really impacting the performance much.

Friday, December 8, 2023

Got bitten by the appending-to-a-slice-copied-by-value again...

I made that same mistake again, this time with this code:

internals := map_hostname_to_internals[hostname] 

internals.prefix_handlers = append(internals.prefix_handlers, &HandlerPair{handler: entries[i].handler, prefix: entries[i].prefix})

In the above code, the variable internals is a struct that contains a slice. That's right - not a pointer to a slice, but a slice.

So the internals variable is a copy of the contents of the map, and therefore contains COPIES of the slice headers - not pointers to the slice.

Which means that when I appended to the slice, that modifies the slice in the copy of the internals rather than the internals in the original map.

I guess the simplest solution is to have the map contain pointers to internals instead of containing the actual values.

That's a convenient solution because in Go, the dot operator will automatically dereference a pointer to a struct so if you change from using a struct to a pointer-to-struct then you don't have to change anything.

Thursday, December 7, 2023

Figured out cause of memory "leak"

So I found out where the "memory leak" was coming from. Turns out it's not really a memory leak, just causes Go to grab more system memory but if you run the function again it doesn't grab more memory which indicates that it's not actually leaking.

Anyway the function is this:

// This function seems to create a memory leak, but it doesn't.
func Power(a, b, m int) int {
    result := new(big.Int).Exp(
        big.NewInt(int64(a)),
        big.NewInt(int64(b)),
        big.NewInt(int64(m)),
    )
    return int(result.Int64())
}

And this is how I was calling it:

    for i := 0; i < util.Power(53, n-1, 0); i++ {
        results[i] = 0
    }

So as you can see, it is calling the Power function around 8 million times and each time it gets called, it allocates new memory for the BigInts which adds up to like 60MB. But this memory isn't actually leaked because if you call the function multiple times, the memory usage doesn't go up.

Anyway it's obviously a bad idea to call Power on each iteration of the loop, should just run it once and record the result and use that as the loop variable.

Wednesday, December 6, 2023

How to speed up maps in golang, part 2

So I came across this page: https://www.komu.engineer/blogs/01/go-gc-maps

And I tested it.

It turns out map[int64]string is indeed much faster than map[string]string

inserting 1 million entries: 250ms (int64) vs 380ms (string) average

inserting 10 million entries (with GC after each 1 million inserts): 6s (int64) vs 5.2s (string)

Those results seemed anomalous so I ran the 10 million entries tests with string again and got 4.9s, then I ran int64 again and got 3.1s then 3.2s then 3.4s, then running string again I got 4.8s, then 4.96s, then 4.7s.

So overall, inserting into map[int64]string is indeed much faster than inserting into map[string]string.

With map pre-allocation I got these results:

map[string]string

Time elapsed: 2.093162196s
Time elapsed: 2.043661164s
Time elapsed: 2.348678727s

map[int64]string

Time elapsed: 2.036045741s
Time elapsed: 2.019765882s
Time elapsed: 1.733399438s
Time elapsed: 1.982084492s

Replacing string by converting to []byte is much faster than converting to []rune

This stackoverflow answer: https://stackoverflow.com/questions/24893624/how-to-replace-a-letter-at-a-specific-index-in-a-string-in-go says:

It turns out that using str[:index] + string(replacement) + str[index+1:] is way faster.

But this is not true for short strings if you use []byte instead of []rune

Here are the performance benchmark results for []rune:

$ go test -bench=.
goos: linux
goarch: amd64
BenchmarkSmall1-4       11415662               102.5 ns/op
BenchmarkSmall2-4       36775035                33.35 ns/op
BenchmarkMedium1-4        830799              1246 ns/op
BenchmarkMedium2-4      12631957               109.9 ns/op
BenchmarkBig1-4            10000            115676 ns/op
BenchmarkBig2-4           580592              2275 ns/op
PASS

And here are the results for []byte:

$ go test -bench=.
goos: linux
goarch: amd64
BenchmarkSmall1-4       95241897                12.33 ns/op
BenchmarkSmall2-4       36181358                35.20 ns/op
BenchmarkMedium1-4       8377953               126.7 ns/op
BenchmarkMedium2-4      13281609                91.47 ns/op
BenchmarkBig1-4           244942              5473 ns/op
BenchmarkBig2-4           450855              2967 ns/op
PASS

As you can see, the convert-to-slice-then-back-to-string method is faster for 10 byte strings but slower for 100 byte strings.

If you know that the string is ASCII then you are incurring an unnecessary performance penalty by converting it to []rune.

Friday, December 1, 2023

How to speed up map insert in golang

Inserting 10 million items into map in Go.

Begin with empty map of interface to interface. Takes 6.8s.

Now we pre-allocate map with 20 million elements. Takes 3.4s.

Now we pre-allocate map with 9 million elements. Takes 2.9s.

Now we change type to map[string]string. Takes 2.3s.

Now we change type to map[string]*[150]byte. Takes 2.9s.

Surprising that changing from interface{} to string made it faster, but changing from string to a pointer to a fixed length array made it slower. I have no idea why string is faster than pointer, but it apparently is.

Anyway, the biggest improvement came from pre-allocation. This makes sense since pre-allocating means we don't have to keep copying the contents over and over every time we resize the map. Go is especially bad for this because the reallocation growth factor in Go is 25%.

Next biggest improvement came from using string instead of interface{}. Not sure why this is the case but I'll take it I guess.

So yeah, now I can insert 10 million items in 3 seconds. Of course generating the array itself takes around 5 seconds, so in total 8 seconds, but still, it's a nice improvement - from 700ns to 300ns per item.

Next step is to make bulk heap insertion faster.

EDIT: I guess this might explain why [string]string is faster than [interface{}]interface{}

One important thing to note is that maps keep track of keys and values that can hold pointers. If the entries in a bucket can’t hold any pointers, the bucket is marked as containing no pointers, and maps just create overflow buckets (which means more memory allocations). This avoids unnecessary overhead with GC. See this comment in the mapextra struct (map.go, line 132) and this post for reference.