UPDATE: This blog post has been updated based on feedback from Reddit and Hacker News.
HN discussion: https://news.ycombinator.com/item?id=39011850
Reddit discussion: https://www.reddit.com/r/cpp/comments/1980q8l/stdclamp_still_generates_less_efficient_assembly/
How do you correctly implement std::clamp?
[Credit for this section goes to Reddit user F54280]
To make sense of the rest of this blog post, we have to first discuss the correctness requirements of std::clamp in order to understand why std::clamp is implemented the way it is in libstdc++.
Here is an incorrect implementation of std::clamp:
#include <algorithm>
double clamp(double v, double min, double max){
return std::min(max, std::max(min, v));
}
The above implementation will return the correct answer most of the
time but will return an incorrect result when dealing with
positive/negative zeros, because, according
to cppreference, clamp should:
If v compares less than lo, returns lo; otherwise if hi compares less than v, returns hi; otherwise returns v.
So if I call std::clamp(-0, +0, +0) it should return -0. Why? Because according to the IEEE standard, positive and negative zero are equal. Positive zero is not greater than negative zero. Therefore, since v does not compare less than lo, and hi does not compare less than v, the call to std::clamp must return v, which is -0 in this case.
The incorrect implementation above does not return -0, instead it returns +0. Why? Because std::min and std::max returns the first parameter in the case the two parameters are equal. Since negative zero and positive zero are equal, it will return the first parameter. The implementation therefore would return max when max is equal to v, or min when min is equal to v - so it is really doubly wrong.
A correct implementation of clamp must
return v when v is equal to both min and max. So if v is equal to -0 and
min and max are both +0 then clamp must return -0. With that in mind,
let's look at some correct implementations.
Here is the (correct) implementation that libstdc++ uses:
double clamp(double v, double min, double max){
return std::min(std::max(v, lo), hi);
}
This
implementation is correct because std::max(v, lo) will return v when v
is equal to lo, and it will return std::max(v, lo) when std::max(v, lo)
is equal to hi.
And here is an alternative correct implementation:
double clamp(double v, double min, double max){
return std::max(std::min(v, max), min);
}
This implementation is correct because std::min(v, max) will return v when v is equal to max, and std::max(std::min(v, max), min) will return std::min(v, max) when it's equal to min.
These two are probably the only correct implementations of std::clamp using std::min and std::max. You cannot change the order of the parameters in the calls to std::min or std::max, because that would cause v to not be returned when it's equal to min and max. The semantics of std::clamp requires that v be returned when it's equal to min and max.
Why does the standard library (libstdc++) implementation of std::clamp sometimes generate an extra mov instruction?
[Credit for this section goes to Reddit user F54280 as well as HN commenters jeffbee and vitorsr]
This is the main focus of this blog post and it refers to the following observation:
https://godbolt.org/z/rq9dsGxh5
Generated assembly
So the interesting observation here is that by reordering the function parameters with respect to the arguments passed to either std::clamp or std::min and std::max, we can achieve shorter assembly.
Now the question is why does reordering the function parameters with respect to the arguments to std::clamp or std::min and std::max affect the number of assembly instructions generated? The short answer is that it's because of 3 things:
- The requirement imposed by the C++ standard
- The limitations on what assembly instructions can be used due to architecture
- The requirements imposed by the ABI (the calling convention)
I have already explained the requirement imposed by the standard which is that v must be returned when it is equal to min and max. So now let's discuss the relevant assembly instructions.
First, let's make sure we understand what the minsd, maxsd, and movapd instructions do and why they are generated. Long story short, they are generated because you didn't specify a microarchitecture to the C++ compiler, which means that the compiler must assume x86-64-v1, which only supports SSE. If we had specified a microarch equal to or greater than x86-64-v3, then the compiler would generate AVX instructions instead. But since we didn't, the compiler can't generate AVX instructions because it has to assume that the target CPU only supports the x86-64-v1 instruction set which doesn't contain AVX.
The SSE minsd and maxsd functions each have two operands. The first operand is the destination operand and the second operand is the source operand. This is the norm in x86 Intel Syntax where the mov instruction can be thought of as an assignment, so when you see:
mov eax, ebx
You should think of that as:
eax = ebx
Because it simply copies the value held in the register ebx into eax.
So when we see:
minsd xmm0 xmm1
We should think of that as:
xmm0 = std::min(xmm1, xmm0)
The most important thing to note here is this:
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned
This means that if both the first and second registers contain 0, then the 0 in the second register will overwrite the 0 in the first register. So in the above example:
minsd xmm0 xmm1
If prior to running the above instruction, xmm0 contained negative zero and xmm1 contained positive zero, then the instruction will overwrite xmm0 with positive zero.
Now that we know what the assembly instructions do, let's talk about the ABI limitation.
The ABI limitation in this case the calling convention is that the first parameter of a function is stored in xmm0, and the return value is also stored in xmm0.
This means that if you have the following function signature:
std::clamp(v, lo, hi)
Then v will be stored in xmm0, and also the return value will also be stored in xmm0.
What does this mean? It means that we can't start with this:
maxsd xmm0 xmm1
Why not? Because that would overwrite xmm0 with the value in xmm1 if both xmm0 and xmm1 contained zero. That means we lose the value stored in v (which is stored in xmm0), which automatically leads to the incorrect result in the case of std::clamp(-0.0, +0.0, +0.0).
It also means that we can't start with this:
minsd xmm0 xmm2
For the same reason as explained above - because it would cause us to lose the value stored in v, which we have to return in the case v and lo and hi are all zero.
But we can do this:
minsd xmm2 xmm0
So this will overwrite xmm2 with std::min(v, hi), and then we can do
maxsd xmm1 xmm2
Which will overwrite xmm1 with std::max(std::min(v, hi), lo).
You could also start with
maxsd xmm1 xmm0
But no matter what you start with, the register that contains the result of the first instruction must be the second operand of the next minsd/maxsd instruction because the result might contain v, and we must not overwrite v. Therefore the register containing the result, which might be v, must be the second operand of the next instruction.
If you think about it, whatever assembly we generate has to contain at least two instructions: one minsd and one maxsd, because each instruction can only look at 2 variables and there are 3 variables that we have to look at. So there must be at least two instructions. And I just showed that regardless of whether you start with minsd or maxsd, xmm0 has to be the second operand of the first instruction. And the register that contains the result of the first operation has to be the second operand of the second instruction in order to avoid losing v. Which means the first operand of the second instruction HAS TO BE the register that wasn't involved in the first instruction (either xmm1 or xmm2).
So the result of the two comparison instructions is stored in that other register, which is not xmm0. But the ABI requires the result to be stored in xmm0. So an extra move instruction is generated in order to move the result into xmm0.
And that's why the assembly generated for std::clamp is not the shortest. It's basically because of the combination of the quirks of the assembly instruction minsd and maxsd and because of the ABI which requires the first operand (v) to be stored in xmm0 and the result to be stored in xmm0 as well.
And the 3 reasons why the compiler won't generate the shortest assembly for std::clamp naturally suggests 3 ways to get the compiler to generate shorter assembly:
- Relax the C++ standard
- Don't make v the first parameter in std::clamp
- Allow the compiler to use different assembly instructions (AVX)
We've already covered option 1: the incorrect version of std::clamp that I showed you. If you don't care about getting C++ standard-mandated behavior for edge cases like positive/negative zeroes, NaN values etc, then you can just use my incorrect implementation.
Now let's try option 2: don't make v the first parameter in std::clamp. Does it matter whether we make v the second or third parameter? As it turns out, it doesn't matter (with some caveats: see below) - as long as v is not the first parameter, we can get the compiler to generate the shortest assembly. See https://godbolt.org/z/xYs47s8nx:
Generated assembly
The above shows something very interesting. It shows that the compiler will only generate the shortest assembly when the parameters are in a very specific order with respect to the min and max instructions.
I believe that this is due to the requirement for std::min and std::max to return the first parameter when the two parameters are equal. So the compiler is not allowed to convert between the max(min()) and min(max)) forms, because they actually are not identical. For example: https://godbolt.org/z/fhaP6dee5
Output
custom clamp1: 0
custom clamp2: 1
official clamp: 0
So you can see, the minmax clamp and the maxmin clamp are actually not semantically equivalent when the lo is greater than hi (which is undefined behavior btw) :
The behavior is undefined if the value of lo is greater than hi.
But the C++ compiler doesn't know that we're trying to implement std::clamp, so it cannot freely convert between the maxmin and the minmax clamp implementations, because they are not semantically equivalent.
What this means is that if you do max(min()) then the compiler has to actually generate assembly that is equivalent to that. It can't convert it to min(max()) because that's not semantically equivalent.
So that's why in the generated assembly for:
std::max(std::min(v, hi), lo);
You always see minsd first. Because the compiler has to first calculate std::min(v,hi) first, only after that can it use the result of that as the argument to std::max.
This has serious ramifications for the generated assembly. Why? Because in order for the shortest assembly to be generated, we must use one of the minsd or maxsd instructions to store directly into xmm0 and that's the result that we return. This can only happen on the second line because we can't know the return value on the first line since we haven't looked at all the operands yet.
So, to generate the shortest assembly, we need xmm0 as the first operand on the second line. But we also need minsd of v and hi on the first line.
This means that if hi is stored in xmm0 then we can't have the shortest assembly. Because that means xmm0 has to be the first operand on the first line. Which then means xmm0 must be the second operand on the second line. Which means it can't be the first operand on the second line. So the first operand must be some other register, and then we'll have to copy the value from that register into xmm0 before returning. And that's exactly what you see in the generated assembly shown above.
In conclusion, then, changing the ordering of parameters in a function can change the number of assembly instructions generated because of the ABI which mandates that the first parameter AS WELL AS the return value must both be held in the same register.
When the first parameter in the function must be preserved as in the case of std::clamp, this can cause the compiler to emit an extra move instruction to store the result at the end into xmm0. If you reorder the parameters so that the first parameter doesn't need to be preserved, then the compiler can omit the extra move instruction by storing the result of a computation directly in xmm0 instead of moving it into xmm0 from some other register, thus resulting in shorter generated assembly code.
But as I mentioned, there is a third way to reduce the length of the generated assembly and that is to target a newer microarchitecture e.g. x86-64-v3 which supports AVX. See: https://godbolt.org/z/dM3f8v4nM
Generated assembly for march=x86-64-v3 (same in Clang and GCC)
Now we get the shortest possible assembly output - only two instructions emitted. This is because it's using the AVX instruction vminsd which has 3 operands. As usual the first operand is the destination operand into which the result will be stored. The other two operands are dealt with in the following way:
If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned
As before, v is stored in xmm0. So here xmm0 is the third operand i.e. the second source operand which means its zero value will be returned as the result and written into the destination operand which is xmm1.
So now max(xmm0, xmm1) is stored in xmm1, so now xmm1 naturally should be the second source operand and indeed on line 2 we see that xmm1 is the second source operand. Of course we want the result to be stored in xmm0 and here we see that xmm0 is indeed used as the destination operand while the first source operand is the other variable that we haven't looked at yet which is stored in xmm2.
So from this we can see that by targeting a more recent microarchitecture we allow the compiler to emit more recent instructions (AVX in this case) which allow for fewer assembly instructions, in this case because the AVX instruction takes 3 operands and allows the user to specify which register to store the result into, as opposed to the SSE instruction which only takes 2 operands and simply stores the result into the first source operand.
Does any of this actually matter?
Some commenters have noted that std::clamp will be inlined in real code, and that real code is likely to target x86-64-v3. This is true, but even in "realistic code", the compiler still seems to generate shorter assembly for the incorrect (non-standards-compliant) clamp compared to the correct clamp:
https://godbolt.org/z/hd44KjMMn (see also https://godbolt.org/z/eq8q1jnss)
Generated assembly
As
you can see in the lines highlighted in yellow, the compiler does seem
to generate an extra assembly instruction for the correct clamp
implementations regardless of parameter ordering.
So it looks like even in "realistic" code (I'm not sure how realistic my example is, but it does seem to resemble a real program in that it takes input from argv and returns an output based on the input), the compiler will still generate more assembly for the correct clamp compared to the incorrect clamp.
But why is this? Reddit user Sporule explained it:
After sscanf()
, all three values are located in memory. The operand src1
in the instruction vminsd dst, src1, src2
can only be a register. You cannot specify the memory location there.
Also both lo
and hi
values need to be passed via src1
operand. It is incorrect to pass them via src2
.
Therefore, it is absolutely necessary to execute a load instruction to
transfer the value from memory to a register so that the order of
operands is correct. Thus, at least two vmovsd
instructions are required. And this is exactly that both compilers emitted.
Let's unpack this. The vminsd instruction has the following very important characteristics:
- If the values being compared are both 0.0s (of either sign), the value in the second source operand is returned. If a value in the second source operand is an SNaN, then SNaN is returned unchanged to the destination (that is, a QNaN version of the SNaN is not returned).
- The second source operand can be an XMM register or a 64-bit memory location. The first source and destination operands are XMM registers.
The first point says that when the two source operands are equal, then the second source operand is what will be written into the destination operand. Since we want v to be returned when it's equal to whatever it's being compared to (as explained earlier, the C++ standard requires std::clamp to return v when v is equal to lo and hi), this means that v must be the second source operand. Which means that whatever v is being compared to must be the first source operand.
The second point says that the first source operand and the destination operand must both be registers. Combining this with the above, we have that since v must be the second source operand, this means that the value that v is being compared to must be placed into a register first, before we can compare v to it.
Okay, so we have this:
vmovsd xmm_a MEMORY_LOCATION_OF_HI_OR_LO
Here, xmm_x can be any xmm register and MEMORY_LOCATION_OF_HI_OR_LO can be either the memory location of hi or lo - it doesn't matter. All that matters is that we load some non-v value into some register so that we can compare it to v.
Now that we've loaded it into a register, we can use it for the comparison:
vmin_or_max_sd xmm_b xmm_a MEMORY_LOCATION_OF_V
Here, vmin_or_max_sd could be vminsd or vmaxsd, doesn't matter. And xmm_b could be the same as or different from xmm_a, doesn't matter. Note here that we are using the memory location of v in order to avoid having an extra mov instruction. Now we have already used up 2 instructions so there is only one instruction left.
The important thing is that the return value of this instruction could be v. This means that the return value must be the second source operand of the next comparison instruction. Because if it's the first source operand then v won't be returned when it's equal to hi and lo. So whatever xmm_b is, it has to be the second operand of the next comparison.
And lastly, we have this:
vmin_or_max_sd xmm_? xmm_? xmm_b
Here we run into a problem. We know that xmm_b must be the second source operand, but what should be the first source operand? We know that it has to be a register. But it also must contain the value that we haven't looked at yet. But that's still in memory - we've only loaded one value from memory into a register using the mov from before, and we already used that value in a comparison. So we need the other value now, but that's still in memory, and we need it in a register - which means we have to use an extra mov.
And that's why the generated assembly for the correct clamp contains an additional mov instruction compared to the incorrect clamp implementation, which is not restricted by the rule that v must be returned when it's equal to both hi and lo.
So the reason why the correct clamp implementation generates more assembly instructions than the incorrect clamp implementation in this case is different to the previous case. In the previous case, it was because of the ABI requirement that both v as well as the return value must be stored in xmm0. But in this case it's because the AVX instruction requires the first source operand to be a register.
Here's the original blog post (with corrections):
I originally wrote this blog post in 2019 (or maybe 2018 - my timestamps say that it was written before 1 May 2019).
Here’s my old blog post:
Contents:
Let’s say you want to clamp a value v between 2 values, min and max. If v is greater than max, return max. If v is smaller than min, return min. Otherwise return v.
Ternary
Implementing it directly as per the description:
double clamp(double v, double min, double max){
return v < min? min : v > max? max : v;
}
gcc 8.2:
clamp(double, double, double):
comisd xmm1, xmm0
ja .L2
minsd xmm2, xmm0
movapd xmm1, xmm2
.L2:
movapd xmm0, xmm1
ret
One branch instruction.
clang 7.0:
clamp(double, double, double): # @clamp(double, double, double)
minsd xmm2, xmm0
cmpltsd xmm0, xmm1
movapd xmm3, xmm0
andnpd xmm3, xmm2
andpd xmm0, xmm1
orpd xmm0, xmm3
ret
Branchless.
Using intermediate values
From this stackoverflow answer: https://stackoverflow.com/questions/427477/fastest-way-to-clamp-a-real-fixed-floating-point-value
double clamp(double v, double min, double max){
double out = v > max ? max : v;
return out < min ? min : out;
}
gcc 8.2:
clamp(double, double, double):
minsd xmm2, xmm0
maxsd xmm1, xmm2
movapd xmm0, xmm1
ret
clang 7.0:
clamp(double, double, double): # @clamp(double, double, double)
minsd xmm2, xmm0
maxsd xmm1, xmm2
movapd xmm0, xmm1
ret
Identical output. Much better than before. Can we do better?
using std::min and std::max
Here's the incorrect implementation that I mentioned at the start of this post:#include <algorithm>
double clamp(double v, double min, double max){
return std::min(max, std::max(min, v));
}
gcc 8.2:
clamp(double, double, double):
maxsd xmm0, xmm1
minsd xmm0, xmm2
ret
clang 7.0:
clamp(double, double, double): # @clamp(double, double, double)
maxsd xmm0, xmm1
minsd xmm0, xmm2
ret
Yes, it generates the fewest assembly instructions but is incorrect as explained at the beginning of this post.
Let's go through the assembly to show why it's wrong.
Firstly, xmm0 holds v, xmm1 holds min, and xmm2 holds max.
Now, what does maxsd do?
If the values being compared are both 0.0s (of either sign), the value in the second operand (source operand) is returned.
So maxsd xmm0, xmm1
compares xmm1 to xmm0 and if xmm1 is greater than xmm0 then it overwrites v, but also if xmm1 and xmm0 are both zero, then the value in xmm0 is overwritten with the value in xmm1. Meaning in the case where xmm0 contains negative zero and xmm1 contains positive zero, the value in xmm0 will be overwritten with positive zero, thus destroying v. So we've already messed up in the first line!
The next line minsd xmm0, xmm2
has the same problem. If they are both 0, then xmm0 will be overwritten with the value in xmm2, which is not what we want since v is stored in xmm0.
Using std::clamp
#include <algorithm>
double clamp(double v, double min, double max){
return std::clamp(v, min, max);
}
gcc 8.2:
clamp(double, double, double):
comisd xmm1, xmm0
ja .L2
minsd xmm2, xmm0
movapd xmm1, xmm2
.L2:
movapd xmm0, xmm1
ret
clang 7.0:
clamp(double, double, double): # @clamp(double, double, double)
minsd xmm2, xmm0
cmpltsd xmm0, xmm1
movapd xmm3, xmm0
andnpd xmm3, xmm2
andpd xmm0, xmm1
orpd xmm0, xmm3
ret
Not very efficient.
EDIT: It’s been almost 5 years since I originally wrote this article, so I decided to try again using the latest versions of GCC and Clang:
gcc 13.2:
clamp(double, double, double):
maxsd xmm1, xmm0
minsd xmm2, xmm1
movapd xmm0, xmm2
ret
clang 17.0.1:
clamp(double, double, double): # @clamp(double, double, double)
maxsd xmm1, xmm0
minsd xmm2, xmm1
movapd xmm0, xmm2
ret
Still not the shortest - it uses one more instruction than the std::min(max, std::max(min, v))
implementation. But see the top of this post for an explanation of why it must necessarily use one more assembly instruction than the incorrect implementation in certain situations.
I did some investigation. It turns out `std::clamp` and your `std::min(std::max)` are not the same when `nan` is involving. See this link: https://godbolt.org/z/jaE7EdjTo
ReplyDeleteHi StevenSun, thank you for the comment. Indeed, the NaN behavior is different. But what is the correct behavior? Does the C++ Standard mention it?
ReplyDelete