Hi for all,
Please bare a little with me, i mean really don't take my calling the Delphi Compiler as an offense toward anyone, it is just i know it, i used to be and still earning my bread from optimizing old and new code, i know the compiler very intimately to call it a fucking centuries old brick.
I will explain just this piece of code, and if you want to expand on it, but first and foremost please, at least consider your knowledge of the code generated by Delphi is wrong(outdated or naively trusting) or unoptimized (inefficient) and will go from there to proof a contradiction, how about that ?.. it is my best method as mathematician by education.
so many agreed on div 16 is always shr 4 or to be more concise should be sar 4, i agree and i know that the compiler wrongly do it for unsigned integers, but this is not the case, as i was talking about that specific case at hand, may be it is my mistake may to not wrote an essay for each line i wrote.
So here a proof that the compiler doesn't use sar 4 or shr 4 for div 16, the proof is just look at x64 version of it !!
try this function in the above optimized version
Code:
PROCEDURE SetPixel(XOffset,YOffset,Factor:NativeInt);
var AP:TPBGR;
begin
// XOffset=Horizontaler Offset in Pixel
// YOffset=Vertikaler Offset in Bytes
AP:=P;
Inc(AP,XOffset);
Inc(NativeInt(AP),YOffset);
with AP^, Delta do begin
Blue:=EnsureRange(Blue+B*Factor shr 4,0,255);
Green:=EnsureRange(Green+G*Factor shr 4,0,255);
Red:=EnsureRange(Red+R*Factor shr 4,0,255);
end;
end;
My speed shows that it is faster by 18% in
Win32 and 29% on Win64 !! do it please, it is not slower by 200%, and these values as positive so no problem here.
also if you look at the generated assembly code, then this is it
Also try this
Code:
procedure SetPixel(XOffset, YOffset, Factor: NativeInt);
var
AP: TPBGR;
v: NativeInt;
begin
AP := P;
Inc(AP, XOffset);
Inc(NativeInt(AP), YOffset);
with AP^, Delta do
begin
v := Blue + B * Factor shr 4;
if v < 0 then
Blue := 0
else if v > 255 then
Blue := 255
else
Blue := v;
v := Green + G * Factor shr 4;
if v < 0 then
Green := 0
else if v > 255 then
Green := 255
else
Green := v;
v := Red + R * Factor shr 4;
if v < 0 then
Red := 0
else if v > 255 then
Red := 255
else
Red := v;
end;
end;
The above function makes the whole process around double the speed,
for both platform .
and again not saying that i know it all, i do mistakes, but not in this case, would love to be proven wrong, but with factual code done right not with you assumptions based on something you didn't see for sure.
My test is attached here
FloydSteinberg.zip and hope it is working unlike the above attached project as it is empty.
As for "with" the compiler might fail to generate nice assembly and will revert to shuffle the data and
access them continuously on the stack introducing unneeded memory
access, this happen with complex loops also, with "with" it in many case will resolve the pointer and reused it from a register, alas it seems no gain in the above mentioned function, but once the function have few more local variables and it will go 90s turbo pascal mode specially in x64 platforms, i don't have the mood to sit and tweak such case for you now, but the effect is there.
ps re-reading before posting this, i sound retarded and offended, and i am sorry for that, i don't mean to offend anyone and never meant to, just had very bad experience from an neighbor forum and trying to be triggered by personal sentences.