Registriert seit: 17. Nov 2005
Ort: Hamburg
1.062 Beiträge
Delphi XE2 Professional
|
AW: Floyd-Steinberg Dithering
7. Nov 2023, 12:14
You are really hardworking on this !
Are you familiar with CMOV instruction ?!!
Read about it from here https://www.felixcloutier.com/x86/cmovcc
Also search and research example for it and its usage, and trust me you will feel real good after that and your above code will go brrrrrrr faster.
eg. That part of clamping the slow one replacing values with
Code:
if v < 0 then
Blue := 0
else if v > 255 then
Blue := 255
else
Blue := v;
This could be two CMP and two CMOV, with 0 branching/jumping, will boost the speed nicely, by relieving branch prediction (removing the jmps) letting out-of-order-execution kick in unhindered.
instead of
Code:
jle @BlueZero // Ist <= 0
cmp eax,255
jbe @BlueSet // Ist <= 255
mov byte[edx],255 // Blauanteil = 255 setzen
jmp @Green // Grьnanteil errechnen
@BlueZero: xor eax,eax // Blau = 0
@BlueSet: mov [edx],al // Blauanteil speichern
Also you could ditch all that and try SSE or MMX , both are supported on CPUs for almost 3 decades (MMX) and no need to check for CPU compatibility for it, MMX will perform the all these operation on one pixel (4 colors) in parallel, the speed should be around 4 times than simple plain linear assembly.
SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.
SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.
CMOV
I am fully aware of that instruction.
However:
1) Would need 2 additional registers for the 0 and 255 (CMOV with #Values is not supported), alternatively I could push a 0 and a 255 on the Stack i.e. CMOV from memory.
2) Both CMOV from registers and CMOV from memory are significantly slower than my code.
1139 CMOV from registers
1123 CMOV from memory
842 my code
Times are ms.
May be, my codes contain errors (did not spend too much time).
PS:
I use the "Intel® 64 and IA-32 Architectures Software Developer’s Manual" to get informations about instructions.
(See Attachments)
Delphi-Quellcode:
const Count=1000000000;
PROCEDURE TestCMov1;
const S:String=' ';
asm
push edi
push esi
push 0
mov edi,0
mov esi,255
mov ecx,Count
@1: mov edx,-255
@2: mov eax,edx
cmp edx,0
cmovl eax,edi
cmp edx,255
cmova eax,esi
mov [esp],al
add edx,1
cmp edx,255
jbe @2
sub ecx,1
jne @1
@ End: pop ecx
pop esi
pop edi
end;
Delphi-Quellcode:
PROCEDURE TestCMov2;
const S:String=' ';
asm
push 0
push 255
push 0
mov edi,0
mov esi,255
mov ecx,Count
@1: mov edx,-255
@2: mov eax,edx
cmp edx,0
cmovl eax,[esp+8]
cmp edx,255
cmova eax,[esp+4]
mov [esp],al
add edx,1
cmp edx,255
jbe @2
sub ecx,1
jne @1
@ End: add esp,12
end;
Delphi-Quellcode:
PROCEDURE TestMov;
const S:String=' ';
asm
push 0
mov edi,0
mov esi,255
mov ecx,Count
@1: mov edx,-255
@2: mov eax,edx
cmp eax,0
jle @Z
cmp eax,255
jbe @S
mov byte[esp],255
jmp @N
@Z: xor eax,eax
@S: mov [esp],al
@N: add edx,1
cmp edx,255
jbe @2
sub ecx,1
jne @1
@ End: pop ecx
end;
Delphi-Quellcode:
PROCEDURE Test;
var T0,T1,T2,T3:Cardinal;
begin
T0:=GetTickCount;
TestCMov1;
T1:=GetTickCount;
TestCMov2;
T2:=GetTickCount;
TestMov;
T3:=GetTickCount;
Dec(T3,T2);
Dec(T2,T1);
Dec(T1,T0);
ShowMessage(Format(' %D CMOV from registers'#13' %D CMOV from memory'#13+
' %D my code',[T1,T2,T3]));
end;
Gruß, Klaus
Die Titanic wurde von Profis gebaut,
die Arche Noah von einem Amateur.
... Und dieser Beitrag vom Amateurprofi....
|