Einzelnen Beitrag anzeigen

Amateurprofi

Registriert seit: 17. Nov 2005
Ort: Hamburg
1.085 Beiträge
 
Delphi XE2 Professional
 
#26

AW: Floyd-Steinberg Dithering

  Alt 7. Nov 2023, 12:14
You are really hardworking on this !

Are you familiar with CMOV instruction ?!!
Read about it from here https://www.felixcloutier.com/x86/cmovcc
Also search and research example for it and its usage, and trust me you will feel real good after that and your above code will go brrrrrrr faster.

eg. That part of clamping the slow one replacing values with
Code:
      if v < 0 then
        Blue := 0
      else if v > 255 then
        Blue := 255
      else
        Blue := v;
This could be two CMP and two CMOV, with 0 branching/jumping, will boost the speed nicely, by relieving branch prediction (removing the jmps) letting out-of-order-execution kick in unhindered.
instead of
Code:
               jle     @BlueZero           // Ist <= 0
               cmp     eax,255
               jbe     @BlueSet            // Ist <= 255
               mov     byte[edx],255        // Blauanteil = 255 setzen
               jmp     @Green              // Gr&#1100;nanteil errechnen
@BlueZero:    xor     eax,eax             // Blau = 0
@BlueSet:     mov     [edx],al            // Blauanteil speichern
Also you could ditch all that and try SSE or MMX , both are supported on CPUs for almost 3 decades (MMX) and no need to check for CPU compatibility for it, MMX will perform the all these operation on one pixel (4 colors) in parallel, the speed should be around 4 times than simple plain linear assembly.
SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.
SSE, MMX
Unfortunately Pixels are 3 Bytes, not 4 Bytes in pf24bit-Bitmaps.
Furthermore loading the r,g,b Values from memory into SSE/MMX registers and storing from SSE/MMX registers into memory is (my opinion) slower than my code.
Feel free to prove me wrong.

CMOV
I am fully aware of that instruction.
However:
1) Would need 2 additional registers for the 0 and 255 (CMOV with #Values is not supported), alternatively I could push a 0 and a 255 on the Stack i.e. CMOV from memory.
2) Both CMOV from registers and CMOV from memory are significantly slower than my code.

1139 CMOV from registers
1123 CMOV from memory
842 my code
Times are ms.
May be, my codes contain errors (did not spend too much time).

PS:
I use the "Intel® 64 and IA-32 Architectures Software Developer’s Manual" to get informations about instructions.
(See Attachments)

Delphi-Quellcode:
const Count=1000000000;
PROCEDURE TestCMov1;
const S:String=' ';
asm
      push edi
      push esi
      push 0
      mov edi,0
      mov esi,255
      mov ecx,Count
@1: mov edx,-255
@2: mov eax,edx
      cmp edx,0
      cmovl eax,edi
      cmp edx,255
      cmova eax,esi
      mov [esp],al
      add edx,1
      cmp edx,255
      jbe @2
      sub ecx,1
      jne @1
@End: pop ecx
      pop esi
      pop edi
end;
Delphi-Quellcode:
PROCEDURE TestCMov2;
const S:String=' ';
asm
      push 0
      push 255
      push 0
      mov edi,0
      mov esi,255
      mov ecx,Count
@1: mov edx,-255
@2: mov eax,edx
      cmp edx,0
      cmovl eax,[esp+8]
      cmp edx,255
      cmova eax,[esp+4]
      mov [esp],al
      add edx,1
      cmp edx,255
      jbe @2
      sub ecx,1
      jne @1
@End: add esp,12
end;
Delphi-Quellcode:
PROCEDURE TestMov;
const S:String=' ';
asm
      push 0
      mov edi,0
      mov esi,255
      mov ecx,Count
@1: mov edx,-255
@2: mov eax,edx
      cmp eax,0
      jle @Z
      cmp eax,255
      jbe @S
      mov byte[esp],255
      jmp @N
@Z: xor eax,eax
@S: mov [esp],al
@N: add edx,1
      cmp edx,255
      jbe @2
      sub ecx,1
      jne @1
@End: pop ecx
end;
Delphi-Quellcode:
PROCEDURE Test;
var T0,T1,T2,T3:Cardinal;
begin
   T0:=GetTickCount;
   TestCMov1;
   T1:=GetTickCount;
   TestCMov2;
   T2:=GetTickCount;
   TestMov;
   T3:=GetTickCount;
   Dec(T3,T2);
   Dec(T2,T1);
   Dec(T1,T0);
   ShowMessage(Format('%D CMOV from registers'#13'%D CMOV from memory'#13+
                      '%D my code',[T1,T2,T3]));
end;
Angehängte Dateien
Dateityp: pdf BasicArchitecture_25366521.pdf (3,02 MB, 2x aufgerufen)
Dateityp: pdf InstructionSet_A-M_25366621.pdf (1,93 MB, 2x aufgerufen)
Dateityp: pdf InstructionSet_N-Z_25366721.pdf (1,51 MB, 1x aufgerufen)
Dateityp: pdf Optimization_24896613.pdf (2,43 MB, 2x aufgerufen)
Gruß, Klaus
Die Titanic wurde von Profis gebaut,
die Arche Noah von einem Amateur.
... Und dieser Beitrag vom Amateurprofi....
  Mit Zitat antworten Zitat