You are really hardworking on this !
Are you familiar with CMOV instruction ?!!
Read about it from here
https://www.felixcloutier.com/x86/cmovcc
Also search and research example for it and its usage, and trust me you will feel real good after that and your above code will go brrrrrrr faster.
eg. That part of clamping the slow one replacing values with
Code:
if v < 0 then
Blue := 0
else if v > 255 then
Blue := 255
else
Blue := v;
This could be two CMP and two CMOV, with 0 branching/jumping, will boost the speed nicely, by relieving branch prediction (removing the jmps) letting out-of-order-execution kick in unhindered.
instead of
Code:
jle @BlueZero // Ist <= 0
cmp eax,255
jbe @BlueSet // Ist <= 255
mov byte[edx],255 // Blauanteil = 255 setzen
jmp @Green // Grьnanteil errechnen
@BlueZero: xor eax,eax // Blau = 0
@BlueSet: mov [edx],al // Blauanteil speichern
Also you could ditch all that and try SSE or MMX , both are supported on CPUs for almost 3 decades (MMX) and no need to check for CPU compatibility for it, MMX will perform the all these operation on one pixel (4 colors) in parallel, the speed should be around 4 times than simple plain linear assembly.