Rawheds Tutorial#5:


MMX Coding

[Introduction]
[MMX Registers]
[MMX Instructions]
[Saturated vs Wraparound]
[MMX Instructions - EMMS]
[MMX Instructions - Moving 32/64bits]
[MMX Instructions - Addition & Subtraction]
[MMX Instructions - Shifting]
[MMX Instructions - Logical Instructions]
[MMX Instructions - Multiply]
[MMX Instructions - Comparing]
[MMX Instructions - Packing/Unpacking]
[Little Tips]
[Some Implementation Ideas - Vector Rotation]
[Some Implementation Ideas - ARGB pixels]
[Some Implementation Ideas - Byte Shifting]
[Some Implementation Ideas - Crossfading]
[Some Implementation Ideas - Bluring]
[Some Implementation Ideas - Complex Multiplications]
[Closing Words]

---==[Introduction]==---------------------------------------------------------

When Intel released MMX I thought(and still think) that it sucked! I thought this until I tried it out a few months ago, and its actually VERY cool. I think the reason no one took it seriously was Intels marketing. I mean look at the P3 - its surposed to enhance your internet experience. "What crap!" everyone thinks. hehe. P3 has very cool SSE instructions which is basically their reply to AMDs 3DNow! technology - but more about that in another tutorial.

I've seen a few demos starting to use MMX, which is very cool. My current demo engine checks if the machine has MMX/3DNow! and if so - uses the appropriate functions optimised using those instructions. I think i've started off badly with all my rambling, so let me give you a bit of background about MMX.

Basically MMX is a set of instructions for the pentium range of machines and is Intels first big change to their x86 instruction set since the 386(1985). There are 57 new instructions in all. The instructions are very good for multimedia type processing - things like audio, video, imagery etc. MMX also comes with 8 new 64bit registers(well sort of) and uses SIMD (Single Instruction Multiple Data) which basically means that instructions can handle multiple data in parallel. Noone is totally sure what MMX stands for because Intel have never said, however it seems most people agree its MultiMedia eXtenstions.

Most machines thesedays support MMX. I'm not sure if modernday compilers optimise using MMX, or if big commercial programs use MMX much - but they should! Its high time everyone excepts MMX as standard. Even Cyrix and AMD machines support it. I hope to see more MMX demo stuff too :) Maybe it will help demos have more particles/polygons/whatever than ever before.

Anyways, enough gibberish - on with the tut.

---==[MMX Registers]==--------------------------------------------------------

There are 8 new MMX registers. They are called MM0, MM1, MM2...MM7. They are actually not really new, because physically on the chip they are not there - instead they using using the floating point stack. As you know the registers in the FP unit are 80bits wide, The signed bit(bit 79) and the exponent part of each register are filled with 1's, and the remaining 64bit of each FP register is where the MMX register lies.

So basically the MMX registers are aliased onto the floating point registers. What this means is that while using MMX can you can't use FP instructions. You have to call EMMS when finished a block of FP/MMX code.

MMX instructions work with the 64bit registers in various ways - depending on the instruction. There are 4 new ways the instructions can look at the 64bit data:

        Packed Bytes   - 64bits divided into 8 bytes.

        Packed Words   - 64bits divided into 4 words.

        Packed Dwords  - 64bits divided into 2 dwords.

        Quadword       - 64bits undivided.

So if you are using an instruction that works on packed bytes, it will perform 8 operations - one on each byte. Each byte will be treated as an independant entity and will not touch any of the other bytes. The same goes for the packed words and dwords.

---==[MMX Instructions]==-----------------------------------------------------

MMX instructions are all pretty much formatted the same way:

    instruction dest,source

The mmx instructions often have suffixes which describe

    a)signed/unsigned operation

    b)saturated/wraparound operation

    c)whether the instruction works on packed bytes,words,dwords or qword.

For example the instruction padd can be used as:


    paddusd MM2,mem1 (add unsigned, saturated operation using packed dwords)

    paddb mem1,MM2 (add using wraparound, on packed bytes)

Some mmx instructions only work on certain types of datatypes, so I've indicated this when i describe the instruction.

---==[Saturated vs Wraparound]==----------------------------------------------

You will see that some of the instructions support something called "saturation". This is a very cool new thing in mmx that stops wraparounds(overflows) from happening when you exceed the datarange limits. For example:

mov al,250
add al,10 ;al is now equal to 9. This is wraparound/overflow.

mov eax,250
mov ebx,10
movd MM0,eax
movd MM1,ebx
paddsb MM0,MM1
movd eax,MM0 ;eax = 255. This is saturation.

Similarly if we'd been dealing with 16bits then they would saturate at 65535 and zero. If its signed saturation, then the clipping values will be the signed limits of that datatype eg, for bytes: -127..127. In the above example of saturation, the paddsb is actually doing 8 additions and saturation, all at the same time!

---==[MMX Instructions - EMMS]==----------------------------------------------

Since the MMX registers are using the same space that the FPU uses, they can't be used simultaniously. EMMS must always be called after a block of MMX code, otherwise when FP code is executed after it, stack overflows and wrong answers will arise as it'll be using residual MMX data. The only problem is that EMMS is very slow (50 cycles). AMD fixed this problem with their FEMMS instruction which does the same thing, except its a lot faster (5 cycles).

---==[MMX Instructions - Moving 32/64bits]==----------------------------------

These instuctions are very important because they are how you move data into/around/from the MMX registers. There are 2 MMX data moving instructions, MOVD and MOVQ. As I'm sure you've already worked out, MOVD moves 32bits of data and MOVQ moves 64bits. Here are their operands:


        MOVD dest,src

        MOVD MMXreg/x86reg/Mem,MMXreg/x86reg/Mem

        MOVQ MMXreg/Mem,MMXreg/Mem

However you cannot have both the destination and source operands as memory addresses! This is a major bummer because otherwise nice fast 64bit memory copies could be done.

MOVD can be used to load data into the MMX register from normal x86 registers. So if you want to move the 32bit value of EAX into MM3 you would use MOVD MM3,eax. This fills the lower 32bits of MM3 with eax value and fills the upper 32bits with zeros. This is what is often used to getting data into the MMX registers for them to play with. After your MMX routine has been done and you want to get the result you can use a MOVD eax,MM3. MOVD can also be used to copy the lower 32bits from one MMX register to another, however the upper 32bits is zero filled.

MOVQ can't access normal 32bit registers so you have to use MOVD to load/unload x86 register data. MOVQ is used to load/unload data to/from memory. For example if you have a ARGB memory buffer you can use MOVQ MM0,mem1 to load 2 pixels(64bits) into register MM0. To put the data back once its been through you MMX routine just use something like MOVQ mem2,MM0.

---==[MMX Instructions - Addition & Subtraction]==----------------------------

PADD and PSUB are the base mmx addition and subtraction instructions. Applying suffixes to them allows you to specify whether you are wanting it to be a signed/unsigned and wraparound/saturated instruction. These instructions can accept MMX registers or memory addresses as source operands, but only MMX regisers as source operands:


        PADDx dest,src

        PADDx MMXreg,MMXreg/Mem

        PSUBx MMXreg,MMXreg/Mem

Here are the add/sub instructions and on what datatypes they work:


    PADD   (packed wraparound add)          - byte - word - dword

    PADDS  (packed signed saturated add)    - byte - word

    PADDUS (packed unsigned saturated add)  - byte - word

    PSUB   (packed wraparound sub)          - byte - word - dword

    PSUBS  (packed signed saturated sub)    - byte - word

    PSUBUS (packed unsigned saturated sub)  - byte - word

Now that you know the datatypes that they work on you can just add the (b,w,d or q) suffix to the instructions eg:


    PADDB, PADDW, PADDD    - each one for a different data type

    PADDSB, PADDSW         - each one for a different data type

    PADDUSB, PADDUSW       - each one ofr a different data type

So PADDB works on packed bytes and PADDW works on packed words - but how? This is the beauty of MMX - it does things in parallel. A PADDB will do 8 additions. Here is how:


    MM0 - |008|000|005|000|255|000|001|045|       8 bytes(64bits)

    MM1 - |000|057|005|000|005|000|001|002|       8 bytes(64bits)


    PADDB MM0,MM1


    result(mm1 unchanged):

    MM0 - |008|057|010|000|004|000|002|047|       8 bytes(64bits)

This will add each 8byte entity and put the resulting 8 bytes into the destination operand(MM0). PADDB is a wraparound instruction of course to 255+5=4;

If we were using the PADDSB instruction it would have worked the same, except for the 255+5:


    MM0 - |008|000|005|000|255|000|001|045|       8 bytes(64bits)

    MM1 - |000|057|005|000|005|000|001|002|       8 bytes(64bits)


    PADDSB MM0,MM1


    result(mm1 unchanged):

    MM0 - |008|057|010|000|255|000|002|047|       8 bytes(64bits)

One more example now, except using packed words and saturated subtraction:


    MM0 - |001234|000010|000005|008516|       4 words(64bits)

    MM1 - |000001|000020|000001|009343|       4 words(64bits)


    PSUBSW MM0,MM1


    result(mm1 unchanged):

    MM0 - |001233|000000|000004|000000|       4 words(64bits)

---==[MMX Instructions - Shifting]==-----------------------------------------

These are instructions very similar to the old x86 SHL and SHR instructions only they are very cool because they work on the different packed formats to they can shift multiple values in one instruction. Here are the base shifting instructions and what datatypes they act on:


    PSLL (Packed Shift Left Logical)      - word - dword - qword

    PSRA (Packed Shift Right Arithmetic)  - word - dword

    PSRL (Packed Shift Right Logical)     - word - dword - qword

So once again (just like the padd & psub instructions) just add the suffixes to the base instruction name to get the instruction name that works on a certain datatype. Eg:


    PSLLW - does a left logical shift on the packed word datatype

    PSRAD - does a right arithmetic shift on the packed dword datatype

These shifting instructions are all formatted the same way as the SHL and SHR instructions:


    instruction dest, shiftamount

        PSLLW MMXreg, MMXreg/Mem/Immed

    eg: PSLLW MM1, 3

The PSLLx and PSRLx instructions are all basically the same. They shift the bits to the left/right and fills the low/high order bits with zeros. Here is an example of the PSRLW instruction:


    MM4(64bits, 4words):

    |0000100001001100|0000000000011111|0000011000001100|1111110000000000|


    PSRLW MM4,5 (packed logical shift to the right by 5)


    MM4(64bits, 4words):

    |0000000001000010|0000000000000000|0000000000110000|0000011111100000|

As you can see from this example zeros fill the highorder bits and the loworder bits that shift right too far are killed. The PSLLx instruction works just like this except that it shift to the left and the low order bits the filled with zeros.

Both of those sets of instructions are called "logical" while the PSRAx instructions are "arithmetic". This is basically calling them unsigned and signed instructions. The arithmetic instruction takes into account whether the data is positive/negative. PSRAx shifts data to the right. If the data element is positive then it fills the high order bits of the destination with zeros. If the data elecment is negative then it fills the high order bits of the destination with ones. Remember that a data element is negative if its highest bit is signed(1). Here is an example of how it works:


    MM4(64bits, 4words):

    |0000100001001100|0000000000011111|1000011000001100|1111110000000000|


    PSRAW MM4,5 (packed arithmetic shift to the right by 5)


    MM4(64bits, 4words):

    |0000000001000010|0000000000000000|1111100000110000|1111111111100000|

Its a GREAT pitty that the shifting instructions don't work on the packed byte datatype otherwise we could shift 8 bytes at a time and if using ARGB data this would be invaluable! OH well..We can get around this by doing the hack mentioned later on in this tut, in the section called "some implementation ideas".

---==[MMX Instructions - Logical Instructions]==------------------------------

These are your MMX equivilent bitwise instructions like AND XOR NOT etc. They only work on 64bits(qword) so the instruction is formatted:


    instruction dest,src

    instruction MMXreg,MMXreg/Mem

There are 4 MMX bitwise instructions: pand, pandn, por and pxor.

PAND works just like normal ANDing except that its being applies to 64bits. To refresh and for example:


    0 AND 0 = 0 

    1 AND 1 = 1

    1 AND 0 = 0

    0 AND 1 = 0

PANDN(Not AND) first inverts the bits of the destination then applies the logical AND.


    0(1) ANDN 0 = 0

    1(0) ANDN 1 = 0

    1(0) ANDN 0 = 0

    0(1) ANDN 1 = 1

POR:

PXOR(exclusive OR):


    0 XOR 0 = 0

    1 XOR 1 = 0

    1 XOR 0 = 1

    0 XOR 1 = 1

I often PXOR MM7,MM7 to make my MM7 register==0. This is very useful when doing packing/unpacking as you will see later.

---==[MMX Instructions - Multiply]==------------------------------------------

The 3 MMX multiplication instructions all operate on 16bits of data and output 32bit results of the multiplication. The 3 instructions are:


    PMADD (Packed Multiply Add)     - word-->dword

    PMULH (Packed Multiply High)    - word

    PMULL (Packed Multiply Low)     - word

All 3 work as:


    instruction dest,src

    instrction MMXreg,MMXreg/Mem

PMADDWD multiplies each of the 4 words in the source operand with each of the 4 words in the destination operand - producing 4 dwords. The lower two dwords are added together and stored as 1 dword in the lower 32bits of the destination register. The same is done for the higher 2 dwords, except that they are stored in the highest 32bits of the destination register. You can see why the suffix of the instruction is "WD", because it takes input of words but the output is in packed dwords. This instruction could be very useful for a variety of things. Complex number multiplicatin can benefit from this instruction immensely as is requires 4 multiplications and two additions. Also Imagine how easity it could do 2 lots of (x*x)+(y*y) in parallel!

PMULHW multipies each of the 4 words in the source operand with each of the 4 words in the destincation operand. This again produces 4 32bit numbers, so it discards the lower 16bits of each result and stores the higher 16bits in the corresponding destination operand.

PMULLW does the same as PMULHW except it discards the higher 16bits and stores the lower 16bits of the multiplcaition result.

---==[MMX Instructions - Comparing]==------------------------------------------

Yes, there are even some new additions to the CMP family! :) They are quite weird, let me introduce them:


    PCMPEQ (Packed Compare for Equality)     - byte - word - dword

    PCMPGT (PAcked Compare for Greater Than) - byte - word - dword

From that you can work out that all the actual instructions:


    PCMPEQB, PCMPEQW, PCMPEQD

    PCMPGTB, PCMPGTW, PCMPGTD

PCMPEQx compares the data elements (whatever their size) in the source operand to those in the destination operand. If they are equal, 1's are written to that part of the destincation operand, if not then 0's are written. So you end up with a destination operand comprised of zero and FF(PCMPEQB)/FFFF(PCMPEQW)/FFFFFFFF(PCMPEQD) data elements.

PCMPGTx does the same as PCMPEQx, except that if the data in the destination data element is greater than the data in the source data element, 1's are written to the destincation data element, otherwise 0's are written.

---==[MMX Instructions - Packing/Unpacking]==--------------------------------

I've left a very important set of instructions till last. I'm not sure why.. perhaps its all about saving the best till last. These aren't a magical set of instructions which will make your coding amazingly fast. They are very important because they allow you do control the format of data going into the MMX registers so that you can use its parallelism. They are also pretty cool because some of them perform saturation too. Often your data don't be in the nice format needed for parallel number crunching, these instructions can convert it into this format. You then use your cool MMX function on it and unpack the number back out of its format.


    PACKSS  (Pack Signed Saturated)    - byte<--word - word<--dword

    PACKUS  (Pack Unsigned Saturated)  - byte<--word

    PUNPCKH (Unpack High Data)         - byte-->word - word-->dword - dword-->qword

    PUNPCKL (Unpack Low Data)          - byte-->word - word-->dword - dword-->qword

This whole "-->" thing might seem confusing, buts its the same as it is for the PMADDWD instruction. Look at PACKSS and PUNPCKH, it means that the instructions are:


    PACKSSWB, PACKSSDW

    PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ

The packing instructions take larger data elements and convert them to smaller data elements(eg word<--dword). The unpacking instructions take smaller data elements and convert them to larger data elements.

PACKUSWB: First off a saturation check is performed on the data elements in the source and destination operands. If the word is negative is makes it 0, and if the word is greater than 255(the maximum size of a byte) it clips it to 255. Now in the source and destination operands you have 16bit words with values between 0 and 255. All it does now is collect these 8 bytes and put them into the destination operand. First the destination 4 words and put into the first 4 bytes of the destination operand, and then the 4 words of the source and put into the second 4 bytes of the destination operand. Here is how:


    MM0 - |0008000|0000005|000230|0001045|       4 words(64bits)

    MM1 - |00000-5|0024525|002345|0000112|       4 words(64bits)


    PACKUSWB MM0,MM1(Pack Unsigned Saturated Words to Bytes)


    result of saturation:

    (done in processor so MM1 doesn't actually change)

    MM0 - |0000255|0000005|000230|0000255|       4 words(64bits)

    MM1 - |0000000|0000255|000255|0000112|       4 words(64bits)

    final result:

    MM0 - |000|255|255|112|255|005|230|255|      8 bytes(64bits)

PACKSSx does the same as PACKUSWB except that because its signed it looked to saturate if the number is bigger than 127 and if its less than -127. Also it can convert from dword to word.

PUNPCKHx only works on the higher 32bits of the destination and source operands. It takes data elements from each and intertwines them into the destincation operand. Here is an example(in hex for convenience):


    MM0 |AF|45|0E|8A|12|67|FF|00|	8 bytes(64bits)

    MM1 |11|91|AB|5C|93|B8|0F|09|	8 bytes(64bits)


    PUNPCKHBW MM0,MM1(Unpack High Data from Bytes to Words)


    result:

    MM0 |11AF|9145|AB0E|5C8A|	8 bytes(64bits)

PUNPCKLx does the same except that it takes data from the lower 32bits of the source and destination operand. eg:


    MM0 |AF|45|0E|8A|12|67|FF|00|	8 bytes(64bits)

    MM1 |11|91|AB|5C|93|B8|0F|09|	8 bytes(64bits)


    PUNPCKLBW MM0,MM1(Unpack Low Data from Bytes to Words)


    result:

    MM0 |9312|B867|0FFF|0900|	8 bytes(64bits)

One might wonder why on earth you need such WEIRD instructions? Well let me give you an example. Now lets say we want to add and average 2 ARGB pixels together from different memory locations. Now if we put 2 pixels into different MMX registers and tried to add them(without saturation) we would get overflows and all sorts of weird things happening. Look:


    movd MM0,[edi]      ;load pixel 1

    movd MM1,[esi]      ;load pixel 2

    punpcklbw MM0,MM7   ;copy the lower 32bits of MM0 into MM0

    punpcklbw MM1,MM7   ;copy the lower 32bits of MM1 into MM1

    paddusw MM0,MM1

    psrlw MM0,1         ;/2

    packuswb MM0,MM7

    movd [esi],MM0

---==[Little Tips]==---------------------------------------------------------

These are little tips/tricks I've made myself and most I've collected:

Making an MMX register=0

PXOR MM0, MM0

Filling all 64bits of a MMX register with 1s.

PCMPEQ MM1, MM1

Compute the absolute difference of 2 unsigned numbers.
(assuming packed-byte or packed-words)

Input: MM0: source operand
MM1: source operand

Output: MM0: The absolute difference of the unsigned operands

MOVQ MM2, MM0 ; make a copy of MM0
PSUBUSB MM0, MM1 ; compute difference one way
PSUBUSB MM1, MM2 ; compute difference the other way
POR MM0, MM1 ; OR them together

---==[Some Implementation Ideas - Vector Rotation]==---------------------------

LSD/Meltdown actually gave me this idea after implimenting it in his 3D engine. Whether you're using a 12 or 9 multiplication rotation formula, you can execute those multiplations in parallel - making it a lot faster.

---==[Some Implementation Ideas - ARGB pixels]==--------------------------------

The Alpha-Red-Green-Blue pixel format is perfect for fast manipulation with MMX. Doing a 64bit read you can load 2 of these pixels into each register. From there on you are free to use MMXs parallelism to the max. You can now process up to 4 32bit pixels per instruction - adding them, multiplying them, subtracting - and all with or without automatic saturation.

Here is an example of a 320x200x32bpp loop which additively copies a buffer onto another buffer, and saturates the RGB at 255:

    ;ASM 32bpp MMX adding
    mov edi,[dest]
    mov esi,[src]
        mov ecx,32000
        @MMX_layeraddloop:
            movq MM0,[edi]      ;Move QUAD(64bits)
            movq MM1,[esi]      ;Move QUAD(64bits)
            paddusb MM0,MM1     ;Saturated Add
            movq [esi],MM0      ;Move QUAD(64bits)
            add esi,8
            add edi,8
            dec ecx
        jnz @MMX_layeraddloop
        EMMS                ;Must always do this after about of 
                            ;MMX instructions

Its fast.

---==[Some Implementation Ideas - Byte Shifting]==------------------------------

One of the weird things with MMX shifting is that it doesn't do shifting for the byte data element. This would be very handly for things like ARGB pixel manipulation. What this forces you do to, is load the 2 pixels in 2 registers, unpack them to words, manipulate them, then repack them. Very long process. I've got a method which is a bit of a hack(as usual), but is faster.

Look at this simple example below of shifting:


        source data:   |pppaaa|hhhrrr|  (1 word)

        word shift 2:  |00pppa|aahhhr|  (1 word)

        byte shift 2:  |00pppa|00hhhr|  (1 word)

You can see that the only difference between word and byte shifting is that there are zeros in the byte shift where the overflows occur from the word shift. This can easily be eliminated by masking those bits off. So depending on the abount we shift by, and the direction of the shift, a different mask will have to be used:


    shr1mask = 0111111101111111011111110111111101111111011111110111111101111111b

    shr2mask = 0011111100111111001111110011111100111111001111110011111100111111b

    shr3mask = 0001111100011111000111110001111100011111000111110001111100011111b

    shl1mask = 1111111011111110111111101111111011111110111111101111111011111110b

    shl2mask = 1111110011111100111111001111110011111100111111001111110011111100b

    shl3mask = 1111100011111000111110001111100011111000111110001111100011111000b

In most tight loops the shift amount and direction is fixed, so you can use this method, however where the shift isn't constant it won't be so good. Here is an example of loading 2 32bit pixels and byte shifting it using this method:

    MOVQ MM7,[shr3mask]	;loads 64bit mask
    MOVQ MM0,[edi]		;loads 2 32bit pixels
    PSRLW MM0,3	     	;shifts word elements 3 to the right.
    PAND MM0,MM7		;mask off irrelevant bits
    MOVQ [edi],MM0		;put modified pixels back.

---==[Some Implementation Ideas - Crossfading]==----------------------------

On the sademoscene mailing list Jacques posted an interesting challenge. He wanted to find a fast implimentation of the alpha blend function - basically a crossfader. The functions formula is:


	a=ARGB pixel1

	b=ARGB pixel2

	alpha=(0..1 value of the percentage of each image to blend)

	finalpixel=[alpha*(a-b)]+b

You can see that if alpha==0 then 100% of image b will be shown and 0 percent of image a will be shown. Also if alpha==1, 100% of image a will be shown and 0 percent of image b will be shown. Now how to speed up this very useful algorithm using MMX? First of all, lets rewrite the algorithm to remove the floating point alpha value:


    alpha=alpha<<8;		//scale it up by 256.

    so now:

    finalpixel=b+[alpha*(a-b)]>>8;

There are 4 main parts to the formula:


    1 : (a-b)

    2 : *alpha

    3 : >>8

    4 : +b

Try to make your own implementation of it using MMX before looking how I did it. Here is how I did it:

    ;assume alpha is a value 0..255
    ;assume edi points to buffer1
    ;assume esi points to buffer2
    ;asuume edx points to destination buffer
    pxor MM6,MM6		;make MM6==0
    mov eax,[alpha]
    mov ebx,eax
    shl ebx,16
    add eax,ebx
    movd MM7,eax
    movq MM6,MM7
    punpckldq MM7,MM6	;MM7=alpha

    ;//////////inner LOOP/////////////
    movq MM0,[esi]  	;pixels a
    movq MM1,[edi]		;pixels b
    punpcklbw MM0,MM6	;byte-->word(pixel 1)
    punpcklbw MM1,MM6	;byte-->word(pixel 1)
    psubw MM0,MM1		;a-b
    pmullw MM0,MM7		;pixel 1 * alpha
    PSRLW MM0,8		;shifts word elements 8 to the right.
    paddb MM0,MM1		;add (b) to result
    packuswb MM0,MM6    	;convert back into byte form
    movq [edx],MM0
    ;//////////inner LOOP/////////////

Its a weird, I know :) If you know a better way - and I've no doubt there is one, please let me know.

---==[Some Implementation Ideas - Blurring]==---------------------------------

Here is my MMX bluring routine:

_MMX_blur_:
    push edi
    mov edi,[destaddr]
    mov ecx,256000    ;320x200x4

    sub ecx,2564
    add edi,1284
	
    pxor MM7,MM7		;=0
    movd MM0,[edi-4]  
    @blur_more:
        movd MM1,[edi+4]  
        movd MM2,[edi-1280]
        movd MM3,[edi+1280]

        punpcklbw MM0,MM7
        punpcklbw MM1,MM7
        punpcklbw MM2,MM7
        punpcklbw MM3,MM7

        paddusw MM0,MM1
        paddusw MM0,MM2
        paddusw MM0,MM3

        psrlw MM0,2		

        packuswb MM0,MM7

        movd eax,MM0
        stosd
    sub ecx,4
    jnz near @blur_more

    EMMS
    pop edi
    ret

I think it can be optimised a lot, especially since it doesn't operate on pixels in parallel.

---==[Some Implementation Ideas - Complex Multiplications]==-------------------

MMX can be VERY useful for doing complex multiplications - which is useful in things like fractals. Now I'm no expert on imaginary number planes or anything like that so this is straight out of an Intel document:

Let the input data be Dr and Di where
Dr = real component of the data
Di = imaginary component of the data

Format the constant complex coefficients in memory as four 16-bit values [Cr -Ci Ci Cr]. Remember to load the values into the MMX technology register using a MOVQ instruction.

Input:  MM0 : a complex number Dr, Di 

        MM1 : constant complex coefficient in the form[Cr-Ci Ci Cr]

Output: MM0 : two 32-bit dwords containing [ Pr Pi ]

The real component of the complex product is Pr = Dr*Cr - Di*Ci, and the imaginary component of the complex product is Pi = Dr*Ci + Di*Cr


        PUNPCKLDQ MM0,MM0       ; This makes [Dr Di Dr Di]

        PMADDWD   MM0, MM1      ; and you're done, the result is

                                ; [(Dr*Cr-Di*Ci)(Dr*Ci+Di*Cr)]

Note that the output is a packed word. If needed, a pack instruction can be used to convert the result to 16-bit (thereby matching the format of the input).

---==[Closing Words]==--------------------------------------------------------

Well I really hope that people start using MMX more - because it REALLY is very cool. If you'd like to comment to me about anything in this document, please don't hesitate! I think next time I'll look into 3DNow! which is AMDs new set of floating point instructions. Just to give you a taste - 3DNow!'s registers are also MM0-MM7, except that the 64bits is divided into 2 single precision floating point numbers. What this means is that you can do floating point functions in parallel - much like MMX. There is also new few much needed MMX instructions which is included in AMDs 3DNow!, and in the new PIII range, which are just an extension to the integer MMX instructions mentioned in this doc.

Greets: Demoscene, All the people at Optimise'99, Everyone at #programming, ColdBlood, Cyberphreak, Deadpoet, LSD, Maverick, Neuron, NiMH, Saurax, Viper, and everybody that I know!

-Rawhed/Sensory Overload
-Mailto:sfeist@netactive.co.za
-HTTP://www.surf.to/demos/
-Andrew Griffiths
-South Africa
-01-12-1999