Optimising your assembly code with RDTSC
by maverick

Back in the days when a single misplaced semi-colon or careless ASM instruction could mean wiping out days of work, the Zen timer was a cool thing to have. It was a tool written by Michael Abrash that let you time just how long your ASM loops were taking. If you knew a few optimising rules, you could take the results and figure out what was taking the most time in your loops.

These days under multi-tasking OSs, the Zen timer is no longer practical. But Intel have come up with something that's even more handy the than the Zen timer. It's an undocumented instruction that works on the Pentium or better (Pentium MMX, Pentium Pro, Pentium II or Pentium III). It reads the current time stamp counter in clock cycles from the chip's internal counter and writes it to a register. In short, it's a really, really handy way of measuring *exactly* how many clock cycles your routine is taking. It doesn't replace global optimisation and profiling of course - those are things that must happen first, but when you have an ASM routine that has to run faster, then RDTSC is handy for doing exact cycle counts.

ReaDTimeStampCounter
RDTSC reads the value of the Pentium's internal cycle counter and writes the 64-bit result into EDX:EAX. Most of the time 2^32 clock cycles is more than enough to measure your code so we can ignore EDX and just compare two different time values in EAX. If you want to measure clock cycles in more units than that, then it's easy to modify the code.

Here's the basic code for timing a loop at the assembly level. It uses NASM (which understands RDTSC as an op code) and is C-callable from Linux (or probably DOS running DJGPP as well). Check out rdtsc.zip which should be provided with this diskmag and should have full working source in both C and ASM for a Linux code timer.
Anyway here it is:

BITS 32
SECTION .data
time dd   0

SECTION .text

proc TestRDTSC

%define   count     [ebp+8]

     mov  ecx, [count]
     test ecx, ecx
     jz   .bailout

     rdtsc
     mov  [time], eax

.timerloop

; insert your test code in here
     
     dec  ecx
     jnz  .timerloop
     rdtsc
     sub  eax, [time]
.bailout
endproc

(BTW my procs and endprocs are helper macros which do all the grunt work of pushing ebp and mov'ing ebp, esp and so on, but that is all in the source). This function when called from C takes one argument - how many times you want to run your code snippet, and returns an unsigned int value - how many cycles it took to run that snippet multiplied by the times it ran.
So:

int cycle_count = TestRDTSC(1000);

will run your test routine 1000 times and return the value in cycle_count.

I tried this out under Linux to see how good it was. It's very accurate indeed. I had a resampling loop for a mixer which was running in about 12 clock ticks per iteration of the loop - reasonably quick I thought. After revising some of the instruction pairing rules on the Pentium, I re-arranged some instructions, ran the timer and found it was down to 3 clock ticks - a fourfold speed improvement. So what, you might say. But this routine is called fifty times per second and the loop can execute on data sizes up to 8k. Four times the speed should be noticeable - and it was. Combined with a couple of other speedups, the mixer went from 25% CPU usage down to about 8%.

Another good example was a code snippet from a text on Pentium optimisation. It's a complex floating point calculation which takes twelve clock cycles and illustrates some overlapping properties of the FPU. I never knew about these overlapping properties so I cut and pasted the code into the timer and ran it 1000 times (not manually of course :). Bingo - the result was 13000 and something. It wasn't exactly 13000 because of a little overhead plus an extra cycle to count the loop (plus the fact that the OS might have interrupted my program occasionally to do other things) but the average was near as dammit 12 clock ticks for the code I was testing.

Conclusion: if you have ASM code and you think it could be performing better, then try the following steps:

1. look up all the cycle counts for the instructions you're using
2. try and see how they will pair in the U and V pipe of the Pentium
3. add up how long you think the loop should take
4. using RDTSC, time exactly how long it takes
5. if the results aren't what you expected, then try and see what's taking the time.

The cool thing is an exact timer eliminates guesswork from assembly optimisation completely. There's no need to be a guru or have a gut feel about your code - see what's slow, find out why it's slow and make it go faster :)

Have fun.