@Markster: Not Sure.
------------------------------------------------------
I've switched to speed optimization/debugging for bit. The cache effect is helping speed tremendously in DELTA, but there remains some extrodinary speed hog bugs to obliterate.
Drawing the BG color ONLY goes at, for the entire screen, a measly 2% usage, which over the 12.5% usage it should take, is
6.25 TIMES faster than usual. But, there's several processor hogging bugs along the way, collectively eating 500%. They have absolutely nothing to do with the graphics blending. In fact it lags
even if that code is still disabled. My thoughts are that it's some kind of bad loop increment somewhere, or something of that nature. Or if I'm very unlucky it could be the dynamic jump operation being extraordinarily slow.
------------------------------------------------------
A fill rate of 2% for the whole screen means that at the rate it's going, it will take about 7% for a full game to be shown in 1280x1024.....
Provided your machine is new enough, ....add another 20% for the data reads plus a bit of extra overhead, plus 25% for video output.
This means
roughly 50% usage will cover for HD graphics regardless of what inexpensive blenders you use, on a Core 2 Duo machine. Make DELTA use multithreaded mode, and you cut all of this in half, making the final usage 25%. Absolutely stunningly impressive numbers, I know, but don't quote this as being exact just yet. I'm not to the point of being able to conduct such extreme levels of testing. I have to program and debug the rest of it first.
--------------------------------------
I'm not able to get a really precise reading due to all the lag, but, in general, ....
According to my current WIP statistics, the blender you are using is almost insignificant, for the vast majority of blenders. The reasoning behind this is that the cached pixel data can be read in as fast as CPU registers, meaning that complicated blending operations will still go fast -- the data can arrive in the main area of the CPU, ready to be put in registers, before a pixel even finishes processing.
Divide based blenders are in general not really that much more expensive than a multiply. The really heavy and slow blenders are the LOG/EXP based blenders, POW and RPW based blenders; and the two YRX/RFL combo variants take twice as long as their slightly slow components. The MGRFL blender takes 4 times as long, and is extremely slow (best not to use that one for too much stuff).