Blogger

Delete comment from: Ken Shirriff's blog

Snial said...

It's great to see a quote from Steve Furber! I hate to question part of the premise at the beginning of the blog post, but I understood that the prefetch queue on the 8086 and 8088 existed because the CPU was often slower than RAM, and in this way, differs from a cache (where RAM is slower than the CPU).

Firstly, when I think about DRAM accesses, as Steve Furber says, even 1977's 250ns RAM would match quite well with the 8086/8088, because it took 4 cycles to fetch a byte or word. I understand that the address is valid at T1.5, and data is read at T(3+n).5 where n is the number of wait states. This means an original 5MHz 8086 needs 400ns or faster RAM doesn't it?

However, the Execution unit, can of course sometimes process instructions faster than the bus can read them: at up to 2 or 3 cycles per instruction vs 4 cycles per bus fetch. But when the EU is doing that, it becomes bus & memory speed limited, so the prefetch queue doesn't help. Instead, the execution rate slows down to the bus bandwidth (a long sequence of 2 or 3 cycle ALU operations). And on an 8088 that's 8 cycles per 2 byte ALU operation (pretty much everything apart from inc/dec).

Thirdly, a cache helps primarily with the principle of locality, but the BIU doesn't. For example, a loop small enough to fit in the 8086's queue, e.g.
lp: add al,[SI]
inc si
loop lp


Forces a BIU flush. Even rep movsw will force prefetch queue reloads simply by observing the 20-odd cycles needed per loop (and the fact that it can be interrupted). That's different to the 68010 whose similar length 3 x 16-bit word cache is specifically designed to speed up move.w (as)+,(ad)+ dbra dn,lp type loops.

Surely, what the prefetch queue helps with is when the CPU is slower than the memory and the bus timing, i.e. when it has internal cycles, e.g. MUL/DIV or shift,cl or an EA calculation (BX+SI); then the bus can fetch future instructions during the calculation, and even a fairly simple EA is enough to fetch a couple of 16-bit words before it's ready to perform the memory fetch/store.

The real question for me is why is the 8088 faster with a 4-byte queue rather than a 6-byte queue as on the earlier 8086? As you say, it's because the prefetch queue soaks up bandwidth so when there's a change in the control flow, the BIU could have wasted cycles delaying some prior data memory requests, if they had been made after an instruction fetch had started.

But isn't another reason due to instruction alignment? The 8086 has a 16-bit bus, so sometimes a 6-byte queue is only as good as a 4-byte queue if you load a pair of instructions that are misaligned (the first and last bytes get discarded) and most instructions are about 2 bytes (reg ALU, simple load/stores or short jumps). And 50% of the time that will be the case. Couldn't it be argued that the 8086 needs a 6-byte queue to achieve what the 8088 can do with 4 bytes and those 6 bytes are the limit of the liability of the BIU (which could explain why even though bus cycles are faster, the 80188 queue remains at 4 bytes, while the 80186 and 80286 queues remain at 6 bytes and why the '386 queue is 12 bytes... after the bug fix. This makes sense if we assume the RAM could already keep up, and the prefetch queue exists to soak up internal cycles.).

Mar 31, 2024, 6:18:05 AM


Posted to The Intel 8088 processor's instruction prefetch circuitry: a look inside

Google apps
Main menu