Google-Apps
Hauptmenü

Post a Comment On: Ken Shirriff's blog

"Improvements to the Xerox Alto Mandelbrot drop runtime from 1 hour to 9 minutes"

15 Comments -

1 – 15 of 15
Blogger colinstu said...

Nice work!

Now.... someone who's really feeling up to it, needs to program this in microcode! Curious to see how fast it would be.

June 26, 2017 at 9:19 AM

Blogger Franc said...

Awesome. Fun fact. The Droste effect is named after the tins of Dutch Droste cacao featuring a nurse carrying a tray with a tin of cacao with a nurse .....
Comes in boxes these days but still good stuff for a cold winter :)

June 26, 2017 at 10:40 AM

Anonymous Anthill Beetle said...

If you used symmetry, you can as well use the fact it is continuous and only trace points along the boundary. Needs some math to be done properly, but should improve speed dramatically.

(Although even flood fill was nontrivial problem on old machines.)

June 26, 2017 at 12:04 PM

Blogger Ed said...

There's a boundary-tracing Mandlebrot for the 6502-based BBC Micro here.

June 26, 2017 at 12:36 PM

Anonymous Renato Romano said...

Wow! Good job!
Shutting off the display remember me the ZX81 times... :-)

June 26, 2017 at 4:42 PM

Blogger Larry Stewart said...

Actually the Alto do floating point. I know this because I wrote the microcode for it :)
That was for the Mesa language environment. For BCPL, there was a subroutine library for floating point and possibly the microcode was backported.

The canonial way to to this sort of thing as of about 1981 was to farm out the computation to idle Altos on the Ethernet. See the paper by Jon Hupp and John Shoch. Or, if you were lucky enough to have one, run it on a Dorado...

June 27, 2017 at 1:52 PM

Blogger Ken Shirriff said...

Hi Larry! Thanks for your comments. Did you write the FLOAT library? We don't have Mesa running yet - I heard that it needs two drives. As for the Hupp and Shoch worm, do you know if the source code is available anywhere? It would be cool to get that running on a few Altos.

June 27, 2017 at 3:42 PM

Blogger Larry Stewart said...

I didn't write the BCPL float library. Possibly Ed Fiala? I wrote the IEEE single precision microcode and Mesa trap code to handle the hard cases (gradual underflow, etc.) I remember Ed took over the code I wrote and added underflow to 0 because he thought gradual underflow was too slow. Regarding the worm code. I don't know of any, but Paul McJones' archive at CHM might have some, or John Shoch is at Alloy Ventures I think.

June 28, 2017 at 6:03 PM

Blogger Snial said...

Hello Ken,

Thanks for publishing my suggested improvements to the code - it's really exciting to think that I actually have some code that's been run on a real Alto!

I wrote a Mac Plus version in Think C and added a poor-man's Floyd-Steinberg dithering algorithm. It's equivalent to adding this line after the for h loop:

for h = 0 to 37 do // horizontal word count
  [
  let dither=0

and replace the "if n eq 0 then [" sequence with:

dither=dither+(20-n)
if dither ge 20 then [
  dither=dither-20
  v!0=v!0 + b // set the pixel
  v2!=v!0
]

-cheers from Julz

June 29, 2017 at 10:44 PM

Blogger Snial said...

Hi colinstu,

I'd be amazed if someone actually tried to convert the algorithm into microcode, but it's too tempting not to think about it. I don't know how much you understand about the Xerox Alto microarchitecture - Ken's posts are really helpful and the Alto Hardware Manual contains the complete specific details.

It's important to remember of course, that the BCPL compiler has its own overhead, and typically an assembler programmer would be able to improve upon the performance by a factor of 3 (if we trade size for speed). So, this would get us down from 48minutes for a conventional render to 16 minutes.

The microcode version could be much faster if we assume that all the key variables can be held in fast microcode RAM. We can't avoid the cost of multiplication, which is going to be about 10 to 14µs ( = 64c to 84c based on my reading of the existing microcode), but in microcode we can go back to using shifts for fix-point math correction and 16 shifts would take at least 16c*.17=2.72µs (though I think 32c is more likely).

The microcode architecture can't perform register to register calculations in one cycle, because it can only really transfer registers to one ALU input per cycle and it can't read and write to a RAM register in the same cycle. So, a calculation like:

C=A+B

would take 3 cycles: Load A to T_, then B+T to L_ then C_ = L. However, a more complex function such as C=A+B+D could probably be done in just 4 or 5 cycles: A->T, B+T->L, L->T and D+T->L, L->C.

With this and the knowledge that branches are delayed, we can estimate the basic loop time:

A line like this: let x2 =(x ls 0) ? -x,x would be: X->L; L<<1, ALUCY; -X->L,:NONEGX (:NEGX would be the true branch, which would then jump to NONEGX anyway). That's 3 or 4 cycles.

Then a multiplication and shift would be about 80+32. So that's 116c in total.

The y2 multiplication is then another 116c.

"if x2+ y2 ge 16384" is pretty tricky. I would do x2->T; y2+T->L; then shift and test for carry twice. That's about 5 cycles including the delayed branching.

Computing the xor for the sign test is 3cycles (the ALU supports xor directly).
The abs functions will take another 4 cycles each => 8c.
The multiplication and shift is another 112c.
The sign correction is probably another 4 cycles.
The final calculations for x and y are 5 cycles each = 10.
n=n-1 is 2 cycles, because I think decrement is possible directly via the ALU; the microcode can test the result at the same time and perform the jump so that completes it.

This makes the total = 376c per iteration (about 64µs).

My Mac Plus version requires 420259 iterations, which scaled up to the Alto screen size is: 760469 iterations x 64µs gives about 48.6s.

The Alto itself takes 66% of CPU, so that's really: 146s or nearly 2.5 minutes. But I've only optimised the inner loop because that's what will make most of the difference. The screen update in assembler or BCPL would still involve maybe 10 Nova instructions per pixel + maybe another 10 per black pixel and at 7.5µs per instruction that's another 27.36s. So the total is 173s or nearly 3 minutes (without tricks like mirroring).

So, the upshot is that microcode makes a difference, but it's not magic; the program would still be measured in minutes!

-cheers from Julz

June 29, 2017 at 10:53 PM

Anonymous Anonymous said...

I am sure you know that x*x-y*y can be computed as (x-y)(x+y) ?

July 10, 2017 at 8:17 PM

Anonymous Mark Jeronimus said...

For the large mandelbrot image you can precompute the cardoid and first bulb for another significant speed-up. Here's some Basic code:

A =
B =
ITER = 0
U = 4 * (A * A + B * B): V = U - 2 * A + 1 / 4
IF U + 8 * A + 15 / 4 < 0 THEN GOTO REPEAT
IF V - SQR(V) + 2 * A - 1 / 2 < 0 THEN GOTO REPEAT



REPEAT:

August 25, 2017 at 1:02 PM

Anonymous Mark Jeronimus said...

The site ate the meta-code inside angle brackets :( (A, B) is the complex number of the pixel. REPEAT is where you draw the pixel. Before REPEAT is where you calculate the orbit.

August 25, 2017 at 1:05 PM

Anonymous Anonymous said...

I'm sort of surprised that they didn't look at the Nova instruction set and try to improve on it a bit, as it seems really poorly suited to a machine that has pushing pixels as one of its major responsibilities.

April 10, 2020 at 7:06 AM

Blogger Alan Kay said...

@anonymous Basically, no one at Parc wanted to use either the Nova emulator or BCPL -- the whole idea and plan was to bootstrap very high level languages "of the future" in the present with the help of the Alto microcode to emulate future HW. Several strategies/paths were adopted.

For example, the Learning Research Group opted to gut it out writing in machine (Nova) code and then did a Smalltalk VM in microcode.

CSL had a similar route for a variety of planned languages, but also opted to do considerable utility work in BCPL, and this included some practical apps such as the BRAVO word processor.

We expected a new *personal supercomputer* would appear every few years but this got delayed, and this led to optimizations and some kludgery instead of cleaner speed. For example, Smalltalk-76 is slow only because it swaps objects from the slow disk instead of having them in RAM.

September 25, 2022 at 12:43 PM

You can use some HTML tags, such as <b>, <i>, <a>

Comment moderation has been enabled. All comments must be approved by the blog author.

You will be asked to sign in after submitting your comment.
Please prove you're not a robot