The lesser known CPUs: Lattice Mico8

Well, aren't you a funny little one...

I like strange machines, especially the ones that are quirky. The CoolRISC 816 was kinda cute. The dsPIC33F was amazing. This one... Well, it isn't that powerful of a thing, but I like what it stands for.

The Lattice Mico8 (officially written "LatticeMico8", as one word) is a curious thing, mainly due to what kind of circumstances you may encounter it in. Even more curiouser is the fact that the Mico8 is open source, under a decent license. Unlike many other 8-bit soft processors, this one is yours to mangle at will. You can obtain and operate it for free, if you have a gate array handy.

As usual, you can read along, though I don't have a convenient link for you. The main document that I'm using is the "LatticeMico8 Processor Reference Manual 3.8", labelled inside as being from October 2014 and as being from 9/7/2016 on the download page at Lattice Semiconductor. I will have more to say about the quality of the documentation later, the date mismatch being just the start of documentation issues. Oh boy, do I have words to say...

Before we truly begin, however, I need to tell you a little bit about the places in which the Mico8 operates. If you are familiar with FPGAs/gate arrays and soft processors, you should skip the next section or you risk death by boredom.

Of gate arrays and soft processors

A gate array (or an FPGA) is an absolutely beautiful piece of machinery. Imagine what you would do if you could fab chips in your own bedroom. Well, gate arrays are not quite that powerful, but get you pretty close to that dream. You can create any digital circuit you want, provided if fits inside the gate array chip.

Since you are building everything from raw logic, you can accomplish some very interesting things that are out of reach if you are using off-the-shelf hardware. You can run data in parallel to squeeze more performance out of the system. Through pipelining you get massive performance increases. Really low latencies are possible too, you have no overheads caused by software. No matter how exotic your needs, if your data processing is not a good match for processor or a GPU, then you can always conjure up a custom circuit that can do the work for you.

Thing being, you only have so much space in the gate array fabric. (The fabric is the part that you can use to create your own circuits.) All this pipelining and parallelism don't come free. It may be worth it to process some data in a more serial, slower, fashion to save the space, spending it where it is really needed. (Here is a really neat white paper on this idea) This is where you might wish you had a processor. Many gate arrays provide a built-in hardware processors that take on those kinds of tasks. This is nice, but what if you don't have one built-in? Gee, too bad you don't have a device which can implement any circuit inside itself. Oh, wait...

While the exact vocabulary seems a bit fuzzy, I'm going to stick with the definition of a soft processor being a processor that is going to be put inside the gate array fabric. This is opposite from a hard processor, which will be standalone, put into an ASIC or crammed into the gate array alongside the fabric.

The Lattice Mico8 is a decent representative of soft processors, but there are many others, such as the Zet. You wouldn't want to use Zet as a soft processor, however. You see, it wouldn't be very good at coping with this environment. Processor architectures that we would use for everyday processing just don't yield great performance on a gate array, while taking up too much room. Zet implements the x86, not an easy architecture to deal with in this setting. The Mico8 is built for gate arrays and should behave better at a smaller size.

Enter the Mico

The Mico8 is an 8-bit machine with either 16 or 32 registers, depending on how you choose to compile it. Yup, you get to decide. I wasn't kidding about things being insanely customizable. The registers are yours to use, with up to three being special-cased to provide assistance to memory access instructions. The AVR has this many registers, but usually 8-bit machines don't have that many.

Once again like the AVR, this is a true RISC machine, being created around a load-store approach for memory access. You need to use a dedicated load instruction to bring data from memory before you can operate on it. This simplifies things a lot, allowing us to save space.

The instructions are 18 bits in width. This is actually standard, believe it or not. Gate arrays often provide 18-bit memories. The first 16 bits are the data proper and the other two bits are usually reserved for the use of any error-correcting circuits/codes. There is nothing inherently "ECC" about those two bits, however, and they can be used for regular storage. This is what happens here. The two extra memory bits are already there, might as well use them for something!

The Mico8 doesn't have an interesting pipeline. The design is meant to be small, not fast. We, after all, are spending space to save space. Accordingly, you eat two cycles for just about any operation, including branches. Memory loads take... three? two? one? cycles. Yeah, the documentation isn't clear and I have seen all three given as an answer at one point or another. You at least don't have to worry about pipeline stalls or delay slots.

The instruction encoding is the worst of both worlds between a RISC and a CISC. You have at most two operands, so you don't have the three-address code that can sometimes come in handy when working with multiple registers. There are only three instruction formats: Register-register operations, register-immediate operations and just immediate instructions, usually branches.

Another limitation of Mico8 is the instruction space. The previous two machines I talked about in this series of posts had fairly generous theoretical limits for the count of instructions that you could store in the instruction memory. The Mico8 maxes out at 4 thousand instructions. This is a limit dictated by the gate array architecture. The FPGA memory blocks have 1024 spots for 18-bit data. You can combine up to four to work as a larger memory.

Now, you might wonder why not use eight memory blocks and buy us even more instruction space? Combining the memory blocks isn't that easy. Outside of a gate array you could hook up your memories to a common bus and float out all of them except the one that has the instructions that you currently need. However, you cannot just tri-state anything inside of the fabric, all memory outputs are always driven. Tri-stating exists, but it is actually simulated by using up some of the fabric, and that is counter-productive. You'd not only pay for the memory blocks themselves, you'd also have to pay for the circuits that decide which of the memories is currently feeding the processor. The more blocks you want, the larger the circuit that does this deciding. Above four memory blocks it all starts to be big and run slow.

The instructions themselves!

The instructions are pretty basic, although this is pretty common to all small processors designed to run within gate arrays. The count seems impressive at first glance: 51. Sadly, a whole bunch of the count comes from register-intermediate and register-register instructions being treated separately. A whole dozen instructions are just the register-intermediate versions of register-register instructions. There are even more instructions that could be combined into one. Maybe someone wanted to have an easier job writing the assembler or something? Not that this is that unusual, the AVR does it too.

As far as what the instructions do, pretty standard fare with a couple of holes.

You have adds with and without carry. You have regular subtraction, but not reverse subtraction. CMP provides you with non-destructive subtracting. You have the option of subtracting with or without borrow. There is no multiplier, no divider, no decimal adjusts. Bare-bones, but functional.

You have rotates by one, but no native shifts. You can still get shifts by combining rotates through the carry flag and clearing/setting of the said flag. Hassle, but at least can be done.

Standard logic operations are available too: AND OR XOR. The common TEST instruction is provided for non-destructive AND. You don't have a NOT instruction, but you can always XOR with 255 to get the same effect. You can also use XOR with constants to flip bits, AND with constants to clear bits and OR with constants to set bits.

You have two status flags: Zero and Carry. This is not great, but not that unusual for a gate array CPU. Adds and subtractions manipulate both flags. Logic instructions manipulate only the zero flag. As mentioned before, you can also clear and set the flags with dedicated flag setting/clearing instructions.

The branches cover jumping on either flag, with option for jump on both set as well as clear. Calls can be conditional upon the flags as well. Unconditional branching is there too. There is a limit of how far you can jump, about 2 thousand instructions either forward or backward from the current position in the instruction stream. Not that big of a deal, but just something to pay attention to, if your code gets big.

To handle calls, interrupts and returns you have a hardware stack that is between 8 to 32 entries deep, another fairly natural fit for the resources in the FPGA fabric. You cannot get any less than a few levels of stack, so might as well use everything that you are given.

And then there are the scratchpad access instructions.

Remember me?

The memory on the Mico8, even if external to the CPU, is called the "scratchpad" for some reason. Calling memory a scratchpad isn't that uncommon, many systems which don't have a cache call their local memory a scratchpad. Calling external memory a scratchpad seems an odd decision, when you could just call it memory. I suppose they wanted to be consistent, considering you only use one set of instructions for both internal and external memories.

And memories you can access. This little thing can, when compiled for the "large" memory model, address up to four gigabytes of memory. Nice, no? The ability to do so does not mean that the process is pretty. The word "painful" comes to mind. Still, not like you can have a nice solution to this problem with an 8-bit processor.

To reach four gigabytes, you'd need 32 bits worth of address information. The lower eight bits of this information come from a CPU register used as a "page index" or directly from the instruction itself, leaving us with a shortfall of 24 bits. The missing bits come from the grouping of registers R13, R14 and R15, which are called the "page pointer" registers.

And if you decide that you don't need this much memory, then you can either settle for 64 kilobyte reach with one page pointer register or 256 bytes without the use of any page pointer registers. Those are the "medium" and "small" memory models, respectively.

The configurability does not end there. If you manage to do with little enough memory, you can configure the scratchpad as an internal memory, using some dedicated memory components of the fabric and allowing you to quickly perform the accesses. If you cannot do with less, then you need to configure the processor for work with external memories, which will eat at least an extra clock cycle per access.

In addition to memory locations that hold data, the Mico8 can also work with memory-mapped peripherals. This uses a different instruction than the scratchpad load/store, but otherwise appears to be pretty much the same thing.

As I briefly mentioned, there are two main memory addressing modes. Both use the page pointer registers to decide the top bits of an address, which select a single page of 256 bytes. The direct mode provides an address in the first 32 bytes of the selected page, the five address bits needed for this being encoded in the instruction itself and extended to 8 bits for use. Indirect access allows you access to the whole page by using a regular register to provide the remaining 8 bits of the address into the page.

I'm open! ...for interpretation?

As I mentioned, the whole thing is open source. Think you'd need 256 registers? You can do that. Sure, you would need to change the instruction encoding to use a few extra bits. And yes, you'd have to resize the register files in the code. And yes, you'd get some really wide multiplexers that would not go very fast. And why do you even need that many registers? But! If having that many pleases you, you have the option. No shame. Mico8 caters to all tastes given some convincing.

A final nice thing is that you are not stuck with just one brand of gate arrays here. If you'd rather take the design and move it over to a Xilinx or Altera gate arrays, no problem. There are no technical nor legal barriers. You could bake this into an ASIC, if you wanted to.

Hmm... Actually, hold that thought. There is a funny thing going on with the license, you see. When you go to download the thing, you get to check off a permissive license.

3. The Provider grants to You a personal, non-exclusive right to use object code created from the Software or a Derivative Work to physically implement the design in devices such as a [sic] programmable logic devices or application specific integrated circuits. You may distribute these devices without accompanying them with a copy of this license or source code.

Not bad, right? They even mention ASICs. Typo aside, looking good.

This is not the story that the license inside the source code tells you.

Permission: Lattice Semiconductor grants permission to use this code for use in synthesis for any Lattice programmable logic product. Other use of this code, including the selling or duplication of any portion is strictly prohibited.

Umm... Lattice, darling, could you make up your mind? You were doing so well and now it looks like the toppings are sliding off your pizza. What's next? Ketchup and tuna on your chocolate pudding?

Oh, here you go. Another item from the "download" license.

10. Any conflict between the terms of this Agreement and the licensing terms included in the header files provided with the Software will be resolved in favour of this Agreement.

Well, good to know. Still, how hard could it be to change the headers of a few files? This inconsistency brings me to...

The Devil's Doomcumentation

Here's a lesson for you: Blindly trusting beautiful documents is a bad idea. You see, Lattice is a bit absent-minded. Sure, the typography is pretty good, but the content lacks polish.

It starts with the download page for the documentation. Lattice kinda gives you a whole lot of documents, with some really outdated versions happily sitting next to the current versions. My copy of the the "LatticeMico8 Architecture Manual" is... amazing.

First off all, there is no such thing as a "LatticeMico8 Architecture Manual". If you look inside, you will see that this is just an outdated "LatticeMico8 Processor Reference Manual". The Architecture Manual exists in filename alone. You want the Processor Reference Manual.

Right off the bat, in the list of features, there is "Minimum Two Cycles per Instruction". Guys, I hate to break it to you, but "We do at least this badly, or even worse!" is not good marketing. I wonder about the capitalization too. I mean, I often use Capitalization for a Funny Effect, but I'm not a Company and I'm not Trying to be Serious.

The instruction descriptions give information about "CY Flag Updated" and "Zero Flat Updated". Was spelling out "Carry" so hard, if they spelled out Zero in full? And then there is the Zero Flat. This typo was fixed in a later version of the Processor Reference Manual.

I'm still looking for information on how to encode the rcsr and wcsr instructions. Apparently we don't need to know. Throughout all the revisions someone updated the manual to include hyperlinks within the file. Thanks to this, you should be able to quickly get from the table of instructions to the details of the instruction you want to look at. wcsr and rcsc were left without such links. I'm guessing they were noticed as missing in the detailed descriptions and... nothing was done? Hmm.

The manual at one point talks of how one instruction can be fetched per clock cycle. Well, not quite. Instruction fetch does take one cycle, but it occurs at most every other cycle. A bit confusing, at least. Reads and writes to the internal scratchpad being described as one cycle also feel a bit strange, this machine just does not go faster than 2 cycles to an instruction, if the comment in the source code is to be trusted. Sure, technically this is not wrong, but usually when I hear that a memory load completes in a single cycle I don't think that they are talking about only the "memory access" part of a multi-cycle load instruction.

Then there is this oddity. I have no experience of it myself, so I'll just leave it up to you to investigate. How much memory is addressable in the "medium" memory mode. We have 16 bits of addressing power. Common sense dictates 2^16 locations, right? The manual claims 16 kilobytes on page 9, 64 kilobytes on pages 31 and 32. Which is it?

Speaking of those two pages, the scratchpad load and store instructions talk about accessing peripherals, something that is the domain of other instructions. I'm guessing a copy-paste mistake that never got noticed.

Sigh. I could go on... Makes me wonder if Lattice does drug testing of their employees. At the very least, they don't do proofing of their texts.

Final thoughts

I might be wrong somewhere! If so, corrections are most welcome, my email's in the footer.

For all my complaining, you must remember that the alternative is a large mess of a CPU that runs glacially. Mico8 isn't much, but it is the best we have for the task.

Proofread your documentation, people! Argh! I wish Lattice would hire me for a proofreader or something.

There is a GCC port, in case assembly isn't your cup of tea. Very nice and helpful.

Did someone forget the R in micRo and the error was not caught until after press time? After that everyone maybe just decided to pretend they meant to do that all along?

Four gigs? Really? I'm sure you could rig up any 8-bit processor to deal with that much memory, but I cannot help but wonder why would you want to.

I wonder if Lattice will keep on going. The open source processor thing was neat. Their Versa was a really well-priced evaluation board when it went on sale several years ago. Their 32-bit machine was also undergoing some kind of open-sourcing. They called it Lattice Mico32. Friendly chaps, but not very creative with names.

I wish I could get a hardware implementation of the Mill to play around with.

And this donation button below is exactly the thing that you have been waiting your whole life for. It is your destiny. Fate has brought you here for this very reason: For the donation button on my website. Rejoice, for your most secret wishes have been answered.


Past: The missed chances: What minifiers leave behind