The RISC Deprogrammer

84 points by g0xA52A2A 2 years ago | 106 comments
  • ithkuil 2 years ago
    > Back in the 1980s, most of the major CPUs in the world were big-endian, while Intel bucked the trend being little-endian. The reason is that some engineer made a simple optimization back when the 8008 processor...

    The article started talking about the VAX and how it was the gold standard everybody competed against.

    The VAX is little endian.

    Little endian is not a hack. It's a natural way to represent numbers. Its just that most languages in earth write words left to right while writing numbers right to left

    • kens 2 years ago
      The history of why the 8008 was little-endian is interesting and predates the 8008. In 1970, the mostly forgotten company CTC released the Datapoint 2200, a desktop computer built from TTL chips (not a microprocessor) and sold as a programmable terminal. It had a serial processor using shift-register memory chips. It was an 8-bit processor but since it operated on one bit at a time, it had to start with the lowest bit to make addition work. As a result, it was little-endian.

      CTC talked to Intel and Texas Instruments to see if the processor could be put onto VLSI chips to replace the board of TTL chips. Texas Instruments produced the TMX 1795 processor, shortly followed by the Intel 8008, both processors cloning the Datapoint 2200's instruction set and architecture including little-endian. CTC rejected both processors and stuck with TTL. TI couldn't find another customer for the TMX 1795 and it vanished from history. Intel successfully marketed the 8008 as a general-purpose microprocessor. Its architecture was copied for the 8080 and then modified for the 16-bit 8086, leading to the x86 architecture that rules the desktop and server market. As a result, x86 has the little-endian architecture and other features of the Datapoint 2200. I consider the Datapoint 2200 to be one of the most influential processors ever, even though it's almost completely forgotten.

      • kragen 2 years ago
        Hmm, a minor quibble: "VLSI" would be normally more than 10k gates or 100k transistors on a single chip, wouldn't it? But the 8008 had only about 3500 transistors (I don't see an exact count in http://www.righto.com/2016/12/die-photos-and-analysis-of_24....) and so probably about 1000–1500 gates, so I think it should be called "LSI" rather than "VLSI".

        A funny thing about the 8008 is that Intel's manual for its instruction set is unnecessarily shitty — even if you didn't know the history with Datapoint, the Intel manual is obviously not by the people who designed the instruction set because it's in hexadecimal, a tradition sadly followed by the 8080 and 8086 manuals. The Datapoint manuals, by contrast, are all in octal, making the machine code enormously easier to understand. (The H8 I grew up with used an Intel chip, but the front panel monitor program used octal.)

        • kens 2 years ago
          Yes, it's kind of amazing how the 8080/Z80/8086 instructions sets make much more sense in octal, but are always displayed in hexadecimal. In hex, you can kind of see some patterns, but everything is obvious in octal. The 6502 is also based on bit triples, but they grouped the bits from the top, so octal doesn't make things any better.

          The Datapoint 2200, by the way, used decimal decoder chips to decode the octal parts of the instruction set and simply ignored the 8 and 9 outputs.

      • segfaultbuserr 2 years ago
        VAX's predecessor, the PDP-11, is also a little-endian architecture in its basic form (and the PDP-11 was also a major source of influence to many microprocessor designers, just like VAX's influence on Unix workstations).

        The "PDP-endian" is only a quirk due to its Floating Point Unit's long integer and double-precision floating point formats. The FPU was an extra module attached to the processor, and the original PDP-11 did not even have an FPU. It only appeared on later models: on low-end machines a simplified FPU version was available for separate purchase with limited functionalities, and only high-end models had the full FPU. On a system without FPU installed, you basically don't need to worry about "PDP-endian", it's a pure little-endian machine. But for convenience, the Unix C compiler always stored long integers in PDP-endian to avoid swapping endians. Because the same Unix and C software ran on all machines with or without FPU, all Unix programmers needed to worry about it, thus the PDP-endian folklore.

        But why did the PDP-11 FPU use this strange format? @aka_pugs from Twitter did some digging, and found the PDP-endian was already in used as a softfloat format by DEC's PDP-11 Fortran compiler. So the FPU was made compatible with that...

        • pencilguin 2 years ago
          PDP-11 long ints were mixed- endian: high 16 bits, then low 16 bits. But within each half, the low 8 bits, then upper 8 bits. I was sometimes called the "NUXI" format, for how it scrambled the bytes in "UNIX".
          • segfaultbuserr 2 years ago
            What I was saying is that the basic PDP-11 as original designed, is a pure 16-bit machine with no 32-bit capabilities. The Unix C and other compilers used the middle-endian format for 32-bit integers largely as an artificial choice to be compatible with its FPU, as middle-endian was FPU's native long integer format. But the FPU was only a later hardware extension, and was not an inherent part of the basic system, and it's not enforced by the basic PDP-11 insturction set. It's entirely possible to modify the UNIX C compiler to store long integers in little endian.
        • ufo 2 years ago
          I think it's neat that Arabic, where we got the numbers from, is an RTL language. From their point of view the numbers are little endian.
          • masklinn 2 years ago
            OTOH Arabic got its numerals from indic systems, and indic scripts are generally LTR, which points to big median being “more natural”. It also matches the spelling of positional numerals in most languages, and does make sense from a convenience point of view: when talking it’s easier to round off by just stopping as you go than to figure out what rounding you should apply beforehand.

            If you spell out numbers in little-endian, once you start you’re committed to spelling it out in full, whereas big endian lets you stop at basically any point you feel like.

            • ithkuil 2 years ago
              Zwei und dreissig ? Wahid wa ishrun?

              Yes most modern languages have lost little endiannes and those that kept it use only for the first 100 numbers.

              That's because for bigger and bigger numbers you're correctly pointing out that big endian is more useful when saying the numbers aloud since you can often ignore the less significant digits.

              My original point was different though:

              You can easily render little endian hexdumps equally readable as big endian hexdumps by just writing them in the order that is meant for numbers, namely right to left.

              We even align numbers to the right in spreadsheets. That's the same thing.

              Look at an old DEC manual (digital Unix, or VMS) and you'll see hexdumps where the numeric part is aligned from center towards the left and the ASCII part is aligned from the center to the right.

              With this layout you can easily read multibyte numbers naturally.

          • renox 2 years ago
            > Little endian is not a hack. It's a natural way to represent numbers

            To represent integers, for real numbers it's quite weird.

          • noobermin 2 years ago
            So, this is interesting so far and I'm bookmarking it for the depth of the author's historical knowledge, but saying "horizontal microcode" was the main difference that "no one talks about"...I mean I was told this was the very difference that make RISC distinct from x86 and friends (what I see now were VAX-like archs), the simpler transistor logic without the crazy micro-programs and the pipelining, I thought this was common knowledge. Is there some other context people think of when they talk about RISC that I'm unaware of?
            • baryphonic 2 years ago
              You're absolutely right. I was waiting to see what the big misunderstanding was about RISC, and I actually rolled my eyes when I saw _horizontal microcode_ bolded. It's an implementation detail.
              • kragen 2 years ago
                You seem to have read noobermin as saying the opposite of what I read them as saying, because to my eyes the conversation looks like this:

                    <noobermin> X is Y
                    <baryphonic> You're absolutely right.  X is totally not Y
            • klelatti 2 years ago
              The author seems to have been quite wound up by claims that "my high end desktop / server (insert ARM / RISC-V to taste) is better than your x86 because 'RISC'".

              Fair enough. It's an 80's debate really. ISA is probably by a long margin not the most important factor in these comparisons.

              But he says it makes no difference at all without evidence.

              And he ignores that there is a whole world of simpler (especially in-order) cores where ISA probably does matter a lot.

              • noobermin 2 years ago
                In the case it was missed, he actually does talk about that towards the end, that it matters for low powered microcontrollers and for people unwilling to pay for an ARM license, but for the higher powered machines (ARM machines, x86_64 etc), they are both OoO and that is more of what makes them powerful for modern computing ("real" computing by his definition) than does the RISC vs VAX lineage matters.
                • klelatti 2 years ago
                  He talks about it for microcontrollers but conveniently ignores a whole range of CPUs between this class and OoO because they don't fit his argument. eg from Wiki:

                  > The Cortex-A53 is the most widely used architecture for mobile SoCs since 2014 to the present day, making it one of the longest-running ARM processors for mobile devices. It is currently featured in most entry-level and lower mid-range SoCs, while higher-end SoCs used the newer ARM Cortex-A55. The latest SoCs still using the Cortex-A53 are MediaTek Helio G37, both of which are entry-level SoCs designed for budget smartphones.

                  These may not be the most exciting CPUs but does the ISA matter here? Yes probably quite a bit. Does it matter that they are compatible with beefier OoO cores in say a big.LITTLE configuration. Yes it does.

                  I wouldn't mind too much if he had said I'm just talking about high end. But he implicitly claims to cover everything except 100k transistor CPUs.

                  • zozbot234 2 years ago
                    ISA is a very real constraint on the width of insn decode for something like x86-64, when compared to AArch64 and even more clearly, to RV64C (which competes in code density with x86-64).
                • KingOfCoders 2 years ago
                  "In contrast, 16-bit processors could only address 64-kilobytes of memory, and weren't really practical for real computing."

                  ?

                  • kragen 2 years ago
                    Yeah, this is a mistake given that the 8088 was a 16-bit processor that could address a mebibyte of memory, and whether you like it or not, it was certainly a consequential design for the history.
                    • noobermin 2 years ago
                      See their definition of "real" computing (the author even uses sarcastic quotes elsewhere):

                      >The interesting parts of CPU evolution are the three decades from 1964 with IBM's System/360 mainframe and 2007 with Apple's iPhone. The issue was a 32-bit core with memory-protection allowing isolation among different programs with virtual memory. These were real computers, from the modern perspective: real computers have at least 32-bit and an MMU (memory management unit).

                      • KingOfCoders 2 years ago
                        Mostly everyone else seems to define the bitness of CPUs by their capacity to add numbers in one go, not by the address space they can address. What "everyone" calls 8bit computers could address 64kb of address space, not 256 bytes. Everyone can call a table a chair, but it doesn't make communication easier.
                        • noobermin 2 years ago
                          I'm not sure I understand what you are saying and how it is a refutation. The line you pulled was in fact his argument for why memory addressing is an important qualifier for his definition and bit-length of registers is insufficient in his mind. The line is from where he argues that even though the 68k is considered a 16bit machine due to how the ALU works, it can address 24bit addresses, making it closer to a "real" computer by his definition. I'd reckon he'd say (as he did :) ) that the 6502 isn't a "real" computer because it can only address 64K of memory.

                          I think "modern" computing is a better term than "real" computing for what he means, but it's merely presumptuous of a definition, it's nowhere as extreme as calling a table a chair. I'd suggest not letting a definition poison your mind too much as the post is interesting if you can get past the tone.

                          • abiloe 2 years ago
                            > Mostly everyone else seems to define the bitness of CPUs by their capacity to add numbers in one go, not by the address space they can address.

                            Nah, only historically. Yes, "8-bit" refers to the ALU. Back in the day, 16-bit was similar. But even then it was muddy, because when talking about operating systems like Unix or NT the key question about bitness would be in the context of a 32-bit flat addressing model, not really the data width.

                            By the time the 64-bit era rolled around, the "64-bits" definitely referred to address space.. The original Pentium had a 64-bit data path and had instructions (MMX, eg PADDQ) that could operate on 64-bit numbers - no one would call it 64-bit. By the early 2000s with the big push to mainstreaming 64-bit, it was all breaking out of the 4GB address space limitation - not the width of data.

                            > define the bitness of CPUs by their capacity to add numbers in one go

                            This is nebulous and therefore troublesome to define. Are we talking about the ISA or the internal circuitry (ALU and/or data path)? Is the 68000 a 16-bit or 32-bit CPU?

                      • atan2 2 years ago
                        I cannot believe I lived to read the expression "anti-VAX" in this context! :)
                        • moomin 2 years ago
                          I don’t doubt the author knows a lot about this, but the case being made constantly has things that even a cursory reading highlights as nonsense. Like RISC requiring a high-level language compiler and operating system, while the dominant “RISC” chip on the planet originally had an operating system written entirely in assembly. And the technical distinction between “real” and “not real CPUs” quietly ignoring the fact that the 80s and 90s were completely dominated by “not real” computers.
                          • snvzz 2 years ago
                            The article (like most RISC hit pieces) neglects the implicit value of simplicity.

                            Complexity needs to be justified, and the article does a very poor job there.

                            • Jweb_Guru 2 years ago
                              It's not really a hit piece and I think he does a good job of arguing that high performance chips are not all that simple anyway anymore regardless of ISA.
                              • pencilguin 2 years ago
                                There has not been any detectable simplicity in anything called RISC in decades. Even things that seem simple looked at from outside are fiendishly fiddly when you look closer.
                              • Taniwha 2 years ago
                                completely ignores the reason why RISC architectures took over .... the L1 I-cache moved on chip, suddenly the main reason for complex instruction encodings went away
                                • kragen 2 years ago
                                  While that was very important, wasn't the ARM2 with no cache significantly faster than contemporary CISC processors like the 80386 with no cache?

                                  I know Dhrystone isn't real but https://www.realworldtech.com/arms-race/2/ says an 8 MHz Archimedes got 4901 Dhrystones per second to the 16-MHz 386's 3626 Dhrystones per second. https://en.wikipedia.org/wiki/Instructions_per_second gives the 8-MHz ARM2 4 [Dhrystone] MIPS at 8 MHz and the 16-MHz "i386DX" 2.15 MIPS at 16 MHz. In fact, the cacheless ARM2 even beat the CISC 68020, which did have a tiny cache!

                                  Also, I think squished RISC instruction encodings like Thumb, MIPS16, and RVC seem pretty competitive with popular CISCs on code density; RVC even seems to best them. So even if your data access is competing with instruction fetch for memory bandwidth because you don't have an icache, you'd probably still get more instructions per memory cycle out of RVC than out of i386 or AMD64.

                                  • klelatti 2 years ago
                                    Yes I think cache was an important motivator for some early designs (eg the IBM 801 - shameless plug below! [1]) as the central idea was that fast instruction cache would replace fast microcode store thus removing any code size penalty.

                                    In fact I don't think code size was that much bigger for these designs so cache was probably less important than they initially thought.

                                    The Arm team recognised that memory bandwidth was key for a cache less design and so designed to maximise this and make the most of it - hence the outperformance.

                                    [1] https://thechipletter.substack.com/p/the-first-risc-john-coc...

                                • IshKebab 2 years ago
                                  So x86 chips are only inefficient because they're fast? Or because Intel only makes laptop and desktop chips?

                                  So how did they fail sooo badly at breaking into the mobile CPU market? Their Android phones were notoriously slow and inefficient.

                                  Also isn't one of the reasons the M1 is so fast because it has so many instruction decoders which is much easier because of the ISA?

                                  The author clearly knows a lot of history but it wasn't an especially convincing argument. Especially the idiotic ranting about what makes something a "real" computer.

                                  • jonstewart 2 years ago
                                    Especially, just consigning Hennessy and Patterson as a lousy book and ignoring what they said at the time, it makes the blogpost apparent for what it is: an illogical rant. On the one hand, Dave Patterson, Turing Award winner. On the other, Rob Graham, notorious troll. Hmmm…
                                  • childintime 2 years ago
                                    Strongly opinionated with a real message, I loved it.

                                    Through the RISC story we pay a cultural debt we owe to RISC. It is story telling, about a time long gone, and the tale is mythical in nature. In opposition to the myth, as the article states, RISC by itself is no longer an ideal worth pursuing.

                                    This is relevant to the other Big Myth of our tech times, the Unix Story, and by extension to Linux. UNIX is mythical, having birthed OS and file abstractions, as well as C. It was a big idea event. But its design is antithetical to what a common user today needs, owning many devices and installing software that can't be trusted, at all, yet needs to be cooperative.

                                    When Unix was born, many users had to share the same machine, and resources were scarce to the point there was an urgent need to share them, between users. Unix created the system administrator concept and glorified him. But today Unix botches the ideals it was once born of, the ideals of software modularity and reusability. Package managers are a thing, yet people seem blind to the fact they actually bubble up from hell. Many PM's have come already and none will ever cure the disease.

                                    Despite this the younger generations see Unix through rosy glasses, as the pinnacle of software design, kinda like a Statue of Liberty, instead of the destruction of creative forces it actually results in. I posit Linux's contribution to the world is actually negative now. We don't articulate the challenges ahead, we're just procrastinating on Linux. It's the only game in town. But the money is still flowing, servers are still a thing, and so the myth is still alive.

                                    The Unix Myth has become a toxic lie, and as collateral Linus has become a playmate for the tech titans. I'm waiting for him to come out and do the right thing, for it is evil for the Myth to continue to govern today's reality.

                                    • kragen 2 years ago
                                      Listen, if you write a software system that puts more power at my disposal than Unix does, I'm happy to try it. But I suspect the Unix design has more wisdom in it than you think, because when I've tried to do better, even with the benefit of hindsight, I've always fallen short.
                                      • childintime 2 years ago
                                        Thanks for replying, others just treat me like a troll and downvote, beholden to the myth that Linux is the gold standard. That's exactly the willful blindness I am talking about.

                                        You talk about network effects, as Linux is the only game in town, currently. Implicitly I talk about that too, that's why I mention Linus, I expect leadership. Why develop Linux? There's not much ROI, unless change is the keyword.

                                        Indeed the challenge is to create a new operating system suited to the demands that exist today. Fuchsia is a step, the only thing I can point to right now, but it is hardly accessible.

                                        Note that Android works to overcome the need for a system administrator. That's because Linux works against what it needs to be: an invisible OS. The most prolific use of Linux isn't really a success story.

                                        Furthermore, you suggest "power". Perhaps you talk about piping and shell tools. These are indeed part of the myth, good ideas but, pardon me, terribly executed. They do not compose, they are not scalable. Again because of the time frame they were conceived in, this was impossible. But that is the refrain throughout. As a result everything is just messy, resulting in huge time sinks.

                                        Indeed I hope to one day have the disposition to truly go back to basics, and make a runtime, a substrate if you will, that will run fully inspectable code, with an execution that can be visualized, questioned and reasoned about. That way users and tools can adjust any process, repeatedly, or just once. Incrementally so.

                                        All machine details would be hidden, and that includes (obviously, for me) everything binaries: compilation, ABI's. Something current OSes (not counting the Web) don't do.

                                        In the end the best OS is an invisible OS.

                                        • kragen 2 years ago
                                          There are lots of exciting options to explore!
                                        • thechao 2 years ago
                                          At this point, I think "Unix" maps cleanly onto a Chesterton's Gate.
                                      • cestith 2 years ago
                                        32-bit versions of OS/2 and multiple versions of Unix ran on the 80386 and 80486 long before Windows NT ever ran on most desktops. Client PCs were mostly Windows 95/98/ME until the XP era. Servers and some professional workstations were NT 3.1, 3.51, and 4.0 then Windows 2000. Few business desktops and home computers ran NT/2000 at all.
                                        • mpweiher 2 years ago
                                          This "debunking" is itself mostly plausible-sounding bunk.

                                          It gets a lot of details simply wrong. For example, the 68030 wasn't "around 100000 transistors", it was 273000 [1]. The 80386 was very similar at 275000 [2]. By comparison, the ARM1 was around 25000 transistors[3], and yet delivered comparable or better performance. That's a factor of 10! So RISC wasn't just a slight re-allocation of available resources, it was a massive leap.

                                          Furthermore, the problem with the complex addressing modes in CISC machines wasn't just a matter of a tradeoff vs. other things this machinery could be used for, the problem was that compilers weren't using these addressing modes at all. And since the vast majority of software was written in high-level language and thus via compilers, the chip area and instruction space dedicated to those complex instructions was simply wasted. And one of the reasons that compilers used sequences of simple instructions instead of one complex instruction was that even on CISCs, the sequence of simple instructions was often faster than the single complex instruction.

                                          Calling the seminal book by Turing award winners Patterson and Hennessy "horrible" without any discernible justification is ... well it's an opinion, and everybody is entitled to their opinion, I guess. However, when claiming that "Everything you know about RISC is wrong", you might want to actually provide some evidence for your opinions...

                                          Or this one: "These 32-bit Unix systems from the early 1980s still lagged behind DEC's VAX in performance. " What "early 1980s" 32-bit Unix systems were these? The Mac came out in 1984, and it had the 16 bit 68000 CPU. The 68020 was only launched in 1984, I doubt many 32 bit designs based on it made it out the door "early 1980s". The first 32 bit Sun, the 68020-based Sun-3 was launched in September of 1985, so second half of the 1980s, don't think that qualifies as "early". And of course the Sun-3 was faster than the VAX 11. The VAX 8600 and later were introduced around the same time as the Sun-3.

                                          Or "it's the thing that nobody talks about: horizontal microcode". Hmm...actually everybody talked about the RISC CPUs not having microcode, at least at the time. So I guess it's technically true that "nobody" talked about horizontal microcode...

                                          He seems to completely miss one of the major simplifying benefits of a load/store architecture: simplified page fault handling. When you have a complex instruction with possibly multiple references to memory, each of those references can cause a fault, so you need complex logic to back out of and restart those instructions at different stages. With a load/store architecture, the instruction that faults is a load. Or a store. And that's all it does.

                                          It also isn't true that it was the Pentium and OoO that beat the competing RISCs. Intel was already doing that earlier, with the 386 and 486. What allowed Intel to beat superior architectures was that Intel was always at least one fab generation ahead. And being one fab generation ahead meant that they had more transistors to play with (Moore's Law) and those transistors were faster/used less power (Dennard scaling). Their money generated an advantage that sustained the money that sustained the advantage.

                                          As stated above, the 386 had 10x the transistors of the ARM1. It also ran at significant faster clock speed (16Mhz-25Hmz vs. 8Mhz). With comparable performance. But comparable performance was more than good enough when you had the entire software ecosystem behind you, efficiency be damned Advantage Wintel.

                                          Now that Dennard scaling has been dead and buried for a while, Moore's law is slowing and Intel is no longer one fab generation ahead, x86 is behind ARM and not by a little either. Superior architecture can finally show its superiority in general purpose computing and not just in extremely power sensitive applications. (Well part of the reason is that power-consumption has a way of dominating even general purpose computing).

                                          That doesn't mean that everything he writes is wrong, it certainly is true that a complex OoO Pentium and a complex OoO PowerPC were very similar, and only a small percent of the overall logic was decode.

                                          But I don't think his overall conclusion is warranted, and with so much of what he writes being simply wrong the rest that is more hand-wavy doesn't convince. Just because instruction decode is not a big part doesn't mean it can't be important for importance. For example, it is claimed that one of the reasons the M1 is comparatively faster than x86 designs is that it has one more instruction decode unit. And the reason for that is not so much that it takes so much less space, but that the units can operate independently, whereas with a variable length instruction stream you need all sorts of interconnects between the decode units, and these interconnects add significant complexity and latency.

                                          Right now, RISC, in the from of ARM in general and Apple's MX CPUs in particular, is eating x86's lunch, and no, it's not a coincidence.

                                          I just returned my Intel Macbook to my former employer and good riddance. My M1 is sooooo much better in just about every respect that it's not even funny.

                                          [1] https://en.wikipedia.org/wiki/Motorola_68030

                                          [2] https://en.wikipedia.org/wiki/I386

                                          [3] https://www.righto.com/2015/12/reverse-engineering-arm1-ance...

                                          • erwan577 2 years ago
                                            I strongly agree with the point the problem was that compilers weren't using these addressing modes at all

                                            at least in the 80s microcomputer compilers were very primitive compared to what we have now which maintained a strong need for ASM. Dev tools used to be very expensive and proprietary too.

                                            GCC started to slowly changes that starting by 1987.

                                            So there was a time when software started to be mainly compiled high level language but using stupid compilers and CPU designers had to live with that.

                                            • snvzz 2 years ago
                                              >whereas with a variable length instruction stream you need all sorts of interconnects between the decode units, and these interconnects add significant complexity and latency.

                                              I find worth noting this is not always the case.

                                              e.g. RISC-V C extension provides variable length instructions, but they're still either 16 or 32 bit.

                                              Special care has been put into making the decoding overhead of dealing with this situation negligible, and it is indeed so. There's benefit, transistor-budget-wise, the moment there's any on-die cache or on-die rom. Any chip that's smaller than that is going to be very specialized and can simply omit C. In any chip that's larger, C is a net benefit.

                                              As a practical example, the RISC-V based Ascalon by Jim Keller's team is a 8-wide (like M1), 10-issue CPU.

                                              However, you're absolutely right the wild sort of variable instruction length that is seen in CISC architectures like x86 is a huge issue that massively complicates implementations and outright imposes a practical limit in decoder width.

                                              OTOH in aarch64, the adoption of a fixed instruction size, thus tanking code density, was unenlightened to the point of brain-dead, we see the cache sizes M1/M2 need just to deal with this, and I'm afraid ARM will be gone for other reasons (non-technical, to do with mismanagement) before they have a chance to correct course and re-introduce compressed instructions.

                                              As for the rest of the article, I generally agree with you that it presents outright wrong information as facts and then tries to push the wrong conclusion. It is utter bull, practically nothing of value can be found in there. I'm not even surprised, as it is pretty much the norm in RISC opposition.

                                              • cesarb 2 years ago
                                                > e.g. RISC-V C extension provides variable length instructions, but they're still either 16 or 32 bit.

                                                It's more than that. In RISC-V, you only need the first two bits of each instruction to determine whether it's a 16 bit or 32 bit instruction; you don't need to decode an instruction to know its length.

                                                > [...] we see the cache sizes M1/M2 need just to deal with this, [...]

                                                Do the M1/M2 need these cache sizes, or do they have these cache sizes because they can have these cache sizes, due to having a 4x larger page size by default? (Normally, page size wouldn't be that much of a problem for instruction caches, but for x86 it is because the x86 ISAs don't require explicit instruction cache invalidation on self-modifying code; x86 processors would likely have larger L1 instruction cache sizes if they could get away with it.)

                                                • jabl 2 years ago
                                                  > In RISC-V, you only need the first two bits of each instruction to determine whether it's a 16 bit or 32 bit instruction

                                                  Isn't it one bit in the beginning(?) of each 16-bit instruction? So a 32-bit instruction has this information duplicated in the same place in the latter 16-bit half, since a decoder has to be able to decide whether it's trying to decode a 16-bit instruction or whether it's in the middle of a 32-bit instruction.

                                                  The above assuming that the common strategy for implementing a parallel decoder for RVC is to start decoding at each 16-bit offset, and then throw away those cases where it turns out that it was in the middle of a 32-bit instruction, and that RVC has been designed with this implementation strategy in mind.

                                            • erwan577 2 years ago
                                              One of the idea I take from the piece is that CPU design success is intimately tied to the software ecosystem of the day and Memory Management Units were a big thing for C langage multitasking.

                                              I wonder if Rust or similar could make the MMU transistors and energy budget redondant.

                                              Disclaimer: I am a 68k fan.

                                              • Someone 2 years ago
                                                > I wonder if Rust or similar could make the MMU transistors and energy budget redundant.

                                                The MMU does two things:

                                                - shields processes from each other

                                                - creates the illusion that the machine has more memory that in it has

                                                To do away with the former without giving up its benefits, the CPU would, somehow, have to know the code it runs won’t interfere with other processes. It could trust a particular compiler to produce code that’s safe, and rust could provide such a compiler, but then, the CPU would have to prevent Mallory (https://en.wikipedia.org/wiki/Alice_and_Bob#Cast_of_characte...) from producing a binary that he claims was created by that rust compiler, but isn’t.

                                                One way would be to make the CPU run that compiler. The CPU then would not be able to run anything else than code compiled by that rust compiler. That may be seen as prohibitive.

                                                Even if it isn’t, the CPU probably would not want to commit to being tied to one particular compiler. Checking that the actual output of the compiler is safe may be easier. That’s one reason why byte codes were invented. They decrease the coupling between programming language and CPU, allowing evolution of a compiler (often even compiler_s_, supporting multiple languages) independent of the byte code.

                                                So, yes, you could use rust for the first item, but you probably don’t want to. Technologies such as the JVM, Microsoft CLR, or WASM are more suited for that kind of stuff.

                                                Also, if you want to give processes the illusion that the machine has more memory that in it has, you would still need a MMU. It could be a bit simpler, but it still would be a MMU.

                                                • adwn 2 years ago
                                                  > I wonder if Rust or similar could make the MMU transistors and energy budget redondant.

                                                  No, those concerns are completely independent of each other. Rust's memory safety protects from accidentally accessing the wrong memory within the same address space, while the MMU protects against accessing (accidentally or intentionally) any memory in other address spaces.

                                                  In addition, the address translation done by the MMU has many more applications, like swapping, memory-mapped files, shared memory, copy-on-write after fork, or stack guard pages, none of which can be done by software alone.

                                                  • flohofwoe 2 years ago
                                                    > ...accessing the wrong memory within the same address space

                                                    On systems without MMU there's only one shared address space (like on the Amiga, you only had lightweight processes/threads called Exec Tasks which all ran in the same global address space).

                                                    Rust could definitely help to isolate memory accesses of applications that all run in the same address space.

                                                    • erwan577 2 years ago
                                                      Rust should help with the stability issues that plagued the 80s/90s before fully enforced memory protection. Now I wonder if MMU really brings other benefits.

                                                      In my latest workstation, I do not use virtual memory to use the disk as extra memory for my running software, i use compressed memory as a disk cache. 30 years of progress has changed the requirements and usecases. Maybe the MMU will survive in a different form that would require a new name.

                                                      I read the 4k virtual memory pages and TLB caches are becoming a performance bottleneck so this needs to be redesigned anyway.

                                                  • jabl 2 years ago
                                                    Hypothetically, sure. One can imagine a system getting rid of virtual memory, and instead using e.g. some kind of capability system to prevent programs from reading memory they're not allowed to. In reality, there's so much software that assumes each process gets its own private address space that I find it very hard to imagine what a transition to this new MMU-less world would look like. Maybe something CHERI-like as an intermediate step?
                                                    • noobermin 2 years ago
                                                      Btw, giving rust like safety is even more fine-grained than this, it would have to ensure ownership for pieces of memory within one program, which seems amazingly tedious to do in hardware.
                                                      • jabl 2 years ago
                                                        Well, that's what CHERI does, more or less.
                                                    • noobermin 2 years ago
                                                      We had a "sample" of this, at least according to the author, although it was in a less savory direction from those who care about memory safety and things like that: The M1 was optimized to run js shit faster apparently, which he claims is why the CPU is better for mobile machines (macbooks). Supposedly rust is too new for a new arch to design around it.

                                                      Tbh, and may be this is just the limits of my imagination, but I'm not sure what rust's guarantees would have on the ISA level, they usually concern safety on the application level. Systems programming in general still needs loads of unsafe blocks to actually work (see the debate a few weeks ago where Linus Torvalds critiqued a patch where rust folks wanted to change memory allocators in Linux so they could play nicer with safe rust code).

                                                      Like, ownership and move semantics are really a higher level concept and anything that happens within a single page the MMU will not care about with machines today, so this wouldn't be a small evolution but a completely different kind of arch. Again, may be I'm just to uninformed or lack the imagination.

                                                      • klelatti 2 years ago
                                                        Here is the JS reference in the article.

                                                        > For example, they added a lot of great JavaScript features, cognizant of the ton of online and semi-offline apps that are written in JavaScript. In contrast, Intel attempts to optimize a chip simultaneously for laptops, desktops, and servers, leading poorly optimizations for laptops.

                                                        Now there is a JS feature in the M1. It's the FJCVTZS instruction "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero" which ensures this conversion follows the JS specification. [1]

                                                        And this does indeed improve JS performance for Arm CPUs. But why does JS behave this way? Because it was specified to follow what x86 does!

                                                        So to say that 'M1 is optimised for JS but x86 isn't etc' is just plain wrong.

                                                        Also: - Apple didn't do it Arm did. - It has nothing whatsoever to do with memory management.

                                                        [1] https://stackoverflow.com/questions/50966676/why-do-arm-chip...

                                                        • masklinn 2 years ago
                                                          Indeed, the few things I do already know of what the article talks about (very much including the javascript feature of the M1 you outline) are completely wrong, that makes the entire thing extremely suspect.

                                                          An other banger:

                                                          > Apple also does crazy things like putting a high end GPU (graphics processor) on the same chip.

                                                          They’re good but they’re not especially high-end, unless compared to other embedded GPUs like the ones in AMD’s APU, which… are rather comparable overall, and also single-die.

                                                      • flohofwoe 2 years ago
                                                        WASM would probably be a better fit for this problem. But I believe that software security ultimately needs to be tackled on the hardware level. It would be a bleak future if I'm forced to write programs for some platforms in a specific 'secure' high level language, this would hamper progress by competition in the programming language design space.
                                                        • pvg 2 years ago
                                                          Memory Management Units were a big thing for C langage multitasking.

                                                          How do you figure? The article outlines MMU development since well before C, it's not like people came up with memory protection because of C.

                                                          • kragen 2 years ago
                                                            PoohBear, it's because Smalltalk, Oberon, and J2ME systems didn't need MMUs. People came up with MMUs because of low-level languages, a category which includes assembly and C. But we're just rehearsing debates that are half a century old.

                                                            (On the other hand, the B5000 had hardware memory protection despite being programmed in Algol. The B5000 inspired Smalltalk, which inspired Oberon and Java. But its memory protection didn't use an MMU.)

                                                            • erwan577 2 years ago
                                                              I was a user at the time, and also a reader of many computing magazines.

                                                              GUI environments of the 80s all brought multitasking with them, and system stability was mediocre to very bad... All writers pointed to memory protection as the cure of all this. See also Mac os history for a more detailed usecase.

                                                              Rising software size and complexity made the industry abandon assembly programming for higher level languages and for GUI apps this quickly meant C/C++.

                                                          • nuc1e0n 2 years ago
                                                            So my takeaway from this article is this: RISC largely displaced CISC except in legacy situations as you could get better throughput for the same number of transistors by moving work into the compiler. In turn Out-of-Order execution largely displaced RISC as you could get better throughput for the same number of transistors by moving more work into the compiler.

                                                            How else might processor topology design dogma be hindering the performance we could get by having better compilers? This is especially important now the transistor budget isn't nearly so flexible.

                                                            • thethirdone 2 years ago
                                                              > In turn Out-of-Order execution largely displaced RISC as you could get better throughput for the same number of transistors by moving more work into the compiler.

                                                              What work does OoO execution displace to the compiler? I thought that OoO CPUs get better performance on the exact same programs compared to in order CPUs.

                                                              • nuc1e0n 2 years ago
                                                                Ensuring register accesses are interleaved in a good way right?
                                                                • thethirdone 2 years ago
                                                                  Register renaming allows CPUs to eliminate stalls due to reuse of a register; I have not noticed any compiler putting particular emphasis on interleaving accesses well.

                                                                  That is actually more of a problem on in order CPUs because a single stall will hold up the entire CPU instead of take longer to commit while other stuff is going on.

                                                            • pencilguin 2 years ago
                                                              The article is almost completely right, aside from missing that VAX was little-endian.

                                                              But if 68k was really a 16-bit design, then Z-80 was a really 4-bit chip, because that was the size of its ALU. What matters, really, is the register size, and how much work you can do in one instruction. Federico Faggin ("fajjeen", btw) recognized that the Z-80 did not need its 8-bit result in the next click cycle anyway, so took two 4-bit cycles, and nobody was the wiser.

                                                              • programmer_dude 2 years ago
                                                                The article does clear up a few things. Could have been a little less acerbic though.
                                                                • rbanffy 2 years ago
                                                                  It kind of pushed their own definition of what RISC is. They also confuse the definition of what a 16-bit computer is and ignore the many commercial successes in 16-bit minicomputers such as the Nova. It is clever to point out that the most useful definition of RISC is that it’s a tradeoff.

                                                                  The point remains that a simplified ISA that’s easy to decode (and, more recently) implement dynamic reordering, will always have an edge by freeing up resources that can be dedicated to execution of the workload rather than housekeeping (as in resolving all inter-instruction dependencies).

                                                                  OTOH, going too far in that direction gives you VLIW, which has proven itself to be a pain more often than not.

                                                                  • snvzz 2 years ago
                                                                    >It kind of pushed their own definition of what RISC is.

                                                                    That's intentional. A straw man tends to be easier to attack.

                                                                    • rbanffy 2 years ago
                                                                      I kind of agree there is no hard definition because what's "complex" and "reduced" are subjective things. The venerable 6502 was called "RISC before it was cool" and I have to agree its minimalistic ISA made it incredibly capable (and fast, clock for clock) compared to its more ambitious contemporaries such as the Intel 8080/8085, the Motorola 6809, and its arch-nemesis, the Z-80.
                                                                • djmips 2 years ago
                                                                  This article is a polemic. Don't take it personally. Enjoy! Also it's clear they are well versed in the topic. You may not agree but it is great food for thought.
                                                                  • kragen 2 years ago
                                                                    I don't think Rob has ever designed a single CPU, much less measured the effects of different tradeoffs.
                                                                    • noobermin 2 years ago
                                                                      To be fair, likely none of the readers here have designed a single CPU either :)
                                                                      • ruslan 2 years ago
                                                                        I toyed with my own implementation of one-stage RV32I for FPGAs, does this count ? ;)
                                                                        • formerly_proven 2 years ago
                                                                          Some variation of designing, building or extending a CPU or ISA is standard in comp-sci curricula.
                                                                          • kragen 2 years ago
                                                                            It's a standard thing to do in EE curricula; you normally do it in a one-semester class, and there are literally thousands of open-source synthesizable CPU cores on GitHub now. Some one-semester classes go so far as to design ASICs and, if they pass DRCs, get them fabbed through something like MOSIS or CMP.

                                                                            To take three examples to show that designing a CPU is less work than writing a novel:

                                                                            - Chuck Thacker's "A Tiny Computer", fairly similar to the Nova, is a page and a half of synthesizable Verilog; it runs at 66 MHz in 200 (6-input) LUTs of a Virtex-5: https://www.cl.cam.ac.uk/~swm11/examples/bluespec/Tiny3/Thac...

                                                                            - James Bowman's J1A is more like Chuck Moore's MuP21 and is about three pages of synthesizable Verilog: https://github.com/jamesbowman/swapforth/blob/master/j1a/ver... and https://github.com/jamesbowman/swapforth/blob/master/j1a/ver.... You can build it with Claire Wolf's iCEStorm (yosys, etc.) and run it on any but Lattice's tiniest FPGAs; it takes up 1162 4-input LUTs.

                                                                            - Ultraembedded's uriscv is about 11 pages of Verilog and implements the full RV32IMZicsr instruction set, including interrupt handling (but not virtual memory or supervisor mode): https://github.com/rolandbernard/kleine-riscv/tree/master/sr...

                                                                            In all three cases, this doesn't include testbenches and other verification work, but as I understand it, that's usually only two or three times as much work as the logic design itself.

                                                                            Maybe we should have a NaCpuDeMo, National CPU Design Month, like NaNoWriMo.

                                                                            I haven't quite done it myself. Last time I played https://nandgame.com/ it took me a couple of hours to play through the hardware design levels. But that's not really "design" in the sense of defining the instruction set (which is, like Thacker's design, kind of Nova-like), thinking through state machine design, and trying different pipeline depths; you're mostly just doing the kind of logic minimization exercises you'd normally delegate to yosys.

                                                                            In https://github.com/kragen/calculusvaporis I designed a CPU instruction set, wrote a simulator for it, wrote and tested some simple programs, designed a CPU at the RTL level, and sketched out gate-level logic designs to get an estimate of how big it would be. But I haven't simulated the RTL to verify it, written it down in an HDL, or breadboarded the circuit, so I'm reluctant to say that this qualifies as "designing a single CPU" either. (Since it's not 01982 anymore maybe you should also include a simple compiler backend before you say a new ISA is really designed?)

                                                                            But I also wouldn't say I'm "well versed in the topic". I can say things about what makes CPUs fast or slow, but I don't know them from my own experience; I'm mostly just repeating things I've heard from people I judge as credible on CPU design. But what is that credibility judgment based on? How would I know if I was just believing a smooth charlatan who doesn't really know any more than I do? And I think Rob is in the same situation as I am, just worse, because he has even less experience.

                                                                        • jonstewart 2 years ago
                                                                          Just because Rob Graham is of a certain age doesn’t mean he’s well-versed in the topic.
                                                                          • 2 years ago
                                                                          • signa11 2 years ago
                                                                            • pvg 2 years ago
                                                                              They're only HN-dupes if they get some nontrivial amount of discussion and that post didn't.
                                                                              • 2 years ago
                                                                              • kosolam 2 years ago