The Neoverse V1 Microarchitecture: X1 with SVE?

Starting off with the new Neoverse V1, the design is both of a familiar origin, but also has a few distinct features that we see for the first time ever in an Arm CPU. As noted in the introduction, the V1 was designed at the same time as the Cortex-X1 by the same team at Arm’s Austin design centre, with large similarities between the two microarchitectures when it comes to the block structures.

What’s notable about the V1, in comparison to the X1 and of course the predecessor N1, is the fact that this is now an SVE capable processor, with two native 256b SIMD pipelines, and also introducing server-only features such as coherent L1I caches, bFloat16 execution capabilities, and a slew of distinct characteristics we’ll cover in just a bit.

The architectural features of the Neoverse V1 are probably the most complicated in terms of describing – essentially, it’s a v8.4 baseline architecture which also pulls v8.5 and v8.6 features in for the HPC oriented workloads the design is aimed for. Given that we talked about Armv9 only a month ago, this may seem a bit odd, but again we have to remember that the V1 has been designed some time ago and that customers have had the IP for quite a while now, taping in or having already taped out V1 processors.

The big promise of the V1 is its extremely large performance jump over the N1, coming in at an IPC increase of +50%. This sounds large, and it is, but it’s also not all that surprising given that the microarchitecture essentially is 2 microarchitecture design generations newer than the N1, even through from a infrastructure product standpoint it’s only one generation newer.

From a high-level pipeline standpoint and microarchitecture view, the Neoverse V1 is very similar to the X1. It’s still an extremely short pipeline design that has a minimum of 11 stages, with Arm putting a lot of focus on this aspect of their microarchitectures to reduce branch misprediction penalties as much as possible. This aspect of the microarchitecture has remained relatively static over the last few iterations of the Austin family of designs starting with the A76, so Arm notes that the frequency capabilities of the V1 is essentially unchanged when compared to the N1, with performance boosts coming solely from increased IPC.

The V1 sees a lot of the front-end improvements we’ve seen with the Cortex-A77 and Cortex-X1 generations, which saw larger front-end branch improvements such as a doubled up bandwidth for the decoupled fetch unit, much larger L2 BTB to up to 8K entries, and a rearranging and resizing of the lower level BTBs, with the L0 (nanoBTB) growing to 96 entries, and the L1 BTB (microBTB) no longer being present when compared to the Neoverse N1.

The V1 one when compared to the N1 also adds in new structures that hadn’t been present in the design, such as the introduction of a macro-Op cache of up to 3K decoded instructions. The dispatch bandwidth from the Mop cache is 8-wide, while the actual instruction decoder this generation is 5-wide, much the same as on the X1.

The out-of-order windows size is essentially doubled when compared to the Neoverse N1, with the ROB growing to 256 entries. This is actually a tad larger than what Arm was willing to disclose for the Cortex-X1 where the company had only talked about a “OoO window size of 224”, so in this regard this seems to be a differentiation to what we’ve seen in the X1.

On the back-end integer execution pipelines, the design also pulls in the many changes we’ve seen with the A77 generations, which amongst others include a doubling of the branch execution ports, and a new complex ALU capable of simple instructions such as additions as well as more complex operations such as multiplications and divisions.

Obviously enough, the new SIMD pipelines are very different on the V1 given that this is Arm’s first ever SVE capable microarchitecture. The design has two pipelines with seemingly two dedicated schedulers, with native capability for 256b wide SVE vectors. The design is fully backwards compatible for 128b NEON/FP operations in which the pipelines then essentially act as 4x128b units, meaning it has the same execution width as the X1 in that regard.

Compared to the N1, the new design also supports new bFloat16 and Int8 data formats which greatly increase the AI and ML inferencing performance capabilities of the core.

On the memory subsystem side, we also see the increased unit count found on the Cortex-X1, including 2 load/store units and one load unit, meaning the core is capable of up to 3 loads per cycle and 2 stores per cycle maximum.  SVE vector bandwidth is 2x32B per cycle for loads, and 32B per cycle for stores.

The core naturally includes the data parallelism improvements seen on the X1 in order to increase MLP (Memory-level parallelism) capabilities.

The L2 cache has also adopted a similar design to that of the X1, which is now 1 cycle faster at the same 1MB size, and has double the number of banks in order for increased access parallelism.

Arm here discloses a quite large reduction in the system level latency for the V1. Besides structural improvements, new generation prefetchers are a big part of this, such as the introduction of a new type of temporal prefetcher which is able to latch onto arbitrary access patterns over time and recognise subsequent iterations of the same pattern, and pull the data in.

Arm discloses that the core has new dynamic prefetching behaviour that plays a major role in reducing L2 to interconnect traffic, which is a critical metric in large core count systems where every byte of bandwidth needs to be of actual use and cannot be wasted for wrongly speculated prefetching.

A Successful 2020 for Arm - Looking Towards 2022 The Neoverse V1 Microarchitecture: Platform Enhancements
Comments Locked

95 Comments

View All Comments

  • Oxford Guy - Tuesday, April 27, 2021 - link

    ‘Fast-forward to 2021, the Neoverse N1 design today employed in designs such as the Ampere Altra is still competitive, or beating the newest generation AMD or Intel designs – a situation that which a few years ago seemed anything but farfetched.’

    Hmm... That last bit is odd. Either it’s just ‘farfetched’ or it’s ‘expected’.
  • eastcoast_pete - Tuesday, April 27, 2021 - link

    Yes, those slides look very promising; now eagerly awaiting an eventual test of one or two of these in a actual silicone. I guess then we'll see how they measure up.
  • mode_13h - Tuesday, April 27, 2021 - link

    Silicone - From Wikipedia, the free encyclopedia

    Not to be confused with the chemical element silicon.

    A silicone or polysiloxane is a polymer made up of siloxane (−R2Si−O−SiR2−, where R = organic group). They are typically colorless, oils or rubber-like substances. Silicones are used in sealants, adhesives, lubricants, medicine, cooking utensils, and thermal and electrical insulation.
  • eastcoast_pete - Thursday, April 29, 2021 - link

    I'll have to take this up with auto-correct. It keeps changing silicon to silicone. Now that I forced it again to leave silicon alone (for the umpteenth time), maybe it will stop (:
  • Mondozai - Tuesday, April 27, 2021 - link

    Fantastic overview by Andrew. AT's most underrated reporter. Hopefully he gets more responsibility to cover more things in the future.
  • Linustechtips12#6900xt - Tuesday, April 27, 2021 - link

    AGREED
  • dotjaz - Tuesday, April 27, 2021 - link

    Good, finally confirmed N2 is in fact ARMv9 as suspected. Now we'll just have to wait and see how the new mobile counterparts are. Hopefully we'll see some real improvements.

    It'll be interesting to see how small the new low power v9 core is given that it has to have a 128b SVE2 pipeline instead of 2x64b NEON.
  • mode_13h - Wednesday, April 28, 2021 - link

    > finally confirmed N2 is in fact ARMv9 as suspected.
    > Now we'll just have to wait and see how the new mobile counterparts are.
    > Hopefully we'll see some real improvements.

    The data presented on N2 doesn't give me much hope that v9 changed much, besides the feature baseline. I was hoping for something slightly revolutionary, but it's certainly not that.
  • dotjaz - Thursday, April 29, 2021 - link

    > hoping for something slightly revolutionary

    We've known for a couple of years ARMv9 is just ARMv8.x rebased. Your hopes weren't realistic to begin with. Besides, what "revolutionary" features would you expect ISAs to include? Can oyu name one? ARMv8.5a+SVE2 already has everything you need to design an excellent and efficient uarch. Why re-invent the wheel just for the sake of it?
  • mode_13h - Thursday, April 29, 2021 - link

    > We've known for a couple of years ARMv9 is just ARMv8.x rebased.

    You knew this according to where? It's one thing to assume that, and clearly it wasn't an unreasonable assumption, but it's another thing to *know* it. So, how did you *know* it?

    > Besides, what "revolutionary" features would you expect ISAs to include? Can oyu name one?

    It's a fair question. Generally speaking, anything that would help improve efficiency. Maybe things like scheduling hints or maybe some kind of tags to indicate memory writes that are thread-private and terminal reads. Just some examples, off the top of my head.

    > ARMv8.5a+SVE2 already has everything you need to design an excellent and efficient uarch.

    The issue I see is that IPC and efficiency gains are going to become ever more hard-won, so there needs to be some more creativity in redefining the SW/HW interface to unlock further gains. ARMv9 is going to be with us for probably another decade and it could end up having to compete with yet-to-be-identified alternatives like maybe RISC VI or something completely out of left-field. So, I see it as a wasted opportunity. A pragmatic decision, for sure, but a little disappointing.

Log in

Don't have an account? Sign up now