Taking Small Steps forward

Today unveiled its new AMD Opteron 6300 series server processors, code name Abu Dhabi. The Opteron 6300 containes the new Piledriver cores, an evolutionary improvement of the Bulldozer cores.

We did an in depth analysis of the Bulldozer core and we came to the conclusion that there are three primary weak spots that resulted in the underwhelming performance of the Bulldozer core:

  1. The L1 instruction cache: when running two threads simultaneously, the cache misrate increased significantly; the associativity is too low.
  2. The branch misprediction penalty
  3. Lower than expected clock speed

Secondary bottlenecks were the high latency and low bandwidth of the L2 cache, and the very high latency of the L3 cache, which signficantly increased the overall memory latency.

The lack of clock speed has been partially solved in Piledriver with the use of hard edge flops and the resonant clock edge, which is especially useful for clock speeds beyond 3GHz. Vishera, the desktop chip with Piledriver cores, runs at clock speeds of up to 4GHz, 11% higher than Bulldozer, without any measureable increase in power consumption. As you can see further below, the clockspeed increase are a lot smaller for the Opteron 6300: about 4-6%. The fastest but hottest (140W TDP) Opteron now clocks at 2.8GHz instead of 2.7GHz, and the "regular" Opteron 6380 now runs at 2.5GHz instead of 2.4GHz (Opteron 6278). That means that the Opteron is still not able to fully leverage the deeply pipelined, high clockspeed architecture: the power envelope of 115W is still limiting the maximum clockspeed. The more complex and less deeply pipelined Intel Xeon E5 runs at 2.7GHz with a 115W TDP.

Piledriver also comes with a few small improvements in the branch prediction unit. Two out of three of the worst bottlenecks got somewhat wider. The most important bottleneck, the L1 Icache, is only going to be fixed with the next iteration, Steamroller.

The L2 cache latency and bandwidth has not changed, but AMD did quite a few optimizations. From AMD engineering:

"While the total bandwidth available between the L2 and the rest of the core did not change from Bulldozer to Piledriver, the existing bandwidth is now used more effectively. Some unnecessary instruction decode hint data writes to the L2 that were present in Bulldozer have been removed in Piledriver. Also, some misses sent to the L2 that would get canceled in Bulldozer are prevented from being sent to the L2 at all in Piledriver. This allows the L2’s existing resources to be applied toward more useful work.”

We talked about the whole list of other improvements when we looked at Trinity:

  • Smarter prefetching
  • A perceptron branch predictor that supplements the primary BPU
  • Larger L1 TLB
  • Schedulers that free up tokens more quickly
  • Faster FP and integer dividers and SYSCALL/RET (kernel/System call instructions)
  • Faster Store-to-Load forwarding

Lastly, the new Opteron 6300 can now support one DDR3 DIMM per channel at 1866MHz. With 2 DPC, you get a 1600MHz at 1.5V.

We're still working to get hardware in house for testing, but we wanted to provide some analysis of what to expect with Abu Dhabi in the meantime.

Performance According to AMD
Comments Locked

22 Comments

View All Comments

  • gamoniac - Tuesday, November 6, 2012 - link

    On top of that, there is licensing costs. Windows Server 2012, for example, can be licensed per processor rather than by core count. That that comes into play, it can quickly inflate the TCO when comparing 4-socket vs 2-socket servers.
  • alpha754293 - Friday, November 9, 2012 - link

    There are a lot of programs that have different licensing methods.

    Ansys is per core.

    Windows actually makes it potentially quite cost effective - especially if you're running a virtualization server because you can throw a lot of VM tiles on a 8-module(?) Opteron 6300 so while you might have to pay more for the additional sockets, it might save you money because you don't have to run twice the number of servers to handle the same number of VM tiles. It really depends on what you're doing with it.

    (I think that Enterprise Linux is also licensed in the same way (per socket).)
  • alpha754293 - Friday, November 9, 2012 - link

    uhh....it depends.

    For some of our larger runs (both at my work and also my CFD runs at home, and also the research that I used to be doing for the university) - we had to write restart files on a regular interval in the event that something goes wrong or the power goes out or something like that.

    That's our kind of "backup". Although unlike say...the financial sector where they want five 9's uptime, (99.99999%), our restriction isn't THAT bad, but the professional HPC centers will have HA of some kind implemented.

    I think that you saw the last time that you ran the LS-DYNA benchmarks on the Opteron 6274 that the way that AMD are counting the cores (integer cores, not FP cores) - means that there was only like 7-8% performance benefit for HPC applications (which isn't much given twice the "core" count).

    The FPU itself runs into something akin to thread contention issues. (It still boils down to fighting for FPU resources).

    But if say...for example, you have a properly, well coded Photoshop - and they are learning on how to write MPP codes from HPC, it can take what they already do quite well, and make it run even better. Fewer cores perhaps, but if the cores ARE available, it will know how to best break up the problem so that it would be able to better run the same task in parallel vs. the more like...quasi-parallel (multi-threaded) approach that a lot of these programs use nowadays.

    (Imagine if you're batch processing images and it's able to spawn multiple instances of the batch solver/processor so that you can work on multiple images at the same time rather than working on them one at a time, but working on them in a multi-threaded manner.)

    Or imagine if the Flash plugin was multi-core capable/aware so when you have 146 tabs, it doesn't crash your browser session. ;o) (Oh the joys of being a researcher.)
  • Kevin G - Monday, November 5, 2012 - link

    Any idea when these will hit e-tail? I have a dual socket G34 board that two Opteron 6320 or two 6374's would be a good match. Still have decided between high clock and high core count. When you get up to 32 simultaneous threads, things really start to hit diminishing returns.
  • MySchizoBuddy - Monday, November 5, 2012 - link

    All of the new opteron chips can be used in 4P configurations. While none of the listed Xeons can. Can you add the Xeons that work in 4P configurations as well.
  • Stuka87 - Monday, November 5, 2012 - link

    Xeons that work in Quad Socket configs cost significantly more and do not really compete with the Opterons.

    But it would be interesting to see the cost to performance difference between the two.
  • Kjella - Monday, November 5, 2012 - link

    Right under "AMD Opteron 6300 versus 6200 SKUs" the leftmost column says Xeon E5, where it should say Opteron 6300. Anyway, now AMD can't even get a review sample out the door? Seriously? Either they're too incompetent or the benchmarks would be too embarrassing, either way it's not good.
  • PsiAmp - Tuesday, November 6, 2012 - link

    Why are you comparing two CPUs that have 64% price difference and say cheaper one has 12% less performance and is not attractive?

    You need to compare products of similar price points. Or take into account price difference, which you didn't mention at all.
  • JohanAnandtech - Tuesday, November 6, 2012 - link

    Can you be more specific and tell me which CPU comparison you are talking about? The CPUs I compared had a 4 to 15% price difference ( 6386 SE vs 2665 or 6366HE vs 2630L).
  • DeaDSOuLz - Monday, November 12, 2012 - link

    Strange I have had 2 Opteron 6376 for about 3 weeks. So getting them out early shouldn't have been an issue. Of course I bought about 2 thousand of the 6274 of the last 12 months, may have something to do with it.

Log in

Don't have an account? Sign up now