With our 2010 server upgrade we're doing more than just replacing hardware, we're moving to a fully virtualized server environment. We're constructing two private clouds, one for our heavy database applications and one for everything else. The point of creating two independent clouds is to equip them with different levels of hardware - more memory/IO for the db cloud and something a bit more reasonable for the main cloud. Within each cloud we're looking to completely duplicate all hardware to make our environment much more manageable. 

The first hardware we got in for the upgrade were our CPUs. We're moving from a 28 server setup to a 12 server environment. Each server has two CPU sockets and we're populating them with Intel Xeon L5640s.

 

The L5640 is a 32nm Westmere processor with 6-cores/12MB L3 per chip. The L indicates a lower voltage part. The L5640 carries a 60W TDP thanks to its 2.26GHz clock speed. We're mostly power constrained in our racks so saving on power is a top priority.

Each server will have two of these chips, that's 12-cores/24-threads per server. We've reviewed Intel's Xeon 5670 as well as the L5640 in particular. As Johan concluded in his review, the L5640 makes sense for us as we have hard power limits at the rack level and are charged extra for overages.

There's not much else to show off at this point but over the coming days/weeks you'll see more documentation from our upgrade process.

 

Hopefully this will result in better performance for all of you as well as more uptime as we can easily scale hardware within our upcoming cloud infrastructure. 

Comments Locked

67 Comments

View All Comments

  • Toadster - Tuesday, August 31, 2010 - link

    You guys need to check out Intel Node Manager technology where you can do group power capping at the server, rack, row or whole data center. This is one of the use cases for Node Manager to cap power across a group of servers. You can run higher performing CPU's as long as the power cap is sustained across the group (rack).
  • bryanW1995 - Tuesday, August 31, 2010 - link

    why is there a hard power cap? are you running this out of anand's basement or something?
  • Pedro80 - Tuesday, August 31, 2010 - link

    Nice one :-)
  • Calin - Tuesday, August 31, 2010 - link

    Funny :)
    They probably are hosted into a large datacenter that itself is running at its peak power draw (they can't get any more power from the power utility, usually)
  • Casper42 - Wednesday, September 1, 2010 - link

    Lots of CoLo facilities charge you based on power utilization because of the density that newer servers and Virtualization have brought. The more power used = the more heat produced that must be cooled as well.

    In a shared CoLo, if every customer spiked their usage at the same time, you also run the risk of exceeding some choke point within the DC and bringing down alot more than your own Rack. So penalties for exceeding your power cap are motivation for the admins to keep their machines in line.

    Not to mention that AT is in Europe somewhere and they seem to be more power conscious over there than the US.
  • Adul - Tuesday, August 31, 2010 - link

    I missed these server upgrade articles. I had not realize how many servers anandtech runs on now. wow.
  • haplo602 - Tuesday, August 31, 2010 - link

    Anand, can you post your average HW fault ratios for the previous infrastructure ?

    What bugs me most about the recent trend of manycore CPUs is that a failure in a CPU is rendering more and more resources unavailable with each generation.

    If you had a 4x4 config before (4sockets 4 cores each), one socket failure was 25% of CPU resources. Now you upgrade to a 2x6 config (2 sockets with 6 cores each) but each socket failure get's you down by 50% ...

    Maybe AT is not that critical on this, but quite many applications are.
  • JasperJanssen - Tuesday, August 31, 2010 - link

    Well, turning off single chips that are defective aren't features that are present in simple dual/quad socket Xeon systems in the first place. That's the sort of thing that gets into HA systems, and would quite simply (especially today) cost more than an n+2 arrangement of servers. Probably even when n is 1.

    The only way to run with a chip less is to drive to the datacenter, physically remove the dead chip, and put the server back into production. If you go to the trouble of that journey, you might as well bring along some spare parts and install them immediately.

    These sorts of systems are slightly more common on memory, because a) there's typically more modules per server installed, and b) the chances of something going wrong with memory are higher. See for example the wikipedia page for IBM's ChipKill, among others. Even so, they only use that on medium-range servers and upwards -- and basic Intel quads don't come under that heading.

    Anand's redundancy is based on the fact that the hardware pools consist of 12 physical servers, among which the running virtual servers can be (re-)distributed at need. My WAG is they won't even have anything in place to do that automatically, since there won't be all that many hardware failures in a system that's relatively small like this.

    This structure is part of what makes virtualisation so attractive. It's not just that resources can be managed by putting multiple services onto a single host machine, it's especially that in the case of hardware failure or even excessive loads on a single virtual service you can redistribute the services to gain better use of your hardware. Multi-core CPUs are one of the biggest driving forces behind virtualisation, that and the fact that many services are getting relatively less resource hungry (ie, they're not getting bigger at the same pace as the hardware improves -- think DNS, IRC, DHCP, etc.etc.).
  • haplo602 - Tuesday, August 31, 2010 - link

    eh, I guess working with HP 9000 class HW biases my view a bit. I did realy not know that Xeon/Opteron systems were that dumb.
  • Milleman - Tuesday, August 31, 2010 - link

    Anand, it would be great if you also could mention the hardware wear and tear during lung term usage. How long did the cooling fans last? How many have you changed? Powersupply breakdown? Harddisk breakdown?

    These are interesting points for those who like to build servers that mechanically will last as long as possible.

    /M

Log in

Don't have an account? Sign up now