RAS Features of the Lenovo ThinkSystem SR950 and SR850Article

Published
25 Oct 2017
Form Number
LP0777
PDF size
10 pages, 523 KB

Abstract

Server downtime is very costly to enterprises, especially business or mission critical workloads. Always-on has become a global requirement and impacts almost every aspect of our lives. The Lenovo ThinkSystem SR850 and SR950 contain multiple levels of RAS capabilities to ensure the servers maintain the highest level of Reliability, Availability and Serviceability (RAS).

Introduction

Server reliability, availability, and serviceability (RAS) are crucial issues for modern enterprise IT shops that deliver mission-critical applications and services, and application delivery failures can be extremely costly per hour of system downtime. In addition, the likelihood of such failures increases statistically with the size of the servers, data, and memory required for these deployments.

Mission-critical applications such as database, enterprise resource planning (ERP), customer resource management (CRM), and business intelligence (BI) applications need to be available 24/7 on a wide area or global basis.

While clustering and virtualization can help meet availability requirements, they are not adequate solutions for very large databases, BI, and high-end transactional systems. A failure affecting a single core business application can easily cost hundreds of thousands or even millions of dollars per hour. All this leads to a need for scalable and highly resilient servers that are well suited for critical business applications and large-scale consolidation.

Always On

Time is money. Even a few minutes of downtime can result in significant costs and cause internal business operations to come to a standstill. Downtime can also adversely impact a company’s relationship with its customers, business suppliers and partners. Reliability or lack thereof can potentially damage a company’s reputation and result in lost business.

The growth of new applications has ratcheted database processing and business analytics to the top of the list for server workloads. These workloads demand continuous availability from the enterprise platforms on which they run.

"Always on" has become a global requirement and impacts many aspects of our lives:

  • Maximize productivity - Manufacturers need to keep their production line up and running. System downtime should not interrupt it.
  • Control access - Facility Security companies prevent external threats to organizations. Security application downtime shouldn't be an internal threat.
  • Protect profit - Retailers have sales targets to meet day in, day out. Transaction system downtime shouldn’t get in the way.
  • Protect lives - First Responders take care of emergencies 24 x 7 x 365.  Application downtime shouldn’t be one of them.
  • Ensure quality care and privacy - Healthcare Institutions need to access patient information and be HIPPA compliant all the time. System downtime shouldn’t compromise either one.
  • Process transactions - Financial Services organizations manage thousands of transactions a second. Processing system downtime simply can’t happen

The Cost of Downtime

The ITIC 2016 survey found that  98% of organizations say that a single hour of downtime costs over $100,000; 81% of respondents indicated that 60 minutes of downtime costs their business over $300,000 and a record one-third or 33% of enterprises report that one hour of downtime costs their firms $1 million to over $5 million.

Cost of hourly downtime in enterprises, 2016-2017
Figure 1. Cost of hourly downtime in enterprises, 2016-2017

Server RAS Defined

RAS in relation to servers is defined as follows:

Reliability – Reducing the mean time between hardware failures and ensuring data integrity. Data integrity is protected through error detection and correction — or, if not correctable, error containment

  • Error Detection and Self-Healing
  • Minimizes outage opportunities
  • Correct results continously

Availability – Refers to uninterrupted system and application operation even in the presence of uncorrectable errors

  • Reduce frequency and duration of outages
  • Self-diagnosing: work around faulty components or “self-heal”
  • Never stops or slows down

Serviceability – Means a system can be maintained without disrupting operation. This capability requires both thoughtful platform design and innovative systems management.

  • Avoid repeat failures with accurate diagnostics
  • Concurrent repair on higher failure rate items
  • Easy to repair and upgrade

Key RAS Features of the SR950 and SR850

The ThinkSystem SR950 and ThinkSystem SR850 support self-monitoring and self-healing capabilities.  This technology enables the server to monitor key sub-systems for errors, and automatically repair known issues.

ThinkSystem SR850

ThinkSystem SR950
Figure 2. SR850 (top) and SR950 (bottom)

Detecting and correcting problems (or isolating problems that cannot be immediately rectified) is important to maintain system integrity and protect mission-critical data. Support for multiple layers of system component redundancy and subsequent automated failover functionality ensures a higher level of availability. The SR950 and SR850 take advantage of predictive failure analysis to identify problematic components before they fail, allowing them to be replaced during regular maintenance cycles, and ultimately minimizing service costs.

Lenovo platform RAS innovation features include:

  • Automated processor failover
  • Automated firmware backup
  • Automated memory page sorting and page retire
  • Advanced transaction recovery

The servers also offer solution-level RAS with software stack integration:

  • VMware virtualization
  • Microsoft virtualization

The servers have the following RAS features:

  • Provides Single Device Data Correction (SDDC, also known as Chipkill), Adaptive Double- Device Data Correction (ADDDC, also known as Redundant Bit Steering or RBS), memory mirroring, and memory rank sparing for redundancy in the event of a non-correctable memory failure.
  • The Dual M.2 Boot Adapter supports RAID-1 which enables two installed M.2 drives to be configured as a redundant pair.
  • Hot-swap redundant power supplies and hot-swap redundant fans to provide availability for mission-critical applications
  • The power-source-independent light path diagnostics uses LEDs to lead the technician to failed (or failing) components, which simplifies servicing, speeds up problem resolution, and helps improve system availability
  • LCD system information display panel provides more detailed diagnostics by displaying all error messages and VPD data needed for a service call, thereby aiding with problem resolution and system uptime
  • Hot-swap drives, supporting RAID redundancy for data protection and greater system uptime
  • Solid-state drives (SSDs) offer more reliability than traditional mechanical HDDs for greater uptime
  • Proactive Platform Alerts (including PFA and SMART alerts): Processors, voltage regulators, memory, internal storage (SAS/SATA HDDs and SSDs, NVMe SSDs, M.2 storage, flash storage adapters), fans, power supplies, RAID controllers, server ambient and subcomponent temperatures.

RAS Features with Lenovo XClarity

In addition to the above SR950 and SR850 key RAS features, Lenovo XClarity which is a centralized systems management solution continuously monitors system parameters, triggers alerts, and performs recovery actions in case of failures to minimize downtime.

XClarity has the following RAS features:

  • Provides the tools needed to enable administrators to deploy platforms more quickly and manage them easier.
  • Allows servers even ‘call home’ if they detect an issue, so a potential problem may be fixed before it occurs.
  • XClarity Provisioning Manager collects and saves service data to USB key drive or remote CIFS share folder, for troubleshooting and to reduce service time.
  • XClarity Administrator Mobile app running on a supported smartphone and connected to the server through the service-enabled USB port, enables additional local systems management functions.
  • Auto restart in the event of a momentary loss of AC power (based on the power policy setting in the XClarity Controller service processor)
  • Collects and downloads diagnostic data, including logs, service data, and inventory to help identify the cause of the issue.

Unique Hardware Serviceability

The design of the ThinkSystem SR950 is based on a modular service model where access is from the front and rear only. This means that nearly all parts can be removed from the front or rear of the system, even parts that are located in the center of the server (e.g., fans, memory DIMMs, and processors). This design helps reduce time and cost associated with installing and maintaining systems, and can reduce the chance of errors occurring while working with the system.

Fans are accessible from the front of the SR950 server
Figure 3. Fans are accessible from the front of the SR950 server

To learn more about the design and usability of SR950, read the article Usability in the Design of the ThinkSystem SR950.

Intel RAS Features

The Intel Xeon Scalable Family processors offer Advanced and Standard RAS features.

  • Bronze and Silver processors support Standard RAS features
  • Gold and Platinum processors support Standard and Advanced RAS features.

The SR950 and SR850 use Gold and Platinum processors exclusively so they offer both Standard and Advanced RAS features.

The following table lists the Intel Advanced RAS features.

Table 1. Intel Advanced RAS features
Advanced RAS features Category Benefit
Viral Mode of error containment Reliability Enhanced error containment to improve data integrity, complimentary to corrupt data containment mode
MCA Recovery-Execution path Reliability OS layer assisted recovery from uncorrectable data errors to prevent system reset
MCA Recovery-Non execution path Reliability OS layer assisted recovery from uncorrectable data errors detected by Patrol scrubber or LLC Explicit Write Back
Local Machine Check (LMCE) based Recovery Reliability Enhances MCA recovery-Execution path event, and increases the possibility of recovery
SDDC +1, Adaptive DDDC (MR) +1 Reliability Adaptive virtual lockstep delivers up to two DRAM Device corrections. Also supports Single DRAM correction, as well as single bit correction post final DRAM device map out.
PCI Express Live Error Recovery Reliability PCI-e root port error containment, and the opportunity to dynamically recover from the error
Intel® UPI Dynamic Link width reduction Availability Enables interconnect to continue operation in presence of Interconnect link persistent failure
Address range/Partial Memory Mirroring Reliability OS managed memory mirroring of selective ranges, increases data integrity at efficient cost
MCA 2.0 Recovery (as per eMCA gen2 architecture) Reliability Firmware first model enables a reliable error sourcing capability with the ability to write to the MSR

The following table lists the Intel Standard RAS features.

Table 2. Intel Standard RAS features
Standard RAS features Category Benefit
Advanced Error Detection and Correction (AEDC) Reliability Enhanced fault coverage within processor cores, and attempt to recover via instruction retry
Error Detection and Correction Reliability Extensive Error detection and correction capability across the silicon, and the interconnects.
Corrupt Data containment-Core Reliability Uncorrectable data explicitly marked and delivered synchronously to the consuming core to assist error containment and increase system reliability
Corrupt Data containment-UnCore Reliability Uncorrectable data explicitly marked and delivered synchronously to the requestor, to assist error containment and increase system reliability
SDDC, Adaptive Data Correction (SR) Reliability Adaptive virtual lockstep delivers single DRAM Device corrections, at bank granularity. Also supports Single DRAM correction.
PCIe “Stop and Scream” Reliability PCI-e root port corrupt data containment feature, increases data integrity
Memory Mirroring- Intra iMC Reliability Increase data integrity by creating a redundant/mirrored copy of data in system DRAM
DDR4 memory RANK Sparing Reliability Reserved/spare DRAM RANKs are utilized to dynamically map out the failing DRAM RANK into the spare Ranks.
Predictive Failure Analysis Serviceability Extensive error logs to assist software in predicting failures
Failed DIMM Isolation Serviceability Extensive error logs to help software identify the failing DIMM
Virtual (soft) Partitioning Reliability Virtual Machine Monitor ability to make use of hardware recovery , signaling and error logs
Error reporting via IOMCA Serviceability Unified error reporting of the IIO logic to the OS
Error reporting through MCA 2.0 (eMCA gen2) Serviceability Firmware first model enables a reliable error sourcing capability
Error reporting through eMCA gen1 Serviceability Firmware first model enables reliable error sourcing capability
PCIe Card Hot Plug NVMe (Add, Remove, Swap) Serviceability Hot add and repalce of NVMe drives
PCI Express ECRC Reliability PCI Express End to end CRC checking, increasing system reliability
PCIe Corrupt Data Containment (Data Poisoning) Reliability PCIe corrupt data mode of operation, synchronous signaling of the corrupted data along with data, increases system reliability
PCIe Link CRC Error Check and Retry Reliability PCIe link CRC error check and retry, system reliability and recovery from transient errors
PCIe Link Retraining and Recovery Reliability PCIe link retraining and attempted recovery from persistent link transient errors
Mem SMBus hang recovery Reliability Software ability to reset memory SMBus interface to recover from hang condition
DDR4 Command/ Address Parity Check and Retry Reliability DDR4 Address and command parity check and retry in the event of errors
Time-out timer Schemes Serviceability Hierarchy of transaction time outs to assist system debug and reliable error sourcing.
Intel® UPI Link Level Retry Reliability Intel UPI link’s ability to perform CRC check and retry on errors for higher degree of system reliability
Intel® UPI Protocol Protection via 16 bit Rolling CRC Reliability Detection of transient data errors over Intel UPI interconnects, via 16bit CRC error checking
Processor BIST Serviceability At power up, Processor’s built in self test engine performs test on the internal cache structure for and provides the results to the system BIOS
Socket disable for FRB Availability The capability to selectively disable socket at the boot time, and therefore allowing system to power-on in a failover configuration
Core disable for FRB Availability The capability to disable failing cores at boot time, map out the failing core
PIROM for System Information Storage Serviceability On package Processor Information ROM

 

Conclusion

The Lenovo ThinkSystem SR950 and SR850 RAS technologies drive the outstanding system availability and uninterrupted application performance needed to host business or mission-critical applications.

Enterprises whose productivity and success depend on large-scale, mission-critical applications require a scale-up high availability server. The SR950 and SR850 must be on the shortlist for any enterprise that is looking at refreshing its high availability or mission critical systems.

About the author

Randall Lundin is the Mission Critical Product Manager in the Lenovo Data Center Group. He is responsible for managing and planning Lenovo’s 4-socket and 8-socket servers. Randall has also authored and contributed to numerous Lenovo Press publications in the Mission Critical space.

Related product families

Product families related to this document are the following:

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo XClarity
Lenovo®
ThinkAgile
ThinkSystem

The following terms are trademarks of other companies:

Intel® and Xeon® are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Microsoft® and SQL Server® are trademarks of Microsoft Corporation in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.