Lenovo Intelligent Computing Orchestration (LiCO)Product Guide

Updated
12 Nov 2018
Form Number
LP0858
PDF size
16 pages, 1.3 MB

Abstract

Lenovo Intelligent Computing Orchestration (LiCO) is a software solution that simplifies the management and use of distributed clusters for High Performance Computing (HPC) workloads and Artificial Intelligence (AI) model development.

This product guide provides essential presales information to understand LiCO and its key features, specifications and compatibility. This guide is intended for technical specialists, sales specialists, sales engineers, IT architects, and other IT professionals who want to learn more about LiCO and consider its use in HPC solutions.

Change History

Changes in the November 12 update:

  • Updates for LiCO Version 5.2

Introduction

Lenovo Intelligent Computing Orchestration (LiCO) is a software solution that simplifies the management and use of distributed clusters for High Performance Computing (HPC) workloads and Artificial Intelligence (AI) model development. LiCO leverages an open source cluster management software stack, consolidating the management, monitoring and scheduling functions into a single platform.

The unified platform simplifies interaction with the underlying compute resources, enabling customers to take advantage of popular open source cluster tools while reducing the effort and complexity of using it for both HPC and AI.

LiCO

Did You Know?

LiCO enables a single cluster to be used for both HPC and AI workloads simultaneously, with multiple users accessing the cluster at the same time. Running more workloads can increase utilization of cluster resources, driving more value from the environment.

What's New in LiCO 5.2

Lenovo recently announced LiCO Version 5.2, improving the ease of use and capabilities of LiCO, including:

  • Queue management functionality, providing the ability to create and manage workload queues from within the GUI
  • Enablement on the Lenovo ThinkSystem SD650 and SR670 systems
  • Exclusive mode, to select whether to dedicate or share systems when requesting resources
  • Support for NVIDIA GPU Cloud (NGC) Container images
  • Lenovo Accelerated AI templates to provide easy-to-use training and inference functionality for a variety of AI use cases
  • Enhancements to storage management within LiCO

Part numbers

The following table lists the ordering information for LiCO.

Table 1. Ordering information
Description LFO Software CTO Feature code
Lenovo HPC AI LiCO Software 90 Day Evaluation License 7S090004WW 7S09CTO2WW B1YC
Lenovo HPC AI LiCO Software w/1 yr S&S 7S090001WW 7S09CTO1WW B1Y9
Lenovo HPC AI LiCO Software w/3 yr S&S 7S090002WW 7S09CTO1WW B1YA
Lenovo HPC AI LiCO Software w/5 yr S&S 7S090003WW 7S09CTO1WW B1YB

Note: LiCO is only configurable in the x-config configurator.
https://lesc.lenovo.com/products/hardware/configurator/worldwide/bhui/asit/x-config.jnlp

Features

For cluster users, LiCO provides the following benefits:

  • A web-based portal to execute, monitor and manage HPC and AI jobs on a distributed cluster
  • Enhanced end-user functionality to support AI model training and management
  • Workflow templates to provide an intuitive starting point for less experienced users
  • Management of private space on shared storage through the GUI
  • Monitoring of job progress and log access
  • Dedicated tools leveraging neural networks for image classification training (Intel Caffe-based)
  • In-flight visualizations, testing, and validation capabilities for image classification training (Intel Caffe-based)
  • TensorBoard visualization tools integrated into the interface (TensorFlow-based)
  • Lenovo Accelerated AI templates to provide training and inference capabilities for many common AI use cases
  • Container-based user management of supported AI frameworks (through Singularity)
  • Console access for advanced cluster users with command-line skills

For cluster administrators, LiCO provides the following benefits:

  • A single cluster management portal consolidating monitoring, alarms, and reporting
  • LiCO user management and multi-user support with user and billing groups
  • The ability to create, manage, and monitor queues for logically grouping compute resources
  • Compatibility with popular shared file systems (Spectrum Scale, NFS, Lustre)
  • Command-line access to the underlying open source stack components for skilled administrators
  • Report generation for job activity, alarms, and actions in the cluster
  • Generation of notifications and alarms based on cluster status

To facilitate the varying needs of an organization, the LiCO web portal supports 3 different access roles: administrators, users, and operators.

Features for LiCO Administrators

For cluster administrators, LiCO provides a sophisticated monitoring solution, built on OpenHPC tooling. The following menus are available to administrators:

  • Home menu for administrators – provides dashboards giving a global overview of the health of the cluster. Utilization is given for the CPUs, GPUs, memory, storage, and network. Node status is given, indicating which nodes are being used for I/O, compute, login, and management. Job status is also given, indicating runtime for the current job, and the order of jobs in the queue. The Home menu is shown in the following figure.

    LiCO Administrator Home Menu
    Figure 1. Administrator Home Menu

  • User menu – provides dashboards to control user groups and users, determining permissions and access levels (based on LDAP) for the organization. Administrators can also control and provision billing groups for accurate accounting.
  • Monitor menu – provides dashboards for interactive monitoring and reporting on cluster nodes, including a list of the nodes, or a physical look at the node topology. Administrators may also use the Monitor menu to drill down to the component level, examining statistics on cluster CPUs, GPUs, jobs, and operations. Administrators can access alerts that indicate when these statistics reach unwanted values (for instance, GPU temperature reaching critical levels.) These alerts are created using the Setting menu. The figures below display the component and alert dashboards.

    Administrator Component dashboard
    Figure 2. Administrator Component dashboard

    Administrator Alarm dashboard
    Figure 3. Administrator Alert dashboard

    GPU View Dashboard
    Figure 4. GPU View dashboard

  • Reports menu – allows administrators the ability to generate reports on jobs, alerts, or actions for a given time interval. Administrators may export these reports as a spreadsheet, in a PDF, or in HTML.
  • Admin menu – Provides the administrator with the capability to examine processes and assets, monitor VNC sessions, and download web logs.
  • Settings menu – allows administrators to set up automated notifications and alerts. Administrators may enable the notifications to reach users and interested parties via email, SMS, and WeChat. Administrators may also enable notifications and alerts via uploaded scripts.

    The Settings menu also allows administrators to create and modify queues. These queues allow administrators to subdivide hardware based on different types or needs. For example, one queue may contain systems that are exclusively machines with GPUs, while another queue may contain systems that only contain CPUs. This allows the user running the job to select the queue that is more applicable to their requirement. Within the Settings menu, administrators can also set the status of queues, bringing them up or down, draining them, or marking them inactive.

Features for LiCO Operators

For the purpose of monitoring clusters but not overseeing user access, LiCO provides the Operator designation. LiCO Operators have access to a subset of the dashboards provided to Administrators; namely, the dashboards contained in the Home, Monitor, and Reports menus:

  • Home menu for operators – provides dashboards giving a global overview of the health of the cluster. Utilization is given for the CPUs, GPUs, memory, storage, and network. Node status is given, indicating which nodes are being used for I/O, compute, login, and management. Job status is also given, indicating runtime for the current job, and the order of jobs in the queue.
  • Monitor menu – Dashboard that enables interactive monitoring and reporting on cluster nodes, including a list of the nodes, or a physical look at the node topology. Operators may also use the Monitor menu to drill down to the component level, examining statistics on cluster CPUs, GPUs, jobs, and operations. Operators can access alarms that indicate when these statistics reach unwanted values (for instance, GPU temperature reaching critical levels.) These alarms are created by Administrators using the Setting menu (for more information on the Setting menu, see the LiCO Administrator Features section.)
  • Reports menu – allows operators the ability to generate reports on jobs, alerts, or actions for a given time interval. Operators may export these reports as a spreadsheet, in a PDF, or in HTML.

Features for LiCO Users

Those designated as LiCO users have access to dashboards related specifically to HPC and AI tasks. Users can add jobs to the queue, and monitor their results through the dashboards. The following menus are available to users:

  • Home menu for users – provides dashboards giving a global overview of the resources available in the cluster. Availability is given for the CPUs, GPUs, memory, storage, and network. Jobs and job statuses are also given, indicating the runtime for the current job, and the order of jobs in the queue. Additionally, a list of recent job templates is given for both HPC and AI workloads. The figure below displays the home menu.

    User Home Menu
    Figure 5. User Home Menu

  • Submit job menu – allows users to set up a job and submit it to the queue. The user first picks a job template. After selecting the template, the user gives the job a name and inputs the relevant parameters, and submits it. LiCO displays key data about the queue through the templates, including whether the queue is up and which nodes or cores it can access. Users submitting jobs can select the Exclusive checkbox to dedicate systems to their job. For instance, if a job is known to be computationally demanding, LiCO will not provision any other jobs to run on the same node.

    Alternatively, if the Exclusive checkbox is not selected, the user can specify their number of CPU cores to use, which allows multiple jobs to be run concurrently on that system. Depending on the selected template, the parameters relevant to the job will change.

    Users can take advantage of Lenovo Accelerated AI templates, industry-standard HPC and AI templates, submit generic jobs as scripts via the Common Job template, as well as create their own templates requesting specified parameters.

    The figures below display three job templates.

    AI Job Template
    Figure 6. AI Job Template

    LiCO HPC Job Template
    Figure 7. HPC Job Template

    LiCO AI training template
    Figure 8. LiCO AI training template

  • Jobs menu – displays a dashboard listing queued jobs and their statuses. In addition, you can select the job and see results and logs pertaining to the job in progress (or after completion.)
  • Model Library menu – displays options for running neural network AI workloads for Intel-Caffe. This includes a list of the available datasets, the neural network topologies that have been created, and the image classification models built from those datasets and topologies. Users can leverage existing datasets or partition new datasets into training, validation, and test data sets. The user can also use existing topologies to get started quickly with a model, or create new topologies to solve their given problem.

    After creating a model, users can train the model as a job. The model will be trained and the accuracy will be evaluated on both the training and validation datasets to help control for overfitting. LiCO provides users with graphs showing model statistics at each epoch including model accuracy, training loss, and processing speed. After the model has finished training, the user can navigate to the testing data set in order to perform unbiased model evaluation or comparison as needed. LiCO also supports pre-trained models – users can upload source, weight, and topology files to create accurate models while requiring little training.

    The following figure shows an example of the graph statistics for a given model run using Intel Caffe.

    LiCO Neural Network results
    Figure 9. Neural Network results

  • Expert mode menu – recommended for users who are familiar with the command line interface for the OpenHPC tools. The Expert mode menu provides console access that allows users to log in to the Management node where LiCO is located. Users who log in through this console can submit HPC and AI jobs and manage workloads using the CLI.
  • Admin menu – allows users to manage the AI framework containers and directly access active nodes with VNC. For AI jobs, users can upload Singularity containers with their own frameworks, including those from the NVIDIA GPU Cloud (NGC), allowing for flexibility in their framework environment. The VNC dashboard provides a real-time display of all the VNC sessions in a cluster created by the user. The Manage Files dashboard allows users to create, move, preview, edit, and delete files within their file system on the LiCO machine through a drag-and-drop interface.

Lenovo Accelerated AI

Lenovo Accelerated AI provides a set of templates that aim to make AI training and inference simpler, more accessible, and faster to implement. The Accelerated AI templates differ from the other templates in LiCO in that they do not require the user to input a program; rather, they simply require a workspace (with associated directories) and a dataset.

The following use cases are supported with Lenovo Accelerated AI templates:

  • Image Classification
  • Object Detection
  • Instance Segmentation
  • Medical Image Segmentation
  • Seq2Seq
  • Memory Network
  • Image GAN

The following figure displays the Lenovo Accelerated AI templates.

Lenovo Accelerated AI templates
Figure 10: Lenovo Accelerated AI templates

Each use case is supported by both a training and inference template. The training templates provide similar parameter inputs to those in the Models section of the Model Library tab, such as batch size and learning rate. These parameter fields are pre-populated with default values, but are fully tunable by those with data science knowledge. The templates also provide visual analytics with TensorBoard; the TensorBoard graphs continually update in-flight as the job runs, and the final statistics are available after the job has completed.

The following figure displays the embedded TensorBoard interface for a job. TensorBoard provides visualizations for TensorFlow jobs running in LiCO, whether through Lenovo Accelerated AI templates or the standard TensorFlow AI templates.

TensorBoard in LiCO
Figure 11: TensorBoard in LiCO

LiCO also provides inference templates which allow users to predict with new data models that have been trained with Lenovo Accelerated AI templates. For the inference templates, users only need to provide a workspace, an input directory (the location of the data on which inference will be performed), an output directory, and the location of the trained model. The job will run, and upon completion, the output directory will contain the analyzed data. For visual templates such as Object Detection, images can be previewed directly from within LiCO’s Manage Files interface.

The following two figures display an input file to the Object Detection inference template, as well as the corresponding output.

Photo of a cat
Figure 12: JPG file containing image of cat for input into inference job


Figure 13: LiCO output displaying the section of the JPG containing the cat image

Subscription & Support

LiCO is enabled through a per-CPU and per-GPU subscription and support model, which once entitled for the all the processors contained within the cluster, gives the customer access to LiCO package updates and Lenovo support for the length of the acquired term.

Lenovo will provide interoperability support for all software tools defined as validated with LiCO, and development support (Level 3) for specific Lenovo-supported tools only. Open source and supported-vendor bugs/issues will be logged and tracked with their respective communities or companies if desired, with no guarantee from Lenovo for bug fixes. Additional support options may be available; please contact your Lenovo sales representative for more information.

LiCO can be acquired as part of a Lenovo Scalable Infrastructure (LeSI) solution or for “roll your own” (RYO) solutions outside of the LeSI framework, and LiCO software package updates are provided directly through the Lenovo Electronic Delivery system. More information on LeSI is available in the LeSI product guide, available from https://lenovopress.com/lp0900.

Validated Software Components

LiCO’s software packages are dependent on a number of software components that need to be installed prior to LiCO in order to function properly. Each LiCO software release is validated against a defined configuration of software tools and Lenovo systems, to make deployment more straightforward and enable support. Other management tools, hardware systems and configurations outside the defined stack may be compatible with LiCO, though not formally supported; to determine compatibility with other solutions, please check with your Lenovo sales representative.

The following software components are validated by Lenovo as part of the overall LiCO software solution entitlement:

  • Lenovo Development Support (L1-L3)
    • Graphical User Interface: LiCO
    • System Management & Provisioning: xCAT/Confluent
  • Lenovo Configuration Support (L1 only)
    • Job Scheduling & Orchestration: SLURM; Torque/Maui (HPC only)
    • System Monitoring: Nagios
    • Application Monitoring: Ganglia
    • Container Support (AI): Singularity
    • AI Frameworks (AI): Caffe, Intel-Caffe, TensorFlow, MxNet, Neon

The following software components are validated for compatibility with LiCO:

  • Supported by their respective software provider
    • Operating System: CentOS/RHEL 7.5, SUSE SLES 12 SP3
    • File Systems: IBM Spectrum Scale, Lustre
    • Job Scheduling & Orchestration: IBM Spectrum LSF
    • Development Tools: GNU compilers, Intel Cluster Toolkit

Supported servers

The following Lenovo servers are supported to run LiCO. This server must run one of the supported operating systems as well as the validated software stack, as described in the Validated Software Components section.

  • ThinkSystem SD530 – The Lenovo ThinkSystem SD530 is an ultra-dense and economical two-socket server in a 0.5U rack form factor. With up to four SD530 server nodes installed in the ThinkSystem D2 enclosure, and the ability to cable and manage up to four D2 enclosures as one asset, you have an ideal high-density 2U four-node (2U4N) platform for enterprise and cloud workloads. The SD530 also supports a number of high-end GPU options with the optional GPU tray installed, making it an ideal solution for AI Training workloads. For more information, see the product guide at https://lenovopress.com/lp0635-thinksystem-sd530-server.
  • ThinkSystem SD650 – The Lenovo ThinkSystem SD650 direct water cooled server is an open, flexible and simple data center solution for users of technical computing, grid deployments, analytics workloads, and large-scale cloud and virtualization infrastructures. The direct water cooled solution is designed to operate by using warm water, up to 50°C (122°F). Chillers are not needed for most customers, meaning even greater savings and a lower total cost of ownership. The ThinkSystem SD650 is designed to optimize density and performance within typical data center infrastructure limits, being available in a 6U rack mount unit that fits in a standard 19-inch rack and houses up to 12 water-cooled servers in 6 trays. For more information, see the product guide at https://lenovopress.com/lp0636-thinksystem-sd650-direct-water-cooled-server.
  • ThinkSystem SR630 – Lenovo ThinkSystem SR630 is an ideal 2-socket 1U rack server for small businesses up to large enterprises that need industry-leading reliability, management, and security, as well as maximizing performance and flexibility for future growth. The SR630 server is designed to handle a wide range of workloads, such as databases, virtualization and cloud computing, virtual desktop infrastructure (VDI), infrastructure security, systems management, enterprise applications, collaboration/email, streaming media, web, and HPC. For more information, see the product guide at https://lenovopress.com/lp0643-lenovo-thinksystem-sr630-server.
  • ThinkSystem SR650 – The Lenovo ThinkSystem SR650 is an ideal 2-socket 2U rack server for small businesses up to large enterprises that need industry-leading reliability, management, and security, as well as maximizing performance and flexibility for future growth. The SR650 server is designed to handle a wide range of workloads, such as databases, virtualization and cloud computing, virtual desktop infrastructure (VDI), enterprise applications, collaboration/email, and business analytics and big data. For more information, see the product guide at https://lenovopress.com/lp0644-lenovo-thinksystem-sr650-server.
  • ThinkSystem SR670 – The Lenovo ThinkSystem SR670 is a purpose-built 2 socket 2U 4GPU node, designed for optimal performance for high-end computation required by both Artificial Intelligence and High Performance Computing workloads. Supporting the latest NVIDIA GPUs and Intel Xeon Scalable processors, the SR670 supports hybrid clusters for organizations that may want to consolidate infrastructure, improving performance and compute power, while maintaining optimal TCO. For more information, see the product guide at https://lenovopress.com/lp0923-thinksystem-sr670-server.

Additional Lenovo ThinkSystem and System x servers may be compatible with LiCO. Contact your Lenovo sales representative for more information.

LiCO Implementation services

Customers who do not have the cluster management software stack required to run with LiCO may engage Lenovo Professional Services to install LiCO and the necessary open-source software. Lenovo Professional Services can provide comprehensive installation and configuration of the software stack, including operation verification, as well as post-installation documentation for reference. Contact your Lenovo sales representative for more information.

Client PC requirements

A web browser is used to access LiCO's monitoring dashboards. To fully utilize LiCO’s monitoring and visualization capabilities, the client PC should meet the following specifications:

  • Hardware: CPU of 2.0 GHz or above and 1 GB or more of RAM
  • Display resolution: 1280 x 800 or higher
  • Browser: Chrome or Firefox is recommended

Related product families

Product families related to this document are the following:

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
System x®
ThinkSystem

The following terms are trademarks of other companies:

Intel® and Xeon® are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.