Lenovo Intelligent Computing Orchestration (LiCO)Product Guide

Updated
16 Apr 2019
Form Number
LP0858
PDF size
19 pages, 2.4 MB

Abstract

Lenovo Intelligent Computing Orchestration (LiCO) is a software solution that simplifies the management and use of distributed clusters for High Performance Computing (HPC) workloads and Artificial Intelligence (AI) model development.

This product guide provides essential presales information to understand LiCO and its key features, specifications and compatibility. This guide is intended for technical specialists, sales specialists, sales engineers, IT architects, and other IT professionals who want to learn more about LiCO and consider its use in HPC solutions.

Change History

Changes in the April 16 update:

  • Updated for LiCO 5.3

Introduction

Lenovo Intelligent Computing Orchestration (LiCO) is a software solution that simplifies the management and use of distributed clusters for High Performance Computing (HPC) workloads and Artificial Intelligence (AI) model development. LiCO leverages an open source cluster management software stack, consolidating the management, monitoring and scheduling functions into a single platform.

The unified platform simplifies interaction with the underlying compute resources, enabling customers to take advantage of popular open source cluster tools while reducing the effort and complexity of using it for both HPC and AI.

LiCO

Did You Know?

LiCO enables a single cluster to be used for both HPC and AI workloads simultaneously, with multiple users accessing the cluster at the same time. Running more workloads can increase utilization of cluster resources, driving more value from the environment.

What's new in LiCO 5.3

Lenovo recently announced LiCO Version 5.3, improving the ease of use and capabilities of LiCO, including:

  • End-to-end AI training workflows for Image Classification, Object Detection, and Instance Segmentation
  • Option to copy existing jobs into the original template, with existing parameters pre-filled and modifiable
  • Enablement on the Lenovo ThinkSystem SR950
  • Support for Keras, Chainer AI framework, and latest MxNet optimizations for Intel CPU training
  • Integration support for HBase and MongoDB BigData sources
  • Integration support for trained AI model publishing to git repositories
  • REST interface to instantiate LiCO AI training functions from DevOps tools

Part numbers

The following table lists the ordering information for LiCO.

Table 1. Ordering information
Description LFO Software CTO Feature code
Lenovo HPC AI LiCO Software 90 Day Evaluation License 7S090004WW 7S09CTO2WW B1YC
Lenovo HPC AI LiCO Software w/1 yr S&S 7S090001WW 7S09CTO1WW B1Y9
Lenovo HPC AI LiCO Software w/3 yr S&S 7S090002WW 7S09CTO1WW B1YA
Lenovo HPC AI LiCO Software w/5 yr S&S 7S090003WW 7S09CTO1WW B1YB

Note: LiCO is only configurable in the x-config configurator.
https://lesc.lenovo.com/products/hardware/configurator/worldwide/bhui/asit/x-config.jnlp

Features

For cluster users, LiCO provides the following benefits:

  • A web-based portal to execute, monitor and manage HPC and AI jobs on a distributed cluster
  • Enhanced end-user functionality to support AI model training and management
  • Workflow templates to provide an intuitive starting point for less experienced users
  • Management of private space on shared storage through the GUI
  • Monitoring of job progress and log access
  • HPC Runtime module definition and pre-loading with job execution
  • Lenovo Accelerated AI templates to provide training and inference capabilities for many common AI use cases
  • TensorBoard visualization tools integrated into the interface (TensorFlow-based)
  • Container-based user management of supported AI frameworks and HPC applications (through Singularity)
  • Console access for advanced cluster users with command-line skills

For cluster administrators, LiCO provides the following benefits:

  • A single cluster management portal consolidating monitoring, alarms, and reporting
  • LiCO user management and multi-user support with user and billing groups
  • The ability to create, manage, and monitor queues for logically grouping compute resources
  • Compatibility with popular shared file systems (Spectrum Scale, NFS, Lustre)
  • Command-line access to the underlying open source stack components for skilled administrators
  • Report generation for job activity, alarms, and actions in the cluster
  • Generation of notifications and alarms based on cluster status

To facilitate the varying needs of an organization, the LiCO web portal supports 3 different access roles: administrators, users, and operators.

Features for LiCO Administrators

For cluster administrators, LiCO provides a sophisticated monitoring solution, built on OpenHPC tooling. The following menus are available to administrators:

  • Home menu for administrators – provides dashboards giving a global overview of the health of the cluster. Utilization is given for the CPUs, GPUs, memory, storage, and network. Node status is given, indicating which nodes are being used for I/O, compute, login, and management. Job status is also given, indicating runtime for the current job, and the order of jobs in the queue. The Home menu is shown in the following figure.

    LiCO Administrator Home Menu
    Figure 1. Administrator Home Menu

  • User menu – provides dashboards to control user groups and users, determining permissions and access levels (based on LDAP) for the organization. Administrators can also control and provision billing groups for accurate accounting.
  • Monitor menu – provides dashboards for interactive monitoring and reporting on cluster nodes, including a list of the nodes, or a physical look at the node topology. Administrators may also use the Monitor menu to drill down to the component level, examining statistics on cluster CPUs, GPUs, jobs, and operations. Administrators can access alerts that indicate when these statistics reach unwanted values (for instance, GPU temperature reaching critical levels.) These alerts are created using the Setting menu. The figures below display the component and alert dashboards.

    Administrator Component dashboard
    Figure 2. Administrator Component dashboard

    Administrator Alarm dashboard
    Figure 3. Administrator Alert dashboard

    GPU View Dashboard
    Figure 4. GPU View dashboard

  • Reports menu – allows administrators the ability to generate reports on jobs, alerts, or actions for a given time interval. Administrators may export these reports as a spreadsheet, in a PDF, or in HTML.
  • Admin menu – Provides the administrator with the capability to examine processes and assets, monitor VNC sessions, and download web logs.
  • Settings menu – allows administrators to set up automated notifications and alerts. Administrators may enable the notifications to reach users and interested parties via email, SMS, and WeChat. Administrators may also enable notifications and alerts via uploaded scripts.

    The Settings menu also allows administrators to create and modify queues. These queues allow administrators to subdivide hardware based on different types or needs. For example, one queue may contain systems that are exclusively machines with GPUs, while another queue may contain systems that only contain CPUs. This allows the user running the job to select the queue that is more applicable to their requirement. Within the Settings menu, administrators can also set the status of queues, bringing them up or down, draining them, or marking them inactive.

Features for LiCO Operators

For the purpose of monitoring clusters but not overseeing user access, LiCO provides the Operator designation. LiCO Operators have access to a subset of the dashboards provided to Administrators; namely, the dashboards contained in the Home, Monitor, and Reports menus:

  • Home menu for operators – provides dashboards giving a global overview of the health of the cluster. Utilization is given for the CPUs, GPUs, memory, storage, and network. Node status is given, indicating which nodes are being used for I/O, compute, login, and management. Job status is also given, indicating runtime for the current job, and the order of jobs in the queue.
  • Monitor menu – Dashboard that enables interactive monitoring and reporting on cluster nodes, including a list of the nodes, or a physical look at the node topology. Operators may also use the Monitor menu to drill down to the component level, examining statistics on cluster CPUs, GPUs, jobs, and operations. Operators can access alarms that indicate when these statistics reach unwanted values (for instance, GPU temperature reaching critical levels.) These alarms are created by Administrators using the Setting menu (for more information on the Setting menu, see the LiCO Administrator Features section.)
  • Reports menu – allows operators the ability to generate reports on jobs, alerts, or actions for a given time interval. Operators may export these reports as a spreadsheet, in a PDF, or in HTML.

Features for LiCO Users

Those designated as LiCO users have access to dashboards related specifically to HPC and AI tasks. Users can add jobs to the queue, and monitor their results through the dashboards. The following menus are available to users:

  • Home menu for users – provides dashboards giving a global overview of the resources available in the cluster. Availability is given for the CPUs, GPUs, memory, storage, and network. Jobs and job statuses are also given, indicating the runtime for the current job, and the order of jobs in the queue. Additionally, a list of recent job templates is given for both HPC and AI workloads. The figure below displays the home menu.

    User Home Menu
    Figure 5. User Home Menu

  • Submit job menu – allows users to set up a job and submit it to the queue. The user first picks a job template. After selecting the template, the user gives the job a name and inputs the relevant parameters, and submits it. LiCO displays key data about the queue through the templates, including whether the queue is up and which nodes or cores it can access. Users submitting jobs can select the Exclusive checkbox to dedicate systems to their job. For instance, if a job is known to be computationally demanding, LiCO will not provision any other jobs to run on the same node.

    Alternatively, if the Exclusive checkbox is not selected, the user can specify their number of CPU cores to use, which allows multiple jobs to be run concurrently on that system. Depending on the selected template, the parameters relevant to the job will change.

    Users can take advantage of Lenovo Accelerated AI templates, industry-standard HPC and AI templates, submit generic jobs as scripts via the Common Job template, as well as create their own templates requesting specified parameters.

    The two figures below display two job templates.

    AI Job Template
    Figure 6. AI Job Template

    LiCO HPC Job Template
    Figure 7. HPC Job Template

    LiCO also provides TensorBoard monitoring when running certain TensorFlow workloads, as shown in the following figure.

    LiCO and TensorBoard monitoring
    Figure 8. LiCO and TensorBoard monitoring

  • Jobs menu – displays a dashboard listing queued jobs and their statuses. In addition, you can select the job and see results and logs pertaining to the job in progress (or after completion.)
  • Expert mode menu – recommended for users who are familiar with the command line interface for the OpenHPC tools. The Expert mode menu provides console access that allows users to log in to the Management node where LiCO is located. Users who log in through this console can submit HPC and AI jobs and manage workloads using the CLI.
  • AI Studio menu – provides users the ability to label data, optimize hyperparameters, as well as test and publish trained models from within an end-to-end workflow in LiCO.  AI Studio supports Image Classification, Object Detection, and Instance Segmentation workflows.
  • Admin menu – allows users to access a number of capabilities not directly associated with deploying workloads to the cluster. From the Admin menu the user can manage Singularity container images for deployment through job templates, access active VNC sessions, access shared storage space through a drag-and-drop interface, pre-define runtime modules, and provision API and git interfaces.

Lenovo Accelerated AI

Lenovo Accelerated AI provides a set of templates that aim to make AI training and inference simpler, more accessible, and faster to implement. The Accelerated AI templates differ from the other templates in LiCO in that they do not require the user to input a program; rather, they simply require a workspace (with associated directories) and a dataset.

The following use cases are supported with Lenovo Accelerated AI templates:

  • Image Classification
  • Object Detection
  • Instance Segmentation
  • Medical Image Segmentation
  • Seq2Seq
  • Memory Network
  • Image GAN

The following figure displays the Lenovo Accelerated AI templates.

Lenovo Accelerated AI templates
Figure 9. Lenovo Accelerated AI templates

The following figure displays the list of template parameters for the Image Classification - Train template.

Lenovo Accelerated AI training template
Figure 10. Lenovo Accelerated AI training template

Each use case is supported by both a training and inference template. The training templates provide similar parameter inputs to those in the Models section of the Model Library tab, such as batch size and learning rate. These parameter fields are pre-populated with default values, but are fully tunable by those with data science knowledge. The templates also provide visual analytics with TensorBoard; the TensorBoard graphs continually update in-flight as the job runs, and the final statistics are available after the job has completed.

The following figure displays the embedded TensorBoard interface for a job. TensorBoard provides visualizations for TensorFlow jobs running in LiCO, whether through Lenovo Accelerated AI templates or the standard TensorFlow AI templates.

TensorBoard in LiCO
Figure 11: TensorBoard in LiCO

LiCO also provides inference templates which allow users to predict with new data models that have been trained with Lenovo Accelerated AI templates. For the inference templates, users only need to provide a workspace, an input directory (the location of the data on which inference will be performed), an output directory, and the location of the trained model. The job will run, and upon completion, the output directory will contain the analyzed data. For visual templates such as Object Detection, images can be previewed directly from within LiCO’s Manage Files interface.

The following two figures display an input file to the Object Detection inference template, as well as the corresponding output.

Photo of a cat
Figure 12: JPG file containing image of cat for input into inference job


Figure 13: LiCO output displaying the section of the JPG containing the cat image

AI Studio

LiCO AI Studio provides an end-to-end workflow for Image Classification, Object Detection, and Instance Segmentation, with training based on Lenovo Accelerated AI pre-defined models.  A user can import an unprocessed, unlabeled data set of images, label them, train multiple instances with a grid of parameter values, test the output models for validation, and publish to a git repository for use in an application environment. Additionally, users can initiate the steps in AI Studio from a REST API call to take advantage of LiCO as part of a DevOps toolchain.

LiCO dataset file with labeled image
Figure 14. LiCO dataset file with labeled image 

LiCO AI Studio model tuning
Figure 15. LiCO AI Studio model tuning

LiCO Trained model repository displaying published model location
Figure 16. Trained model repository displaying published model location

HPC Runtime Module Management

LiCO allows the user to pre-define modules and environmental variables to load at the time of job execution through Job submission templates.  These user-defined modules eliminate the step of needing to manually load required modules before job submission, further simplifying the process of running HPC workloads on the cluster.  Through the Runtime interface, users can choose from the modules available on the system, define their loading order, and specify environmental variables for repeatable, reliable job deployment.

HPC runtime module list
Figure 17. HPC runtime module list

MPI job template with custom module setup
Figure 18. MPI job template with custom module setup

Container Image Management

LiCO provides both users and administrators with the ability to upload and manage application environment images. These images can support users with AI and HPC frameworks, as well as others. LiCO uses Singularity images, which may be built from Docker containers, imported from NVIDIA GPU Cloud (NGC), or other image repositories. Users looking to deploy a particular image can create a custom template that will deploy the container and run workloads in that environment.

Container management through the Administrator portal
Figure 19. Container management through the Administrator portal

Subscription & support

LiCO is enabled through a per-CPU and per-GPU subscription and support model, which once entitled for the all the processors contained within the cluster, gives the customer access to LiCO package updates and Lenovo support for the length of the acquired term.

Lenovo will provide interoperability support for all software tools defined as validated with LiCO, and development support (Level 3) for specific Lenovo-supported tools only. Open source and supported-vendor bugs/issues will be logged and tracked with their respective communities or companies if desired, with no guarantee from Lenovo for bug fixes. Additional support options may be available; please contact your Lenovo sales representative for more information.

LiCO can be acquired as part of a Lenovo Scalable Infrastructure (LeSI) solution or for “roll your own” (RYO) solutions outside of the LeSI framework, and LiCO software package updates are provided directly through the Lenovo Electronic Delivery system. More information on LeSI is available in the LeSI product guide, available from https://lenovopress.com/lp0900.

Validated software components

LiCO’s software packages are dependent on a number of software components that need to be installed prior to LiCO in order to function properly. Each LiCO software release is validated against a defined configuration of software tools and Lenovo systems, to make deployment more straightforward and enable support. Other management tools, hardware systems and configurations outside the defined stack may be compatible with LiCO, though not formally supported; to determine compatibility with other solutions, please check with your Lenovo sales representative.

The following software components are validated by Lenovo as part of the overall LiCO software solution entitlement:

  • Lenovo Development Support (L1-L3)
    • Graphical User Interface: LiCO
    • System Management & Provisioning: xCAT/Confluent
  • Lenovo Configuration Support (L1 only)
    • Job Scheduling & Orchestration: SLURM; Torque/Maui (HPC only)
    • System Monitoring: Nagios
    • Application Monitoring: Ganglia
    • Container Support (AI): Singularity
    • AI Frameworks (AI): Caffe, Intel-Caffe, TensorFlow, MxNet, Neon, Chainer

The following software components are validated for compatibility with LiCO:

  • Supported by their respective software provider
    • Operating System: CentOS/RHEL 7.5, SUSE SLES 12 SP3
    • File Systems: IBM Spectrum Scale, Lustre
    • Job Scheduling & Orchestration: IBM Spectrum LSF v9
    • Development Tools: GNU compilers, Intel Cluster Toolkit

Supported servers

The following Lenovo servers are supported to run LiCO. This server must run one of the supported operating systems as well as the validated software stack, as described in the Validated Software Components section.

  • ThinkSystem SD530 – The Lenovo ThinkSystem SD530 is an ultra-dense and economical two-socket server in a 0.5U rack form factor. With up to four SD530 server nodes installed in the ThinkSystem D2 enclosure, and the ability to cable and manage up to four D2 enclosures as one asset, you have an ideal high-density 2U four-node (2U4N) platform for enterprise and cloud workloads. The SD530 also supports a number of high-end GPU options with the optional GPU tray installed, making it an ideal solution for AI Training workloads. For more information, see the product guide at https://lenovopress.com/lp1041-thinksystem-sd530-server-xeon-sp-gen-2.
  • ThinkSystem SD650 – The Lenovo ThinkSystem SD650 direct water cooled server is an open, flexible and simple data center solution for users of technical computing, grid deployments, analytics workloads, and large-scale cloud and virtualization infrastructures. The direct water cooled solution is designed to operate by using warm water, up to 50°C (122°F). Chillers are not needed for most customers, meaning even greater savings and a lower total cost of ownership. The ThinkSystem SD650 is designed to optimize density and performance within typical data center infrastructure limits, being available in a 6U rack mount unit that fits in a standard 19-inch rack and houses up to 12 water-cooled servers in 6 trays. For more information, see the product guide at https://lenovopress.com/lp1042-thinksystem-sd650-server-xeon-sp-gen-2.
  • ThinkSystem SR630 – Lenovo ThinkSystem SR630 is an ideal 2-socket 1U rack server for small businesses up to large enterprises that need industry-leading reliability, management, and security, as well as maximizing performance and flexibility for future growth. The SR630 server is designed to handle a wide range of workloads, such as databases, virtualization and cloud computing, virtual desktop infrastructure (VDI), infrastructure security, systems management, enterprise applications, collaboration/email, streaming media, web, and HPC. For more information, see the product guide at https://lenovopress.com/lp1049-thinksystem-sr630-server-xeon-sp-gen2.
  • ThinkSystem SR650 – The Lenovo ThinkSystem SR650 is an ideal 2-socket 2U rack server for small businesses up to large enterprises that need industry-leading reliability, management, and security, as well as maximizing performance and flexibility for future growth. The SR650 server is designed to handle a wide range of workloads, such as databases, virtualization and cloud computing, virtual desktop infrastructure (VDI), enterprise applications, collaboration/email, and business analytics and big data. For more information, see the product guide at https://lenovopress.com/lp1050-thinksystem-sr650-server-xeon-sp-gen2.
  • ThinkSystem SR670 – The Lenovo ThinkSystem SR670 is a purpose-built 2 socket 2U 4GPU node, designed for optimal performance for high-end computation required by both Artificial Intelligence and High Performance Computing workloads. Supporting the latest NVIDIA GPUs and Intel Xeon Scalable processors, the SR670 supports hybrid clusters for organizations that may want to consolidate infrastructure, improving performance and compute power, while maintaining optimal TCO. For more information, see the product guide at https://lenovopress.com/lp0923-thinksystem-sr670-server.
  • ThinkSystem SR950 – The Lenovo ThinkSystem SR950 is Lenovo’s flagship server, suitable for mission-critical applications that need the most processing power possible in a single server. The powerful 4U ThinkSystem SR950 can expand from two to as many as eight Intel Xeon Scalable Family processors. The modular design of SR950 speeds upgrades and servicing with easy front or rear access to all major subsystems that ensures maximum performance and maximum server uptime. For more information, see the product guide at https://lenovopress.com/lp1054-thinksystem-sr950-server-xeon-sp-gen-2.

Additional Lenovo ThinkSystem and System x servers may be compatible with LiCO. Contact your Lenovo sales representative for more information.

LiCO Implementation services

Customers who do not have the cluster management software stack required to run with LiCO may engage Lenovo Professional Services to install LiCO and the necessary open-source software. Lenovo Professional Services can provide comprehensive installation and configuration of the software stack, including operation verification, as well as post-installation documentation for reference. Contact your Lenovo sales representative for more information.

Client PC requirements

A web browser is used to access LiCO's monitoring dashboards. To fully utilize LiCO’s monitoring and visualization capabilities, the client PC should meet the following specifications:

  • Hardware: CPU of 2.0 GHz or above and 1 GB or more of RAM
  • Display resolution: 1280 x 800 or higher
  • Browser: Chrome or Firefox is recommended

Related product families

Product families related to this document are the following:

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:
Lenovo®
System x®
ThinkSystem

The following terms are trademarks of other companies:

Intel® and Xeon® are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Other company, product, or service names may be trademarks or service marks of others.