Why Customers are Moving High Performance Computing Workloads to Amazon Web Services
Why Customers are Moving High Performance Computing Workloads to Amazon Web Services
June 7, 2017 08:03 CEST
Amazon Web Services (AWS) offers access to cloud infrastructure that can run a wide variety of high performance computing (HPC) applications. Such workloads represent some of the most computationally challenging in the world, including those in engineering, financial risk analysis, molecular dynamics, weather prediction, genomic analysis, and many more. And since the service is backed by a massive reserve of servers in Amazon’s Elastic Compute Cloud (EC2), no problem is too big for AWS.
Using AWS for high performance computing enables organizations, both public and private, to advance scientific understanding, create better products, and extract new insights from datasets. Global enterprises employ the platform to help accelerate their product development and manufacturing efforts, evaluate financial risks, and develop better business practices. Research and academic institutions turn to AWS to run simulations at scales that were previously impractical, enabling scientists to gain new insights into their research. Startups use AWS to deploy traditional HPC applications in new ways to drive the innovation these companies are bringing to market.
There is growing awareness among HPC users that the cloud can provide near-instant access to computing resources that might otherwise be unavailable locally or impractical to maintain in-house. With the cloud, HPC clusters can be commissioned and decommissioned in just minutes, rather than in days or weeks. Especially for those situations where scalability and elasticity is paramount, the cloud offers the most practical delivery model for performance-demanding applications.
A Wide Spectrum of HPC Applications
In addition to the many researchers in the public sector using HPC in the cloud, commercial enterprises have also been increasing their use, augmenting or in some cases replacing their legacy clusters. Pharmaceutical companies, for example, are taking advantage of scalability of EC2 to accelerate drug discovery with large-scale computational chemistry applications. Life science and agricultural firms are avoiding the capital expenditures of procuring clusters every few years by running their genomic analysis workloads on EC2.
Unilever, for example, augmented their existing HPC capacity with EC2, enabling the company to process genetic sequences twenty times faster than with their in-house system. “The key advantage that AWS has over running this workflow on Unilever’s existing cluster is the ability to scale up to a much larger number of parallel compute nodes on demand,” said Pete Keeley, eScience Technical Lead for R&D IT at Unilever.
Similarly, AWS enabled Novartis to massively accelerate pre-clinical R&D focused in the area of computational chemistry. “We completed the equivalent of thirty-nine years of computational chemistry in just under nine hours for a cost of around $4,200,” noted Steve Litster, Global Head of Scientific Computing, at Novartis
In the manufacturing domain, firms around the world are successfully deploying third-party and in-house developed applications for computer aided design (CAD), electronic design automation (EDA), 3D rendering, and parallel materials simulations. These firms routinely launch simulation clusters consisting of many thousands of CPU cores, for example, to run thousands or even millions of parallel parametric sweeps.
In the financial services sector, organizations ranging from hedge funds, to global banks, to independent auditing agencies such as FINRA, are using AWS to run complex financial simulations, to predict future outcomes, to back-test proprietary trading algorithms, and to help meet regulatory requirements. Stochastic simulations and other “pleasingly parallel” applications, such as Monte Carlo financial risk analysis, are particularly well suited to using EC2 spot instances, which allow users to bid on unused capacity at cost savings of up to 90 percent off the normal hourly rate.
As the capabilities and performance of Amazon’s platform have continued to advance, the types of HPC applications that are running on EC2 have also evolved, with open source and commercial software applications being successfully deployed on it across industries and application categories. Today EC2 provides a comprehensive set of capabilities to support traditional HPC workloads, as well as emerging use cases, such as machine learning, the internet of things (IoT), data lake solutions, and high performance data analytics.
What Makes AWS Compelling for HPC
Demand for performance-demanding applications continues to grow, driven by the ever-increasing need for more accurate simulations, faster turn-around time, and greater insights into ever-larger datasets. In addition to this increasing demand, the time and the expense required to procure, deploy and manage physical HPC infrastructure has led many users to consider using AWS, either to augment their existing HPC clusters, or to replace them entirely.
AWS allows HPC users to scale applications horizontally and vertically and eliminates the need for jobs to wait in queues. Horizontal scalability is provided by the natural elasticity of the cloud, in which additional compute nodes can be automatically added as needed. Vertical scalability is provided by the wide range of EC2 instance types, which cater to different computational profiles based on the application’s processor, memory, and I/O requirements.
TLG Aerospace, is one example of a customer taking advantage of the scalability that AWS affords, in this case, for the company’s computational fluid dynamic (CFD) simulations. Prior to moving this CFD work on to AWS, TLG couldn’t run jobs of more than 1,000 nodes, resulting in lost opportunities. “We are definitely saving money by actively monitoring jobs to catch problems early and reduce rework,” explained Andrew McComas, Engineering Manager at TLG. “We can also use it to reduce unnecessary cost in larger jobs that may otherwise run longer than required.”
HPC users deploying their applications on AWS discover that running workloads in the cloud is not simply a means to doing the same kinds of work as before. Instead, they are seeing that cloud enables a new way for distributed teams to collaborate. Such collaboration in manufacturing can include using the cloud as a secure, globally accessible platform for production yield analysis, or enabling joint design efforts using remote 3D visualization.
A good example of this is the experience of General Electric (GE), which leveraged AWS to develop an innovative manufacturing platform known as the Crowd-driven Ecosystem for Evolutionary Design (CEED). It connects people, materials, models, simulations, and equipment in an ITAR-compliant, secure, and distributed global environment. According to Joe Salvo, Manager, Business Integration Technology Laboratory, at General Electric, “this could change the way manufacturing is architected.”
In a typical HPC environment, individual HPC users will submit their jobs to a shared resource using a batch scheduler. Depending on the mix of jobs being submitted, their inter-dependencies and priorities, and whether they are optimized for the shared resource, the HPC cluster may or may not be efficiently utilized. When workloads are highly variable, such as when there is a simultaneous high demand for simulations from many different groups, or when there are unexpected high-priority jobs being submitted, queue wait times can grow dramatically, resulting in job completion times far greater than the actual time needed to complete each job.
When running HPC applications with AWS, the problem of queue contention is eliminated, because every job or every set of related jobs can be provided with its own resources. Moreover, these resources can be customized for the unique set of applications for which it is used. This results in a more efficient use of infrastructure.
HSGT, a business division of Western Digital, is using AWS to support its engineering workloads, which span CAD, CFD, EDA and molecular dynamic simulations. AWS enables millions of parallel parameter sweeps, running months of simulations in just hours. At peak execution, HGST utilizes over 70,000 Intel cores, leveraging the EC2 spot market to minimize cost.
The Breadth of Capability
AWS provides a broad set of capabilities with more than 90 major products and services, encompassing computation, networking, storage, and cluster management.
With regards to computational capabilities, EC2 offers a number of instances optimized for different processor and memory requirements. Instances equipped with different families of Intel Xeon processors are available, including Haswell, Broadwell and Skylake CPUs. EC2 also includes support for graphics and accelerated instances with GPUs and FPGAs.
A number of networking options are also available for latency and bandwidth-sensitive workloads. For example, EC2 instances can be launched within a Placement Group to reduce latency and obtain full bisection bandwidth between instances. Support for Single Root I/O Virtualization (SR-IOV) is also available to provide higher I/O performance and lower CPU utilization. This feature also provides a higher rate of packet delivery, lower inter-instance latencies, and very low network jitter. The capabilities of EC2 Enhanced Networking and SR-IOV have direct benefit for MPI workloads.
A large number of storage capabilities are supported as well. These include the Amazon Elastic Block Store for local I/O, the Amazon Elastic File System for home directories and scratch file, and for parallel file system support, Intel Lustre and BeeGFS. For object storage, Amazon offers Simple Storage Service (S3), a web service that promises 99.999999999 percent durability and scales to trillions of objects. HPC customers have seen the benefit of moving to object storage for cloud-based simulation.
For data archiving and backup, Amazon Glacier is a cloud storage service offered as a convenient alternative to tape. For HPC users, it’s especially useful for long-term backup of research data. AWS also provides a data lake solution, which allows customers to store all their data in a single centralized repository.
For cluster management, AWS offers a useful set of tools for building, deploying, managing and monitoring HPC clusters. For batch processing, there is the appropriately named AWS Batch, which enables users to provision the optimal compute resources for their applications. Another powerful tool is AWS CloudFormation, which gives developers and systems administrators an easy way to create and manage a collection of related AWS resources. It also enables you to build reference architectures for specific HPC workflows and applications. Nice Engine Frame, an advanced, commercially supported HPC Portal, is also available.
For resource monitoring, AWS offers Amazon Cloudwatch, a service that allows you to collect and track metrics, save and monitor log files, set alarms, and automatically react to changes in AWS resources. And for cost management, there is AWS Cost Explorer, which provides a set of tools to track all costs associated with your HPC infrastructure on AWS. It offers cost tagging, alarms, forecasting and report generation.
AWS enables HPC customers to take their applications to the cloud with NICE EnginFrame and DCV products, which are proven technologies on-premises and on AWS. NICE EnginFrame provides a common interface to manage clusters on AWS or on local datacenters, and NICE DCV provides the remote visualization technology to access your applications from where they are deployed. User can also choose AppStream, which is a managed service for secure streaming of desktop applications using DCV technology.
To learn more about how AWS can help you achieve better results with your HPC workloads, come join us at ISC High Performance 2017 in Frankfurt, Germany, which takes place from June 18-22. We will be delivering over 35 presentations from AWS technical teams, along with our partners Alces Flight, Altair, Ansys, ARM, BeeGFS, Cadfem, Cycle Computing, Ellexus, Fly Elephant, Intel, Rescale, Techila and Zenotech. Also, stop by the AWS booth, G-822, during the exhibition, which runs June 19-21, or contact us directly at firstname.lastname@example.org if you would like to arrange a briefing session with us or our partners.