GROMACS for Intel® Xeon Phi™ Coprocessor

Purpose

This code recipe describes how to get, build, and use the GROMACS* code with support for the Intel® Xeon Phi™ coprocessor with Intel® Many-Integrated Core (MIC) architecture.

Introduction

GROMACS is a versatile package to perform molecular dynamics, using Newtonian equations of motion, for systems with hundreds to millions of particles. GROMACS is primarily designed for biochemical molecules, like proteins, lipids and nucleic acids that have a multitude of complicated bonded interactions. But, since GROMACS is extremely fast at calculating the non-bonded interactions typically dominating simulations, many researchers use it for research on non-biological systems, such as polymers.

GROMACS supports all the usual algorithms expected from a modern molecular dynamics implementation.

The GROMACS code is maintained by developers around the world. The code is available under the GNU General Public License from www.gromacs.org.

Code Support for Intel® Xeon Phi™ coprocessor

GROMACS 5.0-RC1 has been released with Intel Xeon Phi coprocessor native/symmetric support. The code is currently available at http://www.gromacs.org/Downloads, or via ftp at ftp://ftp.gromacs.org/pub/gromacs/gromacs-5.0-rc1.tar.gz.

Only source code is available; configure and build it using the instructions provided below.

In GROMACS 5.0, the code was restructured with a platform-independent SIMD layer, which simplifies moving to a new instruction set architecture by redefining macros in localized header files. Version 5.0 with Intel Xeon Phi coprocessor support provides a 16-way neighbor list for enabling 512-bit vector registers (KNC/KNL), support of FMA (fused multiply-add) instructions and mask registers, and heavy optimizations for Intel Xeon Phi coprocessor native computations. These optimizations including the following:

16-wide SIMD non-bonded computations optimized with MIC-intrinsics
4-wide SIMD intrinsics in PME (Partical Mesh Ewald method)
Improvements to force reduction over OpenMP* threads
Resolved issues of Intel® Math Kernel Library enabling for FFT, BLAS, LAPACK

Version 5.0 code performance scales well on the host node's Intel® Xeon® processor. However, coprocessor native/symmetric support has known scaling challenges beyond two nodes, and Intel is diligently working to resolve them. Additionally, an offload version is in development, which will asynchronously move Particle-Particle computations to the Intel Xeon Phi coprocessor.

Code Access

This version of GROMACS code supports both message passing and threading programming models of the Intel Xeon processor (referred to as 'host' in this document) with the Intel Xeon Phi coprocessor (referred to as 'coprocessor' in this document) in both a single node and a cluster environment.

To get access to the code and test workloads:

Go to the downloads page: http://www.gromacs.org/Downloads
Download the gromacs-5.0-rc1.tar.gz code.

Build Directions

Configure cmake by adding these essential parameters:

DGMX_FFT_LIBRARY=mkl #enable MKL for FFT DGMX_MPI=ON #enable MPI DGMX_OPENMP=ON #enable OpenMP DCMAKE_EXE_LINKER_FLAGS="-L$ZLIB_DIR/lib64" #path to MIC zlib DCMAKE_C_FLAGS="-O3 –mmic -I$ZLIB_DIR/include" #C-compiler flags should include –mmic ## and zlib include dir DCMAKE_CXX_FLAGS="-O3 –mmic -I$ZLIB_DIR/include" #the same for C++ compiler DGMX_SKIP_DEFAULT_CFLAGS=ON #omit default compiler flags make -j 12

Full cmake configuration for Intel Xeon Phi coprocessor

cmake .. \ DBUILD_SHARED_LIBS=OFF \ DGMX_PREFER_STATIC_LIBS=ON \ DGMX_BUILD_MDRUN_ONLY=ON \ DGMX_FFT_LIBRARY=mkl \ DCMAKE_INSTALL_PREFIX=$GROMACS_INSTALL_DIR \ DGMX_MPI=ON -DGMX_OPENMP=ON \ DGMX_GPU=OFF \ DGMX_XML=OFF \ DGMX_SOFTWARE_INVSQRT=OFF \ DGMX_SKIP_DEFAULT_CFLAGS=ON \ DCMAKE_EXE_LINKER_FLAGS="-L$ZLIB_DIR/lib64 -mkl=sequential" \ DCMAKE_C_COMPILER=mpiicc \ DCMAKE_C_FLAGS="-std=gnu99 -O3 -mmic -vec-report1 -fno-alias -ip -funroll-all-loops -fimf-domain-exclusion=15 -g -DNDEBUG -I$ZLIB_DIR/include" \ DCMAKE_CXX_COMPILER=mpiicpc \ DCMAKE_CXX_FLAGS="-std=c++0x -O3 -mmic -vec-report1 -fno-alias -ip -funroll-all-loops -fimf-domain-exclusion=15 -g -DNDEBUG -I$ZLIB_DIR/include"

Running Workloads on Intel Xeon Phi coprocessor Only

To run the workload on the Intel Xeon Phi coprocessor only, do the following:

Source the Intel® compiler, so libraries can be found.
export LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
Set up extra environment variables for run time.
export I_MPI_MIC=1 export I_MPI_PIN_MODE=pm export I_MPI_PIN_DOMAIN=omp export KMP_AFFINITY=verbose,compact,0 export OMP_NUM_THREADS=$NTHREADS
Create an appropriate machinefile list for MPI, such as:
<nodename>-mic<device_id>:<MIC_PPN> nodename – name of host, check by `uname-n` device_id – 0,1,etc, depending on MIC device used MIC_PPN – number of MPI processes executed on MIC card

Running Workloads on the Host Processor and Coprocessor

To run workloads on both the host's Intel Xeon processor and the Intel Xeon Phi coprocessor, do the following:

Source the Intel® compiler, so libraries can be found.
export LD_LIBRARY_PATH=$MIC_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
Set up extra environment variables for run time.
export I_MPI_MIC=1
export I_MPI_PIN_MODE=pm
export MIC_OMP_NUM_THREADS=$NTHREADS #num of OMP threads on MIC
export IVB_OMP_NUM_THREADS=1 #num of OMP threads on Host
export I_MPI_PIN_MODE=pm
export MIC_KMP_AFFINITY=verbose,compact,0 #KMP_AFFINITY for MIC threads
export IVB_KMP_AFFINITY=verbose,compact,1 #KMP_AFFINITY for Host threads
Create the appropriate machinefile list for MPI, such as:
<nodename>:<HOST_PPN> <nodename>-mic<device_id>:<MIC_PPN>
Where nodename = name of host, check by `uname-n` HOST_PPN = number of MPI processes executed on HOST device_id = 0, 1, etc. depending on MIC device used MIC_PPN = number of MPI processes executed on the Intel Xeon Phi coprocessor
Mpiexec executes the wrapper script mdrun.sh, which runs an Intel Xeon processor binary on the host and an MIC binary on the Intel Xeon Phi coprocessor:
if ; then export OMP_NUM_THREADS=${MIC_OMP_NUM_THREADS} export KMP_AFFINITY=${MIC_KMP_AFFINITY} $BIN_DIR/mdrun_mpi.MIC $@ else export OMP_NUM_THREADS=${IVB_OMP_NUM_THREADS} export KMP_AFFINITY=${IVB_KMP_AFFINITY} $BIN_DIR/mdrun_mpi.IVB $@ fi

Optimizing performance differs for each workload. The KMP_AFFINITY variable allows you to easily adjust affinity for MPI ranks and the number of threads per rank on the Intel Xeon Phi coprocessor. Experimenting with the values for KMP_AFFINITY allow you to get best performance for a given workload.

Performance Testing Results^1,2

The following graph shows the results achieved from the GROMACS code using the hardware and software configurations shown below. Up to 1.8x performance speedup can be achieved on the RF workload with symmetric mode, using both processors and coprocessors together. To achieve these results, the engineers used these affinity settings:

2 CPU : 24 MPI x 1 OMP
2 CPU + 1 Coprocessor : 24 MPI x 1 OMP + 30 MPI x 8 OMP
2 CPU + 2 Coprocessor : 24 MPI x 1 OMP + 30 MPI x 8 OMP + 30 MPI x 8 OMP

Testing Platform Configurations

The following hardware was used for the above recipe and performance testing.3

Server Configuration:

2-socket/24 cores:
Processor: Intel® Xeon® processor E5-2697 V2 @ 2.70GHz (12 cores) with Intel® Hyper-Threading Technology
Operating System: Red Hat Enterprise Linux* 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 x86_64 x86_64 GNU/Linux
Memory: 64GB
Coprocessor: 2X Intel® Xeon® Phi™ Coprocessor (Board SKU "C0-7120P/7120X/7120"): 61 cores @ 1.238 GHz, 4-way Intel Hyper-Threading Technology4, Memory: 15872 MB
Intel® Many-core Platform Software Stack Version 2.1.6720-21
Intel® C++ Compiler Version 14.0.1.106

GROMACS

FFT: Intel® Math Kernel Library
Configuration parameters were modified to achieve optimal performance

DISCLAIMERS:

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm

1. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

2. Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

3. For more information go to http://www.intel.com/performance

4. Available on select Intel® processors. Requires an Intel® HT Technology-enabled system. Consult your PC manufacturer. Performance will vary depending on the specific hardware and software used. For more information including details on which processors support HT Technology, visit http://www.intel.com/info/hyperthreading.

Intel, the Intel logo, Xeon and Xeon Phi are trademarks of Intel Corporation in the US and/or other countries.

*Other names and brands may be claimed as the property of others.

biochemical molecules

molecular dynamics

Intel(R) Xeon Phi(TM) Coprocessor