Deferred Rendering for OpenGL ES3.0 on Android

Elaborazione parallela

Vettorizzazione

Area tema:

IDZone

↧

Vectorization in Julia

September 15, 2014, 4:20 pm

Latest and popular articles on Intel Technologies

≫ Next: How to analyze Intel® Xeon Phi™ coprocessor applications using Intel® VTune™ Amplifier XE 2015

≪ Previous: Intel® Xeon Phi™ Coprocessor code named “Knights Landing” - Application Readiness

Julia is a new language for technical computing that combines interactive scripting convenience with high performance. Version 0.3 was released Aug. 20, 2014, and introduces experimental support for vectorizing loops, which can significantly improve performance of some kernels. This article explains how to use the vectorization feature effectively. This article is based on material that I first presented at JuliaCon 2014

What Is Vectorization?

"Vectorization" has two different meanings in Julia, both related to operating on chunks of data:

Writing your code in terms of operations that operate on whole arrays. For example, writing d=a+b-c where the variables all indicate array objects. See the Julia @devec package for more information about this style of code.
Compiler transformations that improve performance by using SIMD (Single Instruction Multiple Data) instructions that operate on chunks of data. For example, hardware with Intel® Advanced Vector Extensions (Intel® AVX) can do eight 32-bit floating-point additions at once.

The first definition concerns how you write your code. The second definition concerns code generation. This article concerns the second definition.

Julia 0.3 has vectorization capabilities that can exploit SIMD instructions when executing loops, under the right conditions. Sometimes you have to give it a little nudging. Here is an example function and its invocation on two random arrays with 1003 32-bit values:

function axpy(a,x,y)
    @simd for i=1:length(x)
        @inbounds y[i] += a*x[i]
    end
end

n = 1003
x = rand(Float32,n)
y = rand(Float32,n)
axpy(1.414f0, x, y)

The @simd and @inbounds will be explained later. When the call to axpy happens, the Julia just-In-time (JIT) compiler generates code for an instance of axpy specialized to the arguments' types. For serial execution (that is, non-SIMD code) each iteration of the instance behaves as if the code were written like this:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
    n=length(x)
    i = 1
    @inbounds while i<=n
        t1 = x[i]
        t2 = y[i]
        t3 = w*t1[i]
        t4 = t2+t3
        y[i] = t4
        i += 1
    end
end

Here each assignment in the loop represents one machine instruction.

The @simd macro gives the compiler license to vectorize without checking whether it will change the program's visible behavior. The vectorized code will behave as if the code were written to operate on chunks of the arrays. For example, for hardware with 4-wide execution units, Julia might generate code as if the source were written as follows, using fictional Julia operations + and * that operate on tuples:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
    n=length(x)
    i = 1
    # Vectorized loop - four logical iterations per physical iteration
    @inbounds while i+3<=n
        t1 = (x[i],x[i+1],x[i+2],x[i+3])   # Load tuple
        t2 = (y[i],y[i+1],y[i+2],y[i+3])   # Load tuple
        t3 = a*t1                          # Scalar times tuple
        t4 = t2+t3                         # Tuple add
        (y[i],y[i+1],y[i+2],y[i+3]) = t4   # Tuple store
        i += 4                             # Finished 4 logical iterations
    end
    # Scalar loop for remaining iterations
    @inbounds while i<=n
        t1 = x[i]
        t2 = y[i]
        t3 = a*t1
        t4 = t2+t3
        y[i] = t4
        i += 1
    end
end

As long as the contents of array y do not overlap the contents of array x, the result will be the same as the serial code. However, if there is overlap, the results might differ, because @simdgives the compiler license to reorder operations. Here is a diagram showing the original order:

@simd transposes the order of operations for chunks of iterations, like this:

The horizontal arrows are academic here, since with 4-wide execution hardware, each row of operations happens instantaneously. However, @simd gives license to "vectorize" across chunks of iterations wider than the execution hardware. For example, the order above is also valid for 2-wide execution hardware too. In practice, the compiler often uses chunks that are wider than the execution hardware, so that multiple operations can be overlapped. So do not assume that the chunk size for transposition is the natural hardware size. Furthermore, @simd actually grants the compiler even more reordering latitude as I will discuss later.

Implicit vs. Explicit Vectorization

Vectorization is a program transform. For any program transform, a compiler has to ask three questions:

Is the transform possible?
Is the transform legal?
Is the transform profitable?

The part of a compiler that answers these questions for vectorization is called the vectorizer. The first question relates to the capabilities of the vectorizer and hardware. Some vectorizers can handle only very simple loops. Others can deal with complicated control flow. If the vectorizer determines that it does not know how to vectorize the loop, there is no point in asking the next two questions.

The meaning of "legal" depends on the language specification. Julia currently does not have a formal specification, so in practice "legal" means that the visible program behavior is identical to that of unvectorized code. Sometimes the behavior will be slightly different in a way that the programmer did not care about, but the vectorizer must nonetheless assume vectorization is not legal unless the programmer grants explicit license.

The meaning of "profitable" can depend on context. For purposes here, "profitable" means "runs faster", though in other contexts it might be "takes less code space in memory". Vectorization often, but not always, improves performance at the cost of increasing code space.

Vectorization can be either implicit or explicit. In implicit vectorization, the compiler proves that the transposition of operations is legal. For Julia, the hard part of the proof is proving that no output array overlaps with any input or output array. If the compiler can otherwise vectorize the code, but cannot prove the absence of overlap, it may generate the vector code anyway, with a run-time check that jumps to the remainder loop if overlap is detected. Here is our example with a run-time check:

function axpy(a::Float32, x::Array{Float32,1}, y::Array{Float32,1})
    n=length(x)
    i = 1
    if !overlap(x,y)                           # "overlap" is fictional function
        # Vectorized loop - four logical iterations per physical iteration
        @inbounds while i+3<=n
            t1 = (x[i],x[i+1],x[i+2],x[i+3])   # Load tuple
           ...
            (y[i],y[i+1],y[i+2],y[i+3]) = t4   # Tuple store
            i += 4                             # Finished 4 logical iterations
        end
    end
    # Scalar loop for remaining iterations
    @inbounds while i<=n
        t1 = x[i]
        ...
        y[i] = t4
        i += 1
    end
end

In this example, the run-time check is relatively cheap since only two arrays are involved. You won't see a noticeable performance difference by adding @simd to the example since all it would do is to remove the run-time check. However, the cost of the check can grow quadratically with the number of arrays referenced by the loop body. Given M output arrays and N input arrays, the cost is O(M*(M+N)). Furthermore, a run-time check is impractical in cases that involve tricky subscripting patterns, such as:

for i=1:n
    t = w[j[i]]    # "gather"
    w[k[i]] = t    # "scatter"
end

Here, proving that transposition of chunks does not change the visible program behavior amounts to detailed inspection of j[i] and k[i] that, in the absence of special hardware support, can be slower than just executing the code serially.

In explicit vectorization, you as the programmer guarantee that vectorization is legal. In Julia, that's done be prefixing a for loop with @simd. Be careful using it. If you use @simd on a loop what was not legal to vectorize, your results may be wrong.

What You Promise With @simd

For some technical reasons, @simd actually grants more latitude than just permission to transpose evaluations. It also tells the compiler that all iterations are independent in both of the following senses:

No iteration reads or writes a location that is written by another iteration.
No iteration waits on another iteration.

The second sense is what distinguishes an @simd loop from multi-threaded loops found in some other languages. Currently, it's an academic point for Julia until it acquires shared memory multi-threading capabilities. To summarize, when you use @simd, you promise that iterations do not communicate with each other.

Reductions

Reductions are an exception to the no communication rule for @simd. Reduction operations will work as long as the vectorizer recognizes them as such. The rules for making them recognizable need to be formalized, but for now use +=, *=, &=, |=, or $=, or expand the op= into its equivalent syntactic form. For example, "s += expr" can be written as "s =s + expr". The reduction variable should be a local variable. Here is an example reduction:

function summation(x)
    s = zero(x[1])
    @simd for i=1:length(x)
        @inbounds s += x[i]
    end
    s
end

Integer min/max reductions also work. If the compiler fails to recognize your reduction, it will refuse to vectorize the code.

For 4-wide SIMD execution, the vectorized code acts as if it were written like this:

function summation(x::Array{Float32,1})
    n = length(x)
    t = (0f0,0f0,0f0,0f0)          # Initialize partial sums
    i=1
    @inbounds while i+3<=n
        # Four logical iterations per physical iteration
        t += (x[i],x[i+1],x[i+2],x[i+3])
        i += 4
    end
    s = (t[1]+t[2]) + (t[3]+t[4])  # Merge partial sums
    @inbounds while i<=n
        s += x[i]
        i += 1
    end
    s
end

Vectorization of reductions not only reorders loads and stores, it also reorders the reduction operations. Here is a drawing of how data flows through a serial summation:

Here is how it flows through a vectorized summation:

Since flloating-point addition is commutative, but not associative, the reordering may cause a different result. Whether the result from the reordered summation is more or less accurate than from the original serial order depends on the addends being summed. In fact, given random addends, the vector order probably gives a slightly more accurate result. Summing rand(Float32,1000), the odds are better than 6 to 1 that the four-way vector summation gives a more accurate result than the original serial code.

Of course, under some conditions, the serial sum may be more accurate. For example, if you are sorting your addends by magnitude before adding them, and there are a few addends with relatively large magnitude which cancel each other, the vectorized summation will likely give a less accurate result. But that kind of tricky case tends to be easily identifiable by noticing the sort.

The Julia compiler does implicit vectorization of reductions only if the result is the same as the serial code, as it will be for integers. Otherwise you need to use @simd. The following table summarizes the situation.

Which Reductions Vectorize in Julia 0.3
	Integer	Float32 and Float64
Implicit	+, *, &, \|, $, min, max	none
Explicit (@simd)	+, *, &, \|, $, min, max	+, *

Floating-point min/max reductions might work in some future version of Julia. The issue is that LLVM recognizes floating-point min reductions for min(x,y) defined as x<y?x:y, but Julia uses a trickier definition of min that conforms with the IEEE floating-point standard:

min{T<:FloatingPoint}(x::T, y::T) = ifelse((y < x) | (x != x), y, x)

The expression x!=x is true when x is an IEEE NaN (Not a Number) value.

Speedup Surprise

The reassociation allowed by @simd sometimes improves speed by more than you would think possible. For example, I've seen @simd speed up the summation example by 12x on an Intel 4th Generation processor, even though the hardware has only 8-wide execution units. This is possible because the reassociation lets the compiler do the summation with conceptual 32-wide vector operations, each synthesized from four 8-wide operations. The compiler then overlaps those 8-wide operations in a way that lets the hardware pipeline them efficiently in a way that was not possible in the serial code.

Inspecting Whether Code Vectorizes

There is currently no feedback from the compiler on whether code vectorized. Feedback might be possible in future versions of Julia since the most recent version of the LLVM vectorizer implements such feedback. In lieu of feedback, the best thing is to learn how to skim LLVM code to check if vectorization is happening as you expect. You don't have to be a compiler expert since the vectorizer leaves some footprints in the code.

To inspect the LLVM code for the earlier axpy example, invoke the macro @code_llvm like this:

julia> @code_llvm axpy(1.414f0, x, y)

where the arguments are the same as the earlier example. Since code generation pays attention only to the types, not the values of the arguments, you can also use:

julia> @code_llvm axpy(0.0f0, Float32[],Float32[])

The macro dumps the LLVM code for the top-level expression. There's also a similar function code_llvm that is invoked like this:

julia> code_llvm(axpy,(Float32,Array{Float32,1},Array{Float32,1}))

The function code_llvm takes a function as its first argument and a tuple of argument types as its second argument.

Because Julia uses a just-in-time Compiler (JIT), the LLVM output depends on your processor. Indeed, one of the benefits of a JIT is that you can get code tailored to your processor. What I get on an Intel 4th Generation Core i7 processor for @code_llvm axpy(1.414f0, x, y) is too long to include completely here. See the attached .txt file if you are interested in seeing it all. For skimming purposes, here are the relevant lines (... denotes deleted portions)

  ...
vector.ph:                                        ; preds = %if
  %broadcast.splatinsert12 = insertelement <8 x float> undef, float %0, i32 0
  %broadcast.splat13 = shufflevector <8 x float> %broadcast.splatinsert12, <8 x float> undef, <8 x i32> zeroinitializer
  br label %vector.body

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  ...
  %wide.load = load <8 x float>* %24, align 4
  ...
  %wide.load9 = load <8 x float>* %26, align 4
  ...
  %wide.load10 = load <8 x float>* %28, align 4
  ...
  %wide.load11 = load <8 x float>* %30, align 4
  %31 = fmul <8 x float> %wide.load10, %broadcast.splat13
  %32 = fmul <8 x float> %wide.load11, %broadcast.splat13
  %33 = fadd <8 x float> %wide.load, %31
  %34 = fadd <8 x float> %wide.load9, %32
  store <8 x float> %33, <8 x float>* %24, align 4
  store <8 x float> %34, <8 x float>* %26, align 4
  ...
  br i1 %35, label %middle.block, label %vector.body
  ...

The footprints to look for are the labels prefixed with vector, and types of the form <n x float>. Those labels are added by the vectorizer for the vectorized loop. The operations on <n x float> are the SIMD instructions. In this example, they correspond to AVX instructions on 8-wide vectors. Note that the vectorizer used a chunk size of 16 here, using pairs of 8-wide instructions for each conceptually 16-wide operation.

By the way, Julia has more features for inspecting code at different levels used by the Julia compiler. For example, you can use code_native to see the native code generated by compiler. Leah Hanson's blog is a good introduction to the various levels of "introspection".

Vectorization Recommendations for Julia 0.3

Earlier, I described how vectorization is a transform, and to decide whether to perform a transform, a compiler must answer three questions:

Is it possible?
Is it legal?
Is it profitable?

@simd provides a way to force the answer about legality. But you also have to make vectorization possible within the capabilities of the compiler, and profitable for the given hardware. So you have to learn how to cater to the limitations of the vectorizer. The following sections describe the catering rules and their rationale. A summary of those rules will be presented afterwards.

Trip Count Must Be Obvious

The trip count of a loop is the number of times the body is executed. The vectorized code needs to be able to calculate that trip count before commencing the loop, so that it knows how many chunk iterations to execute. Fortunately, Julia has concise for-loop syntax from which the trip count is obvious when the the loop is over a range object such as m:n. Stick with that form and the trip count should not be an issue.

The trip count is an issue when a for-loop is applied to iterable objects other than ranges, because a regular Julia for-loop is really just a glorified while-loop that marches an iterator across the object until the iterator says "no more". For sake of being able to compute a trip count, @simd causes a for-loop to executed differently. Given a loop of the form:

    @simd for i=r…
    end

execution of @simd assumes that:

length(r) returns the trip count.
first(r) returns the first index value.
step(r) returns the stride between successive index values.

A regular for-loop does not assume such expressions are valid. See Section 2.11 of this paper for how regular for-loops are handled.

The Loop Body Should Be Straight-Line Code

The current vectorizer for Julia generally requires that the loop body not contain branches or function calls. But practically all operations in Julia are function calls, furthermore complicated by run-time dispatch (i.e. run-time overload resolution). Fortunately, the Julia compiler is good at eliminating calls to short functions by inlining them, as long as it can infer the argument types at compilation time. So the key is learning how to write type-stable code, which lets the inference mechanism work well. See John Myles White's article to learn about type-stable code.

Code with constructs that might throw exceptions also contains branches, and so will not vectorize. This is why the @inbounds notation is currently necessary. It turns off subscript checking that might throw an exception. Be sure that your subscripts are in bounds when using @inbounds, otherwise you can corrupt your Julia session.

Short conditional expressions involving &&, ||, or ?: will sometimes vectorize. Here is an example that vectorized with Julia 0.3 when I tried it on an Intel(R) 4th Generation Core i7 processor:

function clip(x, a, b)
    @simd for i=1:length(x)
        @inbounds x[i] = ifelse(x[i]<a,a,ifelse(x[i]>b,b,x[i]))
    end
end

# Shows that code vectorizes for Float32 arrays.
@code_llvm clip(Float32[],0.0f0,0.0f0)

Vectorization of p?q:r depends on the compilers ability to figure out whether it can legally/profitably replace it with the equivalent of Julia's ifelse(p,q.r). It's not always legally the same, because ifelse evaluates both expressions q and r regardless of the value of p. If you intend a loop body to be vectorizable, consider writing it with ifelse instead of ?:, like this:

function clip(x, a, b)
    @simd for i=1:length(x)
        @inbounds x[i] = ifelse(x[i]<a,a,ifelse(x[i]>b,b,x[i]))
    end
end

The current Julia definitions of min and max use ifelse, so this example could be written more simply using minand max, like this:

function clip(x, a, b)
    @simd for i=1:length(x)
        @inbounds x[i] = max(x[i]<a,a,min(b,x[i]))
    end
end

Subscripts Should Be Unit Stride

The amount that an array subscript changes between iterations is called its stride. The current vectorizer targets loops that have unit-stride access patterns. In practice, that means that for a @simd loop with index i, you want array subscripts to either be:

loop_invariant_value
i
i + loop_invariant_value
i- loop_invariant_value

The vectorizer will tolerate an occasional non-unit stride index, such as 2i, but be warned that the resulting code may be slow. For example, I tried this example, which has an access with stride 2:

function stride2(a, b, x, y)
    @simd for i=1:length(y)
        @inbounds y[i] = a * x[2i] + b
    end
end

@code_llvm stride2(0.0f,0.0f,Float32[],Float32[])

On an Intel 4th Generation Core i7 processor, the LLVM output showed that the compiler synthesized the load x[2i] from a bunch of scalar loads, so clumsily that removing the @simd actually speeds up the example by about 1.4x. I'm hoping futures versions of Julia will do better, but for now I recommend sticking with unit-stride subscripts when using @simd.

When working with nested loops on two-dimensional arrays, use @simd on the inner loop and make that loop index the leftmost subscript of arrays. This rule comes naturally to Fortran programmers, but rubs against habit for C/C++ programmers. Here is an example:

function updateV(irange, jrange, U, Vx, Vy, A)
    for j in jrange
        @simd for i in irange
            @inbounds begin
                Vx[i,j] += (A[i,j+1]+A[i,j])*(U[i,j+1]-U[i,j])
                Vy[i,j] += (A[i+1,j]+A[i,j])*(U[i+1,j]-U[i,j])
            end
        end
    end
end

# Shows that code vectorizes for Float32
R = 1:8
A = Float32[]
@code_llvm updateV(R,R,A,A,A,A)

Use 32-Bit Floating-Point Arithmetic

You'll notice that all the examples that I've shown use Float32 instead of Float64. As far as I can tell the LLVM 3.3 vectorizer in Julia 0.3 on Intel hardware refuses to vectorize 64-bit math. Hopefully the situation will improve in the future.

Summary Recommendations for Effective Vectorization in Julia

For implicit or explicit vectorization:

No cross-iteration dependencies
Straight-line loop body only. Use ifelse for conditionals.
Use @inbounds
Make sure that all calls are inlined. Write type-stable code.
Use unit-stride subscripts
Use 32-bit floating-point arithmetic
Reduction variables should be local variables.

For implicit vectorization, the additional constraints are:

Access no more than about 4 arrays inside the loop.
Do not use floating-point reductions.

Otherwise, use explicit vectorization by marking your loop with @simd.

Future Directions

@simd is a first step in adding vectorization to Julia. There is much room for improvement, particularly if Julia is to match the performance of statically typed languages. Some possible future improvements are:

Report to the user why a loop failed to vectorize.
Vectorize 64-bit arithmetic.
Vectorize complex arithmetic.
Vectorize tuple math. This was prototyped, but took too much compilation time for Julia 0.3. With changes to LLVM it may be practical in the future.
Vectorize loops without@inbounds. The semantics of @simd are designed to allow doing subscript checking (and throwing an exception if necessary) before the loop really starts. Or perhaps the checks could be vectorized.
Vectorize loops bodies that have complicated control-flow. This may become more important and practical with instruction set extensions such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) that support masking.
Add support for limited forms of cross-iteration dependencies that are useful in practice. See my notes for more details.

LLVM hackers looking for projects: please take note!

Conclusion

There is a joke that vectorizers in the 1970's did not work well, but they taught programmers how to write code that could be vectorized by simple vectorizers. Such is the case with the current Julia implementation: Learn how to cater to the vectorizer and it can deliver. Remember too that Julia is an interactive "scripting" language. That such a language can often be at least half as fast as vectorized static languages such as C/C++/Fortran, without restricting that performance to a set of precompiled library routines, is amazing.

Acknowledgments

Jeff Bezanson pointed out ifelse as a way to avoid branches. Jim Cownie suggested corrections and improvements to an earlier draft. Elliot Saba, Patrick O'Leary, and Jacob Quinn corrected errors in the first public draft. Jacob pointed out the convenience of using @code_llvm instead of code_llvm. Thanks goes to the LLVM project for a framework that enables new languages to get going quickly. Thanks goes to the Julia people for starting a modern language for technical computing and making it open source.

Area tema:

IDZone

Ultimo aggiornamento:

Lunedì, 15 Settembre, 2014

↧

How to analyze Intel® Xeon Phi™ coprocessor applications using Intel® VTune™ Amplifier XE 2015

September 16, 2014, 6:38 am

Latest and popular articles on Intel Technologies

≫ Next: Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics

≪ Previous: Vectorization in Julia

Introduction

Intel® VTune™ Amplifier XE 2015 now includes some new capabilities for analyzing Intel® Xeon Phi™ coprocessor applications. This article will step through this analysis on a Intel® Xeon Phi™ coprocessor and also outline some of the new capabilities.

Compiling and running the application

The application we will be using is one of the samples include in VTune Amplifier. It is located in /opt/intel/vtune_amplifier_xe_2015/samples/en/C++/matrix_vtune_amp_xe.tgz. To build the application on Linux*:

First source the environment for the Intel® Compiler you are using.
1. source /opt/intel/compiler_xe_2015/compilervars.sh intel64
Untar the sample in a directory where you have permission
1. tar xvzf matrix_vtune_amp_xe.tgz
By default the sample does not use OpenMP*. You will need to modify the Makefile
1. cd matrix/linux
2. Edit the Makefile
3. Comment the default PARAMODEL and uncomment the OpenMP PARAMODEL.
Build the application to run native on the Intel® Xeon Phi™ coprocessor
1. cd matrix/linux
2. make mic
The make command from step #4 will create a Intel Xeon Phi native matrix.mic executable. It will also copy the file to mic0:/tmp.
Verify the libiomp5.so library is available on your Intel Xeon Phi coprocessor.
Run the application
1. /tmp/matrix.mic

Addr of buf1 = 0x7fec2b054010

Offs of buf1 = 0x7fec2b054180

Addr of buf2 = 0x7fec23fd3010

Offs of buf2 = 0x7fec23fd31c0

Addr of buf3 = 0x7fec1cf52010

Offs of buf3 = 0x7fec1cf52100

Addr of buf4 = 0x7fec15ed1010

Offs of buf4 = 0x7fec15ed1140

Threads #: 240 OpenMP threads

Matrix size: 3840

Using multiply kernel: multiply1

Freq = 1.090908 GHz

Execution time = 23.866 seconds

Running the application using VTune Amplifier

Source /opt/intel/vtune_amplifier_2015/amplxvar.sh
Start the VTune Amplifier GUI
1. amplxe-gui
Create a VTune Amplifier project
1. File->New->Project
2. There are several new options in the “Target System” menu pull down
3. .
We will be selecting the menu item Intel Xeon Phi coprocessor (native)
Specify the "Launch Application" menu item
Specify the application name /tmp/matrix.mic
1. Note: This application is located on the Intel Xeon Phi coprocessor's file system.
Click on Ok
To analyze your application
1. Click on "New Analysis"
2. Click on "Advanced Hotspots"
3. Click Start
4. VTune Amplifier will launch the application and then finalize the result

Summary

VTune Amplifier has made some significant improvements in the analysis of Intel Xeon Phi coprocessor applications. This article has explained how to launch native applications under VTune Amplifier using the new GUI interface available in the 2015 release, but you can easily analyze offloaded applications using a very similar method, using the “Intel Xeon Phi coprocessor (host launch)” menu item shown in a menu above. Changes for the 2015 release also impact the command line interface, amplxe-cl. Look for another article to explain that.

Intermedio

Intel® SDK per applicazioni OpenCL™

Area tema:

IDZone

↧

Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics

September 16, 2014, 12:37 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel Composer XE 2015 Silent Installation Guide

≪ Previous: How to analyze Intel® Xeon Phi™ coprocessor applications using Intel® VTune™ Amplifier XE 2015

Downloads

Download Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics [PDF 673KB]

Download OpenCL Zero Copy code sample [ZIP 22.4KB]

Introduction

This document provides guidance to OpenCL™ developers who want to optimize applications running on Intel® processor graphics. Specifically, this document shows you how to minimize the memory footprint of applications and reduce the amount of copying on buffers in the shared physical memory system of an Intel® System on Chip (SoC) solution. It also provides working source code to demonstrate these principles.

The OpenCL 1.2 Specification includes memory allocation flags and API functions that developers can use to create applications with minimal memory footprint and maximum performance. This is accomplished by eliminating extra copies during execution, referred to as zero copy behavior. This document augments the OpenCL API specification by giving guidance specific to Intel processor graphics.

Key Takeaway

To create zero copy buffers, do one of the following:

Use CL_MEM_ALLOC_HOST_PTR and let the runtime handle creating a zero copy allocation buffer for you
If you already have the data and want to load the data into an OpenCL buffer object, then use CL_MEM_USE_HOST_PTR with a buffer allocated at a 4096 byte boundary (aligned to a page and cache line boundary) and a total size that is a multiple of 64 bytes (cache line size).

When reading or writing data to these buffers from the host, use clEnqueueMapBuffer(), operate on the buffer, then call clEnqueueUnmapMemObject(). This paper contains code samples to demonstrate the best known practices on Intel® platforms.

Motivation

Memory management within the GPU driver has a complicated set of memory usage scenarios that need to be considered. Applications can inform the driver of their usage by specifying flags during memory allocation as well as through specific memory access or transfer APIs called during runtime. Sometimes driver implementations need to create or manage internal copies of memory buffers to facilitate servicing these API calls. For example, internal memory buffer copies might be created to support the memory layout preferred by the CPU or GPU or to improve caching behavior. Such copies may be necessary in these scenarios, but they can detrimentally impact performance. Application developers need device-specific knowledge in order to know how to avoid these copies.

Definitions

Before going into the technical details, here are some definitions of terms used in this article.

Host memory: Memory accessible on the OpenCL host.
Device memory: Memory accessible on the OpenCL device.
Zero copy: Refers to the concept of using the same copy of memory between the host, in this case the CPU, and the device, in this case the integrated GPU, with the goal of increasing performance and reducing the overall memory footprint of the application by reducing the number of copies of data.
Zero copy buffers: Buffers created via the clCreateBuffer() API that follow the rules for zero copy. This is implementation dependent so the rules on one device may be different than another.
Shared Physical Memory: The host and the device share the same physical DRAM. This is different from shared virtual memory, when the host and device share the same virtual addresses, and is not the subject of this paper. The key hardware feature that enables zero copy is the fact that the CPU and GPU have shared physical memory. Shared physical and shared virtual memories are not mutually exclusive.
Virtual Memory: The memory model used by the operating system to give the process perceived ownership of its own dedicated memory space. Pointers that programmers operate on are not physical memory addresses but instead virtual addresses that are part of a virtual address space. The platform handles conversions between these virtual addresses and the physical memory address.
Intel processor graphics: The term used when referring to current Intel graphics solutions. Product names for Intel GPUs integrated in SoC include Intel® Iris™ graphics, Intel® Iris™ Pro graphics, or Intel® HD Graphics depending on the exact SoC. For additional hardware architecture details see the Intel® Gen 7.5 Compute Architecture document referenced at the end of this document or http://ark.intel.com/.

Intel® Processor Graphics with Shared Physical Memory

Intel processor graphics shares memory with the CPU. Figure 1 shows their relationship. While not shown in this figure, several architectural features exist that enhance the memory subsystem. For example, cache hierarchies, samplers, support for atomics, and read and write queues are all utilized to get maximum performance from the memory subsystem.

Figure 1.Relationship of the CPU, Intel® processor graphics, and main memory. Notice a single pool of memory is shared by the CPU and GPU, unlike discrete GPUs that have their own dedicated memory that must be managed by the driver.

Benefits of Zero Copy

With Intel processor graphics, using zero copy always results in better performance relative to the alternative of creating a copy on the host or the device. Unlike other architectures with non-uniform memory architectures, memory shared between the CPU and GPU can be efficiently accessed by both devices.

Memory Buffers in OpenCL

The performance of buffer operations in OpenCL can be different on different OpenCL implementations. Here, we clarify the behavior on Intel processor graphics.

Creating Buffers with clCreateBuffer()

One use case for OpenCL is when memory is already populated on the host and you want the device to read this data. In this case use the flag CL_MEM_USE_HOST_PTR to create the buffer. This may be the case when we are using OpenCL with existing codebases. When using the CL_MEM_USE_HOST_PTR flag, if we want to guarantee a zero copy buffer on Intel processor graphics, we need to ensure that we adhere to two device-dependent alignment and size rules. We must create a buffer that is aligned to a 4096 byte boundary and have a size that is a multiple of 64 bytes. Also note that if we write into this buffer, we will overwrite the original contents of the buffer. For clarity we include a short code sequence at the end of this section to test a buffer to determine if it meets this criteria.

CL_MEM_USE_HOST_PTR: Use this when a buffer is already allocated as page-aligned with _aligned_malloc() instead of malloc() and a size that is a multiple of 64 bytes:

int *pbuf = (int *)_aligned_malloc(sizeof(int) * 1024, 4096);

Using _aligned_malloc() requires the use of _aligned_free() when deallocating. On Linux*, Android*, and Mac OS* see documentation for mem_align() or posix_memalign(). Do not use free() on memory allocated with _aligned_malloc(). Create the buffer and its associated cl_mem object using:

cl_mem myZeroCopyCLMemObj = clCreateBuffer(ctx,…CL_MEM_USE_HOST_PTR…);

A second case is when the data will be generated on the device but may be read back on the host. In this case leverage the CL_MEM_ALLOC_HOST_PTR flag to create the data. Do not worry about using the example code below to test that the memory is sized and allocated at a proper base address, the runtime will handle this for you.

A third case is when data is generated on the host but your application is in control of the initialization of the buffer. In this case you create the buffer, then initialize it. For example, suppose you are reading in input from a file. The difference between this case and the use of CL_MEM_USE_HOST_PTR is if the buffer has already been populated. To initialize the contents of the buffer, use the OpenCL map and unmap API functions described later.

CL_MEM_ALLOC_HOST_PTR: Use this flag when you have not yet allocated the memory and want OpenCL to ensure that you have a zero copy buffer.

buf = clCreateBuffer(ctx, ….CL_MEM_ALLOC_HOST_PTR, ….)

Table 1.Different application scenarios showing which flags to pass to clCreateBuffer() to enable the use of zero copy buffers on Intel® processor graphics

Flag(s)	When to use the flag(s) to enable a zero copy scenario
CL_MEM_USE_HOST_PTR	Buffer was already created in existing application code and the alignment and size rules were followed when the buffer was allocated, or you want control over the buffer allocation and do not want to rely on OpenCL. In cases when you don't want to incur the cost of a copy that would take place with CL_MEM_ALLOC_HOST_PTR \| CL_MEM_COPY_HOST_PTR. In cases when data can be safely overwritten by OpenCL or you know the data will not be overwritten because your application controls any writes to the buffer.
CL_MEM_ALLOC_HOST_PTR	You want the OpenCL runtime to handle the alignment and size requirements. In cases when you may be reading data from a file or another I/O stream. A brand new application being written to use OpenCL and not a port from existing code. Buffer will be initialized in host or device code and not by a library decoupled from your control. Don't forget to map and unmap the buffer during initialization.
CL_MEM_ALLOC_HOST_PTR \| CL_MEM_COPY_HOST_PTR	You want the OpenCL runtime to handle the size and alignment requirements. In cases when you may be reading or writing data from a file or another I/O stream and aren't allowed to write to the buffer you are given. Buffer is not already in a properly aligned and sized allocation and you want it to be. You are okay with the performance cost of the copy relative to the length of time your application executes, for example at initialization. Porting existing application code where you don't know if it has been aligned and sized properly. The buffer used to create the OpenCL buffer needs the data to be unmodified and you want to write to the buffer

In summary, most cases can use CL_MEM_ALLOC_HOST_PTR on Intel processor graphics. Do not forget when initializing the buffer contents to first map the buffer, write to the buffer, and then unmap the buffer. In some cases the data may already be in an aligned and properly sized allocation, allowing you to use CL_MEM_USE_HOST_PTR.

This short function in C verifies that the pointer and size of the allocation adheres to the alignment and size rules:

unsigned int verifyZeroCopyPtr(void *ptr, unsigned int sizeOfContentsOfPtr)
{
	int status; //so we only have one exit point from function
	if((uintptr_t)ptr % 4096 == 0) //page alignment and cache alignment
	{
		if(sizeOfContentsOfPtr % 64 == 0) //multiple of cache size
		{
			status = 1;
		}
		else status = 0;
	}
	else status = 0;
	return status;

Accessing the Buffer on the Host

When directly accessing any buffer on the host, zero copy buffer or not, you are required to map and unmap the buffer in OpenCL 1.2. See below and the sample code for details.

Accessing the Buffer on the Device

Accessing the buffer on the device is no different than any other buffer; no code change is required. You only need to worry about the host-side interaction and the map and unmap APIs.

Use clEnqueueMapBuffer() and clEnqueueUnmapMemObject()

The APIs clEnqueueReadBuffer(), clEnqueueWriteBuffer(), and clEnqueueCopyBuffer() are not recommended, especially for large buffers since they require the contents of the buffer to be copied. Sometimes, however, these APIs might be beneficial, for example, if you are reading the contents out of the buffer because you want to reuse it immediately on the GPU in a double buffering scenario. In this case, it is useful to make a host-side copy and let the device continue to operate on the original buffer. To facilitate host read or write access to a memory buffer that has been shared with Intel processor graphics, use the APIs clEnqueueMapBuffer() and clEnqueueUnmapMemObject().

Example use of clEnqueueMapBuffer():

mappedBuffer = (float *)clEnqueueMapBuffer(queue, cl_mem_image, CL_TRUE, CL_MAP_READ, 0, imageSize, 0, NULL, NULL, NULL);

Example use of clEnqueueUnmapMemObject():

clEnqueueUnmapMemObject(queue, cl_mem_image, mappedBuffer, 0, NULL, NULL);

Caveats on Other Platforms

The behavior described above may not be the same on all platforms. It is best to check the vendor's documentation. The OpenCL API specification provides a vendor the ability to create zero copy buffers. It does not guarantee to always return a zero copy buffer. In fact, the specification actually states that a copy may be created. In the documentation for CL_MEM_USE_HOST_PTR:

"OpenCL implementations are allowed to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device."

An Observation about the Virtual Address Space

You may notice that the address you are validating to be on a 4096 byte page boundary is a virtual address boundary and not a physical address. Is this a problem? While in theory it could be an issue, after all the OS could have mapped a virtual address to any physical address, this does not happen in our implementation. You are assured that if the virtual address is page aligned, then the physical address is page aligned. More details on how and why this works is beyond the scope of this article.

Validating Zero Copy Behavior

One downside of the OpenCL 1.2 API is there is no runtime mechanism to validate that a copy has or has not occurred. For example, when a clEnqueueMapBuffer() executes, was a copy created or not? The way we verified these samples was to time a short program that declared the output buffer with _aligned_malloc() and compare the result when using malloc() with an image of size 1024x1024 over several iterations to verify that the aligned allocation was significantly faster from just before the map to just after the unmap. These timings also include driver overhead. We have left the timing code in the sample if you want to verify this yourself.

A Zero Copy Example

We ported to OpenCL a well-known BSD licensed codebase that simulates ambient occlusion called AOBench available at: https://code.google.com/p/aobench/. We start with a straightforward mapping to OpenCL, the way any reasonable programmer would do as a first attempt. Next we show two different versions where we create the computed ambient occlusion resultant image as a zero copy buffer. The first shows the code when you want to allocate the buffer and pass this buffer to the OpenCL runtime. This might be common when you already have an existing application that makes use of CL_MEM_USE_HOST_PTR. The second is further simplified and is useful when you want the OpenCL runtime to handle the buffer allocation and uses CL_MEM_ALLOC_HOST_PTR. We have left it up to you to implement the third possibility: a short, but useful, exercise to create the buffer using CL_MEM_ALLOC_HOST_PTR, map the pointer, populate the buffer on the host, then unmap the pointer. We focus on the output image buffer for this example; other buffers could be treated similarly.

Sample File and Directory Structure

This section contains details on how this code is partitioned. The emphasis was to create a simple C example and not a product quality implementation. Most of the code is common across all of the samples.

Source Files:

Common/host_common.cpp: Boilerplate to start up and manage an OpenCL context, compile the source, create queues, and handle general cleanup.
Common/kernels.cl: OpenCL kernel code for this particular example. The contents of this file are of no significance other than demonstrating a complete application.
Include/host_common.h: header file for host_common.cpp, various initialization values, and function and variable declarations.
Include/scene.h: scene graph functions and variables used in this sample.

NotZeroCopy/main.c: source file that contains the functions for doing a straightforward port of the sample code. The functions we are interested in are initializeDeviceData() and runClKernels(). In initializeDeviceData() we used a standard malloc() without forcing an alignment, resulting in a surface that will not support zero copy. Also, in runCLKernels() we used the standard API call clEnqueueReadBuffer(). Functionally, this is 100% correct, however it is not optimal for performance.
ZeroCopyUseHostPtr/main.c: source file that contains functions that demonstrate modifications to NotZeroCopy. Notice the use of _aligned_malloc() instead of malloc() from NotZeroCopy in initializeDeviceData(). Also, when we call clCreateBuffer() we use the flag CL_MEM_USE_HOST_PTR.
ZeroCopyAllocHostPtr/main.c: source file that contains functions that demonstrate the modifications required when using the allocation mechanism of the runtime. Specifically, notice the fact that now we do not even need to call _aligned_malloc() as we did in the ZeroCopyUseHostPtr example. Instead, we simply pass the flag CL_MEM_ALLOC_HOST_PTR and pass the size of the buffer we want to allocate.

The other files and directories are generated automatically by the Microsoft Visual Studio* IDE.

Microsoft Visual Studio 2012 Configuration:

The Microsoft Visual Studio 2012 solution, OpenCLZeroCopy.sln, has three projects: NotZeroCopy.vcxproj, ZeroCopyAllocHostPtr.vcxproj, and ZeroCopyUsedHostPtr.vcxproj. The OpenCLZeroCopy.props property sheet holds the settings specific to this project. These settings include: the system path to the cl.h header file, the pointer to the OpenCL.lib library to link to, and the pointer to the local include directory for this example. You may need to change these settings for your build environment.

Building and Running the Examples

Build Requirements

First, make sure you have downloaded and installed the Intel® SDK for OpenCL™ Applications available here: https://software.intel.com/en-us/vcsource/tools/opencl-sdk. Also, be sure to install Microsoft Visual Studio 2012 (MSVC 2012) IDE. Next, open the solution file OpenCLZeroCopy.sln in MSVC 2012.

Paths in the Property Sheet

Instead of making changes for each build of each executable, MSVC supports the use of property sheets. If you change a property sheet, the change propagates to all builds that include this property sheet in their build settings. We have included figures here that show the path names on our system if you decide you need to change them. Alternatively, you can use the environment variable: $(INTELOCLSDKROOT), which defaults to C:\Program Files (x86)\Intel\OpenCL SDK\3.0\. You may have a more recent version. We have a relative path for the include files used across each of the executables and use the default installation location for the Intel OpenCL SDK. For more information on property sheets consult the MSDN documentation: http://msdn.microsoft.com/en-us/library/z1f703z5(v=vs.90).aspx .

Figure 2.Additional include directories used in this code sample.

You can access the property manager from the View menu and select the Property Manager Menu item.

Figure 3.Additional library directories used in this code sample.

Figure 4.Notice the opencl.lib file is added as an additional library.

Running the Example

Build this sample code by selecting Build->Build Solution from the main menu. All of the executables should be generated. You can run them within Visual Studio directly or go to the Debug and/or Release directories that are located in the same location as the OpenCLZeroCopy solution file.

What's Coming in OpenCL 2.0: Shared Virtual Memory (SVM)

This paper has focused on understanding the use of buffers that can be shared on platforms that support shared physical memory (SPM) such as the Intel CPUs and Intel processor graphics. OpenCL 2.0 will have APIs to expose shared virtual memory on architectures that can support it. This will allow you to not just have a shared buffer for writing, but also to share virtual addresses on the CPU and GPU. For example, you could leverage SVM to update a scene graph on the CPU using a physics simulation then use the GPU to calculate the final image.

Acknowledgements

Lots of folks have encouraged the development of this and other collateral. Some provided feedback and some helped create the space to get it done: Stephen Junkins, Murali Sundaresan, David Blythe, Aaron Kunze, Allen Hux, Mike Macpherson, Pavan Lanka, Girish Ravunnikutty, Ben Ashbaugh, Sergey Lyalin, Maxim Shevstov, Arnon Peleg, Vadim Kartoshkin, Deepti Joshi, Uri Levy, and Shiri Manor.

References

OpenCL 1.2 specification: https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf
OpenCL 2.0 specification, composed of three books: the OpenCL C Language specification, the OpenCL Runtime API, and the OpenCL extensions: https://www.khronos.org/registry/cl/specs/
AOBench: https://code.google.com/p/aobench/
Stephen Junkins' whitepaper: Intel® Gen 7.5 Compute Architecture: https://software.intel.com/sites/default/files/managed/f3/13/Compute_Architecture_of_Intel_Processor_Graphics_Gen7dot5_Aug2014.pdf. A must-read for anybody using OpenCL on Intel Processor Graphics platforms.

About the Author

Adam Lake – Adam works in the Visual Products Group as a Senior Graphics Architect and Voting Representative to the Khronos OpenCL Standards Body. He has worked on GPGPU programming for 12+ years. Previously he has worked in VR, 3D, graphics, and stream programming language compilers.

* Other names and brands may be claimed as the property of others.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission from Khronos.
Copyright © 2014 Intel Corporation. All rights reserved.

OpenCL*

zero copy

zero copy buffers

OpenCL memory buffers

Area tema:

IDZone

↧

Intel Composer XE 2015 Silent Installation Guide

September 17, 2014, 8:49 am

Latest and popular articles on Intel Technologies

≫ Next: Using the Develop Tab

≪ Previous: Getting the Most from OpenCL™ 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics

Intel^® Composer XE 2015 for Linux
"Silent" or non-interactive Installation Instructions

Navigation:

Linux and Mac OS X Compilers Installation Help Center: /en-us/articles/intel-compilers-linux-installation-help

Contents of this document:

Silent Command Line Installations

Starting with release 11.0, the Linux installation programs for Compiler Professional are built using the PSET (Program Startup Experience Technologies) 2.0 core. This PSET core is a framework of tools built by Intel to provide a robust set of installation and licensing features that are consistent across Intel product lines. A similar PSET core is used for the Windows* and Mac OS* X installation packages as well.

One feature provided in the PSET core is support for the "silent" install. Historically, "silent" really meant "non-interactive". At this point, "silent" also means "does not report copious amounts of information", assuming there are no problems during the installation. The silent install method allows the user to perform a command line installation of an entire package with no need to answer prompts or make product selections.

A historical note specific to Linux installs: there has been a "silent" install capability in the Linux Compiler products since version 9.0. This legacy version, included in versions 9.1, 10.0, 10.1, and 11.0 as well, performed the same function. But starting with version 11.1, the legacy silent install embedded in the "inner install components" has been removed. The new PSET core silent install is the only method still supported - that is, older silent installation tools in v11.0 and older are no longer supported other than the RPM command line install method (see details on RPM-based installs below).

Silent Install Steps: "From Scratch"

To run the silent install, follow these steps:

For reasons outlined below we recommend that a working product license or server license is in place before beginning. The file should be world-readable and located in a standard Intel license file directory, such as the default linux license directory /opt/intel/licenses. For more details, keep reading.
Create or edit an existing silent install configuration file. This file controls the behavior of the installation. Here is an example file. A similar file can be edited and placed in any directory on the target system. After this example we explain the configuration file contents.

# silent.cfg
# Patterns used to check silent configuration file
#
# anythingpat - any string
# filepat     - the file location pattern (/file/location/to/license.lic)
# lspat       - the license server address pattern (0123@hostname)
# snpat      - the serial number pattern (ABCD-01234567)

# accept EULA, valid values are: {accept, decline}
ACCEPT_EULA=accept

# install mode for RPM system, valid values are: {RPM, NONRPM}
INSTALL_MODE=RPM

# optional error behavior, valid values are: {yes, no}
CONTINUE_WITH_OPTIONAL_ERROR=yes

# install location, valid values are: {/opt/intel, filepat}
PSET_INSTALL_DIR=/opt/intel

# continue with overwrite of existing installation directory, valid values are: {yes, no}
CONTINUE_WITH_INSTALLDIR_OVERWRITE=yes

# list of components to install, valid values are: {ALL, DEFAULTS, anythingpat}
COMPONENTS=DEFAULTS

# installation mode, valid values are: {install, modify, repair, uninstall}
PSET_MODE=install

# this one is optional
# directory for non-RPM database, valid values are: {filepat}
#NONRPM_DB_DIR=filepat

# Choose 1 of the 2 activation options - either serial or license
# license is needed if system does not have internet connectivity to Intel
#
# Serial number, valid values are: {snpat}
#ACTIVATION_SERIAL_NUMBER=snpat
#
# License file or license server, valid values are: {lspat, filepat}
#ACTIVATION_LICENSE_FILE=/put/a/full/path/and/licensefile.lic
#
# and based on the above, set the activation type: again, recommend using a license_file.
# exist_lic will look in the normal places for an existing license.
# Activation type, valid values are: {exist_lic, license_server, license_file, trial_lic, serial_number}
ACTIVATION_TYPE=exist_lic

# the next block is for Cluster Edition installations.  Leave commented for non-cluster installs
# Select 'yes' if head node installation can be used from compute nodes, valid values are: {yes, no}
#CLUSTER_INSTALL_AUTOMOUNT=yes
#
# Path to the cluster description file, valid values are: {filepat}
#CLUSTER_INSTALL_MACHINES_FILE=filepat

# Intel(R) Software Improvement Program opt-in, valid values are: {yes, no}
PHONEHOME_SEND_USAGE_DATA=no

# Perform validation of digital signatures of RPM files, valid values are: {yes, no}
SIGNING_ENABLED=yes

Running the Silent Installation

Once you have created your silent installation configuration file, installation is quite simple. First, extract the compiler full package tar file in a temporary directory. For purposes of this example we use /tmp as our temporary directory. You may use any directory in which you have full file permissions. Do not untar the package in the directory where you intend to install the compiler, the temporary directory should be disjoint from your final target installation directory.

Untar the compiler package (assumes the package is copied to /tmp). Your compiler version and package name may differ than that shown below:

cd /tmp
tar -zxvf l_ccompxe_2015.0.090.tgz or
tar -zxvf l_fcompxe_2015.0.090.tgz

Now cd to the extracted directory

cd l_ccompxe_2015.0.090 or
cd l_fcompxe_2015.0.090

Run the install.sh installer program, passing the full path to your configuration file with the --silent option

./install.sh --silent /tmp/silent.cfg

where "silent.cfg" is replaced by the name you used to create your silent configuration file. You may use any name for this file.

DONE. If your configuration file is accepted the installer will now progress with the installation without further input from you, and no output will appear unless there is an error.

CONFIGURATION FILE FORMAT

A few comments on the directives inside the silent install configuration file:

ACCEPT_EULA=accept

This directive tells the install program that the invoking user has agreed to the End User License Agreement or EULA. This is a mandatory option and MUST be set to 'accept'. If this is not present in the configuration file, the installation will not complete. By using the silent installation program you are accepting the EULA.
The EULA is in a plain text file in the same directory as the installer. It has file name "license". Read this before proceeding as using the silent installer means you have read and agree to the EULA. If you have questions, go to our user forum: https://software.intel.com/en-us/forums/intel-software-development-products-download-registration-licensing

INSTALL_MODE=RPM

This directive tells the install program that the RPM method should be used to install the software. This will only work if the install user is "root" or has full root priveleges and your distribution support RPM for package management. In some cases, where the operating system of the target system does not support RPM or if the install program detects that the version of RPM supported by the operating system is flawed or otherwise incompatible with the install program, the installation will proceed but will switch to non-RPM mode automatically. This is the case for certain legacy operating systems (e.g. SLES9) and for operating systems that provide an RPM utility, but do not use RPM to store or manage system-installed operating system infrastructure (e.g. Ubuntu, Debian). THUS, Ubuntu and Debian users set this to INSTALL_MODE=NONRPM.
If the you do not want to use RPM, then this line should read "INSTALL_MODE=NONRPM". In this case, the products will be installed to the same location, but instead of storing product information in the system's RPM database, the Intel product install information will be stored in a flat file called "intel_sdp_products.db", usually stored in /opt/intel (or in $HOME/intel for non-root users). To override this default, use configuration file directive NONRPM_DB_DIR

NONRPM_DB_DIR

If INSTALL_MODE=NONRPM the directive NONRPM_DB_DIR can be used to override the default directory for the installation database. The default is /opt/intel or in $HOME/intel for non-root users. The format for this directive is:
NONRPM_DB_DIR=/path/to/your/db/directory

ACTIVATION=exist_lic

This directive tells the install program to look for an existing license during the install process. This is the preferred method for silent installs. Take the time to register your serial number and get a license file (see below). Having a license file on the system simplifies the process. In addition, as an administrator it is good practice to know WHERE your licenses are saved on your system. License files are plain text files with a .lic extension. By default these are saved in /opt/intel/licenses which is searched by default. If you save your license elsewhere, perhaps under an NFS folder, set environment variable INTEL_LICENSE_FILE to the full path to your license file prior to starting the installation or use the configuration file directive ACTIVATION_LICENSE_FILE to specify the full pathname to the license file.
Options for ACTIVATION are { exist_lic, license_file, server_lic, serial_number, trial_lic }
- exist_lic directs the installer to search for a valid license on the server. Searches will utilize the environment variable INTEL_LICENSE_FILE, search the default license directory /opt/intel/licenses, or use the ACTIVATION_LICENSE_FILE directive to find a valid license file.
- license_file is similar to exist_lic but directs the installer to use ACTIVATION_LICENSE_FILE to find the license file.
- server_lic is similar to exist_lic and exist_lic but directs the installer that this is a client installation and a floating license server will be contacted to active the product. This option will contact your floating license server on your network to retrieve the license information. BEFORE using this option make sure your client is correctly set up for your network including all networking, routing, name service, and firewall configuration. Insure that your client has direct access to your floating license server and that firewalls are set up to allow TCP/IP access for the 2 license server ports. server_lic will use INTEL_LICENSE_FILE containing a port@host format OR a client license file. The formats for these are described here https://software.intel.com/en-us/articles/licensing-setting-up-the-client-floating-license
- serial_number directs the installer to use directive ACTIVATION_SERIAL_NUMBER for activation. This method will require the installer to contact an external Intel activation server over the Internet to confirm your serial number. Due to user and company firewalls, this method is more complex and hence error prone of the available activation methods. We highly recommend using a license file or license server for activation instead.
- trial_lic is used only if you do not have an existing license and intend to temporarily evaluate the compiler. This method creates a temporary trial license in Trusted Storage on your system.
No license file but you have a serial number? If you have only a serial number, please visit https://registrationcenter.intel.com to register your serial number. As part of registration, you will receive email with an attached license file. If your serial is already registered and you need to retrieve a license file, read this: https://software.intel.com/en-us/articles/how-do-i-manage-my-licenses
Save the license file in /opt/intel/licenses/ directory, or in your preferred directory and set INTEL_LICENSE_FILE environment variable to this non-default location. If you have already registered your serial number but have lost the license file, revisit https://registrationcenter.intel.com and click on the hyperlinked product name to get to a screen where you can cut and paste or mail yourself a copy of your registered license file.
Still confused about licensing? Go to our licensing FAQS page https://software.intel.com/en-us/articles/licensing-faq

ACTIVATION_LICENSE_FILE

This directive instructs the installer where to find your named-user or client license. The format is:
ACTIVATION_LICENSE_FILE=/use/a/path/to/your/licensefile.lic where licensefile.lic is the name of your license file.

CONTINUE_WITH_OPTIONAL_ERROR

This directive controls behavior when the compiler encounters an "optional" error. These errors are non-fatal errors and will not prevent the installation to proceed if the user has set CONTINUE_WITH_OPTIONAL_ERROR=yes. Examples of optional errors include an unrecognized or unsupported linux distribution or version or certain prerequisites for a product cannot be found at the time of installation (such as a supported Java runtime or missing 32bit development libraries for 32bit tool installation). Fatal errors found during installation will cause the installer to abort with appropriate messages printed.
CONTINUE_WITH_OPTIONAL_ERROR=yes directs the installer to ignore non-fatal installation issues and continue with the installation.
CONTINUE_WITH_OPTIONAL_ERROR=no directs the installer to abort with appropriate warning messages for the non-fatal error found during the installation.

PSET_INSTALL_DIR

This directive specifies the target directory for the installation. The Intel Compilers default to /opt/intel for installation target. Set this directive to the root directory for the final compiler installation.

CONTINUE_WITH_INSTALLDIR_OVERWRITE

Determines the behavior of the installer if the PSET_INSTALL_DIR already contains a existing installation of this specific compiler version. The Intel compiler allows co-existence of multiple versions on a system. This directive does not affect this behavior, each version of the compiler will have a unique installation structure that does not overwrite other versions. This directive dictates behavior when the SAME VERSION is already installed in the PSET_INSTALL_DIR.
CONTINUE_WITH_INSTALLDIR_OVERWRITE=yes directs the installer to overwrite the existing compiler version of the SAME VERSION
CONTINUE_WITH_INSTALLDIR_OVERWRITE=no directs the installer to exit if an existing compiler installation of the SAME VERSION already exists in PSET_INSTALL_DIR

COMPONENTS

A typical compiler package contains multiple sub-packages, such as MKL, IPP, TBB, Debugger, etc. This directive allows the user to control which sub-packages to install.
COMPONENTS=DEFAULTS directs the installer to install the pre-determined default packages for the compiler (recommended setting). The defaults may not include some sub-packages deemed non-essential or special purpose. An example is the cluster components of MKL, which are only needed in a distributed memory installation. If you're not sure of the defaults you can do a trial installation of the compiler in interactive mode and select CUSTOMIZE installation to see and select components.
COMPONENTS=ALL directs the installer to install all packages for the compiler.
COMPONENTS=<pattern> allows the user to specify which components to install. The components vary by compiler version and package. For example, pattern can be "mkl,ipp,tbb.composer"

PSET_MODE

Sets the installer mode. The installer can install, remove, modify, or repair an installation.
PSET_MODE=install directs the installer to perform an installation
PSET_MODE=remove directs the installer to remove a previous installation. If multiple versions of the compiler are installed, the installer removes the most recent installation. This information is kept in the RPM database or the non-rpm database depending on the mode used for the installation.
PSET_MODE=modify allows the user to redo an installation. The most common scenario is to overwrite an existing installation with more COMPONENTS set or unset.
PSET_MODE=repair directs the installer to retry an installation again, checking for missing or damaged files, directories, and symbolic links, permissions, etc.

CLUSTER_INSTALL_AUTOMOUNT (optional)

This directive is only needed for installation of the Intel(R) Parallel Studio XE 2015 Cluster Edition product. For Composer and Professional Editions leave this directive commented out.
CLUSTER_INSTALL_AUTOMOUNT=yes tells the installer to only perform the main package installation on a cluster head node or admin node in a directory that is remote mounted on all the cluster compute nodes. This prevents the cluster installation from replicating all files on all nodes. The head or admin nodes has the tools installed whereas compute nodes assume the PSET_INSTALL_DIR is remote mounted - hence they do not need a full installation and just a few symbolic links and other small changes as necessary.
CLUSTER_INSTALL_AUTOMOUNT=no directs the installer to use CLUSTER_INSTALL_MACHINES_FILES to find all cluster nodes and perform local installations as if those nodes were stand-alone servers. This requires additional time and replicates files on all nodes.

CLUSTER_INSTALL_MACHINES_FILE (optional)

This directive is only needed for installation of the Intel(R) Parallel Studio XE 2015 Cluster Edition product. For Composer and Professional Editions leave this directive commented out.
This directive instructs the installer where to find the machines file for a cluster installation. The machines file is any text file with the names of all the cluster hosts on which to install the compiler. The work performed on each host depends on CLUSTER_INSTALL_AUTOMOUNT (see above)
CLUSTER_INSTALL_MACHINES_FILE=/your/path/to/your/machinefile/machinefile.txt

PHONEHOME_SEND_USAGE_DATA

This directive guides the installer in the user's intent for the optional Intel Software Improvement Program. This setting determines whether or not the compiler periodically sends customer usage information back to Intel. The intent is for Intel to gather information on what compiler options are being used, amongst other information. More information on the Intel Software Improvement Program can be found here: https://software.intel.com/en-us/articles/software-improvement-program.
PHONEHOME_SEND_USAGE_DATA=no directs the installer to configure the compiler to not send usage data back to the Intel Software Improvement Program.
PHONEHOME_SEND_USAGE_DATA=yes directs the installer to configure the compiler to send usage data back to the Intel Software Improvement Program. Setting this to YES is your consent to opt-into this program.

SIGNING_ENABLED

Directs the installer whether or not to check RPM digital signatures. Checking signatures is recommended. It allows the installer to find data corruption from such things as incomplete downloads of compiler packages or damaged RPMs.
SIGNING_ENABLED=yes directs the installer to check RPM digital signatures.
SIGNING_ENABLED=no directs the installer to skip the checking of RPM digital signatures.

Silent Install Steps: "Copy and Repeat" Method for Silent Configuration File Creation

If you need to make the same sort of installation over and over again, one way to get the silent installation configuration file right the first time is to run the installation program once interactively, using the options that meet the local needs, and record these options into a configuration file that can be used to replicate this same install via silent install for future installations.

To do this, the user simply needs to add the "duplicate" option to the script invocation, and run a normal interactive install, as follows:

prompt> ./install.sh --duplicate /tmp/silent.cfg

This "dash dash duplicate" option will put the choices made by you into the file specified on the command line. You can modify this recorded configuration file as appropriate and use it to perform future silent installations.

RPM Command Line Installations

The files associated the Linux Compiler Professional products are stored in "RPM" files. RPMs (short for Red Hat Package Manager). They are grouped according to certain file type guidelines. Each major product component will consist of one more or of these RPMs. For non-RPM systems and for users who choose to install the product without using the RPM database of their target systems, an "underneath the hood" utility is embedded inside the installation program tools to extract the contents of the RPM files.

RPM Embedded Installation Functionality

Starting with the version 11.1 packages, the Linux Compiler Professional packaging includes RPM files that also contain embedded installation functionality. This means that key install behaviors such as environment script updating and symbolic link creation, which used to be only in the install program itself, are now embedded in the RPM files. As a result, the experienced user can make use of the RPM files directly in order to install and remove Intel Composer XE 2011 for Linux and intel Compiler Professional 11.1 for Linux products.

Warning: this is truly for the experienced, Linux system savvy user. Most RPM command capabilities require root privileges. Improper use of rpm commands can corrupt and destroy a working system.

The changes done for the Linux compiler products are intended to ease the job of deploying in enterprise deployments, including cluster environments.

Product Layout for Composer XE 2015

Here is an example (for C++ package 2011.2.137)

Top directory contents of l_ccompxe_2011.2.137 package:

cd_eject.sh - CD eject script used by install.sh
install.sh - install script
install_GUI.sh - GUI front-end to the installer using X11. Only used for interactive, graphical installation method.
license - end user license agreement
support.txt - package version and contents information
pset - installation and licensing content directory used by the Intel installers
rpm - directory containing all product content in RPM file format, plus the EULA and LPGL license

This is an example Composer XE 2015 rpm directory. This directory listing is for the initial Composer XE 2015 C++ release, your version strings will vary by compiler versions: https://software.intel.com/en-us/articles/intel-compiler-and-composer-update-version-numbers-to-compiler-version-number-mapping NOTE: this is not intended to be a comprehensive list for every compiler. RPMs vary by compiler edition, components, and may vary by release. Please list your 'rpm' directory for a list specific to your compiler. The following is intended as a representative list:

EULA.txt                                             intel-ipp-sc-090-8.2-0.i486.rpm
intel-ccompxe-090-15.0-0.noarch.rpm                  intel-ipp-sc-090-8.2-0.x86_64.rpm
intel-compilerproc-090-15.0-0.i486.rpm               intel-ipp-sc-common-090-8.2-0.noarch.rpm
intel-compilerproc-090-15.0-0.x86_64.rpm             intel-ipp-st-090-8.2-0.i486.rpm
intel-compilerproc-common-090-15.0-0.noarch.rpm      intel-ipp-st-090-8.2-0.x86_64.rpm
intel-compilerproc-devel-090-15.0-0.i486.rpm         intel-ipp-st-devel-090-8.2-0.i486.rpm
intel-compilerproc-devel-090-15.0-0.x86_64.rpm       intel-ipp-st-devel-090-8.2-0.x86_64.rpm
intel-compilerpro-common-090-15.0-0.noarch.rpm       intel-ipp-st-devel-common-090-8.2-0.noarch.rpm
intel-compilerpro-common-pset-090-15.0-0.noarch.rpm  intel-ipp-vc-090-8.2-0.i486.rpm
intel-compilerproc-vars-090-15.0-0.noarch.rpm        intel-ipp-vc-090-8.2-0.x86_64.rpm
intel-compilerpro-devel-090-15.0-0.i486.rpm          intel-ipp-vc-common-090-8.2-0.noarch.rpm
intel-compilerpro-devel-090-15.0-0.x86_64.rpm        intel-mkl-090-11.2-0.i486.rpm
intel-compilerpro-vars-090-15.0-0.noarch.rpm         intel-mkl-090-11.2-0.x86_64.rpm
intel-gdb-090-7.7-0.i486.rpm                         intel-mkl-cluster-090-11.2-0.i486.rpm
intel-gdb-090-7.7-0.x86_64.rpm                       intel-mkl-cluster-090-11.2-0.x86_64.rpm
intel-gdb-cdt-090-7.7-0.x86_64.rpm                   intel-mkl-cluster-common-090-11.2-0.noarch.rpm
intel-gdb-cdt-source-090-7.7-0.x86_64.rpm            intel-mkl-cluster-devel-090-11.2-0.i486.rpm
intel-gdb-common-090-7.7-0.noarch.rpm                intel-mkl-cluster-devel-090-11.2-0.x86_64.rpm
intel-gdb-mic-090-7.7-0.x86_64.rpm                   intel-mkl-common-090-11.2-0.noarch.rpm
intel-gdb-mpm-090-7.7-0.x86_64.rpm                   intel-mkl-devel-090-11.2-0.i486.rpm
intel-gdb-python-source-090-7.7-0.noarch.rpm         intel-mkl-devel-090-11.2-0.x86_64.rpm
intel-gdb-source-090-7.7-0.noarch.rpm                intel-mkl-f95-common-090-11.2-0.noarch.rpm
intel-gdb-toplevel-090-7.7-0.noarch.rpm              intel-mkl-f95-devel-090-11.2-0.i486.rpm
intel-ipp-ac-090-8.2-0.i486.rpm                      intel-mkl-f95-devel-090-11.2-0.x86_64.rpm
intel-ipp-ac-090-8.2-0.x86_64.rpm                    intel-mkl-gnu-090-11.2-0.i486.rpm
intel-ipp-ac-common-090-8.2-0.noarch.rpm             intel-mkl-gnu-090-11.2-0.x86_64.rpm
intel-ipp-common-090-8.2-0.noarch.rpm                intel-mkl-gnu-devel-090-11.2-0.i486.rpm
intel-ipp-di-090-8.2-0.i486.rpm                      intel-mkl-gnu-devel-090-11.2-0.x86_64.rpm
intel-ipp-di-090-8.2-0.x86_64.rpm                    intel-mkl-mic-090-11.2-0.x86_64.rpm
intel-ipp-di-common-090-8.2-0.noarch.rpm             intel-mkl-mic-devel-090-11.2-0.x86_64.rpm
intel-ipp-gen-090-8.2-0.i486.rpm                     intel-mkl-pgi-090-11.2-0.i486.rpm
intel-ipp-gen-090-8.2-0.x86_64.rpm                   intel-mkl-pgi-090-11.2-0.x86_64.rpm
intel-ipp-gen-common-090-8.2-0.noarch.rpm            intel-mkl-pgi-devel-090-11.2-0.i486.rpm
intel-ipp-jp-090-8.2-0.i486.rpm                      intel-mkl-pgi-devel-090-11.2-0.x86_64.rpm
intel-ipp-jp-090-8.2-0.x86_64.rpm                    intel-mkl-sp2dp-090-11.2-0.x86_64.rpm
intel-ipp-jp-common-090-8.2-0.noarch.rpm             intel-mkl-sp2dp-devel-090-11.2-0.x86_64.rpm
intel-ipp-mt-090-8.2-0.i486.rpm                      intel-openmp-090-15.0-0.i486.rpm
intel-ipp-mt-090-8.2-0.x86_64.rpm                    intel-openmp-090-15.0-0.x86_64.rpm
intel-ipp-mt-devel-090-8.2-0.i486.rpm                intel-openmp-devel-090-15.0-0.i486.rpm
intel-ipp-mt-devel-090-8.2-0.x86_64.rpm              intel-openmp-devel-090-15.0-0.x86_64.rpm
intel-ipp-mx-090-8.2-0.i486.rpm                      intel-sourcechecker-common-090-15.0-0.noarch.rpm
intel-ipp-mx-090-8.2-0.x86_64.rpm                    intel-sourcechecker-devel-090-15.0-0.i486.rpm
intel-ipp-mx-common-090-8.2-0.noarch.rpm             intel-sourcechecker-devel-090-15.0-0.x86_64.rpm
intel-ipp-rr-090-8.2-0.i486.rpm                      intel-tbb-090-4.3-0.noarch.rpm
intel-ipp-rr-090-8.2-0.x86_64.rpm                    intel-tbb-devel-090-4.3-0.noarch.rpm
intel-ipp-rr-common-090-8.2-0.noarch.rpm

Installing Compilers With the RPM Command Line

To install a Linux compiler solution set via RPM command line, you should first ensure that a working license file or other licensing method (such as floating or network-served licenses) is already in place. There is no license checking performed during RPM installation. However, if you install without a license file you will get an 'cannot check out license' error when you try to use the compiler.

You are assumed to have complied with the End User License Agreement (EULA) if you are performing an RPM command line installation. The EULA is present in the parent installation directory ( license or license.txt file). Please read this license agreement. It is assumed you agree to this license agreement if you proceed with an rpm installation.

Once a license file or license method is in place, the user can install the products directly with these simple steps:

Login as root or 'su' to root
Composer XE 2015: 'cd' to the package/rpm directory ( e.g. /tmp/l_ccompxe_2015.0.090/rpm )
Run the RPM install command
- rpm -i *.rpm

This completes without error in most cases. If some system-level prerequisites, for required system libraries for example, are not met by the target operating system, a dependency warning may be returned by the rpm install. There are no embedded detailed dependency checks inside the RPM install capabilities for required commands such as g++ or for optional requirements such as a valid supported operating system or supported JRE. The embedded requirements are kept simple to ease installation for the general case, with an exception. The exception is the requirement for a /usr/lib/libstdc++.so.6 library to exist on the target system, and must match in 64bit or 32bit (there will be 2 copies of this library, one 64bit and one 32bit in 2 separate /lib paths, if you wish to be able to compile in 64bits and 32bits).

The second requirement is that the target operating system have at least the 3.0 version of "lsb" component installed. Availability of this LSB component will, in the vast majority of cases, also ensure that other necessary system level libraries are available. See LSB Support below for more information on getting the 'lsb' capability onto a target system.

If you believe that you have effectively installed the correct requirements on the target system and the dependency failures still persist, there is a fallback option, the "--nodeps" (dash dash nodeps) rpm switch. Invoking 'rpm -i' with the --nodeps option will allow the rpm installation to succeed in most cases.

prompt> rpm -i --nodeps *.rpm

Again, this will get you past the perceived dependency issues, which may be unique to a particular distribution of Linux and not really a problem for the resulting installation. But there is no assurance of complete success other than testing the resulting installation.

Other Special RPM Install Cases

If you are installing RPMs using the rpm command line, but using a multi-architecture package (such as the "combo" IA-32 / Intel64 package or a DVD package), you may want to install all of the RPMs that match their specific target machine's architecture. Or, if you are installing onto an Intel64 system and want to include both the IA-32 and Intel64 components, you may want both of these included. Here are some example rpm command line invocations:

prompt> rpm -i *.noarch.rpm *.486.rpm
- installs all components needed for operation on IA-32 architecture
prompt> rpm -i *.noarch.rpm *.i486.rpm *.x86_64.rpm
- installs all components needed for operation on both IA-32 and Intel 64 architecture

Certain Linux distributions do not like the idea of two RPM files having the same base name. For example, the rpm versions of certain distros might complain that there is more than one instance of intel-cproc023-11.1-1 on the command line when installing both the IA-32 and Intel64 RPMs onto the same machine. For these distros, use the "--force" ( dash dash force ) command line switch:

prompt> rpm -i --force *.noarch.rpm *.i486.rpm *.x86_64.rpm

Customizing the RPM Command Line

The rpm command has a long list of available options, including hooks to install from FTP and HTTP RPM repositories, features to examine contents of installed RPM-based programs and uninstalled RPM package files, etc. Most of these are beyond the scope of this document. See the Links section for references to external documentation on RPM. Here are a couple of additional RPM switches, however, which may be routinely useful.

prompt> rpm -i --prefix /my_NFS_dir/intel_compiler_directory/this_version *.rpm
- instructs rpm to use directory /my_NFS_dir/intel_compiler_directory/this_version as the root installation directory
prompt> rpm -i --replacefiles *.rpm
- directs rpm to replace any existing files using the new rpm files
prompt> rpm -i --replacepkgs *.rpm
- directs rpm to replace any existing package on the system using the new RPM files, even if they are already installed ... this may be useful in test applications where newer versions of a package with the same name are being tested

Uninstallation Using RPM

Since the installation of Intel Linux compiler packages includes in its deliver all of the uninstall scripts, the easiest way to perform a product uninstall is to simply run the uninstall script that is created by the install process. If you have a need to automate rpm-based uninstalls, however, a couple of "tricks" can be employed to make this simpler. These should be used with caution, as with any system command performed from a privileged account.

Here is an example command line that will remove all RPM packages from a Linux Composer XE 2015 package number "090":

rpm -e --allmatches `rpm -qa | grep intel- | grep 090`
- note use of back-quotes
- note that this only removes compiler packages. You may wish to use a similar method to remove intel-mkl, intel-ipp, intel-gdb, intel-openmp and other intel packages

Some Linux distributions will also complain about "multiple matches" during the uninstall process. In this case, the "--allmatches" switch mentioned above can also be employed here.

A Short Word on Updates

The rpm structure and command set support the application of updates or "patches" to existing installations. For example a util-1.1-2.rpm package may be issued that adds fixed content to some pre-existing util-1.1-1.rpm. The existing release process for Linux Composer XE includes support for "version co-existence" or multiple installs of separate product versions. So each new iteration of the product is unique from the previous version. This means that Intel compiler packages are not available in "patch" form. All product releases are stand-alone versions. So use of the 'rpm -U' upgrade capability is not supported by our product delivery model at this time.

LSB Support

LSB, or Linux Standard Base, is an effort sponsored by the Linux Foundation (http://www.linuxfoundation.org) to improve the interoperability of Linux operating systems and application software. Intel is a major participant in Linux Foundation activities and has embraced LSB as a viable means of improving our products and our customers' use of those products. To that end, we have included establishing LSB compliance as a part of our goals for our products and software packages in the future.

For the purposes of the Intel Composer XE 2015 for Linux our primary objective is to product packages that adhere to LSB packaging requirements. Most of the RPM changes mentioned above were done for this purpose. To be specific, however, we should draw a distinction between product compliance and package compliance. Because our compiler products must support a vast array of legacy constructs, the applications themselves may or may not be "certifiable" within the LSB guidelines, but our packages, i.e. our RPMs and install programs should be. This is the primary reason for inclusion of the "lsb >= 3.0" embedded requirements being added to our RPMs.

Some of these Linux distributions come with LSB support already included in the operating system by default (e.g. SLES11). For others, an external or optional package must be installed. If supporting an environment that is using RPM command line installation and want to enable that site / system / systems to be able to install without using the dreaded "--nodeps" option, the best best is to acquire and install the companion LSB solution for that operating system.

The Linux Foundation website contains links to download resources for LSB, as to many of the vendor-specific support sites. Check out these sites for information on adding LSB support to an existing operating system.

For RPM-based systems, a user can check on the status of LSB for their system, using a command like this:

prompt> rpm -q --provides lsb

This will tell if an 'lsb' RPM package is already installed and, if so, what version.

For our non-RPM supported operating systems, Ubuntu and Debian, the privileged user can use the Debian 'apt-get' facility to easily install the latest version of LSB supported by the specific distribution:

prompt> apt-get install lsb

Redistribution Package Installations

Redistribution packages allow applicaitons built with the Intel compilers to be run on client systems that do not have the Intel compilers installed (i.e. end-user systems). These are ONLY needed on systems without the Intel compilers installed. A redistribution package has all the Intel dynamic libraries possibly needed for a dynamically linked application. Alternatively, you can explore to -static-intel compiler switch to statically link all required Intel libraries into an application. Redistribution packages were officially supported with the 11.0 release and beyond.

Installation is simple. Once you extract the contents of the downloaded tarball (or accessing the redist contents of a DVD/image directory or media), you should simply invoke the "install.sh" script provided. The user is instructed to accept a EULA, but there is no run-time license enforcement or any other software licensing included in the redist packages. An uninstall script is produced during the redist install process, which provided for removal of the contents.

A note of caution: if the redist packages are installed on top of an existing Compiler package of the same release, it will land on and replace existing files in that compiler installation by default. Similarly, if the redist uninstall operation is run and the redist and compiler packages are sharing the same directory space, removing the redist package will break the compiler installation. Since the redist packages are not intended for use by compiler users on their development machines, this should not be an issue in most environments. But it is mentioned here in case situations come up where this might explain problems that have occurred.

Uninstall Instructions

As mentioned above, a standard uninstall script is included with each product installation, regardless of whether the install was performed using menu installs, RPM command line installs, or "silent" installs. In all cases, using the provided uninstall script should work and is the usual preferred method of removing installed product.. There is one uninstall feature, however, that is undocumented and can be used to make life a little easier. Here's an example invocation of that feature:

prompt> /opt/intel/composer_xe_2015.<update>.<build>/bin/uninstall.sh --default

This "--default" ( dash dash default ) option tells the uninstall script to use the "remove all" option and remove any compiler components associated with the specific package (in this case all components, including C/C++, Fortran, IDB, MKL, TBB, and IPP, if installed). There is no uninstall program interaction when this switch is used.

Note

As noted in the Intel® Software Development Product End User License Agreement, the Intel® Software Development Product you install will send Intel the product’s serial number and other system information to help Intel improve the product and validate license compliance. No personal information will be transmitted.

Links of Interest

The following links are provided for reference information.

Maximum RPM - http://www.rpm.org/max-rpm/

Excellent on-line resource for understanding RPMs and their usage.

Linux Foundation - http://www.linuxfoundation.org

Navigation:

Linux and Mac OS X Compilers Installation Help: /en-us/articles/intel-compilers-linux-installation-help
Licensing FAQS https://software.intel.com/en-us/articles/licensing-faq
Question: User Forums https://software.intel.com/en-us/forum

Strumenti per i cluster

Compilatori

Strumenti di sviluppo

Area tema:

IDZone

↧

Using the Develop Tab

September 17, 2014, 12:35 pm

Latest and popular articles on Intel Technologies

≫ Next: Parallel Programming with C#

≪ Previous: Intel Composer XE 2015 Silent Installation Guide

Code Editor and GUI Designer Tools

The Develop tab provides two views: a Code editor view and a GUI Design view. The Code view shows the files in your project directory, available web services, the code editor window, and a Live Development pane. If you created your app using either App Starter or App Designer, you can access these GUI layout editors in the Design view. To switch between these views when editing an HTML file, use the [ CODE | DESIGN ] buttons.

The Brackets code editor and App Designer (and App Starter) GUI design tools are all optional tools. You are NOT required to use them to build an Intel XDK hybrid HTML5 mobile web app. You are welcome to use your favorite code editor and/or favorite user interface layout tools. You can also implement your app's UI layout manually. The "Live Layout" feature does require the use of the Brackets editor, but no other features of the Intel XDK are directly dependent on these tools. Thus, if you have an existing web app that you are translating into a hybrid mobile web app, you can simply import that layout and code into a project and continue to work directly on the source, you do not need to "shoehorn" an existing app into App Designer.

A Note about App Designer and App Starter

If you created your app using either App Starter or App Designer (e.g., using the “(+) Start a New Project” button at the bottom of the Projects tab), you can use these GUI layout editors on the Develop tab's Design view.

Use App Starter to build a UI using the App Framework mobile-optimized UI library– or, use App Starter to learn how to build App Framework applications by hand (by reviewing the code that App Starter creates).
With App Designer you can build a UI based on a responsive grid system and one of several UI widget libraries, including the App Framework UI library.

App Designer utilizes a media query grid system for creating responsive web UI layouts. This media query grid system enables your app to resize and adapt to portrait and landscape views on phones, tablets and even Ultrabook^TM devices. To get started, see the App Designer Documentation and Tutorial page.

When you open an HTML file in the Develop tab, if that project was created using either App Designer or App Starter, use the [ CODE | DESIGN ] buttons above the file project tree to switch between the Code and the Design (GUI) views.

Don’t forget to check out the App Framework UI Components documentation page and the App Framework CSS Style Builder for more information about the App Framework UI library, which has been optimized for use with HTML5 hybrid mobile apps.

Code Editor Capabilities

You can edit project files with the built-in Brackets* code editor or with your favorite code editor, alongside the Intel® XDK. The Intel XDK tools automatically detect when project files are changed (as the result of a save when using your external editor) and will prompt you if additional actions are required due to changes to project files.

If you are unfamiliar with the Brackets HTML5 code editor built into the Develop tab, read Using the Editor in the Intel® XDK Develop Tab.

NOTE: The built-in Brackets editor includes a curated list of Brackets extensions. From the code editor menu, choose File > Extension Manager… to see the list of editor extensions that are available. There is no mechanism available to include your own custom Brackets extensions.

Web Service Capabilities

In the Code view below the file tree, the Intel XDK lets you explore a collection of third-party web service APIs (cloud services). In addition to the built-in third-party web services, the Develop tab helps you integrate other existing web services for use within the Intel XDK, such as those developed specifically for your app. For more information, see Exploring and Integrating Web Services in the Intel XDK.

Live Development Capabilities

The Live Development Tasks pane appears on the right side of the Code view in the Develop tab. This pane makes the process of previewing your project's code in a browser or device quick and efficient. The following Live Development Tasks pane shows expanded Run My App, Live Layout Editing, and Connected Device areas.

	Run My App runs your app either on USB-connected mobile Android* device(s) or on virtual devices in the Intel XDK device emulator. Changes appear after you save project files and reload/restart your app.
	Live Layout Editing lets you view your app on WiFi-connected Android and/or Apple iOS* device(s) or in a browser window. Changes appear immediately after you make edits using the built-in Intel XDK editor, or after you save project files using an external editor.
	Connected Devices shows the devices connected by USB cable or WiFi to your development system.

For information about using Live Development, see Using Live Development in the Intel(R) XDK.

Resources

For an overview of the Intel XDK and its related software, access the Intel® Developer Zone at: http://xdk.intel.com
For a quick summary of the Intel XDK product and links to related documentation, see the Intel® XDK Overview and Documentation Page
For information about using Live Development capabilities, see Using Live Development in the Intel XDK.
For information about using the built-in editor, see Using the Editor in the Intel XDK Develop Tab.
For information about exploring and integrating web services, see Exploring and Integrating Web Services in the Intel XDK.
For a short tutorial about using the Intel XDK development environment, see the Tutorial: Get Started with the Intel® XDK.

Legal Information - *Other names and brands may be claimed as the property of others.
Visit Support Forums - Submit feedback on this page

↧

Parallel Programming with C#

September 17, 2014, 1:53 pm

Latest and popular articles on Intel Technologies

≫ Next: NWChem* for the Intel® Xeon Phi™ Coprocessor

≪ Previous: Using the Develop Tab

By Bruno Sonnino

Multicore processors have been around for many years, and today, they can be found in most devices. However, many developers are doing what they’ve always done: creating single-threaded programs. They’re not taking advantage of all the extra processing power. Imagine you have many tasks to perform and many people to perform them, but you are using only one person because you don’t know how to ask for more. It’s inefficient. Users are paying for extra power, but their software is not allowing them to use it.

Multiple-thread processing isn’t new for seasoned C# developers, but it hasn’t always been easy to develop programs that use all the processor power. This article shows the evolution of parallel programming in C# and explains how to use the new Async paradigm, introduced in C# version 5.0.

What Is Parallel Programming?

Before talking about parallel programming, let me explain two concepts closely related to it: synchronous and asynchronous execution modes. These modes are important for improving the performance of your apps. When you execute a program synchronously, the program runs all tasks in sequence, as shown in Figure 1. You fire the execution of each task, and then wait until it finishes before firing the next one.

Figure 1. Synchronous execution

When executing asynchronously, the program doesn’t run all tasks in sequence: it fires the tasks, and then waits for their end, as shown in Figure 2.

Figure 2. Asynchronous execution

If asynchronous execution takes less total time to finish than synchronous execution, why would anybody choose synchronous execution? Well, as Figure 1 shows, every task executes in sequence, so it’s easier to program. That’s the way you’ve been doing it for years. With asynchronous execution, you have some programming challenges:

You must synchronize tasks. Say that in Figure 2 you run a task that must be executed after the other three have finished. You will have to create a mechanism to wait for all tasks to finish before launching the new task.
You must address concurrency issues. If you have a shared resource, like a list that is written in one task and read in another, make sure that it’s kept in a known state.
The program logic is completely scrambled. There is no logical sequence anymore. The tasks can end at any time, and you don’t have control of which one finishes first.

In contrast, synchronous programming has some disadvantages:

It takes longer to finish.
It may stop the user interface (UI) thread. Typically, these programs have only one UI thread, and when you use it as a blocking operation, you get the spinning wheel (and “not responding” in the caption title) in your program—not the best experience for your users.
It doesn’t use the multicore architecture of the new processors. Regardless of whether your program is running on a 1-core or a 64-core processor, it will run as quickly (or slowly) on both.

Asynchronous programming eliminates these disadvantages: it won’t hang the UI thread (because it can run as a background task), and it can use all the cores in your machine and make better use of machine resources. So, do you choose easier programming or better use of resources? Fortunately, you don’t have to make this decision. Microsoft has created several ways to minimize the difficulties of programming for asynchronous execution.

Asynchronous Programming models in Microsoft .NET

Asynchronous programming isn’t new in Microsoft .NET: it has been there since the first version, in 2001. Since then, it has evolved, making it easier for developers to use this paradigm. The Asynchronous Programming Model (APM) is the oldest model in .NET and has been available since version 1.0. Because it’s complicated to implement, however, Microsoft introduced a new model in .NET 2.0: the Event-Based Asynchronous Pattern (EAP). I don’t discuss these models, but check out the links in “For More Information” if you’re interested. EAP simplified things, but it wasn’t enough. So in .NET 4.0, Microsoft implemented a new model: the Task Parallel Library (TPL).

The Task Parallel Library

The TPL is a huge improvement over the previous models. It simplifies parallel processing and makes better use of system resources. If you need to use parallel processing in your programs, TPL is the way to go.

For the sake of comparison, I’ll create a synchronous program that calculates the prime numbers between 2 and 10,000,000. The program shows how many prime numbers it can find and the time required to do so:

GitHub - synchronous program code sample

This is not the best algorithm for finding prime numbers, but it can show the differences between approaches. On my machine (which has an Intel® Core™ i7 3.4 GHz processor), this program executes in about 3 seconds. I use the Intel® VTune™ Amplifier to analyze the program. This is a paid program, but a 30‑day trial version is available (see “For More Information” for a link).

I run the Basic Hotspots analysis in the synchronous version of the program and get the results in Figure 3.

Figure 3. VTune™ analysis for the synchronous version of the Prime Numbers program

Here, you can see that the program took 3.369 seconds to execute, most of which was spent in IsPrimeNumber (3.127 s), and it uses only one CPU. The program does not make good use of the resources.

The TPL introduces the concept of a task, which represents an asynchronous operation. With the TPL, you can create tasks implicitly or explicitly. To create a task implicitly, you can use the Parallel class—a static class that has the For, ForEach, and Invoke methods. For and ForEach allow loops to run in parallel; Invoke allows you to queue several actions in parallel.

This class makes it easy to convert the synchronous version of my program into a parallel one:

GitHub – Parallel program code sample

The processing is broken into 10 parts, and I’ve used Parallel.For to execute each part. At the end of the processing, the counts of the lists are summed and shown. This code is similar to the synchronous version. I analyze it with the VTune Amplifier and get the results in Figure 4.

Figure 4. VTune™ analysis for the parallel version of the Prime Numbers program

The program executes in 1 second, and all eight processors in my machine are used. I have the best of both worlds: efficient usage of resources and ease of use.

You could also create the tasks explicitly and use them in the program, with the Task class. You can create a new Task and use the Start method to start it or use the more streamlined methods Task.Run and Task.Factory.StartNew, which create and start a task, respectively. You can create the same parallel program by using the Task class with a program like this one:

GitHub – Task program code sample

Task.WaitAll waits for all tasks to finish; only then does it continue the execution. If you analyze the program with the VTune Amplifier, you get a result similar to the parallel version.

Parallel Linq

Parallel Linq (PLINQ) is a parallel implementation for the LINQ query language. With PLINQ, you can transform your LINQ queries into parallel versions simply by using the AsParallel extension method. For example, a simple modification to the synchronous version improves the performance a great deal:

GitHub – PLINQ program code sample

Adding AsParallel to Enumerable.Range changes the sequential version to a parallel version of the query. If you run this version, you see a great improvement in the VTune Amplifier (Figure 5).

Figure 5. VTune™ analysis for the PLINQ version

With this simple change, the program runs in 1 second and uses all eight processors. However, there is a catch: the position of AsParallel interferes with the parallelism of the operation. If you change the line to:

return Enumerable.Range(minimum, count).Where(IsPrimeNumber).AsParallel().ToList();

. . . you won’t see an improvement because the IsPrimeNumber method, which takes most of the processing time, won’t be executed in parallel.

Async Programming

C# version 5 introduced two new keywords: async and await. Although it doesn’t seem like a lot, the addition is a huge improvement. These keywords are central to asynchronous processing in C#. When you use parallel processing, sometimes you need to twist the execution sequence completely. Async processing restores the sanity of your code.

When you use the async keyword, you can write code the same way you wrote synchronous code. The compiler takes care of all the complexity and frees you to do what you do best: writing the logic.

To write an async method, follow these guidelines:

The method signature must have the async keyword.
By convention, the method name should end with Async (this is not enforced, but it is a best practice).
The method should return Task, Task<T>, or void.

To use this method, you should wait for the result (i.e., use the await method). Following these guidelines, when the compiler finds an awaitable method, it starts to execute it and will continue the execution of other tasks. When the method is complete, the execution returns to its caller. The program to calculate the prime numbers with async becomes:

GitHub – Async program code sample

Notice that I have created a new method: ProcessPrimesAsync. When you use await in a method, it must be marked as async, and Main cannot be marked as async. That’s why I created this method, which returns void. When Main executes the method, without the await keyword, it starts it but doesn’t wait for finish. For that reason, I’ve added Console.ReadLine (or the program would end before the execution). The rest of the program is similar to the synchronous version.

Notice also that the primes variable is not a Task<List<int>> but a List<int>. This is a compiler trick so that I don’t have to deal with Task to call an async method. With the await keyword, the compiler calls the method, frees resources until the method is complete, and—when the method returns—will transform the Task result into a normal result. When you call return in the method, you should not return Task<T> but the normal return value, as you would do in a synchronous method.

If you run this program, you will see that it doesn’t run faster than the synchronous version because it has just one task. To make it run faster, you must create multiple tasks and synchronize them. You can do that with this change:

GitHub – Parallel async program code sample

With this new method, the program creates 10 tasks but doesn’t wait for them. It awaits them in this line:

var results = await Task.WhenAll(primes);

The results variable is an array of List<int>: there are no tasks anymore. When I run this version in the VTune Amplifier, it shows that all tasks run in parallel (see Figure 6).

Figure 6. VTune™ analysis for the parallel Async version of the Prime Numbers program

The async and await keywords add a new twist to asynchronous processing in C#, but this change involves a lot more than what I’ve shown in this article. There are also cancellation of tasks, exception handling, and task coordination.

Conclusions

There are many ways to create a parallel executing program in C#. With multicore processors, there’s no excuse for creating single-threaded programs: you won’t be using the system resources, and you’ll penalize your users with unneeded delays.

The improvements in the C# language with the async keyword restore sequential ordering in the code while efficiently using system resources. There are still a few issues to keep in mind, like concurrency, task synchronization, and cancellation, but these are minor compared with what you needed to create a good parallel program. If you learn and apply the techniques I describe here and start to create parallel programs, you will make better use of system resources—and have happier users.

For More Information

GitHub code samples repository : https://github.com/bsonnino/ParallelProgramming
Intel VTune Amplifier XE 2015: https://software.intel.com/en-us/intel-vtune-amplifier-xe
APM: http://msdn.microsoft.com/en-us/library/ms228963(v=vs.110).aspx
EAP: http://msdn.microsoft.com/en-us/library/wewwczdw(v=vs.110).aspx
BackgroundWorker class: http://msdn.microsoft.com/en-us/library/system.componentmodel.backgroundworker(v=vs.110).aspx

About the Author

Bruno Sonnino is a Microsoft Most Valuable Professional (MVP) located in Brazil. He is a developer, consultant, and author having written five Delphi books, published in Portuguese by Pearson Education Brazil, and many articles for Brazilian and American magazines and websites.

Notices

UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL INJURY OR DEATH MAY OCCUR.

Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked “reserved” or “undefined.” Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information.

The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Any software source code reprinted in this document is furnished under a software license and may only be used or copied in accordance with the terms of that license.

Intel, the Intel logo, and Core are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.

Sviluppatori

Studenti

Intel® Xeon Phi™ Coprocessor

Processori Intel® Atom™

Processori Intel® Core™

Elaborazione parallela

Area tema:

IDZone

↧

NWChem* for the Intel® Xeon Phi™ Coprocessor

September 17, 2014, 2:55 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Compilers for Linux*: Application Porting Guide

≪ Previous: Parallel Programming with C#

Purpose

This code recipe describes how to get, build, and use the NWChem* code that includes support for the Intel® Xeon Phi™ Coprocessor with Intel® Many-Integrated Core (MIC) architecture.

Introduction

NWChem provides scalable computational chemistry tools. NWChem codes treat large scientific computational chemistry problems efficiently, and they can take advantage of parallel computing resources, from high-performance parallel supercomputers to conventional workstation clusters.

NWChem software handles

Biomolecules, nanostructures, and solid-state
From quantum to classical, and all combinations
Ground and excited-states
Gaussian basis functions or plane-waves
Wide scalability, from one to thousands of processors
Properties and relativistic effects

NWChem is actively developed by a consortium of developers and maintained by the Environmental Molecular Sciences Laboratory (EMSL) located at the Pacific Northwest National Laboratory (PNNL) in Washington State. The code is distributed as open-source under the terms of the Educational Community License version 2.0 (ECL 2.0).

The current version of NWChem can be downloaded from http://www.nwchem-sw.org. Current support for Intel® Xeon Phi™ coprocessors is included in NWChem 6.5 or later. The latest development version, which can be downloaded at https://svn.pnl.gov/svn/nwchem/trunk and might contain additional NWChem modules with support for the Xeon Phi coprocessor. Please check the release notes and NWChem documentation for further information.

Code Access

NWChem code supports the Intel® Language Extensions for Offload of operations of the Intel® Xeon® processor (referred to as ‘host’ in this document) with the Intel Xeon Phi coprocessor (referred to as ‘coprocessor’ in this document) in a single node and in a cluster environment.

To download NWChem, please go to http://www.nwchem-sw.org/index.php/Download and download the latest version. It is advisable to download the source code version, so that you can configure NWChem for your system as desired.

Build Directions

The build of NWChem with offload support for Intel Xeon Phi coprocessors is split into three steps.

Configure NWChem for your system.
Enable offload support.
Build NWChem.

Configure

Set the following configuration options (the following are in bash syntax):

export ARMCI_NETWORK=OPENIB
export ARMCI_DEFAULT_SHMMAX_UBOUND=65536
export USE_MPI=y
export NWCHEM_MODULES=all\ python
export USE_MPIF=y
export USE_MPIF4=y
export MPI_HOME=$I_MPI_HOME/intel64
export MPI_INCLUDE="$MPI_HOME"/include
export MPI_LIB="$MPI_HOME"/lib
export LIBMPI="-lmpi -lmpigf -lmpigi -lrt -lpthread"
export MKLROOT=/msc/apps/compilers/intel/14.0/composer_xe_2013_sp1.1.106/mkl/
export SCALAPACK_LIB=" -mkl -openmp -lmkl_scalapack_ilp64 -lmkl_blacs_intelmpi_ilp64 -lpthread -lm"
export SCALAPACK="$SCALAPACK_LIB"
export LAPACK_LIB="-mkl -openmp  -lpthread -lm"
export BLAS_LIB="$LAPACK_LIB"
export BLASOPT="$LAPACK_LIB"
export USE_SCALAPACK=y
export SCALAPACK_SIZE=8
export BLAS_SIZE=8
export LAPACK_SIZE=8
export PYTHONHOME=/usr
export PYTHONVERSION=2.6
export PYTHONLIBTYPE=so
export USE_PYTHON64=y
export USE_CPPRESERVE=y
export USE_NOFSCHECK=y

Enable Offload Support

Set the following environment variables to enable offload support:

export USE_OPENMP=1
export USE_OFFLOAD=1

Build

To build NWChem, issue the following commands:

cd $NWCHEM_TOP/src
make FC=ifort CC=icc AR=xiar

This will build NWChem with support for Intel Xeon Phi coprocessors for the CCSD(T) method (as of 12 August 2014). Coprocessor support for more NWChem methods will follow in the future.

If you are running a cluster based on Intel® True Scale Fabric, please check the NWChem documentation for the correct configuration settings to use.

Running Workloads Using NWChem CCSD(T) Method

To run the CCSD(T) method you will need to use a proper NWChem input file that triggers this module. You can find an example input file in the Appendix of this document. Other input files that use the CCSD(T) method can be found on the NWChem website at http://www.nwchem-sw.org.

To run the code only on hosts in the traditional mode using plain Global Arrays (GA), run the following command:

$ OMP_NUM_THREADS=1 mpirun –np 768 –perhost 16 nwchem input.nw

This command will execute NWChem using a file called “input.nw” with 768 GA ranks and 16 processes per node (a total of 48 machines).

To enable OpenMP* threading on the host and use fewer total GA ranks run the following command:

$ OMP_NUM_THREADS=2 mpirun –np 384 –perhost 8 nwchem input.nw

This directs NWChem to use eight GA ranks per node and launches two threads for each process on the node. Because it uses less GA ranks, less communication takes place; thus, you should observe a speed-up compared to the plain method above.

Our next step is to enable offloading to the Intel Xeon Phi coprocessor, by executing this command:

$ NWC_RANKS_PER_DEVICE=2 OMP_NUM_THREADS=4 mpirun –np 384 –perhost 8 nwchem input.nw

The NWC_RANKS_PER_DEVICE environment variable enables offloading, if it is set to an integer larger than 0. It also controls how many GA ranks from the host will offload to each of the compute node’s coprocessors

In the example, we assume that the node contains two coprocessors, and NWChem should allocate two GA ranks per coprocessor. Hence, 4 out 8 GA ranks assigned to a particular compute node will offload to the coprocessors. During offload, a host core is idle; thus, we double the number of OpenMP threads for the host (OMP_NUM_THREADS=4 ) in order to fill the idle core with work from another GA rank.

NWChem itself automatically detects the available coprocessors in the system and properly partitions them for optimal use.

For best performance, you should also enable turbo mode on both the host system and the coprocessors, plus set the following environment variable to use large pages on the coprocessor devices:

export MIC_USE_2MB_BUFFER=16K

In all of the above cases, NWChem will produce the output files as requested in the input file.

Once NWChem prints the last lines on the console log, you will find a line that reports the total runtime consumed:

Total times  cpu:           wall:

The reported runtimes will show considerable speedup for the OpenMP threaded version, as well as the offload version. Of course, the exact runtimes will depend on your system configuration. Experiment with the above settings to control OpenMP and offloading in order to find the best possible values for your system.

Performance Testing^1,2

The following chart shows the speedups achieved on NWChem using the configuration listed below. Your performance may be different, depending on configurations of your systems, system optimizations, and NWChem settings described above.

Testing Platform Configurations

Nodes	Intel® Xeon® processor cores	Intel® Xeon Phi™ coprocessor cores	Heterogeneous cores
130	2080	15600	17680
230	3680	27600	31280
360	5760	43200	48960
450	7200	54000	61200

Server Configuration:

Atipa Visione vf442, 2-socket/16 cores, Intel® C600 IOH
Processors: Two Intel® Xeon® processor E5-2670 @ 2.60GHz (8 cores) with Intel® Hyper-Threading Technology³
Operating System: Scientific Linux* 6.5
Memory: 128GB DDR3 @ 1333 MHz
Coprocessors: 2X Intel® Xeon Phi™ Coprocessor 5110P, GDDR5 with 3.6 GT/s, Driver v3.1.2-1, FLASH image/micro OS 2.1.02.390
Intel® Composer XE 14.0.1.106

Appendix: Example Input File

start  example

title example

echo

memory stack   4800 mb heap 200 mb global 4800 mb noverify

geometry units angstrom noprint
symmetry c1
C     -0.7143     6.0940    -0.00
C      0.7143     6.0940    -0.00
C      0.7143    -6.0940     0.00
C     -0.7143    -6.0940     0.00
C      1.4050     4.9240    -0.00
C      1.4050    -4.9240     0.00
C     -1.4050    -4.9240     0.00
C     -1.4050     4.9240     0.00
C      1.4027     2.4587    -0.00
C     -1.4027     2.4587     0.00
C      1.4027    -2.4587    -0.00
C     -1.4027    -2.4587     0.00
C      1.4032    -0.0000    -0.00
C     -1.4032     0.0000     0.00
C      0.7258     1.2217    -0.00
C     -0.7258     1.2217     0.00
C      0.7258    -1.2217     0.00
C     -0.7258    -1.2217     0.00
C      0.7252     3.6642    -0.00
C     -0.7252     3.6642     0.00
C      0.7252    -3.6642     0.00
C     -0.7252    -3.6642     0.00
H     -1.2428     7.0380    -0.00
H      1.2428     7.0380    -0.00
H      1.2428    -7.0380     0.00
H     -1.2428    -7.0380     0.00
H      2.4878     4.9242    -0.00
H     -2.4878     4.9242     0.00
H      2.4878    -4.9242    -0.00
H     -2.4878    -4.9242     0.00
H      2.4862     2.4594    -0.00
H     -2.4862     2.4594     0.00
H      2.4862    -2.4594    -0.00
H     -2.4862    -2.4594     0.00
H      2.4866    -0.0000    -0.00
H     -2.4866     0.0000     0.00
end

basis spherical noprint
H    S
     13.0100000              0.0196850
      1.9620000              0.1379770
      0.4446000              0.4781480
H    S
      0.1220000              1.0000000
H    P
      0.7270000              1.0000000
#BASIS SET: (9s,4p,1d) -> [3s,2p,1d]
C    S
   6665.0000000              0.0006920             -0.0001460
   1000.0000000              0.0053290             -0.0011540
    228.0000000              0.0270770             -0.0057250
     64.7100000              0.1017180             -0.0233120
     21.0600000              0.2747400             -0.0639550
      7.4950000              0.4485640             -0.1499810
      2.7970000              0.2850740             -0.1272620
      0.5215000              0.0152040              0.5445290
C    S
      0.1596000              1.0000000
C    P
      9.4390000              0.0381090
      2.0020000              0.2094800
      0.5456000              0.5085570
C    P
      0.1517000              1.0000000
C    D
      0.5500000              1.0000000
#END
end

scf
#thresh 1.0e-10
#thresh 1.0e-4
#tol2e 1.0e-10
#tol2e 1.0e-8
#noscf
singlet
rhf
vectors input atomic output pent_cpu_768d.movecs
direct
noprint "final vectors analysis" multipole
end

tce
freeze atomic
ccsd(t)
thresh 1e-4
maxiter 10
io ga
tilesize 24
end

set tce:pstat t
set tce:nts  t

task tce energy

Intel® C++ Composer XE

Application porting guide

Area tema:

IDZone

↧

Intel® Compilers for Linux*: Application Porting Guide

September 18, 2014, 1:59 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® XDK "Cordova for Android" Build Options

≪ Previous: NWChem* for the Intel® Xeon Phi™ Coprocessor

This paper describes application porting when using Intel® Compilers for Linux*. The Intel C/C++ compiler is compatible to the GNU* compilers in terms of source, binary and command-line compatibility. The Intel® C/C++ and Fortran Compilers help make your applications run at top speed on Intel's platforms, including those based on the IA-32, Intel® 64 and Intel® Xeon Phi architectures. The compilers also provide compatibility with commonly used Linux* software development tools.

Compilatore C++ Intel®

Compilatore Fortran Intel®

Strumenti di sviluppo

Area tema:

IDZone

↧

Intel® XDK "Cordova for Android" Build Options

September 18, 2014, 9:56 am

Latest and popular articles on Intel Technologies

≫ Next: Intel XDK Documentation Quick Links

≪ Previous: Intel® Compilers for Linux*: Application Porting Guide

The Intel® XDK "Cordova for Android" build system requires a special configuration file in your project source directory to direct the build process. The Cordova build option is based on the open source Apache* Cordova CLI build system. When you use the Cordova build option your application project files are submitted to the Intel XDK build server where a Cordova CLI system is hosted and maintained, there is no need to install the open source Cordova CLI system on your workstation.

The followingintelxdk.config.xml options pertain only to Android builds. They will not affect builds for other target platforms. It is perfectly okay to include these options in your build config file when submitting a build for other platform targets (such as iOS and Windows Phone 8). These options will be ignored when building for those other targets. Thus, you only need a singleintelxdk.config.xml build config file for your project, regardless of the number of Cordova targets you intend to build for.

NOTE: the "Crosswalk for Android" build system is also a Cordova build target but does not currently support the use of theintelxdk.config.xml build config file. An interactive build configuration page is provided for that purpose.

For detailed information regarding the structure and contents of theintelxdk.config.xml file please read Using the Intel XDK Cordova Build Option article.

Android Launch Icon Specifications

If no icon files are provided with your project, the build system will provide default icons. It is highly recommended that you replace the default icons with icons of your own before submitting your application to a store. See this article on the Android Developer site for details regarding Android launch icons. If you do not provide custom icons it is likely that your application will be rejected from the Android store.

Icon files must be provided in PNG format. The height and width numbers in the table (below) are in pixels.

Density	Width	Height
xxxhdpi ‡	192	192
xxhdpi †	144	144
xhdpi	96	96
hdpi	72	72
mdpi	48	48
ldpi *	36	36

* ldpi icon files are optional; if not provided the Android OS will automatically downscale your hdpi icon by a factor of two.

† xxhdpi icon files are only supported on hi-res Android 4.4 and above devices; these icon resolutions are not supported by Cordova 3.3.

‡ xxxhdpi icon files are not supported by any Android devices, at this time; these icon resolutions are not supported by Cordova 3.3.

The launch icon you include with your application, when you submit it to the Google Play Store, must be 512x512 pixels in size. This icon is not included in your application, it is part of your store submission package. The Intel XDK does not include a store submission tool, you must submit your application manually using your Android developer account.

Android Splash Screen Image Specifications

Your application will display a splash screen during initialization. This is done to provide a "getting ready" indication while your app and the underlying device API infrastructure initializes. The build system provides default images if no splash files are provided with your project. It is highly recommended that you replace the default splash screen images with your own before submitting your application to a store. See the splash screen section of the Cordova for * Build Options article for more information.

Splash screen images can be provided in PNG, JPG or JPEG formats. The height and width numbers in the table (below) are the minimum recommended pixel sizes for the respective screen densities. See this article on the Android Developer site for details regarding Android screen sizes. NOTE: the dimensions shown below assume a landscape orientation, reverse the numbers for portrait.

Density	Width	Height
xhdpi	960	720
hdpi	640	480
mdpi	470	320
ldpi	426	320

For greater adaptability of your splash screens, you should use 9-patch images. For more information see this StackOverflow posting and Android developer tools article.

NOTE: there have been issues with splash screens on the Cordova for Android platform when building with the Intel XDK. We have resolved the issue regarding landscape and portrait splash screens (they now work as expected). However, we are still working on making nine-patch splash screens work. When nine-patch splash screens work this notice will be removed. Until then for best results, design your splash screens to a 16:9 screen ratio and that should minimize the distortion on most modern Android phones and tablets (for example, the Samsung S3 has a 16:9 screen ratio and the Nexus 7 2013 edition has a 16:8 ratio). The splash screen resolutions specified above are minimum recommend dimensions, not absolute required dimensions.

Android Build Preferences

`android-minSdkVersion`
Specifies the minimum required Android operating system version on which the application will install and run. For best results it is recommended that you specify Android 2.3.3 or higher (Android 2.3.3 = 10), this value is the default if no minimum is provided. See "Versioning Your Applications" for an overview regarding how to assign version numbers to Android applications.
`<preference name="android-minSdkVersion" value="<N>" />`
`<N>` is a number representing the minimum supported Android version.

`android-targetSdkVersion`
Specifies the target Android operating system. This is the version of Android you have tested against and will be used by the operating system to insure that future versions of your application continue to run by observing compatibility behaviors. See "Versioning Your Applications" for an overview regarding how to assign version numbers to Android applications.
`<preference name="android-targetSdkVersion" value="<N>" />`
`<N>` is a number representing the target Android version. The value 19 is recommended to minimize the effect of Android 4.4 "webview quirks mode" when running your application on Android 4.4+ devices. However, some applications may need this "webview quirks mode" behavior and should then specify a lower API level (such as 17 or 18) Start with 19 and insure your app works on Android 4.4+ devices; if you see issues on Android 4.4+ devices, try using 17 or 18 and see if your app works better there.

`android-installLocation`
Specifies where the application can be installed (internal memory, external memory, sdcard, etc.).
`<preference name="android-installLocation" value="<LOCATION>" />`
`<LOCATION>` indicates where the application can be installed.

`android-windowSoftInputMode`
Specifies how the application interacts with the on-screen soft keyboard.
`<preference name="android-windowSoftInputMode" value="<INPUTMODE>" />`
`<INPUTMODE>` determines the state and features of the keyboard.

`android-permission`
For adding application device permissions.
`<preference name="android-permission" value="<PERMISSION"> />`
`<PERMISSION>` is the permission identifier. The list of valid permissions may vary depending on the version of the Android operating system that is targeted. See the Manifest Permission List on the Android Developer site for a complete list of Android permissions. NOTE: the NETWORK permission will always be part of your Cordova application, even if no Cordova plugins have been included in your application. This is due to the way the Cordova framework communicates ("bridges the gap") between the HTML5 JavaScript layer and the underlying native code layer.

`android-signed`
Allows you to indicate if the application will be signed (for distribution in the Android store).
`<preference name="android-signed" value="<bool>" />`
`<bool>`indicates whether the application should be signed. A value of "`false`" indicates the application will not be signed, and "`true`" indicates the application will be signed. The default is "`true`".

↧

Intel XDK Documentation Quick Links

September 18, 2014, 4:03 pm

Latest and popular articles on Intel Technologies

≫ Next: How to use Intel® Advisor XE 2015 to model suitability on an Intel® Xeon Phi™ coprocessor

≪ Previous: Intel® XDK "Cordova for Android" Build Options

Having trouble searching the website? Use the following search examples in a Google* search bar and it will help to narrow your results to articles from this site only. For example, the first example searches for articles that mention the word "build." The second example searches for tutorials, etc.

The following links are provided here in case you need quick access to detailed device API documentation. These are useful (but optional) APIs that you can use to build hybrid mobile HTML5 apps with the Intel XDK.

For support, please visit the Intel XDK forum.

To manage your online account, and project files that have been pushed to our build and test server, visit the Intel XDK App Center.

Additional Resources

Overview: Intel XDK Development Tools
App Starter video
Crosswalk Overview
jQuery* API
Google Chrome* Developer Tools (CDT)
Intel XDK - What We Did and Why
Archived Intel XDK API Documentation (for "legacy" builds only)

Useful Emulator Debug Hints

The Emulator tab in the Intel XDK simulates a subset of the device APIs that are available to your application. The only way to debug these non-simulated APIs is by using on-device debugging via App Preview or a built application.

Note that the Emulator is implemented as a Chrome browser embedded inside the Intel XDK. It simulates device viewport sizes, touch events, user agent strings and various device APIs (other than those shown below) for a convenient debugging experience using standard Chrome Dev Tools. It does not emulate an actual device or the underlying operating system. The memory, CPU and HTML5 rendering capabilities of your actual device can be quite different. It is best to think of the devices presented inside the Intel XDK Emulator tab as a collection of "ideal" devices with nearly unlimited memory, CPU and HTML5 features.

A list of the unimplemented APIs in the Emulator is provided below:

device.addRemoteScript: undefined (does not throw an exception)
device.blockRemotePages (writes an invisible message and then does nothing)
facebook object: unimplemented
file object: unimplemented
oauth object: unimplemented
playingtrack object: unimplemented
the player object does not implement the following methods:
- show
- hide
- playPodcast
- setAudioCurrentTime
- getAudioCurrentTime
- clearAudioCurrentTime
- startStation
- watchAudioCurrentTime
- clearAudioCurrentTimeWatch
- play
- pause
- stop
- volume
- rewind
- ffwd
- setColors
- setPosition
- startShoutcast

↧

How to use Intel® Advisor XE 2015 to model suitability on an Intel® Xeon Phi™ coprocessor

September 22, 2014, 6:26 am

Latest and popular articles on Intel Technologies

≫ Next: Debugging Intel® Xeon Phi™ Applications on Linux* Host

≪ Previous: Intel XDK Documentation Quick Links

Introduction

Intel® Advisor XE 2015 now includes some new capabilities for analyzing Intel® Xeon Phi™ coprocessor applications. This article steps through this analysis on an Intel Xeon Phi coprocessor and also outlines some of the new capabilities.

Building the application

The application we are using is one of the samples included in the Intel Advisor XE. It is located in C:\Program Files (x86)\Intel\Advisor XE 2015\samples\en\C++\tachyon_Advisor.zip. To build the application on the Microsoft Windows* OS:

First source the environment for the Intel® compiler you are using.
- Run C:\Program Files (x86)\Intel\compiler_xe_2015\bin\compilervars.bat intel64.
Unzip the sample in a directory where you have permission. We will unzip to C:\advisor_samples.
Build the application.
- Open the solution file C:\advisor_samples\tachyon_Advisor\tachyon_Advisor.sln using Microsoft Visual Studio* 2012.
- In the Microsoft Visual Studio IDE right-click 2_tachyon_annotated and select Set As Startup Project.
- Make sure you are set for building in Release mode, then click Build > Rebuild Solution.

Running the application using the Advisor XE 2015 suitability analysis

First bring up the Intel Advisor XE 2015 Workflow.

Click Tools > Advisor XE 2015 > Open Advisor Workflow.
Click the Collect Suitability Analysis button.

Some key observations:

By default the Intel Advisor XE does its modeling on a host CPU. In this case it assumes you have 8 CPUs. You can change the CPU count by using the CPU Count drop-down list.
Note how much speedup you can expect.
Also note the scalability graph. This indicates if you have the type of workload that will scale (that is, get faster) when you add additional CPUs.

Showing Suitability for an Intel Xeon Phi coprocessor

Click the Target System drop-down list and select Intel Xeon Phi.

Some key observations

By default Advisor XE models your application with 128 coprocessor threads. You can modify this with the Coprocessor Threads drop-down list.
In the scalability graph the area in green indicates if your parallel region has enough parallelism to be ready for running on an Intel Xeon Phi coprocessor.
The following indicators show where you should be looking to improve performance:
- Load Imbalance
- Runtime Overhead
- Lock Contention

One interesting note with this application is that when the CPU Count is 8 there is not a significant load imbalance. If you expand the Load Imbalance item on the graph you can see the following:

If you increase the number of CPUs the load imbalance does become significant. In this case we have set the number of CPUs to 128.

Note: In the Task Modeling area you can see that on average there are 512 tasks for this parallel region. When you have only 8 CPUs there is plenty of work to assign to each CPU, but when you have 128 you see the load imbalance. To test an algorithm with a greater amount of work, use the slider in the Task Modeling section. Slide to 5x and then click Apply.

As you can see, when we increase the amount of work we decrease the load imbalance.

Showing suitability for offloading your parallel region to an Intel Xeon Phi coprocessor

Click the Target System drop-down list and select Offload to Intel Xeon Phi.

Modeling different data sets using Intel Advisor XE 2015

You can also model the scalability of your parallel regions using different data sets. For example you can test “what-if” you had 5, 25 or 125 times the number of tasks/work. What would be the resulting speedup and scalability?

Go to the Task Modeling region.
Use the slide to select 5x.
Click Apply.

You can also model average task duration by using the slider titled Avg. Task Duration.

Advanced modeling of Intel Xeon Phi coprocessor vectorization

Intel Advisor XE also has the ability to model how your application would run on an Intel Xeon Phi coprocessor both considering vectorization and not considering vectorization.

If you click Intel Xeon Phi Advanced Modeling you can model the suitability with and without vectorization:

Summary

Intel Advisor XE 2015 is a powerful tool for modeling the scalability of your application. Using the new features for modeling on an Intel Xeon Phi coprocessor you can easily tell if your workload will scale to the high number of coprocessor threads. The features to dynamically change the size of your data set size let you see if your algorithm will see benefits from additional scaling without having to make any code changes.

Intel(R) Xeon Phi(TM) Coprocessor

Sviluppo multithread

Area tema:

IDZone

↧

Debugging Intel® Xeon Phi™ Applications on Linux* Host

September 23, 2014, 6:22 am

Latest and popular articles on Intel Technologies

≫ Next: Debugging Intel® Xeon Phi™ Applications on Windows* Host

≪ Previous: How to use Intel® Advisor XE 2015 to model suitability on an Intel® Xeon Phi™ coprocessor

Introduction

Intel® Xeon Phi™ coprocessor is a product based on the Intel® Many Integrated Core Architecture (Intel® MIC). Intel® offers a debug solution for this architecture that can debug applications running on an Intel® Xeon Phi™ coprocessor.

There are many reasons for the need of a debug solution for Intel^® MIC. Some of the most important ones are the following:

Developing native Intel^® MIC applications is as easy as for IA-32 or Intel^® 64 hosts. In most cases they just need to be cross-compiled (-mmic).
Yet, Intel^® MIC Architecture is different to host architecture. Those differences could unveil existing issues. Also, incorrect tuning for Intel^® MIC could introduce new issues (e.g. alignment of data, can an application handle more than hundreds of threads?, efficient memory consumption?, etc.)
Developing offload enabled applications induces more complexity as host and coprocessor share workload.
General lower level analysis, tracing execution paths, learning the instruction set of Intel^® MIC Architecture, …

Debug Solution for Intel® MIC

For Linux* host, Intel offers a debug solution for Intel® MIC which is based on GNU* GDB. It can be used on the command line for both host and coprocessor. There is also an Eclipse* IDE integration that eases debugging of applications with hundreds of threads thanks to its user interface. It also supports debugging offload enabled applications.

How to get it?

There are currently two ways to obtain Intel’s debug solution for Intel® MIC Architecture on Linux* host:

Intel^® Manycore Platform Software Stack (MPSS):
http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
(currently MPSS 3.3.1 for Linux*)
First version became available with MPSS 2.1.
Intel^® Composer XE (for C/C++ or Fortran):
http://software.intel.com/en-us/intel-composer-xe
It requires at least Intel^® Composer XE 2013 SP1 or later.

Both packages contain the same debug solutions for Intel® MIC Architecture!

Note:
Intel® Composer XE 2013 SP1 contains GNU* GDB 7.5. With later versions, GNU* GDB 7.7 is available. However, all MPSS versions only have version 7.5 right now.

Why use GNU* GDB provided by Intel?

New features/improvements offered back to GNU* community
Latest GNU* GDB versions in future releases
Improved C/C++ & Fortran support thanks to Project Archer and contribution through Intel
Increased support for Intel^® architecture (esp. Intel^® MIC)
Additional debugging capabilities – more later

Latest Intel related HW support and features are provided in the debug solution from Intel!

Why is Intel providing a Command Line and Eclipse* IDE Integration?

The command line with GNU* GDB has the following advantages:

Well known syntax
Lightweight: no dependencies
Easy setup: no project needs to be created
Fast for debugging hundreds of threads
Can be automatized/scripted

Using the Eclipse* IDE provides more features:

Comfortable user interface
Most known IDE in the Linux* space
Use existing Eclipse* projects
Simple integration of the Intel enhanced GNU* GDB
Works also with Photran* plug-in to support Fortran
Supports debugging of offload enabled applications
(not supported by command line)

Deprecation Notice

Intel® Debugger is deprecated (incl. Intel® MIC Architecture support):

Intel® Debugger for Intel® MIC Architecture was only available in Composer XE 2013 & 2013 SP1
Intel® Debugger is not part of Intel® Composer XE 2015 anymore

Users are advised to use GNU* GDB that comes with Intel® Composer XE 2013 SP1 and later!

You can provide feedback via either your Intel® Premier account (http://premier.intel.com) or via the Debug Solutions User Forum (http://software.intel.com/en-us/forums/debug-solutions/).

Features

Intel’s GNU* GDB, starting with version 7.5, provides additional extensions that are available on the command line:

Support for Intel^® Many Integrated Core Architecture (Intel^® MIC Architecture):
Displays registers (zmmX & kX) and disassembles the instruction set
Support for Intel^® Transactional Synchronization Extensions (Intel^® TSX):
Helpers for Restricted Transactional Memory (RTM) model
(only for host)
Data Race Detection (pdbx):
Detect and locate data races for applications threaded using POSIX* thread (pthread) or OpenMP* models
Branch Trace Store (btrace):
Record branches taken in the execution flow to backtrack easily after events like crashes, signals, exceptions, etc.
(only for host)
Pointer Checker:
Assist in finding pointer issues if compiled with Intel^® C++ Compiler and having Pointer Checker feature enabled
(only for host)
Register support for Intel^® Memory Protection Extensions (Intel^® MPX) and Intel^® Advanced Vector Extensions 512 (Intel^® AVX-512):
Debugger is already prepared for future generations

The features for Intel® MIC highlighted above are described in the following.

Register and Instruction Set Support

Compared to Intel® architecture on host systems, Intel® MIC Architecture comes with a different instruction and register set. Intel’s GNU* GDB comes with transparently integrated support for those. Use is no different than with host systems, e.g.:

Disassembling of instructions:
```
		(gdb) disassemble $pc, +10

		Dump of assembler code from 0x11 to 0x24:

		0x0000000000000011 <foobar+17>: vpackstorelps %zmm0,-0x10(%rbp){%k1}

		0x0000000000000018 <foobar+24>: vbroadcastss -0x10(%rbp),%zmm0

		⁞

		
```
In the above example the first ten instructions are disassembled beginning at the instruction pointer ($pc). Only first two lines are shown for brevity. The first two instructions are Intel® MIC specific and their mnemonic is correctly shown.

Listing of mask (kX) and vector (zmmX) registers:


		(gdb) info registers zmm

		k0   0x0  0

		     ⁞

		zmm31 {v16_float = {0x0 <repeats 16 times>},

		      v8_double = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},

		      v64_int8 = {0x0 <repeats 64 times>},

		      v32_int16 = {0x0 <repeats 32 times>},

		      v16_int32 = {0x0 <repeats 16 times>},

		      v8_int64 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},

		      v4_uint128 = {0x0, 0x0, 0x0, 0x0}}

Also registers have been extended by kX (mask) and zmmX (vector) register sets that come with Intel® MIC.

If you use the Eclipse* IDE integration you’ll get the same information in dedicated windows:

Disassembling of instructions:
Listing of mask (kX) and vector (zmmX) registers:

Data Race Detection

A quick excursion about what data races are:

A data race happens…
If at least two threads/tasks access the same memory location w/o synchronization and at least one thread/task is writing.

Example:
Imaging the two functions thread1()& thread2() are executed concurrently by different threads.


		int a = 1;

		int b = 2;

		                                         | t

		int thread1() {      int thread2() {     | i

		  return a + b;        b = 42;           | m

		}                    }                   | e

		                                         v

Return value of thread1() depends on timing: 3 vs. 43!
This is one (trivial) example of a data race.

What are typical symptoms of data races?

Data race symptoms:
- Corrupted results
- Run-to-run variations
- Corrupted data ending in a crash
- Non-deterministic behavior
Solution is to synchronize concurrent accesses, e.g.:
- Thread-level ordering (global synchronization)
- Instruction level ordering/visibility (atomics)
  Note:
  Race free but still not necessarily run-to-run reproducible results!
- No synchronization: data races might be acceptable

GDB data race detection points out unsynchronized data accesses. Not all of them might incur data races. It is the responsibility of the user to decide which ones are not expected and filter them (see next).
Due to technical limitations not all unsynchronized data accesses can be found, e.g.: 3rd party libraries or any object code not compiled with –debug parallel (see next).

How to detect data races?

Prepare to detect data races:
- Only supported with Intel^® C++/Fortran Compiler (part of Intel^® Composer XE):
  Compile with -debug parallel (icc, icpc or ifort)
  Only objects compiled with-debug parallel are analyzed!
- Optionally, add debug information via –g

Enable data race detection (PDBX) in debugger:


		(gdb) pdbx enable

		(gdb) c

		data race detected

		1: write shared, 4 bytes from foo.c:36

		3: read shared, 4 bytes from foo.c:40

		Breakpoint -11, 0x401515 in L_test_..._21 () at foo.c:36

		*var = 42; /* bp.write */

Data race detection requires an additional library libpdbx.so.5:

Keeps track of the synchronizations
Part of Intel® C++ & Fortran Compiler
Copy to coprocessor if missing
(found at <composer_xe_root>/compiler/lib/mic/libpdbx.so)

Supported parallel programming models:

OpenMP*
POSIX* threads

Data race detection can be enabled/disabled at any time

Only memory access are analyzed within a certain period
Keeps memory footprint and run-time overhead minimal

There is finer grained control for minimizing overhead and selecting code sections to analyze by using filter sets.

More control about what to analyze with filters:

Add filter to selected filter set, e.g.:
```
		(gdb) pdbx filter line foo.c:36

		(gdb) pdbx filter code 0x40518..0x40524

		(gdb) pdbx filter var shared

		(gdb) pdbx filter data 0x60f48..0x60f50

		(gdb) pdbx filter reads # read accesses

		
```
Those define various filter on either instructions by specifying source file and line or the addresses (range), or variables using symbol names or addresses (range) respectively. There is also a filter to only report accesses that use (read) data in case of a data race.
There are two basic configurations, that are exclusive:
- Ignore events specified by filters (default behavior)
```
				(gdb) pdbx fset suppress

				
```
- Ignore events not specified by filters
```
				(gdb) pdbx fset focus

				
```
  The first one defines a white list, whilst the latter one blacklists code or data sections that should not be analyzed.
Get debug command help
```
		(gdb) help pdbx

		
```
This command will provide additional help on the commands.

Use cases for filters:

Focused debugging, e.g. debug a single source file or only focus on one specific memory location.
Limit overhead and control false positives. Detection involves some runtime and memory overhead at runtime. The more filters narrow down the scope of analysis, the more the overhead will be reduced. This can also be used to exclude false positives. Those can occur if real data races are detected, but without any impact on application’s correctness by design (e.g. results of multiple threads don’t need to be globally stored in strict order).
Exclude 3rd party code for analysis

Some additional hints using PDBX:

Optimized code (symptom):


		(gdb) run

		data race detected

		1: write question, 4 bytes from foo.c:36

		3: read question, 4 bytes from foo.c:40

		Breakpoint -11, 0x401515 in foo () at foo.c:36

		*answer = 42;

		(gdb)

Incident has to be analyzed further:
- Remember: data races are reported on memory objects
- If symbol name cannot be resolved: only address is printed
Recommendation:
Unoptimized code (-O0) makes it easier to understand due to removed/optimized away temporaries, etc.
Reported data races appear to be false positives:
- Not all data races are bad… user intended?
- OpenMP*: Distinct parallel sections using the same variable (same stack frame) can result in false positives

Note:
PDBX is not available for Eclipse* IDE and will only work for remote debugging of native coprocessor applications. See section Debugging Remotely with PDBX for more information on how to use it.

Debugging on Command Line

There are multiple versions available:

Debug natively on Intel^® Xeon Phi™ coprocessor
Execute GNU* GDB on host and debug remotely

Debug natively on Intel® Xeon Phi™ coprocessor
This version of Intel’s GNU* GDB runs natively on the coprocessor. It is included in Intel® MPSS only and needs to be made available on the coprocessor first in order to run it. Depending on the MPSS version it can be found at the provided location:

MPSS 2.1: /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb
MPSS 3.*: included in gdb-7.*+mpss3.*.k1om.rpm as part of package mpss-3.*-k1om.tar
(for MPSS 3.1.2, please see Errata, for MPSS 3.1.4 use mpss-3.1.4-k1om-gdb.tar)

For MPSS 3.* the coprocessor native GNU* GDB requires debug information from some system libraries for proper operation. Please see Errata for more information.

Execute GNU* GDB on host and debug remotely
There are two ways to start GNU* GDB on the host and debug remotely using GDBServer on the coprocessor:

Intel^® MPSS:
- MPSS 2.1: /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb
- MPSS 3.*: <mpss_root>/sysroots/x86_64-mpsssdk-linux/usr/bin/k1om-mpss-linux/k1om-mpss-linux-gdb
- GDBServer:
  /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdbserver
  (same path for MPSS 2.1 & 3.*)
Intel^® Composer XE:
- Source environment to start GNU* GDB:
```
				$ source debuggervars.[sh|csh]

				$ gdb-mic

				
```
- GDBServer:
  <composer_xe_root>/debugger/gdb/target/mic/bin/gdbserver

The sourcing of the debugger environment is only needed once. If you already sourced the according compilervars.[sh|csh] script you can omit this step and gdb-mic should already be in your default search paths.

Attention: Do not mix GNU* GDB & GDBServer from different packages! Always use both from either Intel® MPSS or Intel® Composer XE!

Debugging Natively

Make sure GNU* GDB is already on the target by:

Copy manually, e.g.:


		$ scp /usr/linux-k1om-4.7/linux-k1om/usr/bin/gdb mic0:/tmp

Add to the coprocessor image (see Intel® MPSS documentation)

Run GNU* GDB on the Intel® Xeon Phi™ coprocessor, e.g.:
```
		$ ssh –t mic0 /tmp/gdb

		
```
Initiate debug session, e.g.:

Attach:
```
		(gdb) attach <pid>
```
<pid> is PID on the coprocessor
Load & execute:
```
		(gdb) file <path_to_application>
```
<path_to_application> is path on coprocessor

Some additional hints:

If native application needs additional libraries:
Set $LD_LIBRARY_PATH, e.g. via:
```
		(gdb) set env LD_LIBRARY_PATH=/tmp/

		
```
…or set the variable before starting GDB
If source code is relocated, help the debugger to find it:
```
		(gdb) set substitute-path <from> <to>
```
Change paths from <from> to<to>. You can relocate a whole source (sub-)tree with that.

Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!

Debugging Remotely

Copy GDBServer to coprocessor, e.g.:
```
		$ scp <composer_xe_root>/debugger/gdb/target/mic/bin/gdbserver mic0:/tmp
```
During development you can also add GDBServer to your coprocessor image!
Start GDB on host, e.g.:
```
		$ source debuggervars.[sh|csh]

		$ gdb-mic

		
```
Note:
There is also a version named gdb-ia which is for IA-32/Intel® 64 only!

Connect:


		(gdb) target extended-remote | ssh -T mic0 /tmp/gdbserver --multi –

Set sysroot from MPSS installation, e.g.:
```
		(gdb) set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/

		
```
If you do not specify this you won't get debugger support for system libraries.
Debug:

Attach:
```
		(gdb) file <path_to_application>

		(gdb) attach <pid>
```
<path_to_application> is path on host, <pid> is PID on the coprocessor
Load & execute:
```
		(gdb) file <path_to_application>

		(gdb) set remote exec-file <remote_path_to_application>
```
<path_to_application> is path on host, <remote_path_to_application> is path on the coprocessor

Some additional hints:

If remote application needs additional libraries:
Set $LD_LIBRARY_PATH, e.g. via:


		(gdb) target extended-remote | ssh mic0 LD_LIBRARY_PATH=/tmp/ /tmp/gdbserver --multi -

If source code is relocated, help the debugger to find it:
```
		(gdb) set substitute-path <from> <to>
```
Change paths from <from> to <to>. You can relocate a whole source (sub-)tree with that.
If libraries have different paths on host & target, help the debugger to find them:
```
		(gdb) set solib-search-path <lib_paths>
```
<lib_paths> is a colon separated list of paths to look for libraries on the host

Debugging is no different than on host thanks to a real Linux* environment on the coprocessor!

Debugging Remotely with PDBX

PDBX has some pre-requisites that must be fulfilled for proper operation. Use pdbx check command to see whether PDBX is working:

First step:
```
		(gdb) pdbx check

		checking inferior...failed.

		
```
Solution:
Start a remote application (inferior) and hit some breakpoint (e.g. b main& run)
Second step:
```
		(gdb) pdbx check

		checking inferior...passed.

		checking libpdbx...failed.

		
```
Solution:
Use set solib-search-path <lib_paths> to provide the path of libpdbx.so.5 on the host.
Third step:
```
		(gdb) pdbx check

		checking inferior...passed.

		checking libpdbx...passed.

		checking environment...failed.

		
```
Solution:
Set additional environment variables on the target for OpenMP*. Those need to be set with starting GDBServer (similar to setting $LD_LIBRARY_PATH).

$INTEL_LIBITTNOTIFY32=""
$INTEL_LIBITTNOTIFY64=""
$INTEL_ITTNOTIFY_GROUPS=sync

Debugging with Eclipse* IDE

Intel offers an Eclipse* IDE debugger plug-in for Intel® MIC that has the following features:

Seamless debugging of host and coprocessor
Simultaneous view of host and coprocessor threads
Supports multiple coprocessor cards
Supports both C/C++ and Fortran
Support of offload extensions (auto-attach to offloaded code)
Support for Intel^® Many Integrated Core Architecture (Intel^® MIC Architecture): Registers & Disassembly

The plug-in is part of both Intel® MPSS and Intel® Composer XE.

Pre-requisites

In order to use the provided plug-in the following pre-requisites have to be met:

Supported Eclipse* IDE version:
- 4.4 with Eclipse C/C++ Development Tools (CDT) 8.3 or later
- 4.3 with Eclipse C/C++ Development Tools (CDT) 8.1 or later
- 4.2 with Eclipse C/C++ Development Tools (CDT) 8.1 or later
- 3.8 with Eclipse C/C++ Development Tools (CDT) 8.1 or later

We recommend: Eclipse* IDE for C/C++ Developers (4.4)

Java* Runtime Environment (JRE) 6.0 or later (7.0 for Eclipse* 4.4)
For Fortran optionally Photran* plug-in
Remote System Explorer (aka. Target Management) to debug native coprocessor applications
Only for plug-in from Intel^® Composer XE, source debuggervars.[sh|csh] for Eclipse* IDE environment!

Install Intel® C++ Compiler plug-in (optional):
Add plug-in via “Install New Software…”:

This Plug-in is part of Intel^® Composer XE (<composer_xe_root>/eclipse_support/cdt8.0/). It adds Intel® C++ Compiler support which is not mandatory for debugging. For Fortran the counterpart is the Photran* plug-in. These plug-ins are recommended for the best experience.

Note:
Uncheck “Group items by category”, as the list will be empty otherwise!
In addition, it is recommended to disable checking for latest versions. If not done, installation could take unnecessarily long and newer components might be installed that did not come with the vanilla Eclipse package. Those could cause problems.

Install Plug-in for Offload Debugging

Add plug-in via “Install New Software…”:

Plug-in is part of:

Intel^® MPSS:
- MPSS 2.1: <mpss_root>/eclipse_support/
- MPSS 3.*: /usr/share/eclipse/mic_plugin/
Intel^® Composer XE:<composer_xe_root>/debugger/cdt/

Configure Offload Debugging

Create a new debug configuration for “C/C++ Application”
Click on “Select other…” and select MPM (DSF) Create Process Launcher:
The “MPM (DSF) Create Process Launcher” needs to be used for our plug-in. Please note that this instruction is for both C/C++ and Fortran applications! Even though Photran* is installed and a “Fortran Local Application” entry is visible (not in the screenshot above!) don’t use it. It is not capable of using MPM.
In “Debugger” tab specify MPM script of Intel’s GNU* GDB:
- Intel^® MPSS:
  - MPSS 2.1: <mpss_root>/mpm/bin/start_mpm.sh
  - MPSS 3.*: /usr/bin/start_mpm.sh
    (for MPSS 3.1.1, 3.1.2 or 3.1.4, please see Errata)
- Intel^® Composer XE:
  <composer_xe_root>/debugger/mpm/bin/start_mpm.sh
  
  Here, you finally add Intel’s GNU* GDB for offload debugging (using MPM (DSF)). It is a script that takes care of setting up the full environment needed. No further configuration is required (e.g. which coprocessor cards, GDBServer & ports, IP addresses, etc.); it works fully automatic and transparent.

Start Offload Debugging

Debugging offload enabled applications is not much different than applications native for the host:

Create & build an executable with offload extensions (C/C++ or Fortran)
- Use offload pragmas/directives or OpenMP* 4.0
- Use MYO (_Cilk_shared, _Cilk_spawn, …)
  
  For getting started examples, please see:
  - <composer_xe_root>/Samples/en_US/[C++|Fortran]/mic_samples
  - http://software.intel.com/en-us/articles/offload-programming-fortran-and-c-code-examples
  - Or search via Intel® Developer Zone’s Content Library:
    http://software.intel.com/en-us/search/site
Don’t forget to add debug information (-g) and reduce optimization level if possible (-O0)
Start debug session:
- Host & target debugger will work together seamlessly
- All threads from host & target are shown and described
- Debugging is same as used from Eclipse* IDE

This is an example (Fortran) of what offload debugging looks like. On the left side we see host & mic0 threads running. One thread (11) from the coprocessor has hit the breakpoint we set inside the loop of the offloaded code. Run control (stepping, continuing, etc.), setting breakpoints, evaluating variables/memory, … work as they used to.

Additional Requirements for Offload Debugging

For debugging offload enabled applications additional environment variables need to be set:

Intel® MPSS 2.1:
COI_SEP_DISABLE=FALSE
MYO_WATCHDOG_MONITOR=-1
Intel® MPSS 3.*:
AMPLXE_COI_DEBUG_SUPPORT=TRUE
MYO_WATCHDOG_MONITOR=-1

Set those variables before starting Eclipse* IDE!

Those are currently needed but might become obsolete in the future. Please be aware that the debugger cannot and should not be used in combination with Intel® VTune™ Amplifier XE. Hence disabling SEP (as part of Intel® VTune™ Amplifier XE) is valid. The watchdog monitor must be disabled because a debugger can stop execution for an unspecified amount of time. Hence the system watchdog might assume that a debugged application, if not reacting anymore, is dead and will terminate it. For debugging we do not want that.

Note:
Do not set those variables for a production system!

For Intel® MPSS 3.2 and later:
MYO debug libraries are no longer installed with Intel MPSS 3.2 by default. This is a change from earlier Intel MPSS versions. Users must install the MYO debug libraries manually in order to debug MYO enabled applications using the Eclipse plug-in for offload debugging. For Intel MPSS 3.2 (and later) the MYO debug libraries can be found in the package mpss-myo-dbg-* which is included in the mpss-*.tar file.

MPSS 3.2 and 3.2.1 do not support offload debugging with Intel® Composer XE 2013 SP1, please see Errata for more information!

Configure Native Debugging

Configure Remote System Explorer
To debug native coprocessor applications we need to configure the Remote System Explorer (RSE).

Note:
Before you continue, make sure SSH works (e.g. via command line). You can also specify different credentials (user account) via RSE and save the password.

The basic steps are quite simple:

Show the Remote System window:
Menu Window->Show View->Other…
Select: Remote Systems->Remote Systems
Add a new system node for each coprocessor:

Context menu in window Remote Systems: New Connection…

Select Linux, press Next>
Specify hostname of the coprocessor (e.g. mic0), press Next>
In the following dialogs select:
- ssh.files
- processes.shell.linux
- ssh.shells
- ssh.terminals

Repeat this step for each coprocessor!

Transfer GDBServer
Transfer of the GDBServer to the coprocessor is required for remote debugging. We choose /tmp/gdberver as target on the coprocessor here (important for the following sections).

Transfer the GDBServer to the coprocessor target, e.g.:


	$ scp <composer_xe_root>/debugger/gdb/target/mic/bin/gdbserver mic0:/tmp

During development you can also add GDBServer to your coprocessor image!

Note:
See section Debugging on Command Line above for the correct path of GDBServer, depending on the chosen package (Intel^® MPSS or Intel^® Composer XE)!

Debug Configuration

To create a new debug configuration for a native coprocessor application (here: native_c++) create a new one for C/C++ Remote Application.

Set Connection to the coprocessor target configured with RSE before (here: mic0).

Specify the remote path of the application, wherever it was copied to (here: /tmp/native_c++). We’ll address how to manually transfer files later.

Set the flag for “Skip download to target path.” if you don’t want the debugger to upload the executable to the specified path. This can be meaningful if you have complex projects with external dependencies (e.g. libraries) and don’t want to manually transfer the binaries.
(for MPSS 3.1.2 or 3.1.4, please see Errata)

Note that we use C/C++ Remote Application here. This is also true for Fortran applications because there’s no remote debug configuration section provided by the Photran* plug-in!

In Debugger tab, specify the provided Intel GNU* GDB for Intel® MIC (here: gdb-mic).

In the above example, set sysroot from MPSS installation in .gdbinit, e.g.:


	set sysroot /opt/mpss/3.1.4/sysroots/k1om-mpss-linux/

You can use .gdbinit or any other command file that should be loaded before starting the debugging session. If you do not specify this you won't get debugger support for system libraries.

Note:
See section Debugging on Command Line above for the correct path of GDBServer, depending on the chosen package (Intel^® MPSS or Intel^® Composer XE)!

In Debugger/Gdbserver Settings tab, specify the uploaded GDBServer (here: /tmp/gdbserver).

Build Native Application for the Coprocessor

Configuration depends on the installed plug-ins. For C/C++ applications we recommend to install the Intel® C++ Compiler XE plug-in that comes with Composer XE. For Fortran, install Photran* (3^rd party) and select the Intel® Fortran Compiler manually.

Make sure to use the debug configuration and provide options as if debugging on the host (-g). Optionally, disabling optimizations by –O0 can make the instruction flow comprehendible when debugging.

The only difference compared to host builds is that you need to cross-compile for the coprocessor: Use the –mmic option, e.g.:

After configuration, clean your build. This is needed because Eclipse* IDE might not notice all dependencies. And finally, build.

Note:
That the configuration dialog shown only exists for the Intel® C++ Compiler plug-in. For Fortran, users need to install the Photran* plug-in and switch the compiler/linker to ifort by hand plus adding -mmic manually. This has to be done for both the compiler & linker!

Start Native Debugging

Transfer the executable to the coprocessor, e.g.:

Copy manually (e.g. via script on the terminal)
Use the Remote Systems window (RSE) to copy files from host and paste to coprocessor target (e.g. mic0):

Select the files from the tree (Local Files) and paste them to where you want them on the target to be (e.g. mic0)
Use NFS to mirror builds to coprocessor (no need for update)
Use debugger to transfer (see earlier)

Note:
It is crucial that the executable can be executed on the coprocessor. In some cases the execution bits might not be set after copying.

Start debugging using the C/C++ Remote Application created in the earlier steps. It should connect to the coprocessor target and launch the specified application via the GDBServer. Debugging is the same as for local/host applications.

Note:
This works for coprocessor native Fortran applications the exact same way!

Documentation

More information can be found in the official documentation:

Intel^® MPSS:
- MPSS 2.1:
  <mpss_root>/docs/gdb/gdb.pdf
  <mpss_root>/eclipse_support/README-INTEL
- MPSS 3.*:
  not available yet (please see Errata)
Intel^® Composer XE:
<composer_xe_root>/Documentation/[en_US|ja_JP]/debugger/gdb/gdb.pdf
<composer_xe_root>/Documentation/[en_US|ja_JP]/debugger/gdb/eclmigdb_config_guide.pdf

The PDF gdb.pdf is the original GNU* GDB manual for the base version Intel ships, extended by all features added. So, this is the place to get help for new commands, behavior, etc.
README-INTEL from Intel® MPSS contains a short guide how to install and configure the Eclipse* IDE plug-in.
PDF eclmigdb_config_guide.pdf provides an overall step-by-step guide how to debug with the command line and with Eclipse* IDE.

Using Intel^® C++ Compiler with the Eclipse* IDE on Linux*:
http://software.intel.com/en-us/articles/intel-c-compiler-for-linux-using-intel-compilers-with-the-eclipse-ide-pdf/
The knowledgebase article (Using Intel® C++ Compiler with the Eclipse* IDE on Linux*) is a step-by step guide how to install, configure and use the Intel® C++ Compiler with Eclipse* IDE.

Errata

With the recent switch from MPSS 2.1 to 3.1 some packages might be incomplete or missing. Future updates will add improvements. Currently, documentation for GNU* GDB is missing.
For MPSS 3.1.2 and 3.1.4 the respective package mpss-3.1.[2|4]-k1om.tar is missing. It contains binaries for the coprocessor, like the native GNU* GDB for the coprocessor. It also contains /usr/libexec/sftp-server which is needed if you want to debug native applications on the coprocessor and require Eclipse* IDE to transfer the binary automatically. As this is missing you need to transfer the files manually (select “Skip download to target path.” in this case).
As a workaround, you can use mpss-3.1.1-k1om.tar from MPSS 3.1.1 and install the binaries from there. If you use MPSS 3.1.4, the native GNU* GDB is available separately via mpss-3.1.4-k1om-gdb.tar.
With MPSS 3.1.1, 3.1.2 or 3.1.4 the script <mpss_root>/mpm/bin/start_mpm.sh uses an incorrect path to the MPSS root directory. Hence offload debugging is not working. You can fix this by creating a symlink for your MPSS root, e.g. for MPSS 3.1.2:

$ ln -s /opt/mpss/3.1.2 /opt/mpss/3.1

Newer versions of MPSS correct this. This workaround is not required if you use the start_mpm.sh script from the Intel(R) Composer XE package.
For MPSS 3.* the coprocessor native GNU* GDB requires debug information from some system libraries for proper opteration.
Beginning with MPSS 3.1, debug information for system libraries is not installed on the coprocessor anymore. If the coprocessor native GNU* GDB is executed, it will fail when loading/continuing with a signal (SIGTRAP).
Current workaround is to copy the .debug folders for the system libraries to the coprocessor, e.g.:

$ scp -r /opt/mpss/3.1.2/sysroots/k1om-mpss-linux/lib64/.debug root@mic0:/lib64/
MPSS 3.2 and 3.2.1 do not support offload debugging with Intel® Composer XE 2013 SP1.
Offload debugging with the Eclipse plug-in from Intel® Composer XE 2013 SP1 does not work with Intel MPSS 3.2 and 3.2.1. A configuration file which is required for operation by the Intel Composer XE 2013 SP1 package has been removed with Intel MPSS 3.2 and 3.2.1. Previous Intel MPSS versions are not affected. Intel MPSS 3.2.3 fixes this problem (there is no version of Intel MPSS 3.2.2!).

Intel® Xeon Phi™ Coprocessor developer zone

URL:

Intel® Many Integrated Core Architecture Forum

Area tema:

IDZone

↧

Debugging Intel® Xeon Phi™ Applications on Windows* Host

September 23, 2014, 6:25 am

Latest and popular articles on Intel Technologies

≫ Next: Intel(R) Compiler XE 2015 for Linux Installation Aborts with Segmentation Fault

≪ Previous: Debugging Intel® Xeon Phi™ Applications on Linux* Host

Introduction

There are many reasons for the need of a debug solution for Intel^® MIC. Some of the most important ones are the following:

Developing native Intel^® MIC applications is as easy as for IA-32 or Intel^® 64 hosts. In most cases they just need to be cross-compiled (/Qmic).
Yet, Intel^® MIC Architecture is different to host architecture. Those differences could unveil existing issues. Also, incorrect tuning for Intel^® MIC could introduce new issues (e.g. alignment of data, can an application handle more than hundreds of threads?, efficient memory consumption?, etc.)
Developing offload enabled applications induces more complexity as host and coprocessor share workload.
General lower level analysis, tracing execution paths, learning the instruction set of Intel^® MIC Architecture, …

Debug Solution for Intel® MIC

For Windows* host, Intel offers a debug solution, the Intel® Debugger Extension for Intel® MIC Architecture Applications. It supports debugging offload enabled application as well as native Intel® MIC applications running on the Intel® Xeon Phi™ coprocessor.

How to get it?

To obtain Intel’s debug solution for Intel® MIC Architecture on Windows* host, you need the following:

Intel^® Composer XE (for C/C++ or Fortran):
http://software.intel.com/en-us/intel-composer-xe
Only Intel® Composer XE contains the debug solution for Windows* host. However, Intel^® Manycore Platform Software Stack (Intel^® MPSS) is also required for this to work:
http://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss
(currently 3.3.1 for Windows*)

Debug Solution as Integration

Debug solution from Intel® based on GNU* GDB:

Full integration into Microsoft Visual Studio*, no command line version needed
Available with Intel® Composer XE 2013 SP1 and later

Note:
Pure native debugging on the coprocessor is also possible by using Intel’s version of GNU* GDB for the coprocessor. This is covered in the following article for Linux* host:
http://software.intel.com/en-us/articles/debugging-intel-xeon-phi-applications-on-linux-host

Why integration into Microsoft Visual Studio*?

Microsoft Visual Studio* is established IDE on Windows* host
Integration reuses existing usability and features
Fortran support added with Intel® Fortran Composer XE

Components Required

The following components are required to develop and debug for Intel® MIC Architecture:

Intel® Xeon Phi™ coprocessor
Windows* Server 2008 RC2, Windows* 7 or later
Microsoft Visual Studio* 2012 or later
Support for Microsoft Visual Studio* 2013 was added with Intel® Composer XE 2013 SP1 Update 1.
Intel® MPSS 3.1 or later
C/C++ development:
Intel® C++ Composer XE 2013 SP1 for Windows* or later
Fortran development:
Intel® Fortran Composer XE 2013 SP1 for Windows* or later

Configure & Test

It is crucial to make sure that the coprocessor setup is correctly working. Otherwise the debugger might not be fully functional.

Setup Intel^® MPSS:

Follow Intel^® MPSS readme-windows.pdf for setup
Verify that the Intel^® Xeon Phi™ coprocessor is running

Before debugging applications with offload extensions:

Use official examples from:
C:\Program Files (x86)\Intel\Composer XE 2013 SP1\Samples\en_US
Verify that offloading code works

It is crucial to make sure that the coprocessor setup is correctly working. Otherwise the debugger might not be fully functional.

Prerequisite for Debugging

Debugger integration for Intel^® MIC Architecture only works when debug information is being available:

Compile in debug mode with at least the following option set:
/Zi (compiler) and /DEBUG (linker)
Optional: Unoptimized code (/Od) makes debugging easier
(due to removed/optimized away temporaries, etc.)

Applications can only be debugged in 64 bit

Set platform to x64
Verify that /MACHINE:x64 (linker) is set!

Debugging Applications with Offload Extension

Start Microsoft Visual Studio* IDE and open or create an Intel^® Xeon Phi™ project with offload extensions. Examples can be found in the Samples directory of Intel® Composer XE, that is:

C:\Program Files (x86)\Intel\Composer XE 2013 SP1\Samples\en_US

C++\mic_samples.zip or
Fortran\mic_samples.zip

We’ll use intro_SampleC from the official C++ examples in the following.

Compile the project with Intel^® C++/Fortran Compiler.

Characteristics of Debugging

Set breakpoints in code (during or before debug session):
- In code mixed for host and coprocessor
- Debugger integration automatically dispatches between host/coprocessor
Run control is the same as for native applications:
- Run/Continue
- Stop/Interrupt
- etc.
Offloaded code stops execution (offloading thread) on host
Offloaded code is executed on coprocessor in another thread
IDE shows host/coprocessor information at the same time:
- Breakpoints
- Threads
- Processes/Modules
- etc.
Multiple coprocessors are supported:
- Data shown is mixed:
  Keep in mind the different processes and address spaces
- No further configuration needed:
  Debug as you go!

Setting Breakpoints

Note the mixed breakpoints here:
The ones set in the normal code (not offloaded) apply to the host. Breakpoints on offloaded code apply to the respective coprocessor(s) only.
The Breakpoints window shows all breakpoints (host & coprocessor(s)).

Start Debugging

Start debugging as usual via menu (shown) or <F5> key:

While debugging, continue till you reach a set breakpoint in offloaded code to debug the coprocessor code.

Thread Information

Information of host and coprocessor(s) is mixed. In the example above, the threads window shows two processes with their threads. One process comes from the host, which does the offload. The other one is the process hosting and executing the offloaded code, one for each coprocessor.

Additional Requirements

For debugging offload enabled applications additional environment variables need to be set:

Intel® MPSS 2.1:
COI_SEP_DISABLE=FALSE
MYO_WATCHDOG_MONITOR=-1
Intel® MPSS 3.*:
AMPLXE_COI_DEBUG_SUPPORT=TRUE
MYO_WATCHDOG_MONITOR=-1

Set those variables before starting Visual Studio* IDE!

Note:
Do not set those variables for a production system!

Debugging Native Coprocessor Applications

Pre-Requisites

Create a native Intel^® Xeon Phi™ coprocessor application and transfer & execute the application to the coprocessor target:

Use micnativeloadex.exe provided by Intel^® MPSS for an application C:\Temp\mic-examples\bin\myApp, e.g.:

> "C:\Program Files\Intel\MPSS\bin\micnativeloadex.exe""C:\Temp\mic-examples\bin\myApp" -d 0
Option –d 0 specifies the first device (zero based) in case there are multiple coprocessors per system
The application is executed directly after transfer

micnativeloadex.exe transfers the specified application to the specified coprocessor and directly executes it. The command itself will be blocked until the transferred application terminates.
Using micnativeloadex.exe also takes care about dependencies (i.e. libraries) and transfers them, too.

Other ways to transfer and execute native applications are also possible (but more complex):

SSH/SCP
NFS
FTP
etc.

Debugging native applications with Start Visual Studio* IDE is only possible via Attach to Process…:

micnativeloadex.exe has been used to transfer and execute the native application

Make sure the application waits till attached, e.g. by:


		static int lockit = 1;

		while(lockit) { sleep(1); }

After having attached, set lockit to 0 and continue.
No Visual Studio* solution/project is required.

Only one coprocessor at a time can be debugged this way.

Configuration

Open the options via TOOLS/Options… menu:

It tells the debugger extension where to find the binary and sources. This needs to be changed every time a different coprocessor native application is being debugged.

The entry solib-search-path directories works the same as for the analogous GNU* GDB command. It allows to map paths from the build system to the host system running the debugger.

The entry Host Cache Directory is used for caching symbol files. It can speed up lookup for big sized applications.

Attach

Open the options via TOOLS/Attach to Process… menu:

Specify the Intel(R) Debugger Extension for Intel(R) MIC Architecture. Set the IP and port the GDBServer should be executed with. The usual port for GDBServer is 2000 but we recommend to use a non-privileged port (e.g. 16000).
After a short delay the processes of the coprocessor card are listed. Select one to attach.

Note:
Checkbox Show processes from all users does not have a function for the coprocessor as user accounts cannot be mapped from host to target and vice versa (Linux* vs. Windows*).

Intel(R) Xeon Phi(TM) Coprocessor

Desktop

Intel® Xeon Phi™ Coprocessor developer zone

Per iniziare

Sviluppo multithread

URL:

Intel® Many Integrated Core Architecture Forum

Area tema:

IDZone

↧

Intel(R) Compiler XE 2015 for Linux Installation Aborts with Segmentation Fault

September 23, 2014, 12:24 pm

Latest and popular articles on Intel Technologies

≫ Next: vectorization support: unroll factor set to xxxx

≪ Previous: Debugging Intel® Xeon Phi™ Applications on Windows* Host

Problem: the Intel(R) Compiler XE 2015 for Linux installation aborts with segmentation fault.

During installation of the Intel(R) Parallel Studio XE 2015 for Linux products, the Compiler version 15.0 initial release, Version 15.0.0.090 Build 20140723 the installer aborts with a segmentation fault:

--------------------------------------------------------------------------------
Initializing, please wait...
--------------------------------------------------------------------------------
./install.sh: line 769: 6168 Segmentation fault (core dumped) $pset_engine_cli_binary --tmp_dir=$user_tmp --TEMP_FOLDER=$temp_folder --log_file=$log_file $silent_params $duplicate_params $params --PACKAGE_DIR=$fullpath --PSET_MODE=install

A bug report has been entered for this issue and a fix will come in a future product update

Work around: The trigger for this segmentation fault is uncommented line(s) in /etc/fuse.conf.
To work around this issue, you can rename this file, install the compiler, rename back to /etc/fuse.conf. OR an alternative is comment out all lines in /etc/fuse.conf before the installation.

Area tema:

IDZone

↧

vectorization support: unroll factor set to xxxx

September 23, 2014, 12:38 pm

Latest and popular articles on Intel Technologies

≫ Next: Disclosure of H/W prefetcher control on some Intel processors

≪ Previous: Intel(R) Compiler XE 2015 for Linux Installation Aborts with Segmentation Fault

Vectorization Diagnostic 15144

vectorization support: unroll factor set to xxxx

The vectorizer prints this message to tell the user that the loop was unrolled by a factor of xxxx, where xxxx is the unroll level or factor.

For example:

program fdiag15144
implicit none
integer, parameter :: M=1024
real :: Dy(M) = 3.124
real :: Dx(M) = 42.0, Da = .24
integer :: i

  do i = 1 , M
    Dy(i) = Dy(i) + Da*Dx(i)
  end do

end program fdiag15144

And compiling this, we see:

ifort -O2 -xhost -opt-report -opt-report-phase=vec -opt-report-file=stdout diag15144.f90

Begin optimization report for: FDIAG15144

    Report from: Vector optimizations [vec]


LOOP BEGIN at diag15144.f90(8,3)
   remark #15399: vectorization support: unroll factor set to 4
   remark #15300: LOOP WAS VECTORIZED
LOOP END
===========================================================================

In essence, the compiler is translating the loop to something similar to this:

do i = mp1 , N , 4
  Dy(i) = Dy(i) + Da*Dx(i)
  Dy(i+1) = Dy(i+1) + Da*Dx(i+1)
  Dy(i+2) = Dy(i+2) + Da*Dx(i+2)
  Dy(i+3) = Dy(i+3) + Da*Dx(i+3)
end do

This allows the compiler to pack 4 elements of Dx into a vector register, 4 copies of Da into another vector register, multiply those 2, load 4 elements of Dy into a vector register and add the 4 Da*Dx(i) elements to Dy, storing the 4 results to a vector register and flushing this out to memory as the new values of Dy.

Back to the list of vectorization diagnostics for Intel Fortran

For more complete information about compiler optimizations, see our Optimization Notice.

Area tema:

IDZone

↧

Disclosure of H/W prefetcher control on some Intel processors

September 24, 2014, 1:37 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel Parallel Compute Center MIC Training Resources

≪ Previous: vectorization support: unroll factor set to xxxx

Disclosure of H/W prefetchers control on some Intel processors

This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.

The above mentioned processors support 4 types of h/w prefetchers for prefetching data. There are 2 prefetchers associated with L1-data cache (also known as DCU) and 2 prefetchers associated with L2 cache. There is a Model Specific Register (MSR) on every core with address of 0x1A4 that can be used to control these 4 prefetchers. Bits 0-3 in this register can be used to either enable or disable these prefetchers. Other bits of this MSR are reserved.

Prefetcher	Bit# in MSR 0x1A4	Description
L2 hardware prefetcher	0	Fetches additional lines of code or data into the L2 cache
L2 adjacent cache line prefetcher	1	Fetches the cache line that comprises a cache line pair (128 bytes)
DCU prefetcher	2	Fetches the next cache line into L1-D cache
DCU IP prefetcher	3	Uses sequential load history (based on Instruction Pointer of previous loads) to determine whether to prefetch additional lines

If any of the above bits are set to 1 on a core, then that particular prefetcher on that core is disabled. Clearing that bit (setting it to 0) will enable the corresponding prefetcher. Please note that this MSR is present in every core and changes made to the MSR of a core will impact the prefetchers only in that core. If hyper-threading is enabled, both the threads share the same MSR.

Most BIOS implementations are likely to leave all the prefetchers enabled (i.e MSR 0x1A4 value at 0) as prefetchers are either neutral or positively impact the performance for a large number of applications. However, how these prefetchers may impact your application is going to be highly dependent on the data access patterns in your application.

These bits can be enabled or disabled at any time. Any changes will impact the prefetchers (and hence the performance of all the applications) running on all the cores where the changes are applied.

Tools that measure memory latencies and bandwidth may want to explicitly set the prefetchers to a known state for more controlled measurements. They can change the prefetcher settings during measurement but should restore them back to the original state on completion. For example, Intel Memory Latency Checker tool (http://www.intel.com/software/mlc) modifies the prefetchers through writes to MSR 0x1a4 to measure accurate latencies and restores them to the original state on exit.

https://software.intel.com/mic-developer

Area tema:

IDZone

↧

Intel Parallel Compute Center MIC Training Resources

September 25, 2014, 4:05 pm

Latest and popular articles on Intel Technologies

≫ Next: OPTIMIZING STORAGE SOLUTIONS USING THE INTEL® INTELLIGENT STORAGE ACCELERATION LIBRARY

≪ Previous: Disclosure of H/W prefetcher control on some Intel processors

Training Resources for the Intel^® Many Integrated Core (Intel MIC Architecture)

Just getting started with The Intel^® Xeon Phi^TMCoprocessor and need some help getting started OR are you already developing for Phi but looking for more advanced help? This is the page for you.

Bootstrapping

The first step is to join the MIC/Phi community. All information related is linked from a central web portal. You need to add this site to your Bookmarks:

and you need to Bookmark the User Forum - this is where all questions can be asked - beginner, intermediate, advanced, Ninja:

https://software.intel.com/en-us/forums/intel-many-integrated-core

Training Videos and Materials

We recommend starting with the following:

Intel® Xeon® Processors & Intel® Xeon Phi™ Coprocessors - Introduction to High Performance Application Development for Multicore and Manycore (2 day)
1. This two day webinar series introduces you to the world of multicore and manycore computing with Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors. Expert technical teams at Intel discuss development tools, programming models, vectorization, and execution models that will get your development efforts powered up to get the best out of your applications and platforms.
Finding the Right Fit for Your Application on Intel® Xeon Processors and Intel® Xeon Phi™ Coprocessors
1. Not all applications are created equal. Some are chomping at the bit to harvest as much parallelism as a target platform can provide. Those may be good candidates for running on an Intel® Xeon Phi™ Coprocessor. Other applications are scalar (not vectorized) and sequential (not threaded). They won't even make full use of an Intel Xeon processor, much less an Intel Xeon Phi Coprocessor. Before moving to a highly-parallel platform, the developer-tuner needs to expose enough parallelism to hit the limits of the Intel Xeon platform. Join us as we explore finding the right fit for applications on Intel(r) Xeon processors and Intel(r) Xeon Phi coprocessors.
Windows Host Users: An Introduction to Intel® Visual Fortran Development on Intel® Xeon Phi™ coprocessor
1. The Intel® Visual Fortran Composer XE SP1 release includes support for Intel® Xeon Phi™ coprocessors on Windows*. This webinar introduces the development environment for developing Fortran applications for the Intel® Xeon Phi™ coprocessor for Windows*. You will learn about the system configuration including details of the Intel® Manycore Platform Software Stack (Intel® MPSS), integrations with Microsoft Visual Studio*, Fortran offload programming models, developing and debugging offload and native applications, and existing limitations.

Intel hosts live Webinars on a regular basis. The sessions are recorded during the live broadcast. A few weeks after the broadcasts Intel will post the recording and any materials for the session.

Intel Software Tools Technical Webinar Series - this site lists upcoming or recent Webinars. Check back often, don't miss the opportunity to ask questions during the live broadcast.

Compiler Tuning and Advanced Optimization Guide for Intel Xeon Phi^TM Coprocessors and Intel^® Xeon^TM processors

Programming and Compiling for Intel® Many Integrated Core Architecture

The Intel compiler team has compiled a comprehensive guide to building and tuning applications for Phi. Topics range from help getting started for those new to the Intel compilers to advanced and Ninja Phi programming optimization and vectorization techniques. Also Floating Point model and accuracy, and differences between Phi and Xeon. This is a must-read for ALL Phi programmers! At a minimum Bookmark this for your future reference.

For more complete information about compiler optimizations, see our Optimization Notice.

Phi Training Resources

Sviluppatori

Linux*

Per iniziare

- Intel® AVX – Intel® Advanced Vector Extensions

Area tema:

IDZone

↧

OPTIMIZING STORAGE SOLUTIONS USING THE INTEL® INTELLIGENT STORAGE ACCELERATION LIBRARY

September 24, 2014, 9:49 am

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio 2015 Beta for Linux Hosts Silent Installation Guide

≪ Previous: Intel Parallel Compute Center MIC Training Resources

With the growing number of devices connected to the Cloud/Internet, data is being generated from many different sources including smartphones, tablets, and Internet of Things devices. The demand for storage is growing every year. The combination of the Intel® Xeon® processor family and the Intel® Intelligent Storage Library (Intel® ISA-L) can provide developers with the tools to process data securely and quickly and even reduce storage space requirements.

In China, Intel collaborated with Qihoo 360 Technology Company Ltd to integrate Intel ISA-L into their storage solution. This resulted in a 10x performance increase and two-thirds reduction in required storage space. Read the case study.

Intel® ISA-L provides the tools to help accelerate and optimize storage on Intel® architecture (IA) for everything from small office NAS appliances up to enterprise storage systems. The functions provided in this library help with storage recoverability, data integrity, data security, and faster data compression mechanisms. This article provides a high level functional overview of Intel ISA-L.

Intel ISA-L provides the following collection of functions for use in storage applications:

RAID (Redundant Array of Inexpensive Disks) functions allow faster parity computation that can be used by a RAID provider. The RAID functions calculate and operate on XOR and P+Q parity. The mathematics of RAID are based on Galois finite-field arithmetic to find one or two parity bytes for each byte in N sources such that single or dual disk failures (one or two erasures) can be corrected.
Erasure Code (EC) functions allow breaking up of objects into smaller fragments, storing the fragments in different places, and regenerating the data from any combination of smaller numbers of those fragments. These EC functions implement a general Reed-Solomon type encoding for blocks of data to protect against erasure of whole blocks. Individual operations can be described in terms of arithmetic in the Galois finite field GF(2⁸) with the particular field-defining primitive or reducing polynomial x⁸+ x⁴+ x³+ x²+1 (0x1d).
CRC (cyclic redundancy check) functions permit the system to detect accidental changes to raw data during transmission. The receiver can ask the transmitter to resend the package until the CRC matches. Functions in the CRC section are fast implementations of cyclic redundancy checks using IA specialized instructions such as PCLMULQDQ, carry-less multiplication.
Multi-buffer Hashing (MbH) functions provide cryptographic hash functions that use the capabilities of IA. Intel ISA-L supports MD5, SHA1, SHA256, and SHA512. These MbH functions are used to increase the performance of the secure hash algorithms on a single processor core by operating on multiple jobs at once. By buffering jobs, the algorithm can exploit the instruction-level parallelism inherent in modern IA cores to an extent not possible in a serial implementation.
Encryption functions provide accelerated encryption by using the Intel® AES-NI instruction set.
Compression functions provide a fast, DEFLATE compatible compression routine. DEFLATE is a widely used binary compression standard that forms the basis of zlib, gzip and zip. The Intel ISA-L implementation of compression is written to be faster than zlib-1 with only a small sacrifice in compression ratio. This is well suited for high-throughput storage applications.

Depending on the platform capability, Intel ISA-L can run on various Intel® processor families. Improvements are obtained by speeding up the computations through the use of the following instruction sets:

Intel® ISA-L also includes unit tests, performance tests and samples written in C which can be used as usage examples.

Additional development information:

The library supports several generations of Intel® processors by providing multi-binary versions of some functions which developers can compile to deploy as a single binary which will detect and optimize based on the processor in use. Alternatively, developers can reduce code size by calling just one version.
The calling convention for most functions in the library is C binding. Individual functions are written to be statically or dynamically linked with an application.
To build the Intel ISA-L functions, use Yasm* Assembler version v1.2.0 or later.
Some functions of the Intel ISA-L require the input parameters to be aligned on a 16B or 32B boundary.

See the Intel ISA-L API reference manual for more details.

For developers who are interested in using the Intel ISA-L, the open source version (limited to Erasure Code functions) is available at 01.org.

To access the full library of functions provided with Intel ISA-L, provide your contact or email information with a brief description about your company to the comment field.

URL:

- Intel® AVX2 - Intel® Advanced Vector Extensions 2

- Intel® AES-NI – Intel® Advanced Encryption Standard - New Instruction

- Intel® SSE – Intel® Streaming SIMD Extensions

Intel® Storage Acceleration Library (Open Source Version)

Erasure Code and Intel® Intelligent Storage Acceleration Library

Swift* with Erasure Coding for Storage

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

Area tema:

IDZone