Adding Plugins to Your Intel® XDK Cordova App

Intel® Integrated Native Developer Experience (INDE)

Intel® Media SDK

OpenCL*

Elaborazione multimediale

Desktop Microsoft Windows* 8

Area tema:

IDZone

↧

Intel® System Studio - Multicore Programming with Intel® Cilk™ Plus

October 6, 2015, 9:21 am

Latest and popular articles on Intel Technologies

≫ Next: Memory profiling techniques using Intel System Studio

≪ Previous: Media: Video Transcoding Sample

Overview

Intel System Studio not only provides a variety of signal processing primitives via:

Intel® Integrated Performance Primitives (Intel® IPP), and
Intel® Math Kernel Library (Intel® MKL)

It also allows developing high-performance low-latency custom code using the:

Intel C++ Compiler together with Intel Cilk Plus.

Intel Cilk Plus is built into the compiler, therefore it can be used where it demands an efficient threading runtime in order to extract even small amounts of parallelism. For example, when using a library for signal processing (e.g., Intel IPP) it is now possible to make use of multicore even when this library does not provide internal multithreading. This article sketches, how to effectively introduce multicore parallelism even without introducing it into each of the important algorithms. This is possible by employing a parallel pattern called pipeline.

Introduction

Intel Cilk Plus in concert with the compiler enables forward-scaling task parallelism for C and C++ with a runtime-dynamic scheduler that maps an arbitrary number of tasks to a limited set of workers (pool of threads). This allows for composable designs where multicore parallelism can be added without supervising call chains (nested parallelism), and without oversubscribing resources. Intel Cilk Plus guarantees to unfold parallelism according to the number of workers in the pool instead of an unbound resource usage according to the number of scheduled tasks. Intel Cilk Plus is part of the Intel® C++ Compiler, and there are no additional steps required to start working with Intel Cilk Plus in C and C++ code.

There are two main usages of multicore that are important for embedded applications:

Singular, long-running or permanent tasks including background tasks. This category usually implements an event handling e.g., queuing user interactions.
Multiple tasks that operate on the same data. This kind of data-parallelism can still include an embarrassing parallel structure but may require synchronization constructs (locks).

In the first category, the task is in a 1:1-relationship to an OS thread, and there is no need to involve Intel Cilk Plus. An OS-specific thread interface (e.g., POSIX threads / Pthreads) can be used. However, C++ libraries provide an easy interface to interact with OS threads and synchronization primitives in a portable manner. With C++11 in particular, no additional library would be needed. Also, Intel® Threading Building Blocks (Intel® TBB) includes a rich set of primitives incl. std::thread (in case of C++11 is not available; TBB_IMPLEMENT_CPP0X). Note with Intel System Studio, Intel TBB can be built from source.

Case Study and Example

In the second category above a threading runtime is about to map many tasks to a limited pool of worker threads. Let's consider an application where we want to introduce some multicore parallelism, but without introducing it into each of the important algorithms. This is a realistic use case where for example a signal processing library is used that does not provide internal multi-threading, or that should not use an available internal multi-threading because of the higher efficiency in case of smaller data sets, latency-constraint applications, or for better control via application-level threading.

--------     -----------     ---------
| Read | --> | Process | --> | Print |
-------1     2---------3     4--------

The signal processing pipeline consists of three stages: (1) the signal reader that reads/parses values into its own dedicated buffer #1, (2) the actual processing stage for the signal that operates out of place on buffer #2 and #3, and a final print stage (3) that reads its own buffer #4 and prints it.

struct { size_t i, n; } stage[] = { { 0, 0 }, { 1, 0 }, { 2, 0 }, { 3, 0 } };

for (size_t i = 0; i <= nsteps && (0 < stage[1].n || 0 == i); ++i) {
   read_signal   (size,       x + stage[0].i, y + stage[0].i, stage[0].n);
   process_signal(stage[1].n, x + stage[1].i, y + stage[1].i,
                              x + stage[2].i, y + stage[2].i);
   print_signal  (stage[3].n, x + stage[3].i, y + stage[3].i, std::cout);

   stage[2].n = stage[1].n;

   std::rotate(stage, stage + 4 - 1, stage + 4); // quad-buffering
}

The previously introduced pipeline now indirectly assigns one of four buffers to the each stages. Here, a quad buffering approach consists of two actual quad-buffers x and y for a signal that consists of multiple components. At each cycle of the loop, the stage's destination buffer is rotated by one position within the ring buffer (called "stage") to become the source of the next stage.

Intel® Cilk™ Plus

for (size_t i = 0; i <= nsteps && (0 < stage[1].n || 0 == i); ++i) {cilk_spawn read_signal   (size,       x + stage[0].i, y + stage[0].i, stage[0].n);cilk_spawn process_signal(stage[1].n, x + stage[1].i, y + stage[1].i,
                                         x + stage[2].i, y + stage[2].i);
              print_signal  (stage[3].n, x + stage[3].i, y + stage[3].i, std::cout);cilk_sync;

   stage[2].n = stage[1].n;

   std::rotate(stage, stage + 4 - 1, stage + 4); // quad-buffering
}

Compared to the previously shown code, cilk_spawn and cilk_sync have been added. In order to generate a serial version of a program that uses Intel Cilk Plus (keywords, reducers, etc.) one can compile with the "-cilk-serialize" option (with just cilk_spawn, cilk_sync, and cilk_for one can simply elide these keywords using the preprocessor). Note that the above multi-buffering approach actually allows calling read_signal, process_signal, and print_signal in any order which can be of interest with Intel Cilk Plus' continuation-passing style.

Thinking of cilk_spawn as asynchronously launching an invocation can explain what is running concurrently before it is synced by cilk_sync. However, the worker thread that launches the first cilk_spawn also executes the spawned function (i.e., read_signal in the previous code fragment). This is in contrast to what a library-based threading runtime is able to achieve. The continuation however is eventually stolen by another worker (i.e., after the sequence point behind read_signal; hence the next spawn). There are also a number of implicit synchronization points (where cilk_sync can be omitted). These are mostly obvious, but also complete the definition of the language extension in presence of exceptions.

for (size_t i = 0; i < N; ++i) {
cilk_spawn little_work(i);

}

for (size_t i = 0; i < N; i += G) {
cilk_spawn work(i, std::min(G, N - i));
}

/*A*/

/*C*/

cilk_for (size_t i = 0; i < N; ++i) {
little_work(i);
}

void work(size_t I, size_t N) {
for (size_t i = I; i < N; ++i) little_work(i);
}

/*B*/

/*D*/

In situation A (above) with only little work for each induction of i, the keyword cilk_for is introduced in code B to not only amortize the cilk_spawn, but to also employ a launch-scheme similar to a binary tree. Intel Cilk Plus allows adjusting the grain size of a cilk_for (#pragma cilk grainsize=<expression>) using a runtime expression. The grain size G in code C is able to accumulate more work in function D. With respect to the launch scheme the examples B and C are still not equivalent. Splitting the loop range according to a binary tree avoids accumulating the total launch overhead on a core.

There are two notable consequences from Intel Cilk Plus' continuation-passing style: (1) a thread continues with what is locally prepared or "hot in cache" and (2) the instructions of a scope (in the sense of C/C++ scope) may not be executed by the same thread. A conclusion from #1 is that tuning a sequential program maps to a tuned parallel program in a more straight-forward manner. In case of #2, a thread-local storage with a lifetime according to the scope cannot be used with Intel Cilk Plus. Now without a myth left, it should be also said that Intel Cilk Plus uses regular OS threads in order to perform the work. However, tasks are conceptually very lightweight user-space objects similar to fibers but this is not much different from other threading libraries such as Intel TBB.

Dynamic scheduling employs workers as needed, and the grain size varies the amount of available parallelism of a cilk_for loop. Of course, with cilk_spawn the number of spawned functions directly refers to the amount of parallelism that can be exploited e.g., the number of pipeline stages that can run in parallel. To summarize, setting the number of workers (see below code fragment) for different parallel sections is not the way to adjust concurrency.

void main(int argc, char* argv[])
{constchar *const threads = 1 < argc ? argv[1] : 0;constint nthreads = 0 != threads ? std::atoi(threads) : -1;if (0 <= nthreads) __cilkrts_set_param("nworkers", threads);
}

The above example code illustrates three cases: (1) no given argument omits to setup the number of workers and eventually defers this step to a later point, (2) where a given "0" argument on the command line instructs the API to automatically initialize with all available workers according to the system and to the process' affinity mask, and (3) where an explicit number of workers is given. The API takes precedence over the environment variable CILK_NWORKERS which is also exist for convenience. Note an explicit number of workers is not necessarily truncated to the number of available hardware threads.

Increasing the grain size lowers the number of worker involved into a cilk_for loop. The default (when the pragma is not used) aims to maximize parallelism, and involves (1) the size of the iteration space as well as (2) the total number of workers (see above code). A constant grain size removes these two dependencies. However, the order of processing these partitions may still vary, and hence impede non-deterministic results in particular with floating-point data where the order of calculations impacts a final result. There are actually many more reasons for not reproducing the same bit-wise result on a system or across systems.

Conclusions

The pipeline pattern is only able to extract a limited amount of parallelism, and the longest running stage always becomes the bottleneck. In the above example, the pipeline consists of only three stages that can run in parallel where two of them are I/O. The latter can be an additional burden (in terms of scalability) if the I/O functionality uses locks in order to protect internal state.

It is actually more interesting in our example, that successive fork-joins (every cycle of our pipeline) demand an efficient threading runtime. Of course, one can try to hide the overhead with larger buffer sizes within the parallel pipeline. However, a parallel region that executes longer is obviously increasing the latency of the application.

Software optimizations including in-core optimizations can save energy. Cache-blocking can avoid unnecessary memory loads. With multicore one can take this idea further with cache-oblivious algorithms. However, multicore parallelism by itself is able to save energy in the following way:

4x the die area of a microprocessor gives 2x the performance in one core, but
4x the performance when the same area is dedicated to 4 cores.

Polack's rule should be considered alongside the fact that performance may scale linearly with clock frequency, but energy consumption will roughly scale with the square of the clock frequency. Amdahl's Law limits the practical use of a system that only provides performance in presence of parallelism. However, it is very attractive to prepare signal processing applications to make use of multiple cores because of possible energy savings, or to consolidate specialized hardware by loading the system with various different tasks in addition to accelerating the signal processing itself.

Case Study

One can find out more by reading the attached case study (2-up) carrying out the above example.

embedded c programming

Compilatore C++ Intel®

Intel® Cilk™ Plus

Intel® Integrated Performance Primitives

Intel® Math Kernel Library

Intel® System Studio

Elaborazione parallela

Efficienza energetica

Intel Memory profiler for embedded application

Librerie

Sviluppo multithread

Learning Lab

Area tema:

IDZone

↧

Memory profiling techniques using Intel System Studio

October 8, 2014, 8:23 am

Latest and popular articles on Intel Technologies

≫ Next: Digital Security and Surveillance on 4th generation Intel® Core™ processors Using Intel® System Studio 2015

≪ Previous: Intel® System Studio - Multicore Programming with Intel® Cilk™ Plus

Introduction

One of the problems with developing embedded systems is the detection of memory errors; like

Memory leaks
Memory corruption
Allocation / de-allocation API mismatches
Inconsistent memory API usage etc.

These memory errors degrade performance of any embedded systems. Designing and programming an embedded application requires great care. The application must be robust enough to handle every possible error that can occur; care should be taken to anticipate these errors and handle them accordingly—especially in the area of memory.

In this article we have described how to use Intel® System Studio to find dynamic memory issues in any embedded application.

Intel® System Studio 2015

Intel® System Studio a new comprehensive integrated tool suite provides developers with advanced system tools and technologies that help accelerate the delivery of the next generation power efficient, high performance, and reliable embedded and mobile devices.

To get more information about Intel® System Studio – http://software.intel.com/en-us/intel-system-studio

Dynamic Memory Analysis

Dynamic memory analysis is the testing and evaluation of an embedded application for any memory errors during runtime.

Advantage of dynamic memory analysis: Dynamic memory analysis is the analysis of an application that is performed by executing application. For dynamic memory analysis to be effective, the target program must be executed with sufficient test inputs to analyze entire program.

Intel® Inspector for Systems

Intel® Inspector for Systems helps developers identify and resolve memory and threading correctness issues in their unmanaged C, C++ and Fortran programs as well as in the unmanaged portion of mixed managed and unmanaged programs. Additionally the tool identifies threading correctness issues in managed .NET C# programs.

Intel® Inspector for Systems will currently identifies following type of dynamic memory problems.

Problem Type	Description
Incorrect memcpy call	When an application calls the memcpy function with two pointers that overlap within the range to be copied.
Invalid deallocation	When an application calls a deallocation function with an address that does not correspond to dynamically allocated memory.
Invalid memory access	When a read or write instruction references memory that is logically or physically invalid.
Invalid partial memory access	When a read or write instruction references a block (2-bytes or more) of memory where part of the block is logically invalid.
Memory growth	When a block of memory is allocated but not deallocated within a specific time segment during application execution.
Memory leak	When a block of memory is allocated, never deallocated, and not reachable at application exit (there is no pointer available to deallocate the block).
Memory not deallocated	When a block of memory is allocated, never deallocated, but still reachable at application exit (there is a pointer available to deallocate the block).
Mismatched allocation/deallocation	When a deallocation is attempted with a function that is not the logical reflection of the allocator used.
Missing allocation	When an invalid pointer is passed to a deallocation function. The invalid address may point to a previously released heap block.
Uninitialized memory access	When a read of an uninitialized memory location is reported.
Uninitialized partial memory access	When a read instruction references a block (2-bytes or more) of memory where part of the block is uninitialized.
Cross-thread stack access	When a thread accesses a different thread's stack

Conclusion: Intel® System Studio provides you dynamic memory analysis feature to build robust embedded application.

intel memory profiler

memory checking of embedded application

Embedded

Digital Security & Surveillance

Controllo degli errori

Errori di memoria

Area tema:

IDZone

↧

Digital Security and Surveillance on 4th generation Intel® Core™ processors Using Intel® System Studio 2015

October 8, 2014, 8:59 am

Latest and popular articles on Intel Technologies

≫ Next: Getting the Most out of your Intel® Compiler with the New Optimization Reports

≪ Previous: Memory profiling techniques using Intel System Studio

This article presents the advantages of developing embedded digital video surveillance systems to run on 4^th generation Intel® Core™ processor with Intel® HD Graphics, in combination with the Intel® System Studio 2015 software development suite. While Intel® HD Graphics is useful for developing many types of computer vision functions in video management software; Intel® System Studio 2015 is an embedded application development suite that is useful in developing robust digital video surveillance applications.

video Surveillance

DSS

Intel haswell

application development on haswell

software application intel 4th generation

Intel® Integrated Performance Primitives

Internet degli oggetti

Intel® System Studio

Intel® Advanced Vector Extensions

Intel® Streaming SIMD Extensions

Azienda

Sviluppo multithread

Area tema:

IDZone

↧

Getting the Most out of your Intel® Compiler with the New Optimization Reports

October 8, 2014, 11:33 am

Latest and popular articles on Intel Technologies

≫ Next: Heading Home with the Intel XDK

≪ Previous: Digital Security and Surveillance on 4th generation Intel® Core™ processors Using Intel® System Studio 2015

The performance improvement an application gets from being compiled with optimization can be enhanced by understanding and acting on optimization reports. Fortunately, this has become much easier with the latest compilers from Intel. Modern optimizing compilers can often make code transformations that greatly improve application performance, but this may depend on how the original code is written and how much information is available to the compiler. The Intel® Compiler’s optimization report tells the programmer which optimizations were performed and why other optimizations were not performed. A programmer can use this feedback to tune code to enable additional compiler optimizations and further enhance application performance.

Prior versions of the Intel compiler provided much potentially valuable information scattered through a series of different reports, but the messages were not logically ordered and were sometimes cryptic or confusing, especially in the presence of inlining or of multiple, compiler-generated loop versions. Some of the information was not actionable or immediately useful to the programmer. The single report stream could be hard to navigate, hard for other tools to access, and was unsuited to the parallel builds which are increasingly used to reduce build times on modern, multi-core processors.

Starting from the latest, version 15.0, compiler in Intel® Parallel Studio XE 2015, the optimization report has been comprehensively redesigned to integrate all individual reports into a single, user-friendly report and to address the limitations described above. This article explains the features of the new optimization report, how they may be used to understand what optimizations the compiler did or did not perform and to guide further application tuning.

Enabling and Controlling the Report

The command line switches for enabling and high level control of the new optimization report are listed in figure 1 for the Intel Compilers for Windows*, Linux* and OS X*. In most cases, the version of a switch for Linux or OS X starts with -q and the corresponding version for Windows starts with /Q. The switches are the same for both C/C++ and Fortran compilers.

Linux* and OS X*	Windows*	Functionality
-qopt-report[=N]	/Qopt-report[:N]	Enables the report; N=1-5 specifies an increasing level of detail (default N=2)
-qopt-report-file=stdout \| stderr \| filename	/Qopt-report-file:stdout \| stderr \| filename	Controls where the report is written (default is to file with extension .optrpt)
	/Qopt-report-format:vs	Report is formatted to enable display in Microsoft* Visual Studio*
-qopt-report-routine=fn1[,fn2,…]	/Qopt-report-routine:fn1[,fn2,…]	Emit report only for functions whose name contains fn1 [or fn2…] as a substring
-qopt-report-filter=“filename,ln1-ln2”	/Qopt-report-filter=“filename,ln1-ln2”	Emit report only for lines ln1 - ln2 of file filename
-qopt-report-phase=phase1[,phase2,…]	/Qopt-report-phase:phase1[,phase2,…]	Optimization information is provided only for the specified optimization phases.

Figure 1a

Optimization Phase	Description
vec	Automatic and explicit vectorization using SIMD instructions
par	Automatic parallelization by the compiler
loop	Memory, cache usage and other loop optimizations
openmp	Explicit threading using OpenMP directives
ipo	Inter-Procedural Optimization, including inlining
pgo	Profile Guided Optimization (using run-time feedback)
cg	Optimizations during code generation
offload	Offload and data transfer to Intel® Xeon Phi™ coprocessors
all	Reports on all optimization phases (default)

Figure 1b

Report Output

The report is disabled by default and may be enabled by the switch -qopt-report. By default, for compatibility with parallel builds, a separate report corresponding to each object file is created with file extension .optrpt in the same directory as the object file. The report output may be redirected to a different, named file, or to stderr or stdout, using the switch -qopt-report-file.

For debug builds with -g on Linux or OS X, /Zi on Windows, some loop optimization information is embedded in the assembly code and in the object file. This makes the loop structure in the assembly code easier to understand, and makes optimization information from the compiler available for use by other software tools.

Optimization reports can sometimes be very large. They may be restricted to functions of interest using the switch -qopt-report-routine, or to a particular range of line numbers within a source file using the switch -qopt-report-filter.

Layout of Loop-Related Reports

Messages relating to the optimization of nested loops are displayed in a hierarchical manner, as illustrated in figure 2. The compiler generates a “LOOP BEGIN” message for each loop in the compiler-generated code, along with the initial source line and column number, and a corresponding “LOOP END” message. Indenting is used to make clear the nesting structure. There may be multiple compiler-generated loops for a single source loop and the nesting structure may differ from that of the source code. A loop may be “distributed” (split) into two or more sub-loops. The partial report displayed in figure 2 shows that the outer loop at line 6 of the source code has become two inner loops in the optimized generated code.

Figure 2

This hierarchical display allows compiler optimizations to be associated directly with the particular loop in the generated code to which they apply.

SIMD load instructions in a vectorized loop are most efficient when the data to be loaded are aligned to a memory address that is a multiple of the SIMD register width. To achieve this, the compiler may “peel” off a few initial iterations, so that the vectorized kernel can operate on data that are better aligned. Any small number of left-over iterations after the vectorized kernel may be optimized as a separate “remainder” loop. Figure 3 shows how such “peel” and “remainder” loops are identified in the optimization report.

Figure 3

Using the Loop and Vectorization reports

The goal of the new optimization report is not just to help you understand what the compiler did, but also to help you understand the obstacles that it encountered, so that you can help it to do better. We will illustrate this with the simple C example in Figure 4, (the report and its interpretation are very similar for C++ and Fortran). The function foo() loops over the input array theta, does a calculation involving a math function and returns the result in the array sth.

#include <math.h>
void foo (float * theta, float * sth)  {
  int i;
  for (i = 0; i < 128; i++)
       sth[i] = sin(theta[i]+3.1415927);
}

$ icc -c -qopt-report=2 -qopt-report-phase=loop,vec -qopt-report-file=stderr foo.c

Begin optimization report for: foo(float *, float *)

    Report from: Loop nest & Vector optimizations [loop, vec]

LOOP BEGIN at foo.c(4,3)
<Multiversioned v1>
   remark #25228: Loop multiversioned for Data Dependence
   remark #15399: vectorization support: unroll factor set to 2
   remark #15300: LOOP WAS VECTORIZED
LOOP END

LOOP BEGIN at foo.c(4,3)
<Multiversioned v2>
   remark #15304: loop was not vectorized: non-vectorizable loop instance from multiversioning
LOOP END

Figure 4

The report shows that the compiler generated two loop versions corresponding to a single loop in the source code, (this is known as “multiversioning”), and explains that this is because of data dependence. The compiler does not know at compile time whether the pointer arguments theta and sth might be aliased, i.e., the data they point to might overlap, in a way that would make vectorization unsafe. Therefore, the compiler creates two versions of the loop, one vectorized, one not. The compiler inserts a run-time test for data overlap so that the vectorized loop is executed if it is safe to do so; otherwise, the non-vectorized loop version is executed.

If the programmer knows that the two pointer arguments are not aliased, he/she can communicate that to the compiler, either using the command line option -fargument-noalias (Linux or OS X) or /Qalias-args- (Windows), or the restrict keyword along with
-restrict (Linux or OS X) or /Qrestrict (Windows). Alternatively, the compiler can be told directly that it is safe to vectorize the loop, using #pragma ivdep or #pragma omp simd (this latter requires the -qopenmp or -qopenmp-simd switch). In each of these cases, only the vectorized version of the loop is generated, and the compiler does not need to generate any run-time tests for data overlap. In the present example, we use the command line switch and increase the level of detail in the report as in Figure 5:

$ icc -c -fargument-noalias  -qopt-report=4 -qopt-report-phase=loop,vec -qopt-report-file=stderr foo.c

Begin optimization report for: foo(float *, float *)

    Report from: Loop nest & Vector optimizations [loop, vec]

LOOP BEGIN at foo.c(4,3)
   remark #15389: vectorization support: reference theta has unaligned access   [ foo.c(5,14) ]
   remark #15389: vectorization support: reference sth has unaligned access   [ foo.c(5,5) ]
   remark #15381: vectorization support: unaligned access used inside loop body   [ foo.c(5,5) ]
   remark #15399: vectorization support: unroll factor set to 2
   remark #15417: vectorization support: number of FP up converts: single precision to double precision 1
   [ foo.c(5,14) ]
   remark #15418: vectorization support: number of FP down converts: double precision to single precision 1
   [ foo.c(5,5) ]
   remark #15300: LOOP WAS VECTORIZED
   remark #15450: unmasked unaligned unit stride loads: 1
   remark #15451: unmasked unaligned unit stride stores: 1
   remark #15475: --- begin vector loop cost summary ---
   remark #15476: scalar loop cost: 114
   remark #15477: vector loop cost: 40.750
   remark #15478: estimated potential speedup: 2.790
   remark #15479: lightweight vector operations: 9
   remark #15480: medium-overhead vector operations: 1
   remark #15481: heavy-overhead vector operations: 1
   remark #15482: vectorized math library calls: 1
   remark #15487: type converts: 2
   remark #15488: --- end vector loop cost summary ---
   remark #25015: Estimate of max trip count of loop=64
LOOP END

Figure 5

The report shows that only a single loop version was generated. The cost summary shows that the estimated speedup from vectorization is about 2.79. Not bad, but we can do better. We note the remarks 15417 and 15418 about conversions between single and double precision at columns 14 and 5 of line 5, and the presence of 2 type converts in the summary. Checking the source code, we see that the array theta is single precision, but the literal constant 3.1415927 defaults to double precision; this causes the result of the addition to be double precision and so the double precision version of the sine function is called, only for the result to be converted back to single precision for storage into sth. This impacts performance in two ways: it takes longer to calculate a sine function to higher precision; and because a double takes twice the space of a float in the SIMD register, the vector instructions can only operate on half as many elements at a time. If we modify the source code by making the literal constant and/or the sine function explicitly single precision,

sth[i] = sinf(theta[i]+3.1415927f);

then the warnings about precision conversions go away, and the estimated speedup almost doubles, to 5.4. This is because most of the time goes in the vectorized math library call (remark #15482), and rather little in the more lightweight vector operations (remark #15479).

Next, we notice that the estimated maximum trip count of the vectorized loop is 64, (remark #25015), compared to the original loop iteration count of 256. So each vector operation is acting on 4 floats, that is, 16 bytes. This is because by default, we are compiling for Intel^® Streaming SIMD Extensions, (Intel^® SSE), for which the vector width is 16 bytes. If we have an Intel^®processor with support for Intel^® Advanced Vector Instructions (Intel^® AVX), which have a vector width of 32 bytes, we can target these with the compiler option -xavx. This causes the following changes in the report:

remark #15477: vector loop cost: 11.620
remark #15478: estimated potential speedup: 9.440
…
remark #25015: Estimate of max trip count of loop=32

If we had targeted an Intel® Xeon Phi™ coprocessor, the maximum trip count would have been 16 and the vector width would have been 16 floats or 64 bytes.

We now look at the messages relating to alignment. Accesses to memory that are aligned to a 32 byte boundary for Intel® AVX (or 16 bytes for Intel® SSE or 64 bytes for Intel® Xeon Phi™ coprocessors) are typically more efficient than memory accesses that are not so aligned. Remark #15381 is a general warning that an unaligned memory access was detected somewhere within the loop. Remarks #15389, 15450 and 15451 tell us that when the compiler generates loads of theta and stores to sth it assumes that the data are unaligned. Since theta and sth are passed in as arguments, the compiler does not know their alignment. Data may be aligned where they are declared by using __declspec(align(32)) (Windows) or __attribute__((align(32))) (Linux or OS X), or where they are allocated, e.g. by using _mm_malloc() or Posix memalign(). If the arguments to function foo() are known to be aligned, the keyword __assume_aligned() may be used to inform the compiler:

__assume_aligned(theta,32);
__assume_aligned(sth,32);

These keywords should only be used if you are sure that the pointer arguments of the function will always point to aligned data. There is no run-time check. After recompiling with the __assume_aligned keyword, only aligned memory accesses are reported, e.g

remark #15388: vectorization support: reference theta has aligned access

The estimated speedup due to vectorization increases by about 20%:

remark #15477: vector loop cost: 9.870
remark #15478: estimated potential speedup: 11.130

Now that sth is aligned, the compiler has the possibility of generating streaming stores (also known as non-temporal stores) directly to memory. This may be worthwhile if the stored data are unlikely to be accessed again in the near future, (i.e., before being evicted from cache). This avoids a “read-for-ownership” of the cache line, which may be beneficial for applications that read and write a lot of data and whose performance is limited by the available memory bandwidth. It also frees up cache for more productive uses. The compiler finds it worthwhile to generate streaming stores automatically only for amounts of data much larger than in this example, typically several Megabytes. If the iteration count is increased to 2000000, or if #pragma vector nontemporal is placed before the loop, the compiler generates streaming store instructions and the following additional messages appear in the optimization report:

remark #15467: unmasked aligned streaming stores: 1
remark #15412: vectorization support: streaming store was generated for sth

Even for such a tiny function, the optimization report can be a rich source of information!

Example of the IPO Report on Inlining

The IPO report gives information about optimizations across function boundaries. Here, we will focus on inlining.

Figure 6

Figure 6 shows schematically a main program that twice calls a small, static function foo() and then calls printf to print a final result. foo() calls a large static function bar(). Each live function gets its own inlining report, Thus main(), whose body starts at line 24, column 19, gets foo() inlined at line 35 and at line 36. foo() in turn gets bar() inlined at line 21. main() also calls printf() at line 37; printf is marked as external, because its content is not visible to the compiler. bar(), whose body starts at line 3 column 42, does not contain any function calls. The static function foo(), whose body starts at line 13 column 42, is marked as a dead because all of the calls to it within the source file are inlined; therefore, since it can’t be called externally, the compiler does not need to generate a standalone version of the function.

Any indirect function calls would also be shown at report level 3, marked “INDIRECT”. At higher levels, the sizes of all called functions visible to the compiler are displayed, along with the increase in size of the calling function when they are inlined.

At the head of the inlining phase of the optimization report is a list the values of the inlining parameters that were used, next to the compiler switches that can be used to modify them. These can be used to control the amount of inlining, based on the information in the report. For example, changing the argument of -inline-factor (/Qinline-factor on Windows) from 100 to 200 doubles all the size limits used to control what may be inlined. Inlining of individual functions can be requested or inhibited using pragmas such as inline, noinline and forceinline, or by the corresponding function attributes using __attribute__ or __declspec keywords. For more detail, see the Intel Compiler User and Reference Guides.

Other Report Phases

A report on automatic parallelization (threading) by the compiler, structured similarly and integrated with the vectorization and loop reports, can be obtained using -qopt-report-phase=par. -qopt-report-phase=openmp produces a report on threading constructs resulting from OpenMP pragmas or directives. A report on Profile Guided Optimization, including which functions had useful profiles, may be obtained using -qopt-report-phase=pgo. -qopt-report-phase=cg reports on optimizations during code generation, such as intrinsic function lowering (conversion to lower level constructs).
-qopt-report-phase=loop reports on additional loop and memory optimizations, such as cache blocking, prefetching, loop interchange, loop fusion, etc. A summary of data scheduled for transfer to and from an Intel Xeon Phi coprocessor may be obtained with -qopt-report-phase=offload.

For further information, see the Intel® Parallel Studio XE 2015 Composer Edition Compiler User and Reference Guides at https://software.intel.com/en-us/compiler_15.0_ug_c and https://software.intel.com/en-us/compiler_15.0_ug_f.

Summary

The new, consolidated optimization report in the Intel® C/C++ and Fortran Compilers 15.0 provides a wealth of information in a readily accessible format. Information not only about which optimizations were performed, but also about those that could not be performed, can guide the programmer in further tuning to improve application performance.

Linux*

Compilatore Fortran Intel®

Compilatore C++ Intel®

Architettura Intel® Many Integrated Core

Ottimizzazione

Vettorizzazione

Intel Compilers Vectorization Reports Optimization Reports

Area tema:

IDZone

↧

Heading Home with the Intel XDK

October 8, 2014, 1:05 pm

Latest and popular articles on Intel Technologies

≫ Next: Diagnostic 15521: Loop was not vectorized: compile time constraints prevent loop optimization.

≪ Previous: Getting the Most out of your Intel® Compiler with the New Optimization Reports

Creating a simple personalized app

For a long time, I never understood the popularity of texting. While a phone call is a quick way to verbally connect, and email is for textual information but with no guarantee of when you'll get a response, texting seems to combine the worst of both worlds - a lousy keyboard and no idea when you'll hear back.

I eventually had to give in and try it, if for no other reason than to effectively communicate with my teenager. I have to admit, I quickly discovered it does have a place. It's a way to send quick communications that don't need an immediate response, if any. It's often easier in certain social situations to fire off a text than it would be to open up a laptop, or excuse oneself to go make a call. Unlike my teenager, however, I can't seem to text and walk. (I'm sure it would be a violation of workplace safety standards, at least that's the excuse I'm going with.)

Nevertheless, I do like to let my wife know when I'm heading home, which tends to slow the process of actually heading home, as it takes me a while to compose and type a missive such as:

My dearest wife, like the Monarch Butterfly and the Blue Whale, I'm about to embark on an odyssey that I hope will bring me home in short order (barring traffic, hurricanes and whaling vessels).

As you can imagine, it takes me a while to get going.

Now, while I may be too old to learn to thumb-type like a teenager, I do have one advantage. I know how to create mobile apps. In fact, with the Intel XDK, a lot of simple tasks can be done really easily, so I could create a simple texting app that does just what I need.

In fact, it's so easy, all I really need is one line of code:

intel.xdk.device.sendSMS('Clever message that could only come from me', '5555551234');

This one line of code will send an SMS message, using your default messaging app. In particular, it will drop you into your messaging app with the text and phone number already filled in, and all you need do is hit the send button. Of course, the other important thing to know is exactly where to put this one line of code such that it correctly serves its purpose. Well it turns out there are two useful possibilities.

When you create a new project in the Intel XDK, there are a number of choices, but for our purpose we can choose "Start with a Blank Project". The structure of the blank project that you start with has changed a bit with the latest release. While it used to be that all the code was in the main index.htmlfile , it's been recently rearranged for improved reliability and to be more consistent with best practices. Now when you open a blank project, there are several javascript files provided. There is also a generic header button labeled "Touch Me" in index.html.

If all you want to do is use the provided button, you'll need to modify www/js/app.js, in the function myEventHandler. Simply replace the contents of that function with the line of code above, but be sure to use the phone number you're interested in messaging and you may want to change the text of the message.

If you've done that, you should be able to start the app, hit the button, and then hit a few more things in response to various dialogs and then hit send in your messaging app and you're done. OK, it can still involve a number of button pushes, but at least there's no typing or finding of contacts involved. On my phone, I have to start the app (1) push the button (2) then the phone asks me which app I want to use (3) Just once or Always? (4) and finally send (5).

Of course, we can get that down a bit. If I can get myself to commit to a particular messaging app, we should be down to 3 taps, but we can go one better. Since the sole purpose of this app is to send a single message to a single number, I don't really need the button push at all, I can just immediately send a message when the app starts up. Now, you may not want to do this, in case you occasionally start apps by mistake, but you should be able to avoid actually sending the app in that case, as you still need to hit send in the messaging app. If you do decide to send the missive on startup, you'll need to change where you add the one line of code above.

To modify the startup behavior of the app, find the file www/js/init-app.js and add the line above to the beginning of the function app.initEvents, and now it should immediately go to sending the message when you start the app.

I've already gone on too long here, but you can of course make your app prettier, give it a custom icon, and so on, but if you just want to try an incredibly simple, yet potentially useful app, give it a try.

↧

Diagnostic 15521: Loop was not vectorized: compile time constraints prevent loop optimization.

October 8, 2014, 1:13 pm

Latest and popular articles on Intel Technologies

≫ Next: Signal Processing Usage for Intel® System Studio – Intel® MKL vs. Intel® IPP

≪ Previous: Heading Home with the Intel XDK

Product Version: Intel(R) Visual Fortran Compiler XE 15.0.0.070

Cause:

The vectorization report generated when using Visual Fortran Compiler's optimization options (-O2 -Qopt-report:2 -traceback -check:bounds -check:nostack) states that loop was not vectorized due to compile time constraints.

Example:

An example below will generate the following remark in optimization report:

subroutine foo (a,n)

 implicit none

interface
   function bar(n)
     integer :: bar
     integer, intent(in) :: n
   end function bar
end interface

    integer, intent(in) :: n
    real,intent(inout) :: a(n)
    integer :: i

  do i=1, bar(i)
     a(i) = 0
  end do

end  subroutine foo

Report from: Vector & Auto-parallelization optimizations [vec, par]

LOOP BEGIN f15521.f90(16,3)
remark #15532: loop was not vectorized: compile time constraints prevent loop optimization. Consider using -O3.
LOOP END

Resolution:

Setting check:nobounds will vectorize the loop. Resulting optimization report will include the following remarks:

remark #25084: Preprocess Loopnests: Moving Out Store
remark #15399: vectorization support: unroll factor set to 2
remark #15300: LOOP WAS VECTORIZED

Back to the list of vectorization diagnostics for Intel Fortran

compilers

Optimization Reports

Intel® Integrated Native Developer Experience (INDE)

Compilatore Fortran Intel®

Intel® Fortran Composer XE

Intel® Visual Fortran Composer XE

Intel® Fortran Studio XE

Intel® Integrated Native Developer Experience Build Edition per OS X*

Intel® Parallel Studio

Intel® Parallel Studio XE

Intel® Parallel Studio XE Cluster Edition

Intel® Parallel Studio XE Composer Edition

Intel® Parallel Studio XE Professional Edition

Intel® Integrated Performance Primitives

Area tema:

IDZone

Windows*

↧

Signal Processing Usage for Intel® System Studio – Intel® MKL vs. Intel® IPP

October 19, 2014, 4:42 pm

Latest and popular articles on Intel Technologies

≫ Next: How to use Intel® Inspector for Systems

≪ Previous: Diagnostic 15521: Loop was not vectorized: compile time constraints prevent loop optimization.

Employing performance libraries can be a great way to streamline and unify the computational execution flow for data intensive tasks, thus minimizing the risk of data stream timing issues and heisenbugs. Here we will describe the two libraries that can be used for signal processing within Intel^® System Studio.

Intel® Integrated Performance Primitives (Intel®IPP)

Performance libraries such as the Intel IPP contain highly optimized algorithms and code for common functions including as signal processing, image processing, video/audio encode/decode, cryptography, data compression, speech coding, and computer vision. Advanced instruction sets help the developer take advantage of new processor features that are specifically tailored for certain applications. One calls Intel IPP as if it is a black box pocket of computation for their low-power or embedded device – ‘in’ flows the data and ‘out’ receives the result. In this fashion, using the Intel IPP can take the place of many processing units created for specific computational tasks. Intel IPP excels in a wide variety of domains where intelligent systems are utilized.

Without the benefit of highly optimized performance libraries, developers would need to optimize computationally intensive functions themselves carefully to obtain adequate performance. This optimization process is complicated, time consuming, and must be updated with each new processor generation. Intelligent systems often have a long lifetime in the field and there is a high maintenance effort to hand-optimize functions.

Signal processing and advanced vector math are the two function domains that are most in demand across the different types of intelligent systems. Frequently, a digital signal processor (DSP) is employed to assist the general purpose processor with these types of computational tasks. A DSP may come with its own well-defined application interface and library function set. However, it is usually poorly suited for general purpose tasks; DSPs are designed to quickly execute basic mathematical operations (add, subtract, multiply, and divide). The DSP repertoire includes a set of very fast multiply and accumulate (MAC) instructions to address matrix math evaluations that appear frequently in convolution, dot product and other multi-operand math operations. The MAC instructions that comprise much of the code in a DSP application are the equivalent of SIMD instruction sets. Like the MAC instructions on a DSP, these instruction sets perform mathematical operations very efficiently on vectors and arrays of data. Unlike a DSP, the Single Instruction Multiple Data (SIMD) instructions are easier to integrate into applications using complex vector and array mathematical algorithms since all computations execute on the same processor and are part of a unified logical execution stream.

For example, an algorithm that changes image brightness by adding (or subtracting) a constant value to each pixel of that image must read the RGB values from memory, add (or subtract) the offset, and write the new pixel values back to memory. When using a DSP coprocessor, that image data must be packaged for the DSP (placed in a memory area that is accessible by the DSP), signaled to execute the transformation algorithm, and finally returned to the general-purpose processor. Using a general-purpose processor with SIMD instructions simplifies this process of packaging, signaling, and returning the data set. Intel IPP primitives are optimized to match each SIMD instruction set architecture so that multiple versions of each primitive exist in the library.

Intel IPP can be reused over a wide range of Intel® Architecture based processors, and due to automatic dispatching, the developer’s code base will always pick the execution flow optimized for the architecture in question without having to change the underlying function call (Figure 2). This is especially helpful if an embedded system employs both an Intel® Core™ processor for data analysis/aggregation as well as a series of Intel® Atom™ Processor based SoCs for data pre-processing/collection. In that scenario, the same code base may be used in part on both the Intel® Atom™ Processor based SoCs in the field and the Intel® Core™ processor in the central data aggregation point.

Figure 1: Library Dispatch for Processor Targets

With specialized SoC components for data streaming and I/O handling combined with a limited user interface, one may think that there are not a lot of opportunities to take advantage of optimizations and/or parallelism, but that is not the case. There is room for

- heterogeneous asynchronous multi-processing (AMP) based on different architectures, and

- synchronous multi-processing (SMP) taking advantage of the Intel® Hyper-Threading Technology and dual-core design used with the latest generation of processors designed for low-power intelligent systems.

Both concepts often coexist in the same SoC. Code with failsafe real-time requirements is protected within its own wrapper managed by a modified round-robin real-time scheduler, while the rest of the operating system (OS) and application layers are managed using standard SMP multi-processing concepts. Intel Atom Processors contain two Intel Hyper-Threading Technology based cores and may contain an additional two physical cores resulting in a quad-core system. In addition Intel Atom Processors support the Intel SSSE3 instruction set. A wide variety of Intel IPP functions found at http://software.intel.com/en-us/articles/new-atom-support are tuned to take advantage of Intel Atom Processor architecture specifics as well as Intel SSSE3.

Figure 2: Intel IPP is tuned to take advantage of the Intel Atom Processor and the Intel SSSE3 instruction set

Throughput intensive applications can benefit from the use of use of Intel SSSE3 vector instructions and parallel execution of multiple data streams through the use of extra-wide vector registers for SIMD processing. As just mentioned, modern Intel Atom Processor designs provide up to four virtual processor cores. This fact makes threading an interesting proposition. While there is no universal threading solution that is best for all applications, the Intel IPP has been designed to be thread-safe:

- Primitives within the library can be called simultaneously from multiple threads within your application.

- The threading model you choose may have varying granularity.

- Intel IPP functions can directly take advantage of the available processor cores via OpenMP*.

- Intel IPP functions can be combined with OS-level threading using native threads or Intel^® Cilk™ Plus.

The quickest way to multithread an application that uses the Intel IPP extensively is to take advantage of the OpenMP* threading built into the library. No significant code rework is required. However, only about 15-20 percent of Intel IPP functions are threaded. In most scenarios it is therefore preferable to also look to higher level threading to achieve optimum results. Since the library primitives are thread safe, the threads can be implemented directly in the application, and the performance primitives can be called directly from within the application threads. This approach provides additional threading control and allows meeting the exact threading needs of the application.

Figure 3: Function level threading and application level threading using the Intel IPP

When choosing applying threading at the application level, it is generally recommended to disable the library’s built-in threading. Doing so eliminates competition for hardware thread resources between the two threading models, and thus avoids oversubscription of software threads for the available hardware threads.

Intel IPP provides flexibility in linkage models to strike the right balance between portability and footprint management.

Table 1: Intel IPP Linkage Model Comparison

The standard dynamic and dispatched static models are the simplest options to use in building applications with the Intel IPP. The standard dynamic library includes the full set of processor optimizations and provides the benefit of runtime code sharing between multiple Intel IPP-based applications. Detection of the runtime processor and dispatching to the appropriate optimization layer is automatic.

If the number of Intel IPP functions used in your application is small, and the standard shared library objects are too large, using a custom dynamic library may be an alternative.

To optimize for minimal total binary footprint, linking against a non-dispatched static version of the library may be the approach to take. This approach yields an executable containing only the optimization layer required for your target processor. This model achieves the smallest footprint at the expense of restricting your optimization to one specific processor type and one SIMD instruction set. This linkage model is useful when a self-contained application running on only one processor type is the intended goal. It is also the recommended linkage model for use in kernel mode (ring 0) or device driver applications.

Intel IPP addresses both the needs of the native application developer found in the personal computing world and the intelligent system developer who must satisfy system requirements with the interaction between the application layer and the software stack underneath. By taking the Intel IPP into the world of middleware, drivers and OS interaction, it can be used for embedded devices. The limited dependency on OS libraries and its support for flexible linkage models makes it simple to add to embedded cross-build environments with popular GNU* based cross-build setups like Poky-Linux* or MADDE*.

Developing for intelligent systems and small form factor devices frequently means that native development is not a feasible option. Intel IPP can be easily integrated with a cross-build environment and be used with cross-build toolchains that accommodate the flow requirements of many of these real-time systems. Use of the Intel IPP allows embedded intelligent systems to take advantage of vector instructions and extra-wide vector registers on the Intel Atom Processor. Developers can also meet determinism requirements without increasing the risks associated with cross-architecture data handshakes of complex SoC architectures.

Developing for embedded small form factor devices also means that applications with deterministic execution flow requirements have to interface more directly with the system software layer and the OS scheduler. Software development utilities and libraries for this space need to be able to work with the various layers of the software stack, whether it is the end-user application or the driver that assists with a particular data stream or I/O interface. The Intel IPP has minimal OS dependencies and a well-defined ABI to work with the various modes. One can apply highly optimized functions for embedded signal and multimedia processing across the platform software stack while taking advantage of the underlying application processor architecture and its strengths, all without redesigning and returning the critical functions with successive hardware platform upgrades.

Intel^® Math Kernel Library (Intel^®MKL)

IntelMKL includes routines and functions optimized for Intel® processor-based computers running operating systems that support multiprocessing. Intel MKL includes a C-language interface for the Discrete Fourier transform functions, as well as for the Vector Mathematical Library and Vector Statistical Library functions.

The Intel® Math Kernel Library includes the following groups of routines:

- Basic Linear Algebra Subprograms (BLAS):

Vector operations
Matrix-vector operations
Matrix-matrix operations

- Sparse BLAS Level 1, 2, and 3 (basic operations on sparse vectors and matrices)

- LAPACK routines for solving systems of linear equations

- LAPACK routines for solving least squares problems, eigenvalue and singular value problems, and Sylvester's equations

- Auxiliary and utility LAPACK routines

- ScaLAPACK computational, driver and auxiliary routines

- PBLAS routines for distributed vector, matrix-vector, and matrix-matrix operation

- Direct and Iterative Sparse Solver routines

- Vector Mathematical Library (VML) functions for computing core mathematical functions on vector arguments

- Vector Statistical Library (VSL) functions for generating vectors of pseudorandom numbers with different types of statistical distributions and for performing convolution and correlation computations

- General Fast Fourier Transform (FFT) Functions, providing fast computation of Discrete Fourier Transform via the FFT algorithms

- Tools for solving partial differential equations - trigonometric transform routines and Poisson solver

- Optimization Solver routines for solving nonlinear least squares problems through the Trust-Region (TR) algorithms and computing Jacobi matrix by central differences

- Basic Linear Algebra Communication Subprograms (BLACS) that are used to support a linear algebra oriented message passing interface

- Data Fitting functions for spline-based approximation of functions, derivatives and integrals of functions, and search

Intel IPP and Intel MKL for Signal Processing

The next question is when to use one Fourier Transform over another with respect to Intel IPP and Intel MKL.

DFT processing time can dominate a software application. Using a fast algorithm, Fast Fourier transform (FFT), reduces the number of arithmetic operations from O(N²) to O(N log₂ N) operations. Intel MKL and Intel IPP are highly optimized for Intel architecture-based multi-core processors using the latest instruction sets, parallelism, and algorithms.

Read further to decide which FFT is best for your application.

Table 2: Comparison of Intel MKL and Intel IPP Functionality

	Intel MKL	Intel IPP
Target Applications	Mathematical applications for engineering, scientific and financial applications	Media and communications applications for audio, video, imaging, speech recognition and signal processing
Library Structure	Linear algebra BLAS LAPACK ScaLAPACK Fast Fourier transforms Vector math Vector statistics Random number generators Convolution and correlation Partial differential equations Optimization solvers	Audio coding Image processing, compression and color conversion String processing Cryptography Computer vision Data compression Matrix math Signal processing Speech coding and recognition Video coding Vector math Rendering
Linkage Models	Static, dynamic, custom dynamic	Static, dynamic, custom dynamic
Operating Systems	Linux*	Linux*
Processor Support	IA-32 and Intel® 64 architecture-based and compatible platforms, IA-64	IA-32 and Intel® 64 architecture-based and compatible platforms, IA-64, Intel IXP4xx Processors

Intel MKL and Intel IPP Fourier Transform Features

The Fourier Transforms provided by MKL and IPP are respectively targeted for the types of applications targeted by MKL (engineering and scientific) and IPP (media and communications). In the table below, we highlight specifics of the MKL and IPP Fourier Transforms that will help you decide which may be best for your application.

Table 3: Comparison of Intel MKL and Intel IPP DFT Features

Feature	Intel MKL	Intel IPP
API	DFT Cluster FFT FFTW 2.x and 3.x	FFT DFT
Interfaces	C LP64 (64-bit long and pointer) ILP64 (64-bit int, long, and pointer)	C
Dimensions	1-D up to 7-D	1-D (Signal Processing) 2-D (Image Processing)
Transform Sizes	32-bit platforms - maximum size is 2^31-1 64-bit platforms - 2⁶⁴ maximum size	FFT - Powers of 2 only DFT -2³² maximum size (*)
Mixed Radix Support	2,3,5,7 kernels ( **)	DFT - 2,3,5,7 kernels (**)
Data Types (See Table 3 for detail)	Real & Complex Single- & Double-Precision	Real & Complex Single- & Double-Precision
Scaling	Transforms can be scaled by an arbitrary floating point number (with precision the same as input data)	Integer ("fixed") scaling Forward 1/N Inverse 1/N Forward + Inverse SQRT (1/N)
Threading	Platform dependent IA-32: All (except 1D when performing a single transform and sizes are not power of two) Intel® 64: All (except in-place power of two) IA-64: All Can use as many threads as needed on MP systems.	1D and 2D
Accuracy	High accuracy/p>	High Accurate

Data Types and Formats

The Intel MKL and Intel IPP Fourier transform functions support a variety of data types and formats for storing signal values. Mixed types interfaces are also supported. Please see the product documentation for details.

Table 4: Comparison of Intel MKL and Intel IPP Data Types and Formats

Feature	Intel MKL	Intel IPP
Real FFTs
Precision	Single, Double	Single, Double
1D Data Types	Real for all dimensions	Signed short, signed int, float, double
2D Data Types	Real for all dimensions	Unsigned char, signed int, float
1D Packed Formats	CCS, Pack, Perm, CCE	CCS, Pack, Perm
2D Packed Formats	CCS, Pack, Perm, CCE	RCPack2D
3D Packed Formats	CCE	N/A
Format Conversion Functions
Complex FFTs
Precision	Single, Double	Single, Double
1D Data Types	Complex for all dimensions	Signed short, complex short, signed int, complex integer, complex float, complex double
2D Data Types	Complex for all dimensions	Complex float

Formats Legend
CCE - stores the values of the first half of the output complex conjugate-even signal
CCS - same format as CCE format for 1D, is slightly different for multi-dimensional real transforms
For 2D transforms. CCS, Pack, Perm are not supported for 3D and higher rank
Pack - compact representation of a complex conjugate-symmetric sequence
Perm - same as Pack format for odd lengths, arbitrary permutation of the Pack format for even lengths
RCPack2D - exploits the complex conjugate symmetry of the transformed data to store only half of the resulting Fourier coefficients

Performance

The Intel MKL and Intel IPP are optimized for current and future Intel^® processors, and they are specifically tuned for two different usage areas:

Intel MKL is suitable for large problem sizes
Intel IPP is specifically designed for smaller problem sizes including those used in multimedia, data processing, communications, and embedded C/C++ applications.

Choosing the Best FFT for Your Application

Before making a decision, developers must understand the specific requirements and constraints of the application. Developers should consider these questions:

What are the performance requirements for the application? How is performance measured, and what is the measurement criteria? Is a specific benchmark being used? What are the known performance bottlenecks?
What type of application is being developed? What are the main operations being performed and on what kind of data?
What API is currently being used in the application for transforms? What programming language(s) is the application code written in?
Does the FFT output data need to be scaled (normalized)? What type of scaling is required?
What kind of input and output data does the transform process? What are the valid and invalid values? What type of precision is required?

Suite di tool per sviluppo software Intel® Embedded per il processore Intel® Atom™

Intel® System Studio

Intel® Integrated Native Developer Experience. Intel INDE

Area tema:

IDZone

↧

How to use Intel® Inspector for Systems

October 10, 2014, 6:07 am

Latest and popular articles on Intel Technologies

≫ Next: Media SDK Tutorials for Client and Server

≪ Previous: Signal Processing Usage for Intel® System Studio – Intel® MKL vs. Intel® IPP

Background

Intel® System Studio is the new embedded software tool suite and includes Intel® Inspector for Systems. This article will explain the steps you need to follow to run Inspector for Systems on an embedded platform.

Overview

We currently recommend that you run Intel Inspector for Systems on your host and not on your embedded system. However, analyzing on an embedded system is possible. We will use Yocto Project* version 1.2 as an example. This platform supports many Intel board support packages (BSP’s) and it also allows you to work without running any physical embedded hardware by letting you develop via an emulator that they provide. The following steps explain how to setup an application and then run an Intel® Inspector for Systems collection on it via the Yocto Project* emulator(runqemu). Here are the steps we will take to run our collection:

Setting up a Yocto Project* 1.2 environment.
1. Cross compilers
2. Yocto Project* pre-built kernel
3. File system to NFS mount from host
Install Intel System Studio
1. Copy installation to root file system created above.
Cross compiling the tachyon sample application
1. Build the application
2. Copy to root file system created above.
Start a QEMU emulator session
1. Login to emulator
2. cd /home/root
Run an Intel Inspector for Systems on the tachyon sample code
On your Linux* host open the Inspector for Systems results and view results in the Inspector for systems GUI

Setting up a Yocto Project* 1.2 environment

Download the pre-built toolchain, which includes the runqemu script and support files
download from: http://downloads.yoctoproject.org/releases/yocto/yocto-1.2/toolchain/
1. The following tool chain tarball is for a 32-bit development host system and a 32-bit target architecture: poky-eglibc-i686-i586-toolchain-gmae-1.2.tar.bz2
2. You need to install this tar ball on your Linux* host in the root “/” directory. This will create an installation area “/opt/poky/1.2”
Downloading the Pre-Built Linux* Kernel:
You can download the pre-built Linux* kernel (*zImage-qemu<arch>.bin OR vmlinux-qemu<arch>.bin).
1. http://downloads.yoctoproject.org/releases/yocto/yocto-1.2/machines/qemu/
  1. download: bzImage-qemux86.bi
  2. This article assumes this file is located ~/yocto/ bzImage-qemux86.bin
Create a file system
1. from: http://downloads.yoctoproject.org/releases/yocto/yocto-1.2/machines/qemu/
2. Download core-image-sato-sdk-qemux86.tar.bz2
3. source /opt/poky/1.2/environment-setup-i586-poky-linux
  mkdir -p ~/yocto/file_system/
  runqemu-extract-sdk core-image-sato-sdk-qemux86.tar.bz2 ~/yocto/file_system/
4. This will create a root file system that you can access for your host and emulated session.

Install Intel® Inspector for Systems

Install Intel® System Studio on your Linux* host.
Copy the Intel Inspector for Systems installation to the file system you created above.
1. cp inspector_for_systems ~/yocto/file_system/home/root

Cross compiling the tachyon sample code

The tachyon sample code is provided as part of the Inspector for Systems release.
On your Linux* host
1. cd ~/yocto
2. untar tachyon : tar xvzf /opt/intel/systems_studio_2013.0.xxx/inspector_for_systems/samples/en/C++/tachon_insp_xe.tgz
3. You will need to modify the tachyon sample as follows:
4. In the top level Makefile: Comment the line containing CXX.
5. The the lower level Makefile.gmake ('tachyon/common/gui/Makefile.gmake') Add the following lines:

UI = x
EXE = $(NAME)$(SUFFIX)
CXXFLAGS += -I$(OECORE_TARGET_SYSROOT)/X11
LIBS += -lpthread -lX11
#LIBS += -lXext
CXXFLAGS += -DX_NOSHMEM

source /opt/poky/1.2/environment-setup-i586-poky-linux
e. make
f. Copy the tachyon binary and the create libtbb.so file to ~/yocto/file_system/home/yocto/test

Start a QEMU emulator session

source /opt/poky/1.2/environment-setup-i586-poky-linux
runqemu bzImage-qemux86.bin ~/yocto/file_system/

Run Intel® Inspector for Systems on the tachyon sample code

Login to the QEMU emulator session
1. User root no password
2. cd /home/root/inspector_for_systems
You should see the tachyon binaries and inspector directory in the file system copied from above.
source /home/root/inspector_for_systems/inspxe-vars.sh
Run an Inspector collection:

Create directory test; cd test

inspxe-cl –no-auto-finalize –collect mi2 ../tachyon_find_hotspots

Note: the above will perform a level 2 memory checking analysis.

Run inspxe-cl –collect help to see some other collections you can do.

On your Linux* host: Open the Inspector for Systems results

You can view the results you create above on you Linux host. You should see a directory ~/yocto/file_system/home/root/test/r000mi2.
To view these results in Intel® Inspector for Systems on your Linux host
1. Source /opt/intel/system_studio_2014.0.xxx/inspector_for_systems/inspxe-vars.sh
2. inspxe-gui ~/yocto/file_system/home/root/test/r000mi2
3. You should see the following results similar to the following:

Summary

Intel Inspector for Systems is a powerful tool for finding correctness errors in your code.

Area tema:

IDZone

↧

Media SDK Tutorials for Client and Server

October 13, 2014, 3:39 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® System Studio Case Studies

≪ Previous: How to use Intel® Inspector for Systems

The Media Software Development Kit (Media SDK) Tutorials show you how to use theMedia SDK by walking you step-by-step through use case examples from simple to increasingly more complex usages.

The Tutorials are divided into few parts (sections):

1.	Introduces the Media SDK session concept via a very simple sample.
2-4.	Illustrates how to utilize the three core SDK components: Encode, Decode and VPP (video pre/post processing).
5.	Showcases transcode workloads, utilizing the components described in earlier sections.
6.	Showcases more advanced and compound usages of the SDK.

For simplicity and uniformity the Tutorials focus on the H.264 (AVC) video codec. Other codecs are supported by Intel® Media SDK and can be utilized in a similar way.

Additional information on the tutorials can be found at https://software.intel.com/en-us/articles/media-sdk-tutorial-tutorial-samples-index. The Media SDK is available for free through the Intel® INDE Starter Edition for client and mobile development, or the Intel® Media Server Studio for datacenter and embedded usages.

Download the Media SDK tutorial in the following available formats:

Quick installation instructions:

For Linux setup MFX_HOME environment variable:
```
export MFX_HOME=/opt/intel/mediasdk
```
For Windows setup INTELMEDIASDKROOT and build with Microsoft* VS2012.

Previous versions of the Tutorials package:

v0.0.2 - http://software.intel.com/protected-download/267277/354922

Intel INDE

Intel INDE media client

Intel® Integrated Native Developer Experience (INDE)

Media Client Solutions

Intel® Media Server Studio Essentials Edition

Intel® Media Server Studio Professional Edition

Area tema:

IDZone

↧

Intel® System Studio Case Studies

October 13, 2014, 8:33 pm

Latest and popular articles on Intel Technologies

≫ Next: Enhancing In-Vehicle-Infotainment Application Reliability and Performance using Intel® System Studio

≪ Previous: Media SDK Tutorials for Client and Server

Enhancing In-Vehicle-Infotainment Application Reliability and Performance using Intel® System Studio

The x86 based In vehicle infotainment (IVI) solution is available to reduce the time and cost of developing in-vehicle technologies by providing application-ready solutions consisting of compute modules, automotive middleware and development kits. When compared to the traditional approach, Intel products help you to simplify, accelerate, and reduce the cost of technology integration, solution development, and product testing. This application note focused on development tools to enhance the reliability and performance of IVI applications.

Learn more

Attachment : Intel® System Studio for IVI Applications.pdf

Boosting Long Term Evolution (LTE) Application Performance with Intel® System Studio

Intel® System Studio presents a wide variety of tools within a single tool chain for signal processing to application processing to obtain great computing performance, short development cycle and product simplification.

Learn more

Attachment :intel-system-studio-boost-lte-study.pdf

Usage of Intel® System Studio with MinnowBoard MAX

MinnowBoard MAX is an open hardware embedded board designed with the Intel® Atom™ E38xx series SOC (known as Bay Trail). The board targets the small and low cost embedded market and was designed to appeal to both embedded developers and the maker community.

Intel® System Studio is a comprehensive and integrated tool suite that provides developers with advanced system tools and technologies to help accelerate the delivery of the next generation power efficient, high performance, and reliable embedded and mobile devices. Using Intel® System Studio, easy to develop proto type of your application.

Internet degli oggetti

Intel® Advanced Vector Extensions

Intel® Streaming SIMD Extensions

Ricerca

Debugging

Istruzione

Sanità

Controllo degli errori

Intel® Advanced Vector Extensions

Area tema:

IDZone

↧

Enhancing In-Vehicle-Infotainment Application Reliability and Performance using Intel® System Studio

October 13, 2014, 9:11 pm

Latest and popular articles on Intel Technologies

≫ Next: GLS Test Page HTML5 Tools

≪ Previous: Intel® System Studio Case Studies

Download now : Intel® System Studio for IVI Applications.pdf

As the numbers of car users are increasing around the world, the demand for more sophisticated and advanced in-vehicle- infotainment (IVI) is also increasing. Automotive developers are facing a big challenge in terms of hardware and software capability to satisfy these merging needs. The x86 based solution is available to reduce the time and cost of developing in-vehicle technologies by providing application-ready solutions consisting of compute modules, automotive middleware and development kits.

Intel® System Studio 2015 allows you to develop for embedded & mobile Android* and Tizen* IVI systems, added cross development from Windows* hosts, and provides expanded JTAG debug support for all IA platforms. Intel® System Studio can be used in various stages of In-Vehicle-Infotainment development right from debugging BIOS in hardware layer to performance tuning of HMI layer.

In Vehicle Infotainment

IVI

Intel IVI

Intel automotive

Intel software for IVI

Intel System Studio IVI

Intel® Streaming SIMD Extensions

Debugging

Embedded

Controllo degli errori

Looking for Intel® XDK IoT Edition? Click here.

Area tema:

IDZone

↧

GLS Test Page HTML5 Tools

October 16, 2014, 12:49 pm

Latest and popular articles on Intel Technologies

≫ Next: intel.xdk.cache

≪ Previous: Enhancing In-Vehicle-Infotainment Application Reliability and Performance using Intel® System Studio

Intel

XDK

The easy and fast way to get your apps to market.

Intel® XDK HTML5 Cross-platform Development Tool provides a simplified workflow to enable developers to easily design, debug, build, and deploy HTML5 web and hybrid apps across multiple app stores, and form factor devices.

Experience Intel® XDK - the easy and fast way to get your apps to market.

Easy-to-use ›

Streamlined workflow from design to app store

Develop efficiently ›

Integrated design, test, and build tools

Deploy simply ›

Across more app stores, and form factors

Get Intel XDK now.

Intel XDK is available as a free download for Windows* 7 & 8, Apple OS X*, and Ubuntu* Linux

HTML5 App Development

Create compelling, content-rich apps using common UI Frameworks, Apache* Cordova* and third-party plugins for advertising and in-app purchasing, as well as a host of backend, authentication, and social media services.

Integrated HTML5 Workflow

From your idea to app store.

Develop

Debug

Deploy

Start creating the next generation of HTML5 apps for the mobile world.

Build cross-platform apps easily for many app stores and web platforms

Built on Web technologies HTML, CSS, JavaScript*, and Node-Webkit back-end
Hosted on Windows*, OS X* and Ubuntu* Linux*

Jumpstart Development

Start with a number of samples or templates for both hybrid and web apps
Use the App Designer UI Builder to quickly prototype or refine the UI of your app
Start from scratch and edit in the open source BRACKETS* Editor

Multiple UI Frameworks & APIs

jQuery* Mobile, App Framework*, Twitter Bootstrap*, and Topcoat* - all you need to create great, responsive UIs
Full support for Apache* Cordova* device APIs for your hybrid app, and the many 3rd-party plugins

Web Services & Plugins for Content-rich, Interactive apps

Easily add web services, such as datafeeds, backend datastores, authentication from a number of providers
Add in ads and in-app purchasing, and other monetization services from Google*, appMobi*
Deliver immersive surround sound mobile app experiences with Dolby* Audio API
Safeguard data and storage for Hybrid Apps with App Security APIs

Development made easy.

Responsive on any device.

Intuitive UI Design with App Designer

Drag & Drop UI Builder
Round-trip capable: modify in App Designer and the Editor

Built-in JS editing of UI elements

Enables custom JS code editing of the UI element
Start with common UI Frameworks: App Framework, JQuery* Mobil, Twitter* Bootstrap, Topcoat*

Accelerate Code Development

Brackets* Editor

Efficient Coding - Switch between HTML5 project files with code hints
Auto completion - Speeds up coding without knowing the exact syntax

...or, use your own favorite editor

Expanded Device API Support

Cordova* 3.X - More device capabilities

Android*, iOS* Windows* 8, Windows* Phone 8
Supports Cordova plugins
Emulation support for Cordova device APIs

Extends Hybrid Capabilities

Crosswalk* Runtime for Android

Web Runtime Performance - Web developers can now create applications with native-like performance with WebGL* and WebAudio*
Built on Open source Foundation - Enables better performance, flexibility, and ease of deployment to many app stores
Standards - Provides native platform and full screen capabilities using HTML5, CSS3, and JavaScript

Essential Debugging Tools

Get your app to the stores faster.

Develop

Debug

Deploy

Intel® XDK helps decrease your testing and debugging time and get your app to market faster.

Testing

Preview your app while editing in a separate browser window or on your device with Live Preview
Use the App Preview app for iOS*, Android*, Windows* 8, and Windows* Phone 8 for full testing on your device

Emulator

Simulate your app running on different phone and tablet skins before deployment
Quickly switch to debugger to debug your app inside the device emulator

JS Remote Debugger and Profiler for Android

Efficiently debug apps remotely with JS Remote Debugger for Android
View memory, frames, and events profiling results to get the best performance out of your app

More app reliability.

Tune App Performance

Remote JS Performance Profiler for Android

Quickly pinpoints app performance bottlenecks
Round-trip capable, modify within the app

Xlint Platform-Compatibility CSS Checker

Xlint* is a Brackets* HTML5 editor extension
Reports cross-platform CSS3 compatibility issues
Tests against W3C specs for animations, color level, shadings

Work Efficiently / Stay On Schedule

App Preview on-device Testing

Enables on-device testing of hybrid apps, without going through app store submission process

Live Development Side-by-side testing

Run your local project files on your USB-connected testing device with the push of a button in the Intel® XDK
Automatically push your files as you save your edits for quick iteration cycles
Use live layout viewing to instantly see styling and layout changes you make to your CSS and HTML

On-device simultaneous testing while creating and editing the app over Wi-Fi or USB, on Android* and iOS* devices

#INDE #ProductBrief #Android #Windows #OSX #VisualStudio #Eclipse #IDE #Download #CrossOS #Install #GettingStarted #EndOfLife #AndroidStudio #Native #Developer #Experience #Intel

Easier Build, Faster Deployment

Building your application for every platform has never been this fast and easy.

Develop

Debug

Deploy

Build cross-platform apps easily for many app stores and web platforms.

Select your target store and build! Your app is ready for deployment
Hybrid or web apps for Apple* App Store, Google* Play, Windows* 8 and Windows* 8 Phone Stores. Also, Amazon*, Tizen*, Facebook*, Chrome* Stores.
For Android* devices, build with Crosswalk*, an open source web-runtime to greatly improve your media and games apps with high-performance WebGL* and WebAudio* support

Create apps for every need and every device.

Easier Build

Build Hybrid and Web Apps

Write Hybrid or Web apps once, and deploy to many app stores

Faster Deployment

Reach More App Stores

Apple* App Store
Google* Play
Nook* Store
Amazon Store
Windows Store
Tizen Store
Facebook
Chrome

↧

intel.xdk.cache

October 17, 2014, 5:34 am

Latest and popular articles on Intel Technologies

≫ Next: cl_intel_simultaneous_sharing OpenCL extension in a new driver

≪ Previous: GLS Test Page HTML5 Tools

For persistent caching of data between application sessions.

This object is intended to provide local storage for data to speed up applications. It can be used as in conjunction with, or as an alternative to the HTML5 local database. Its methods provide features similar to browser cookies and file caching.

For cookies, the intention is that you would use setCookie to save string data in name-value pairs. Cookies persist between application sessions. Data values may be retrieved using the getCookie command or from the getCookieList command as soon as the"intel.xdk.device.ready" event is fired.

The media cache commands are meant to provide quicker access to files such as videos and images. Adding files to the media cache will expedite access to them when the application runs. These are files that are cached across sessions and are not in your application bundle. See the section on events below for more information about events fired from the cache section of intel.xdk.

Object	Type	Notes
intel.xdk.cache.addToMediaCache	method	This command will get a file from the Internet and cache it locally on the device.More details
intel.xdk.cache.addToMediaCacheExt	method	This method will get a file from the Internet and cache it locally on the device. More details
intel.xdk.cache.clearAllCookies	method	This method will clear all data stored using the setCookie method. More details
intel.xdk.cache.clearMediaCache	method	This method will remove all files from the local cache on the device. More details
intel.xdk.cache.getCookie	method	This method will retrieve the value of a cookie previously saved using the setCookie command. More details
intel.xdk.cache.getCookieList	method	This method will return an array containing the names of all the previously saved cookies using the setCookie command. More details
intel.xdk.cache.getMediaCacheList	method	This method will get an array containing all the names of all the previously cached files. More details
intel.xdk.cache.getMediaCacheLocalURL	method	This method will return an url that you can use to access a cached media file. More details
intel.xdk.cache.removeCookie	method	This method will clear data previously saved using the setCookie method. More details
intel.xdk.cache.removeFromMediaCache	method	This method will remove a file from the local cache on the device. More details
intel.xdk.cache.setCookie	method	This method will set a chunk of data that will persist from session to session. More details
intel.xdk.cache.media.add	event	This event fires when data is cached. More details
intel.xdk.cache.media.clear	event	This event fires once all files are removed from the local file cache. More details
intel.xdk.cache.media.remove	event	This event fires when data is removed from te cache. More details
intel.xdk.cache.media.update	event	This event fires repeatedly to track caching progress. More details

Supplemental Documentation

addToMediaCache

This command will get a file from the Internet and cache it locally on the device. It can then be referenced in a special directory named _mediacache off the root of the bundle. Once this command is run, the“intel.xdk.cache.media.add” event is fired. If there is already a file cached with that name it is overwritten.



        intel.xdk.cache.addToMediaCache(urlToCache);

        function cacheUpdated(e){
                alert(e.url + " cached successfully");
        }
        document.addEventListener("intel.xdk.cache.media.add", cacheUpdated, false);

addToMediaCacheExt

This command will get a file from the Internet and cache it locally on the device. It can then be referenced in a special directory named _mediacache off the root of the bundle. As this method is executed, the "intel.xdk.cache.media.update" event is fired repeatedly to track the progress of the file caching. If there is already a file cached with that name it is overwritten. As the file is cached by this method, a unique id is returned in order to identify their origin. This command will replace the depreciated command addToMediaCache.


        intel.xdk.cache.addToMediaCacheExt(urlToCache, uniqueID);

        function cacheUpdated(evt) {
            var outString = "";
            outString += "current bytes downloaded: " + evt.current;
            outString += " total bytes in download: " + evt.total;
            var percentage = evt.current / evt.total
            outString += " percentage downloaded: " + percentage + "%";
            outString += " the unique id is: " + evt.id;
            outString += "the URL is: " + evt.url;
            alert(outString);
        }

        function cacheComplete(evt) {
            var outString = "";
            outString += "The procedure succeeded (" + evt.success + ") ";
            outString += " the unique id is: " + evt.id;
            outString += "the URL is: " + evt.url;
            alert(outString);
        }

        document.addEventListener("intel.xdk.cache.media.update", cacheUpdated, false);
        document.addEventListener("intel.xdk.cache.media.add", cacheComplete, false);

clearAllCookies

This method will clear all data stored using the setCookie method.


        intel.xdk.cache.clearAllCookies();

clearMediaCache

This command will remove all files from the local cache on the device. Once this command is run the “intel.xdk.cache.media.clear” event is fired.


        intel.xdk.cache.clearMediaCache();

        function cacheCleared(){
                alert("cache cleared successfully");
        }

        document.addEventListener("intel.xdk.cache.media.clear", cacheCleared, false);

getCookie

This method will get the value of a cookie previously saved using the setCookie command. If no such cookie exists, the value returned will be “undefined”.


        var value = intel.xdk.cache.getCookie("userid");

getCookie

This method will get the value of a cookie previously saved using the setCookie command. If no such cookie exists, the value returned will be “undefined”.


        var value = intel.xdk.cache.getCookie("userid");

getCookieList

This method will return an array containing all the names of all the previously saved cookies using the setCookie command. These names can then be used in calls to getCookie.


        var cookiesArray = intel.xdk.cache.getCookieList();
        for (var x=0;x<cookiesarray.length alert></cookiesarray.length>

getMediaCacheList

This method will get an array containing all the names of all the previously cached files using the addToMediaCache command. These names can then be used in calls to getMediaCacheLocalURL.


        var cacheArray = intel.xdk.cache.getMediaCacheList();
        for (var x=0;x<cachearray.length alert cachearray intel.xdk.cache.getmediacachelocalurl></cachearray.length>

getMediaCacheLocalURL

This method will return an url that you can use to access the cached media file. If the requested URL is not cached, the value returned will be "undefined".


        var localurl = intel.xdk.cache.getMediaCacheLocalURL("http://myweb.com/image/logo.gif");

removeCookie

This method will clear data previously saved using the setCookie method.


        intel.xdk.cache.removeCookie("userid");

removeFromMediaCache

This command will remove a file from the local cache on the device. Once this command is run the “intel.xdk.cache.media.remove” event is fired.


        intel.xdk.cache.removeFromMediaCache(urlToRemove);

        function cacheUpdated(e){
                alert(e.url + " removed successfully");
        }
        document.addEventListener("intel.xdk.cache.media.remove", cacheUpdated, false);

setCookie

Call this method to set a chunk of data that will persist from session to session. The data is automatically purged once the expiration date lapses. The data can be retrieved using the getCookie command.


        function saveInfo() {
            //add a cookie
            var name = prompt('Enter information name:');
            var value = prompt('Enter information value:');
            var daysTillExpiry = prompt('Days until cookie expires (-1 for never):');
            try {
                    if (name.indexOf('.')== -1){
                            intel.xdk.cache.setCookie(name,value,daysTillExpiry);
                    } else{
                            alert('cookie names may not include a period');
                    }
            } catch(e) {
                    alert("error in saveInfo: " + e.message);
            }
        }

media.add

This event fires once a file is added to the local file cache using the intel.xdk.cache.addToMediaCache command. The url property on the event object will contain the URL of the remote file cached.

media.clear

This event fires once all files are removed from the local file cache using the intel.xdk.cache.clearMediaCache command.

media.remove

This event fires once a file is removed from the local file cache using the intel.xdk.cache.removeFromMediaCache command.

media.update

This event is fired repeatedly as the intel.xdk.cache.addToMediaCacheExt method runs. It will return an event object that contains several parameters. The first parameter is the URL of the remote file cached. The second is the unique ID assigned when the command was called. The third is the current number of bytes downloaded and cached so far, and the final parameter is the total number of bytes in the file.


This is space for login message

Login

↧

cl_intel_simultaneous_sharing OpenCL extension in a new driver

October 17, 2014, 3:16 pm

Latest and popular articles on Intel Technologies

≫ Next: Intel® Xeon Phi™ Coprocessor code named “Knights Landing” - Application Readiness

≪ Previous: intel.xdk.cache

There is a new Intel® Iris™ and HD Graphics Driver update posted for Haswell and Broadwell and it contains cl_intel_simultaneous_sharing OpenCL extension. Below please find documentation for that extension.

Name String

cl_intel_simultaneous_sharing

Version

Version 7, October 14, 2014

Extension Type

OpenCL platform extension

Dependencies

OpenCL 1.2. This extension is written against revision 19 of the

OpenCL 1.2 Specification and revision 19 of the OpenCL 1.2 Extension

Specification.

Overview

Currently OpenCL 1.2 Extension Spec forbids to specify interoperability

with multiple graphics APIs at clCreateContext or clCreateContextFromType

time and defines that CL_INVALID_OPERATION should be returned in such

cases as noted e.g. in chapters dedicated to sharing memory objects with

Direct3D 10 and Direct3D 11.

The goal of this extension is to relax the restrictions and allow to

specify simultaneously these combinations of interoperabilities that are

supported by a given OpenCL device.

New Tokens

Accepted as a property being queried in the <param_name> parameter

of clGetDeviceInfo:

        CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL        0x4104
        CL_DEVICE_NUM_SIMULTANEOUS_INTEROPS_INTEL    0x4105

Additions to chapter 4 of the OpenCL 1.2 Specification

4.2 Querying Devices

Extend table 4.3 to include the following entry:

    --------------------------------------------------------------------------

    cl_device_info             Return Type  Description

    --------------             -----------  -----------

    CL_DEVICE_NUM_SIMULTANEOUS cl_uint      Number of supported combinations
    _INTEROPS_INTEL                         of graphics API interoperabilities
                                            that can be enabled simultaneously
                                            within the same context.

                                            The minimum value is 1.

    CL_DEVICE_SIMULTANEOUS     cl_uint[]    List of <n> combinations of context
    _INTEROPS_INTEL                         property names describing graphic APIs
                                            that the device can interoperate with
                                            simultaneously by specifying the
                                            combination in the <properties>
                                            parameter of clCreateContext and
                                            clCreateContextFromType.

                                            Each combination is a set of 2
                                            or more property names and is
                                            terminated with zero.

                                            <n> is the value returned by the
                                            query for
                                            CL_DEVICE_NUM_SIMULTANEOUS_INTEROPS_INTEL.
    --------------------------------------------------------------------------

4.4 Contexts

Add to the list of errors for clCreateContext:

“CL_INVALID_OPERATION if a combination of interoperabilities with multiple graphics

APIs is specified which is not on the list of valid combinations returned by

the query for CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL.”

Add to the list of errors for clCreateContextFromType the same new errors

described above for clCreateContext.

Additions to section 9.6.4 of the OpenCL 1.2 Extension Specification

Replace the section about CL_CONTEXT_INTEROP_USER_SYNC property support with:

“OpenCL / OpenGL sharing does not support the CL_CONTEXT_INTEROP_USER_SYNC property

defined in table 4.5. Specifying this property when creating a context with OpenCL /

OpenGL sharing will return an appropriate error or be ignored for OpenGL sharing if

sharing with another graphics API supporting the CL_CONTEXT_INTEROP_USER_SYNC

property is also specified.”

Replace the description of CL_INVALID_PROPERTY error code with:

“errcode_ret returns CL_INVALID_PROPERTY if an attribute name other than those

specified in table 4.5 or if CL_CONTEXT_INTEROP_USER_SYNC is specified in properties

and there is no graphics API interoperability specified that supports it.”

Additions to section 9.9.5 of the OpenCL 1.2 Extension Specification

Remove the following description of CL_INVALID_PROPERTY error code:

“CL_INVALID_OPERATION if Direct3D 10 interoperability is specified by setting

CL_INVALID_D3D10_DEVICE_KHR to a non-NULL value, and interoperability with another

graphics API is also specified.”

Additions to section 9.11.5 of the OpenCL 1.2 Extension Specification

Remove the following description of CL_INVALID_PROPERTY error code:

“CL_INVALID_OPERATION if Direct3D 11 interoperability is specified by setting

CL_INVALID_D3D11_DEVICE_KHR to a non-NULL value, and interoperability with another

graphics API is also specified.”

Additions to cl_intel_dx9_media_sharing extension specification:

Remove the following description of CL_INVALID_PROPERTY error code:

“CL_INVALID_OPERATION if DirectX 9 interoperability is specified by setting

CL_CONTEXT_D3D9_DEVICE_INTEL, CL_CONTEXT_D3D9EX_DEVICE_INTEL, or

CL_CONTEXT_DXVA_DEVICE_INTEL to a non-NULL value, and interoperability with any

other graphics API is also specified.”

Example Usage

    cl_uint  SimInteropsNum;
    cl_uint* SimInterops;
    size_t   SimInteropsSize;
   
    clGetDeviceInfo( deviceID,
                     CL_DEVICE_NUM_SIMULTANEOUS_INTEROPS_INTEL,
                     sizeof( SimInteropsNum ),
                     &SimInteropsNum,
                     NULL );
    clGetDeviceInfo( deviceID,
                     CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL,
                     0,
                     NULL,
                     &SimInteropsSize );
    SimInterops = new cl_uint[ SimInteropsSize / sizeof( cl_uint ) ];
    clGetDeviceInfo( deviceID,
                     CL_DEVICE_SIMULTANEOUS_INTEROPS_INTEL,
                     SimInteropsSize,
                     SimInterops,
                     NULL );
   
    bool SimInteropsCheck[] = { false, false, false };
    bool GLD3D11SimInteropSupported = false;
    cl_uint property = 0;
   
    for( int i = 0; i < SimInteropsNum; i++ )
    {
        SimInteropsCheck[0] = false;
        SimInteropsCheck[1] = false;
        SimInteropsCheck[2] = false;
        do
        {
            property = *SimInterops++;
            if( property == CL_GL_CONTEXT_KHR )
                SimInteropsCheck[0] = true;
            if( property == CL_WGL_HDC_KHR )
                SimInteropsCheck[1] = true;
            if( property == CL_CONTEXT_D3D11_DEVICE_KHR )
                SimInteropsCheck[2] = true;
        }
        while( property != 0 );
           
        if( SimInteropsCheck[0] && SimInteropsCheck[1] && SimInteropsCheck[2] ){
            GLD3D11SimInteropSupported = true;
            printf("This device supports GL and D3D11 simultaneous sharing.\n");
            break;
        }
    }
    if( !GLD3D11SimInteropSupported ){
        printf("This device doesn't support GL and D3D11 simultaneous sharing.\n");
    }

Sviluppatori Intel AppUp®

Desktop

Architettura Intel® Many Integrated Core

OpenCL-SDK-Learn

Area tema:

IDZone

Windows*

↧

Intel® Xeon Phi™ Coprocessor code named “Knights Landing” - Application Readiness

September 15, 2014, 10:57 am

Latest and popular articles on Intel Technologies

≫ Next: Using the Develop Tab

≪ Previous: cl_intel_simultaneous_sharing OpenCL extension in a new driver

As part of the application readiness efforts for future Intel® Xeon® processors and Intel® Xeon Phi™ coprocessors (code named Knights Landing), developers are interested in improving two key aspects of their workloads:

Vectorization/code generation
Thread parallelism

This article mainly talks about vectorization/code generation and lists some helpful tools and resources for thread parallelism.

1) Vectorization

Intel® Advanced Vector Extensions 512 (Intel® AVX-512) will be first implemented on the processor and coprocessor and will also be supported on some future Intel Xeon processors scheduled to be introduced after Knights Landing.
For more details on Intel AVX-512 refer to: https://software.intel.com/en-us/blogs/2013/avx-512-instructions
Intel AVX-512 offers significant improvements and refinements over the Intel® Initial Many Core Instructions (Intel® IMCI) found on current Intel® Xeon Phi™ coprocessors code named Knights Corner.
Today’s Intel® Compiler (14.0+) has the capability to compile your code for Knights Landing and you can run your binary on Intel® Software Development Emulator (Intel® SDE). Intel® Compilers are available as part of Intel® Parallel Studio XE (available for trial and purchase here) and product documentation can be found here
Intel SDE is an emulator for upcoming instruction set architecture (ISA) extensions. It allows you to run programs that use new instructions on existing hardware that lacks those new instructions.
Intel SDE is useful for performance analysis, compiler development tuning, and application development of libraries.
Intel SDE for Sandy Bridge, Ivy Bridge, Haswell, Broadwell, Skylake (Client), Goldmont, and Knights Landing (KNL) is available here: http://www.intel.com/software/sde
Please note that Intel SDE is a software emulator and is mainly used for emulating future instructions. It is not cycle accurate and can be very slow (up-to 100x). It is not a performance-accurate emulator.
Instruction Mix:
- Intel SDE comes with several useful emulator-enabled pin tools and one of them in the mix histogramming tool.
- This mix histogramming tool can compute histograms using any of the following: dynamic instructions executed, instruction length, instruction category, and ISA extension grouping.
- The mix-mt tool can also display the top N most frequently executed basic blocks and disassemble them.
Plethora of information from instruction mix reports:
- Top basic blocks in terms of instruction %, dynamic instruction execution for evaluating compiler code generation, function-based instruction count breakdown, instruction count of each ISA type etc.
- With appropriate parser scripts you can also evaluate FLOP counts, INT counts, memory operation counts, SIMD intensity (operations/instructions), etc.

Compiling your application for Knights Landing

Use the latest Intel Compilers (14.0+) and compile with the “-xMIC-AVX512” compiler knob to generate Knights Landing (KNL) binary.

To run your application on Intel SDE

sde –knl -- ./<knl.exe> <args.>

OR you can run your MPI application as

mpirun –n <no. of ranks> sde –knl -- ./<knl.exe> <args.>

To generate “Instruction Mix” reports using Intel SDE for different architectures:

Intel Xeon Phi coprocessor

Knights Landing

sde –knl -mix -top_blocks 100 -iform 1 -- ./<knl.exe> <args.>

You can also run the corresponding Intel Xeon processor binary on Intel SDE for comparisons and analysis purposes:

Intel Xeon processor

Ivy Bridge

sde -ivb -mix -top_blocks 100 -iform 1 -- ./<ivb.exe> <args.>

Haswell

sde –hsw -mix -top_blocks 100 -iform 1 -- ./<hsw.exe> <args.>

It is recommended to generate instruction mix reports using single MPI/OpenMP* thread runs (OMP_NUM_THREADS=1) for analysis simplification purposes.

For resolving thread parallelism issues refer to the thread parallelism section below.

Example Analysis using instruction mix report from Intel SDE

Extracted Kernel from http://www.berkeleygw.org/

Total Dynamic Instruction Reduction:

Intel AVX -> Intel AVX2 Reduction: 1.08x
Intel AVX2 -> Intel AVX-512 Reduction: 3.15x

Function Level Breakdown

Further Breakdown on isa-set categories

Significant % of x87 code for Intel AVX and Intel AVX2 for this kernel

Basic Blocks

Intel SDE also provided the top basic blocks for your run based on hot instruction execution.

If you look at the top basic blocks, you see a significant number of x87 instructions in this kernel for the Intel AVX/AVX2 code. Below is just a snippet of the first basic block for Intel AVX2 instruction mix report.

The corresponding source code for the above basic block is line 459 (as highlighted above).

Looking at the source we observed there is “complex” division in line 459 involved in this statement and the compiler generates x87 sequence to conform to strict IEEE semantics and to avoid any overflows and underflows

The way to avoid this is to compile with -fp-model fast=2. This allows the compiler to assume that real and imaginary parts of the double precision denominator lie in the approximate range, so it generates simple code without the tricks above. It can then generate vector Intel AVX/AVX2 instructions for the entire loop.

The EXECUTIONS count in the basic block is the number of times this basic block was executed, and ICOUNT gives the total number of instructions executed for this basic block for all the executions. Thus ICOUNT/EXECUTIONS give the total number of instructions in this basic block.

In addition, combination of the vectorization optimization report generated by the compiler (using the –qopt-report=5) and SDE top basic blocks can be used for doing a first pass ‘vectorization study’. Compiling with –qopt-report=5 generates an optimization report file kernel.optrpt. You can look for the corresponding source line in the basic block (example the line 459 above) and map it to the optimization report generated by the compiler to find whether your loop/basic block was vectorized or not (if not, why not). In the optimization report, you can also look for messages like – if some arrays in the loop were aligned or unaligned.

This is just an example of the kind of analysis that is possible with instruction mix reports from Intel SDE, but a lot more analysis is possible. For more details please see https://software.intel.com/en-us/articles/intel-software-development-emulator

Configurations for the run: The instruction mix for the extracted kernel was generated using Intel® SDE version 7.2, the application was compiled with Intel® Compilers version 14.0.2 20140120. The run was conducted by Intel Engineer Karthik Raman. For more information go to http://www.intel.com/performance

2) Thread Parallelism

Efficient parallelism is key for applications in the HPC domain to achieve great performance and cluster scaling. This is more critical than before with the many core architectures (like Intel Xeon Phi coprocessor) and also the increasing core counts with Intel Xeon processors.

The parallelism can be across several layers such as instruction level (super scalar), data level (SIMD/vectorization), thread level: shared memory (OpenMP) and/or distributed memory (MPI). Many HPC programs are moving to hybrid shared memory/distributed memory programming model where both OpenMP and MPI are used.

You can test thread scalability and efficiency of your application using existing hardware (Intel Xeon processor and/or Intel Xeon Phi coprocessor (Knights Corner).

Many tools are available for thread scalability analysis. A few are listed below:

OpenMP scalability analysis using Intel® VTune™ Amplifier XE 2015
Serial vs. Parallel time, Spin Overheads, Potential gains possible etc.
Intel® Trace Analyzer and Collector
To understand MPI application behavior, quickly find bottlenecks, and achieve high performance for parallel cluster applications
Intel® Inspector XE 2015
Memory and threading error debugger and thread dependency analysis.

Compilatore C++ Intel®

Elaborazione parallela

Vettorizzazione

Server