gpu is optimized for sequential computing where latency maters performance can vary across compilers and/or versions when using compiler directives to accelerate applications spmd is short for single process, multiple devices gpu is short for general processing unit in a heterogeneous computing environment, the kernel code is usually handled by the device the total amount of time to complete a parallel job is limited by the thread that takes the longest to finish in a multi-threaded program, the number calls to the join functions equals to the number of calls to the create functions deadlock only happens in shared memory applications mutual exclusion is a mechanism to ensure only one process can access the critical section at any given time in pipelined computing, each stage contributes to the overall problem and passes on information