Parallel Arrays C++ Example



In computing, a group of parallel arrays (also known as structure of arrays or SoA) is a form of implicit data structure that uses multiple arrays to represent a singular array of records. It keeps a separate, homogeneous data array for each field of the record, each having the same number of elements. Then, objects located at the same index in each array are implicitly the fields of a single record. Pointers from one object to another are replaced by array indices. This contrasts with the normal approach of storing all fields of each record together in memory (also known as array of structures or AoS). For example, one might declare an array of 100 names, each a string, and 100 ages, each an integer, associating each name with the age that has the same index.

Examples[edit]

An example in C using parallel arrays:

Distinguish signal waveforms by memristor conductance modulation. A 1k-cell cross-point array of memristors with TiN/TaO x /HfO y /TiN material stack is fabricated (see Materials and Methods for the fabrication details) as the platform to process multichannel neural signals in parallel. Dec 22 2011 1:31 PM. This program is for question 4. When the area code is 262 program is executing well but program is not working for other area. Multidimensional arrays can be described as 'arrays of arrays'. For example, a bidimensional array can be imagined as a two-dimensional table made of elements, all of them of a same uniform data type. Jimmy represents a bidimensional array of 3 per 5 elements of type int. The C syntax for this is. Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.

in Perl (using a hash of arrays to hold references to each array):

Or, in Python:

Pros and cons[edit]

Parallel arrays have a number of practical advantages over the normal approach:

  • They can be used in languages which support only arrays of primitive types and not of records (or perhaps don't support records at all).[example needed]
  • Parallel arrays are simple to understand, particularly for beginners who may not fully understand records.
  • They can save a substantial amount of space in some cases by avoiding alignment issues. For example, some architectures work best if 4-byte integers are always stored beginning at memory locations that are multiple of 4. If the previous field was a single byte, 3 bytes might be wasted. Many modern compilers can automatically avoid such problems, though in the past some programmers would explictly declare fields in order of decreasing alignment restrictions.
  • If the number of items is small, array indices can occupy significantly less space than full pointers, particularly on some architectures.
  • Sequentially examining a single field of each record in the array is very fast on modern machines, since this amounts to a linear traversal of a single array, exhibiting ideal locality of reference and cache behaviour.
  • They may allow efficient processing with SIMD instructions in certain instruction set architectures

Several of these advantage depend strongly on the particular programming language and implementation in use.

However, parallel arrays also have several strong disadvantages, which serves to explain why they are not generally preferred:

  • They have significantly worse locality of reference when visiting the records non-sequentially and examining multiple fields of each record, because the various arrays may be stored arbitrarily far apart.
  • They obscure the relationship between fields of a single record (e.g. no type information relates the index between them, an index may be used erroneously).
  • They have little direct language support (the language and its syntax typically express no relationship between the arrays in the parallel array, and cannot catch errors).
  • Since the bundle of fields is not a 'thing', passing it around it tedious and error-prone. For example, rather than calling a function to do something to one record (or structure or object), the function must take the fields as separate arguments. When a new field is added or changed, many parameter lists must change, where passing objects as whole would avoid such changes entirely.
  • They are expensive to grow or shrink, since each of several arrays must be reallocated. Multi-level arrays can ameliorate this problem, but impacts performance due to the additional indirection needed to find the desired elements.
  • Perhaps worst of all, they greatly raise the possibility of errors. Any insertion, deletion, or move must always be applied consistently to all of the arrays, or the arrays will no longer be synchronized with each other, leading to bizarre outcomes.

The bad locality of reference can be alleviated in some cases: if a structure can be divided into groups of fields that are generally accessed together, an array can be constructed for each group, and its elements are records containing only these subsets of the larger structure's fields. (see data oriented design). This is a valuable way of speeding up access to very large structures with many members, while keeping the portions of the structure tied together. An alternative to tying them together using array indexes is to use references to tie the portions together, but this can be less efficient in time and space.

Another alternative is to use a single array, where each entry is a record structure. Many language provide a way to declare actual records, and arrays of them. In other languages it may be feasible to simulate this by declaring an array of n*m size, where m is the size of all the fields together, packing the fields into what is effectively a record, even though the particular language lacks direct support for records. Some compiler optimizations, particularly for vector processors, are able to perform this transformation automatically when arrays of structures are created in the program.[citation needed]

See also[edit]

References[edit]

  • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN0-262-03293-7. Page 209 of section 10.3: Implementing pointers and objects.
  • Skeet, Jon (3 June 2014). 'Anti-pattern: parallel collections'. Retrieved 28 October 2014.
Retrieved from 'https://en.wikipedia.org/w/index.php?title=Parallel_array&oldid=986698191'

OpenMP is one of the most popular solutions to parallel computation in C/C++. OpenMP is a mature API and has been around two decades, the first OpenMP API spec came out for Fortran(Yes, FORTRAN). OpenMP provides a high level of abstraction and allows compiler directives to be embedded in the source code.

Ease of use and flexibility are the amongst the main advantages of OpenMP. In OpenMP, you do not see how each and every thread is created, initialized, managed and terminated. You will not see a function declaration for the code each thread executes. You will not see how the threads are synchronized or how reduction will be performed to procure the final result. You will not see exactly how the data is divided between the threads or how the threads are scheduled. This, however, does not mean that you have no control. OpenMP has a wide array of compiler directives that allows you to decide each and every aspect of parallelization; how you want to split the data, static scheduling or dynamic scheduling, locks, nested locks, subroutines to set multiple levels of parallelism etc.

Another important advantage of OpenMP is that, it is very easy to convert a serial implementation into a parallel one. In many cases, serial code can be made to run in parallel without having to change the source code at all. This makes OpenMP a great option whilst converting a pre-written serial program into a parallel one. Further, it is still possible to run the program in serial, all the programmer has to do is to remove the OpenMP directives.

Understanding OpenMP

First, let’s see what OpenMP is:

OpenMP, short for “Open Multi-Processing”, is an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran - on most platforms, processor architectures and operating systems.

OpenMP consists of a set of compiler directives, library routines, and environment variables that influence run-time behavior. So basically when we use OpenMP, we use directives to tell the compiler details of how our code shuld be run in parallel. Programmers do not have to write (or cannot write) implicit parallelization code, they just have to inform the compiler to do so. It is imperative to note that the compiler does not check if the given code is parallelizable or if there is any racing, it is the responsibility of the programmer to do the required checks for parallelism.

OpenMP is designed for multi-processor/core, shared memory machines and can only be run in shared memory computers. OpenMP programs accomplish parallelism exclusively through the use of threads. There’s a master thread that forks a number of slave threads that do the actual computation in parallel. The master plays the role of a manager. All the threads exist within a single process.

By default, each thread executes the parallelized section of code independently. Work-sharing constructs can be used to divide a task among the threads so that each thread executes its allocated part of the code. Therefore, both task parallelism and data parallelism can be achieved using OpenMP.

Though, not the most efficient method, OpenMP provides one of the easiest parallelization solutions for programs written in C and C++.

Linear Search

For our first example, let’s look at linear search.

Linear search or sequential search is a method for finding a target value within a list. It sequentially checks each element of the list for the target value until a match is found or until all the elements have been searched.

Linear search is one of the simplest algorithms to implement and has the worst case complexity of O(n), ie. the algorithm has to scan through the entire list to find the element - this happens when the required element isn’t in the list or is present right at the end.

By parallelizing the implementation, we make the multiple threads split the data amongst themselves and then search for the element independently on their part of the list.

Here’s the serial implementation:

Parallelizing Linear Search through OpenMP

In order to use OpenMP’s directives, we will have to include the header file: 'omp.h'. Whilst compilation, we’ll have to include the flag -fopenmp. All the directives start with #pragma omp ... .

In the above serial implementation, there is a window to parallelize the for loop. To parallelize the for loop, the openMP directive is: #pragma omp parallel for. This directive tells the compiler to parallelize the for loop below. As I’ve said before, the complier makes no checks to see if the loop is parallelizable, it is the responsiblity of the programmer to make sure that the loop can be parallelized.

Examples

Whilst parallelizing the loop, it is not possible to return from within the if statement if the element is found. This is due to the fact that returning from the if will result in an invalid branch from OpenMP structured block. Hence we will have change the implementation a bit.

C++

Arrays In C++ Examples

The above snippet will keep on scanning the the input till the end regardless of a match, it does not have any invalid branches from OpenMP block. Also, we can be sure that there is won’t be racing since we are not modifying any variable decalred outside. Now, let’s parallelize this:

It is as simple as this, all that had to be done was adding the comipler directive and it gets taken care of, completely. The implementation didn’t have to be changed much. We didn’t have to worry about the actual implementation, scheduling, data split and other details. There’s a high level of abstraction. Also, the code will run in serial after the OpenMP directives have been removed, albeit with the modification.

It is noteworthy to mention that with the parallel implementation, each and every element will be checked regardless of a match, though, parallely. This is due to the fact that no thread can directly return after finding the element. So, our parallel implementation will be slower than the serial implementation if the element to be found is present in the range [0, (n/p)-1] where n is the length of the array and p is the number of parallel threads/sub-processes.

Further, if there are more than one instances of the required element present in the array, there is no guarantee that the parallel linear search will return the first match. The order of threads running and termination is non-deterministic. There is no way of which which thread will return first or last. To preserve the order of the matched results, another attribute(index) has to be added to the results.

You can find the complete code of Parallel Linear Search here

Still have questions? Find me on Codementor

Selection Sort

Now, let’s look at our second example - Selection Sort.

Selection sort is an in-place comparison sorting algorithm. Selection sort is noted for its simplicity, and it has performance advantages over more complicated algorithms in certain situations, particularly where auxiliary memory is limited.

In selection sort, the list is divided into two parts, the sorted part at the left end and the unsorted part at the right end. Initially, the sorted part is empty and the unsorted part is the entire list.

The smallest/largest element is selected from the unsorted array and swapped with the leftmost element, and that element becomes a part of the sorted array. This process continues moving unsorted array boundary by one element to the right.

Selection Sort has the time complexity of O(n2), making it unsuitable for large lists.

By parallelizing the implementation, we make the multiple threads split the data amongst themselves and then search for the largest element independently on their part of the list. Each thread locally stores it own smallest element. Then,

Here’s the serial implementation:

Parallelizing Selection Sort through OpenMP

First, let’s look at potential parallelization windows. The outer loop is not parallelizable owing to the fact that there are frequent changes made to the array and that every ith iteration needs the (i-1)th to be completed.

Java parallel arrays

In selection sort, the parallelizable region is the inner loop, where we can spawn multiple threads to look for the maximum element in the unsorted array division. This could be done by making sure each thread has it’s own local copy of the local maximum. Then we can reduce each local maximum into one final maximum.

Reduction can be performed in OpenMP through the directive:

where op defines the operation that needs to be applied whilst performing reduction on variable va.

However, in the implementation, we are not looking for the maximum element, instead we are looking for the index of the maximum element. For this we need to declare a new custom reduction. The ability to describe our own custom reduction is a testament to the flexibility that OpenMP provides.

Reduction can be declared by using:

The declared reduction clause receives a struct. So, our custom maximum index reduction will look something like this:

C++

Now, let’s work on parallelizing the inner loop through OpenMP. We’ll need to store both the maximum value as well as its index.

“Correctness”

Now that we’ve parallelized our serial implementation, let’s see if the program produces the required output. For that, we can have a simple verify function that checks if the array is sorted.

Parallel Arrays C++ Example

After running the new sort implementation with the verify function for 100000 elements:

So, the parallel implementation is equivalent to the serial implementation and produces the required output.

You can find the complete code of Parallel Selection sort here.

Mergesort

Mergesort is one of the most popular sorting techniques. It is the typical example for demonstrating the divide-and-conquer paradigm.

Merge sort (also commonly spelled mergesort) is an efficient, general-purpose, comparison-based sorting algorithm.

Mergesort has the worst case serial growth as O(nlogn).

Sorting an array: A[p .. r] using mergesort involves three steps.

1) Divide Step

If a given array A has zero or one element, simply return; it is already sorted. Otherwise, split A[p .. r] into two subarrays A[p .. q] and A[q + 1 .. r], each containing about half of the elements of A[p .. r]. That is, q is the halfway point of A[p .. r].

2) Conquer Step

C++ Parallel Sort

Conquer by recursively sorting the two subarrays A[p .. q] and A[q + 1 .. r].

3) Combine Step

Combine the elements back in A[p .. r] by merging the two sorted subarrays A[p .. q] and A[q + 1 .. r] into a sorted sequence. To accomplish this step, we will define a procedure MERGE (A, p, q, r).

We can parallelize the “conquer” step where the array is recursively sorted amongst the left and right subarrays. We can ‘parallely’ sort the left and the right subarrays.

Here’s the serial implementation:

Parallelizing Merge Sort through OpenMP

As stated before, the parallelizable region is the “conquer” part. We need to make sure that the left and the right sub-arrays are sorted simuntaneously. We need to implement both left and right sections in parallel.

This can be done in OpenMP using directive:

And each section that has to be parallelized should be enclosed with the directive:

Now, let’s work on parallelizing the both sections through OpenMP

The above will parallleize both left and right recursion.

“Correctness”

Now that we’ve parallelized our serial mergesort implementation, let’s see if the program produces the required output. For that, we can use the verify function that we used for our selection sort example.

Cannot Index Parallel Arrays

Great, so the parallel implementation works. You can find the parallel implementation here

Parallel Arrays C++ Examples

That’s it for now, if you have any comments please leave them below.

Parallel Arrays Python

Please enable JavaScript to view the comments powered by Disqus.