# Fastest sorting algorithm for distributed systems (Parallel Radix Sort) [Difficulty: Medium]

Hi everybody!

The time has come to talk about **parallelization** and **MPI**, the heart of our topic and my project. First, the problem the parallel Radix Sort can solve will be presented. Then, we will see the algorithm before talking about my implementation. Lastly, it will be interesting to expose performance results.

## Problem the parallel Radix Sort can solve

Every day we collect more and more data to be analyzed. With nowadays **large-scale data** and **large-scale problems**, **parallel computing** is a powerful ally and important tool. While processing data, one recurrent step is the sorting. In my second blog post, I have highlighted the importance of sorting algorithms. **Parallel sorting** is one of the important component in parallel computing. It allows us to **reduce the sorting time** and to** sort more data, amounts** **that can’t be sorted serially**. Indeed, it is possible when we want to sort a huge amount of data that it can’t fit on one single computer because **computers are memory bounded**. Or, it may take too much time using only one computer, a time we can’t afford. Thus, for memory and time motivations, we can use **distributed systems** and have our **data distributed across several computers** **and sort them in that way**. But how? Let’s define properly the context and what we want before.

**What we have**

**P****resources**able to communicate between themselves, store and process our data. They can be CPUs, GPUs, computers…**Each resource has a unique rank****between 0****and P-1**.**N****data**,**distributed equally**across our resources. This means that**each resource has the same amount of data**.- The data are too huge to be stored or processed by only one resource.

**What we want**

**Sort**the data across the resources, in a distributed way. After the sort, resource 0 should have the lower part of the sorted data, resource 1 the next and so on.- We want to be
**as fast as possible**and**consume the lowest memory possible**on each resource.

Like all **distributed** and **parallel** systems, an issue we have to be careful of is **communications** between the resources. They have to be as minimal as possible to make parallel algorithms efficient. Otherwise, we spend too much time in communications instead of treating the problem itself. Through communications, we can exchange data and information between the resources. The more amount of data we exchange, the more it will take time.

There is a bunch of **parallel sorts **and among them **sample sort**, **bitonic sort**, **column sort**, etc. Each of them has its pros and cons. But, as far as I know, they are not so many satisfying all of our requirements above. Often, they need to gather all the data or a huge part of them on one resource, at one point or another, to be efficient. This is not suitable. They can be managed without gathering the data on one resource, but, most of the time, they are not efficient enough anymore because of the communication overhead involved. The **parallel Radix Sort** is one of those which meets our requirements while being efficient. It is currently known as **the fastest** internal **sorting method for** **distributed-memory** multiprocessors.

Now, we will use the word **processor** instead of resource because it is the word often used in **HPC**.

## Parallel Radix Sort

I recommend to read my two previous blog posts where I have detailed the **serial Radix Sort** because the parallel version is entirely based on it. So if you don’t know the Radix Sort, it is better to read them first. **The notations used here have been introduced in the previous posts**.

In general, **parallel sorts** consist of multiple rounds of serial sort, called **local sort**, performed by each processor in parallel, followed by movement of keys among processors, called the **redistribution step**. Local sort and data redistribution may be interleaved and iterated a few times depending on the algorithms used. The **parallel Radix Sort also follows this pattern**.

We assume that each processor has the same amount of data otherwise processors workload would be** unbalanced **because of the **local sort**. Indeed, if one processor has more data than the others, it will take more time to achieve its local sort and the total sorting time will be greater. However, if the data are equally distributed, the processors will take the same time to achieve their local sort and none of them will have to wait for another to finish. As a result, the total sorting time will be shorter.

The idea of the **parallel Radix Sort** is, **for each key digit**, to first sort locally the data on each processor according to the **digit** value. All the processors do that concurrently. Then, to **compute which processors have to send which portions of their local data to which other processors** in order to have the distributed list sorted across processors and **according to the digit**. After iterating for the last key digit, the distributed array will be sorted as we want. Below is the parallel algorithm.

Input: rank (rank of the processor), L (portion of the distributed data held by this processor) Output: the distributed data sorted across the processors with the same amount of data for each processor 1. for each keys digit i where i varies from the least significant digit to the most significant digit: 2. use Counting Sort or Bucket Sort to sort L according to the i’th keys digit 3. share information with other processors to figure out which local data to send where and what to receive from which processors in order to have the distributed array sorted across processors according to the i’th keys digit 4. proceed to these exchanges of data between processors

Each processor runs the algorithm with its rank and its portion of the distributed data. This is a “high-level” parallel Radix Sort algorithm. It describes what to do but not how to do it because there are so many ways of doing the steps **3.** and **4.** depending on many parameters like the architecture, environment and communication tool used. Let’s go through an **example** and then I will explain how I have chosen to implement it.

We start with the distributed unsorted list above to run our parallel Radix Sort. **P** equals **3** because there are **3** **processors** and **N** equals **15**. We will run the example in **base 10** for simplicity but don’t forget that **we can use any number base we want** and in practice, base 256 is used as explained in my previous posts. Also for simplicity, we deal with **unsigned integer** data and the sorting keys are the numbers themselves. To sort **signed integers** and other data types, please refer to my previous article.

The keys having less digits (2, 1 and 5) have been extended to the same number of digits than the maximum in the list (two here) by adding zero as MSD. If you don’t understand why, please refer to my **second article**. It doesn’t change the value of the keys. The challenging part when implementing can be the **redistribution step**. We have to think about, **for a given processor, what information from other processors is required to figure out where to send the local data**. It is not complicated, we want the distributed array to be sorted according to one particular key digit (the i’th digit). In our example, the value of a digit is between 0 and 9. We have sorted locally the data on each processor, therefore, each processor knows how many data there are having their i’th key digit equals to a given digit value. By sharing this information with the other processors and getting their information too, we can determine which data each processor has to send and receive. In the example above, all the processors (**from the rank 0 to P-1**) have first to send their local data having their key **LSD** equals to 0 to the processor **p0** until this one can’t receive data anymore because it is full. There is no such data so we continue with the next digit value which is 1. The processor **p0** keeps its data having their key LSD equals to 1, then receive those from the processor **p1**, and finally those from the processor **p3** but it doesn’t have. **The processors order really matters** and has to be respected. Once done, we repeat the same with the value 2, then 3 and so on until 9. When **p0** is full, we continue by filling **p1** and so on. **Careful**, **the redistribution step has to be stable** too, like the sort used in each iteration of the Radix Sort and for the same reasons. This is why we have said **the order matters** **and has to be respected**, otherwise it doesn’t sort.

The algorithm is still correct if we swap the local sort step and the data redistribution step. But, in practice, it is not suitable because often, to send data efficiently, we need the data to be contiguous in memory. Two data having their i’th digit identical will most probably be sent to the same processor, so it is a good point to sort the data locally before.

We still have one iteration to do according to the algorithm. Let’s finish.

We have done the same thing but according to the last digit this time. The algorithm is now finished and, good news, our distributed array is sorted as we wanted.

**I have presented here the parallel LSD Radix Sort**. But, like the serial, there is a variant called parallel **MSD** Radix Sort which is the parallelization of the **serial** **MSD**. **I have implemented both** to sort **int8_t** data type and my performance results were better with my **LSD** version. This is why I have continued with the **LSD** to generalize it to other integer sizes. It is also the reason why since the beginning I am focusing on presenting the** LSD** version and I don’t really go further into detail with the **MSD**.

## My implementation

**MPI** is used for communications between processors.

I have used the **Counting Sort** and not the Bucket Sort because my implementation with the Bucket Sort had an extra charge due to memory management. Indeed, unless making a first loop through the keys to count before moving them into the buckets, we don’t know in advance the length of the buckets. Therefore, each of my buckets was a **std::vector** and although they are very well implemented and optimized I still had a lack of performance due to memory management. The problem is absolutely not due to the **std::vector** class, it comes from the fact that on each processor, each bucket has a different size, depending on the **key characteristics**, and we can’t predict them. So, instead of making a first loop to count and find out the length of each bucket before creating them with appropriate sizes, I opted for the Counting Sort which is finally almost the same because we also count before moving the data. The difference is we don’t use buckets anymore, we use a prefix sum instead.

To do step **3.** of the algorithm, I save the **local counts** from the Counting Sort on each processor and share it with other processors via **MPI_Allgather**. In that way, **each processor knows how many keys having their i’th byte equals to a given value there are on each processor**, and from that, it is easy to figure out where to send which data as explained in the **Parallel Radix Sort** section example (just above this “My implementation” section).

Step **4.** is managed using MPI **one-sided communications** instead of send and receive which are two-sided communications. I have tried with both and most of the time either performances were similar or better with one-sided communications. MPI one-sided communications are more efficient when the hardware supports **RMA** operations. We can use them in step **4.** of parallel Radix Sort because we don’t need synchronization for each data movement, but only once they are all done. The data movements are **totally independent** in step 4., they can be made in** any order** as long as we know where each element has to go.

In terms of memory, **on each processor**, I am using a **static array** (**std::array**) of **256 integers** for the** local counts**. In addition to that, I have the input array “duplicated”. The original local array is used to receive the data from other processors in step **4.** Thus, at the end, the array is sorted in this same input buffer. The copy is used to store the local sorted data and send them to the right processor. It is possible to implement it without duplicating the array, but a huge amount of communications and synchronizations will be necessary and the communications won’t be independent anymore. In my opinion, it is too much time lost for the memory gain.

As said previously, **there are several ways of doing steps 3. and 4.** It is also possible for example, to build the global counts across processors (for step** 3.**) using other MPI collective communications like MPI_Scan. To send the data in step** 4.** we can also use MPI_Alltoallv instead of one-sided communications but it requires to sort again the data locally after receiving. I tried several alternatives and what I have exposed here is the one giving me the best performances.

## Performance analysis

As **hardware**, I have used the ICHEC **Kay** cluster to run all the **benchmarks** presented in this section. The **framework** used to get the **execution times** is **google benchmark**. The numbers to sort are generated with **std::mt19937**, a Mersenne Twister **pseudo-random generator** of 32-bit numbers with a state size of 19937 bits. For each execution time measure, I use **ten different seeds** (1, 2, 13, 38993, 83030, 90, 25, 28, 10 and 73) to generate **ten different arrays** (of a same length) to sort. Then, the execution times **mean** is registered as the execution time for the length. These ten seeds have been chosen **randomly**. I proceed like that because the execution time also depends on the data, so it allows us to have a more accurate estimation.

#### Strong scaling

Strong scaling is defined as **how the solution time varies with the number of processors for a fixed total problem size**.

We can notice that my implementation seems good because, in this case, it is better than std::sort and boost::spreadsort whatever the number of processors used. Plus, it seems to be very close to the perfect scalability. My implementation here is faster than boost::spreadsort even in serial (with only one processor) because when the array is not distributed, I am using **ska_sort_copy** instead of boost::spreadsort. **ska_sort_copy **is a great and very fast implementation of the serial Radix Sort. Actually, it is the fastest I have found. This is why it is my reference to compute the perfect scalability.

The problem with the previous graph is the **scale**. It is difficult to distinguish the perfect scalability and the dimer::sort curves because the times are too low contrary to std::sort execution time. This is why we often use a **log-log scale** to plot execution times in computer science.

With a log-log scale we can see better what is happening. Now, we are able to tell that until 8 processors, the perfect scalability and dimer::sort curves are the same. It means until 8 processors, theoretically, we can’t do better than what we already have to sort 1GB int8_t. But, from 8 processors it is not the case anymore and it is most probably due to communication overhead because there are more and more processors. Let’s verify it with another graph.

#### Distribution of execution time

Indeed, when the number of processors increases, the communication part in the total execution time increases too. The **local sort** step (in blue) contains zero communication. All communications are done during the **data redistribution** step (in orange). These communications are one **MPI_Allgather** and several MPI one-sided communications.

Until now, we have seen performance results only for int8_t. It is good but limited because the int8_t value range is too small and not representative of real-world problems and data. What about **int32_t** which is the “normal” integer size?

#### int32_t, sorting time as a function of the array length

The performances are also satisfying. Using dimer::sort we go faster than anything else when the array is enough big and this is what we expect when we use HPC tools. For small array sizes, we are not really expecting an improvement because they can be easily well managed serially and enough quickly. Plus, we add extra steps to treat the problem in a parallel way, therefore, when the problem size is small, it is faster serially because we don’t have these extra steps. It becomes faster in parallel when these extra steps are a small workload compare to the time needed to sort serially.

## Demonstration

Let’s finish by visualizing my implementation! We easily distinguish the two different steps of the algorithm in this visualization.

## Leave a Reply