StackOverflow    [ Questions ]   [ Tags ]   [ Users ]   [ Badges ]   [ Unanswered ]  |  [LOG IN]  |  [SIGN UP]

Home > Questions > Why is processing a sorted array faster...

Why is processing a sorted array faster than
processing an unsorted array?


c++   java   performance   branch-prediction   cpu-architecture

Asked: 2012-06-27  |  Modified: 2023-08-15  |  Viewed: 26,847,312 times  |  Asked by: Bjorn Pollex  |  Reputation: 5,432
[linked: 53]  [related: 89]

-------------------------------------------------------------------------------------------------------------------

 [+]

26847

 [-]


[bookmark]
[follow]

Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster:

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
}
  • Without std::sort(data, data + arraySize);, the code runs in 11.54 seconds.
  • With the sorted data, the code runs in 1.93 seconds.

Initially, I thought this might just be a language or compiler anomaly, so I tried Java... the same result. Why does sorting the data speed up the code?


[share]   [edit]   [follow]   [flag]     edited Jun 28, 2012 at 22:00  |  Bjorn Pollex (5.4k rep)

-------------------------------------------------------------------------------------------------------------------
Comments (38):
▲ 1847   Mysticial: "Upvoted. This is one of the most interesting performance questions I've seen on SO." — Jun 27, 2012
▲ 923   templatetypedef: "The answer explains branch prediction! Read it." — Jun 27, 2012
▲ 412   user3932000: "This should be a required reading for every CS undergraduate." — Jun 28, 2012

26 Answers
[Sorted by: Highest score (default)]
-------------------------------------------------------------------------------------------------------------------

 [+]

88,430

 [-]


✓ ACCEPTED
Branch Prediction.

What is a Branch Predictor?

Consider a railroad junction:
                    [ ONCOMING TRAIN ]
                           |
              o============o============o
             /               (switch)    \
            o                             o
     [TRACK A]                       [TRACK B]

Imagine you are a train operator who cannot see ahead of the junction. You must guess: will the train go left or right?

You stop the train, ask the switchman which track to use, then start again. Each guess costs time. But what if you could predict the switch?

If you guess right, the train continues without stopping.
If you guess wrong, the train must back up and take the other track. This costs time.

This is essentially what happens with an if-statement in your CPU.

Modern CPUs are pipelined — they fetch, decode, and execute multiple instructions simultaneously, before the previous instructions finish. When the CPU encounters a branch (an if statement), it must predict which branch will be taken so the pipeline can continue loading instructions.

A wrong prediction causes the pipeline to be flushed and refilled — a costly penalty of approximately 10-20 clock cycles.


Why does sorting help?

With the UNSORTED array:

data[] =  91  38 201  17 142  55 238   3  99 176 ...
          ^--- Crossing 128? UNPREDICTABLE!
          The CPU cannot predict the pattern.
          Misprediction rate: ~50%
          Result: SLOW (11.54 seconds)

With the SORTED array:

data[] =  2   3   5   7  11  13 ... 128 131 139 ...
          |<---- all below 128 ---->|<-- all above -->|
                                    ^
          The branch predictor sees:
          FFFFFFFFF...FFFTTTTTT...TTT  (F=false, T=true)
          Pattern: all F's then all T's. VERY PREDICTABLE!
          Misprediction rate: ~0% (except the transition)
          Result: FAST (1.93 seconds)

The fix: Branchless code
You can avoid branch prediction entirely using a bitwise trick:
// Original (branch-dependent):
if (data[c] >= 128)
    sum += data[c];

// Branchless version (C++):
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

// This is fast regardless of sorted order.
// No branch = no misprediction penalty.

On the author's machine, the branchless version runs in 1.70 seconds regardless of whether the data is sorted or not.

Further reading:
► Wikipedia: Branch predictor
► Agner Fog's optimization manuals
► Intel 64 and IA-32 Architectures Optimization Reference Manual


answered Jun 27, 2012 at 13:10  |  Mysticial (rep: 462,137)     [share]   [edit]   [follow]   [flag]

Comments (53):
▲ 4,201   Zan Lynx: "The railroad analogy should be in every computer science textbook." — Jun 27, 2012
▲ 1,876   Andy Prowl: "This answer is a work of art. Truly." — Jun 28, 2012
▲ 934   Puppy: "I've been programming for 15 years and never fully understood branch prediction until this post." — Jun 28, 2012
▲ 203   Nick: "Note: on ARM processors the misprediction penalty is typically less severe (~3-5 cycles)." — Jun 30, 2012

-------------------------------------------------------------------------------------------------------------------

 [+]

34,187

 [-]
If you are curious about even more optimizations...

For those interested in the low-level details: modern CPUs use a Two-Level Adaptive Predictor (or similar) that keeps a history table of recent branch outcomes. For a sorted array, after the first few iterations the predictor becomes 100% accurate (long runs of identical outcomes).

The exact penalty depends on the CPU microarchitecture:
CPU              | Pipeline depth | Misprediction penalty
-----------------|----------------|----------------------
Intel Pentium 4  |   ~20 stages   |   ~20 cycles
Intel Core 2     |   ~14 stages   |   ~15 cycles
Intel Core i7    |   ~14-19 stages|   ~15-20 cycles
AMD Athlon 64    |   ~12 stages   |   ~12 cycles

answered Jun 27, 2012 at 15:44  |  Daniel Lemire (rep: 21,548)     [share]   [edit]   [follow]
[ Question Stats ]
Asked:
2012-06-27
Modified:
2023-08-15
Viewed:
26,847,312 times
Answers:
26
Linked:
53
Related:
89

[ Hot Network Qs ]
Why does Java have int and Integer?

What is 'undefined behavior' in C?

How do I undo the last git commit?

What is the difference between == and === in JavaScript?

How to check if a string contains a substring in C++?

[ Related Tags ]
c++ x 312,441
java x 289,002
performance x 44,129
branch-prediction x 187
cpu-architecture x 1,432
optimization x 21,334

Contact
feedback@stackoverflow.com

Stack Overflow  |  Questions  |  Jobs  |  Developer Survey  |  Help  |  Privacy Policy  |  Terms of Service
site design © 1998-1999 Stack Exchange Inc.  |  user contributions licensed under cc by-sa 3.0 with attribution required
rev 1998.10.14.1234  |  Best viewed in Netscape 4.0 at 1024x768  |  Courier New recommended
🎵 MIDI Player
♫ LOADING...
0:00
Initialising...
🎵 MIDI Player
♫ LOADING...
0:00
Initialising...
🎵 MIDI Player
♫ LOADING...
0:00
Initialising...
🎵 MIDI Player
♫ LOADING...
0:00
Initialising...
🎵 MIDI Player
♫ LOADING...
0:00
Initialising...
🎵 MIDI Player
♫ LOADING...
0:00
Initialising...