Parallelizing simple algorithms : Fibonacci

This entry is part 2 of 2 in the series Parallelizing simple algorithms
Fibonacci Blocks

Fibonacci Blocks

It took me a little more than I expected, but finally I managed to write the first post for the Parallelizing simple algorithms series.

As I “promised”, I will start these series by parallelizing the Fibonacci Number sequence generation. As you probably already know, Fibonacci numbers are the integer sequence produced by the following relationship:

   F0 = 0
   F1 = 1
   Fn = Fn-2 + Fn-1

The resulting sequence is: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, . . .

Lets start with programming!

Erlang is a functional language, so programming with Erlang is based on functions and recursion.

For these two reasons, implementing the above recursive definition of Fibonacci sequence is straightforward.

Single process solution
fib0(0) ->
    0;
fib0(1) ->
    1;
fib0(N) ->
    fib0(N-1) + fib0(N-2).

I believe that this implementation is self-documented, right? This is the plain, single threaded (or single process in Erlang terminology) solution.

Lets parallelize it a little :-).

Two processes solution

First, lets break the computation into two processes:

fib1(0) ->
    0;
fib1(1) ->
    1;
fib1(N) ->
    Self = self(),
    spawn(fun() ->
		  Self ! fib0(N-1)
	  end),
    spawn(fun() ->
		  Self ! fib0(N-2)
	  end),
    receive
	F1 ->
	    receive
		F2 ->
		    F1 + F2
	    end
    end.

Two process are spawned, each of them running the sequential code of fib0. After the two processes complete, they send back the results to the process that spawned them. This process acts as a coordinator.

In the introductory post of the series, I presented the necessary theory behind parallelization. As you can easily udnerstand, the solution fib1 uses Data parallelism, since it repeats the same computation on different data in parallel.

Unlimited parallelism

Lets write the solution that uses Unlimited parallelism:

fib2(0) ->
    0;
fib2(1) ->
    1;
fib2(N) ->
    Self = self(),
    spawn(fun() ->
		  Self ! fib2(N-1)
	  end),
    spawn(fun() ->
		  Self ! fib2(N-2)
	  end),
    receive
	F1 ->
	    receive
		F2 ->
		    F1 + F2
	    end
    end.

In this case, every call to fib2 creates 2 new worker processes. So, in order to calculate fib2(2), 2 extra processes are spawned, fib2(3), 4 extra, fib2(4), 8 extra, …, fib2(N), 2N extra processes. It should be obvious that this exponential process growth cannot work.. As usual, unlimited parallelism is not an appropriate solution.

So, 1-process, 2-processes, 2N-processes.. what else? Well, in the introductory post I omitted to write about something rather important.

Amdahl’s law

Amdahl’s law “sets” a limitation on the performance improvement that can be achieved in case that only one part of the system is improved. In parallel programming the meaning of the law can be simplified in the following example: if a piece of code cannot be parallelized, it immediately sets a limit to the speedup that can be achieved by parallelizing. For example, if a sequential application runs in 10 seconds and the bigger non-parallelizable part executes in 3 seconds, parallelizing the system is impossible to achieve better performance than these 3 seconds! Read more about Amdahl’s law on Wikipedia.

Why did I mention Amdahl’s law? Well, right now the best solution is the 2-threaded one. But, why not trying with more processes?

Four processes solution

Just to mention that by saying 4 process, I mean only the worker ones, excluding the coordinators. For example, in this case, the four worker processes are complemented with 3 coordinator ones (imagine it as a binary tree, where the leaves are the workers, and the others the coordinators).

fib3(0) ->
    0;
fib3(1) ->
    1;
fib3(N) ->
    Self = self(),
    spawn(fun() ->
		  Self ! fib1(N-1)
	  end),
    spawn(fun() ->
		  Self ! fib1(N-2)
	  end),
    receive
	F1 ->
	    receive
		F2 ->
		    F1 + F2
	    end
    end.

fib3 follows fib1‘s logic. It spawns 2 fib1 processes, which in turn become coordinators and spawn 2 more processes each, so 4 workers in total.

Lets make a more generic solution.

Generic solution
fib4(0, _) ->
    0;
fib4(1, _) ->
    1;
fib4(N, M) when N >= M ->
    Self = self(),
    spawn(fun() ->
		  Self ! fib4(N-1, M)
	  end),
    spawn(fun() ->
		  Self ! fib4(N-2, M)
	  end),
    receive
	F1 ->
	    receive
		F2 ->
		    F1 + F2
	    end
    end;
fib4(N, M) when N < M ->
    Self = self(),
    spawn(fun() ->
		  Self ! fib0(N-1)
	  end),
    spawn(fun() ->
		  Self ! fib0(N-2)
	  end),
    receive
	F1 ->
	    receive
		F2 ->
		    F1 + F2
	    end
    end.
 
fib4_5(N) ->
    fib4(N, N-4).
fib4_6(N) ->
    fib4(N, N-5).
fib4_7(N) ->
    fib4(N, N-6).
fib4_8(N) ->
    fib4(N, N-7).
fib4_9(N) ->
    fib4(N, N-8).
fib4_10(N) ->
    fib4(N, N-9).

fib4(N, M) spawns 2 new processes while N >= M. That means that if we call fib4(20, 18), we will have 10 fib0 worker processes. To check it, draw the “spawning” binary tree..

Disappointment

I never understood why people use Fibonacci as a parallelization example.. The sequential implementation below should explain why!

fib5(0) ->
    0;
fib5(1) ->
    1;
fib5(N) ->
    fib_bottom(0, 1, 2, N).
 
fib_bottom(L, H, C, N) when C >= N ->
    L+H;
fib_bottom(L, H, C, N) ->
    fib_bottom(H, L+H, C+1, N).

It is not complicated and, as you will see in the benchmarking section, it is fast.

Benchmarking

In order to make the measuring process easier, I created some testing functions in this module. Rename it from “test.txt” to “test.erl” to use it. The interface is:

run(Module, Functions, Parameters, Repeats)

Which runs, times, and generates the average time to Repeats runs of Functions in the Module, with Parameters. Parameters is a list of parameters. One item of the list is used as a parameter at a time.

In order to run my tests, I booted my Ubuntu and used a virtual console, so that I manage a somehow unobstructed execution.

Measurements

So, first:

test:run(fib, [fib0, fib1, fib2, fib3, fib4_6, fib4_8, fib5], [10, 15, 20, 25], 3).

I will present the output parameter by parameter:

Parameter: 10 -- Start ----------
>>fib0:   AVG: 44.7 micros   | [40,38,56]
>>fib1:   AVG: 106.7 micros  | [128,97,95]
>>fib2:   AVG: 2202.0 micros | [2341,2274,1991]
>>fib3:   AVG: 99.7 micros   | [93,131,75]
>>fib4_6: AVG: 1584.0 micros | [1481,1570,1701]
>>fib4_8: AVG: 2207.7 micros | [2332,2145,2146]
>>fib5:   AVG: 8.3 micros    | [21,2,2]

The results are not very “stable”, but we can draw some conclusions. The computation (data) is very short in this case, so the overhead of parallelizing is more than the benefit gained. This overhead is due to the process creation and the communication (small, because the processes run locally).

As expected, the non-recursive execution is the fastest and the performance is decreasing as the concurrency increases. From now on, I will simply not take into consideration the fib5 solution. It will be the fastest, so it is not interesting..

Then,

Parameter: 15 -- Start ----------
>>fib0:   AVG: 200.7 micros   | [190,218,194]
>>fib1:   AVG: 215.3 micros   | [225,200,221]
>>fib2:   AVG: 28233.3 micros | [29247,28091,27362]
>>fib3:   AVG: 243.3 micros   | [261,278,191]
>>fib4_6: AVG: 1822.0 micros  | [1887,1758,1821]
>>fib4_8: AVG: 4603.7 micros  | [4906,4408,4497]
>>fib5:   AVG: 2.7 micros     | [3,3,2]

We can already see that parallelizing is close to take effect. Both with the Fibonacci of 10 and of 15, it is obvious that the unlimited parallelism (fib2) is not working!

Parameter: 20 -- Start ----------
>>fib0:   AVG: 2332.0 micros   | [2403,2294,2299]
>>fib1:   AVG: 1391.3 micros   | [1528,1335,1311]
>>fib2:   AVG: 349473.3 micros | [340365,331448,376607]
>>fib3:   AVG: 2222.7 micros   | [2146,2377,2145]
>>fib4_6: AVG: 3894.0 micros   | [4029,3909,3744]
>>fib4_8: AVG: 6970.3 micros   | [7371,6468,7072]
>>fib5:   AVG: 3.3 micros      | [4,3,3]

Parallelism “won”!! We can also notice that the measurements are stabilizing as the computation grows bigger. The two workers version (fib1) is the fastest. My processor has 2 cores, right?

Lets continue

Parameter: 25 -- Start ----------
>>fib0:   AVG: 23961.0 micros   | [24605,23901,23377]
>>fib1:   AVG: 23197.3 micros   | [23195,23126,23271]
>>fib2:   AVG: 3915766.3 micros | [3833718,4018613,3894968]
>>fib3:   AVG: 12501.0 micros   | [13569,11962,11972]
>>fib4_6: AVG: 13220.3 micros   | [13163,13159,13339]
>>fib4_8: AVG: 15519.0 micros   | [15137,15212,16208]
>>fib5:   AVG: 3.0 micros       | [4,3,2]

The winner is fib3 (4 worker processes)! So, why isn’t the best solution the one with #threds == #cores? Do you remember Amdahl’s law?

What happens is that the longer execution (of the two workers), as explained by the Amdahl’s law, sets the lower limit that can be achieved with parallelizing. fib1(25) spawns fib0(24) and fib0(23). If we independently measure them:

2> timer:tc(fib, fib0, [24]).
{21558,75025}
3> timer:tc(fib, fib0, [23]).
{13324,46368}

We can understand that the fib0(24) call set the performance limit (~21500 microseconds). On the other hand, fib3(25) uses the fib0(23), fib0(22), fib0(22), and fib0(21) workers:

4> timer:tc(fib, fib0, [23]).
{13106,46368}
5> timer:tc(fib, fib0, [22]).
{8297,28657}
6> timer:tc(fib, fib0, [21]).
{5036,17711}

We can see that fib0(23) sets the time limit. Just to mention that the numbers appearing in these last measurements tend to be higher because I am taking them with the full OS loaded.

We can notice this effect in the CPU utilization also (taken from a call to fib1(40)):

The Cores Utilization while computing fib1(40)


After the faster of the two workers finishes (fib0(38)), the processing power is not fully utilized!

No more Unlimited Parallelism

As the fib2 is out of the contest (in the opposite way than fib5), I decided not to use it for tests with parameter higher than 25 (I was bored to wait :-P). With this in mind, I continued with the following tests:

test:run(fib, [fib0, fib1, fib3, fib4_6, fib4_8, fib5], [30, 40, 41, 42], 3).
Parameter: 30 -- Start ----------
>>fib0:   AVG: 152250.0 micros | [159005,151430,146315]
>>fib1:   AVG: 103349.7 micros | [99814,105109,105126]
>>fib3:   AVG: 88625.7 micros  | [87651,94793,83433]
>>fib4_6: AVG: 88146.3 micros  | [89569,91849,83021]
>>fib4_8: AVG: 84386.3 micros  | [84099,84560,84500]
>>fib5:   AVG: 3.3 micros      | [4,3,3]

Parameter: 40 -- Start ----------
>>fib0:   AVG: 26544721.7 micros | [23168055,28247057,28219053]
>>fib1:   AVG: 21077023.7 micros | [21047387,21032536,21151148]
>>fib3:   AVG: 17733562.3 micros | [16394906,17871455,18934326]
>>fib4_6: AVG: 16344892.0 micros | [16330718,16220437,16483521]
>>fib4_8: AVG: 16306948.3 micros | [16206762,16446090,16267993]
>>fib5:   AVG: 10.0 micros       | [5,22,3]

Parameter: 41 -- Start ----------
>>fib0:   AVG: 47261960.3 micros | [47514272,47015432,47256177]
>>fib1:   AVG: 34266284.3 micros | [32578166,34148118,36072569]
>>fib3:   AVG: 33014777.7 micros | [30061008,36471809,32511516]
>>fib4_6: AVG: 29779055.7 micros | [29715648,30000853,29620666]
>>fib4_8: AVG: 29979546.3 micros | [30140758,29720501,30077380]
>>fib5:   AVG: 4.0 micros        | [5,4,3]

Parameter: 42 -- Start ----------
>>fib0:   AVG: 78229815.0 micros | [77388307,79200132,78101006]
>>fib1:   AVG: 55877460.3 micros | [54076288,56712128,56843965]
>>fib3:   AVG: 52787426.3 micros | [50520162,52976514,54865603]
>>fib4_6: AVG: 50517868.7 micros | [52275541,50447961,48830104]
>>fib4_8: AVG: 49112005.0 micros | [49638298,50027250,47670467]
>>fib5:   AVG: 4.7 micros        | [5,5,4]

As we can see, the fib4_8 was generally the fastest. In case of parameter 30, there was a time decrease of ((152250-84386.3)/152250) = 44.6%!

More processes

Now, lets compare the fib4_* solutions, since they “look” more competitive with higher parameters.

test:run(fib, [fib4_6, fib4_7, fib4_8, fib4_9, fib4_10], [30, 40, 41, 42], 3).
Parameter: 30 -- Start ----------
>>fib4_6:  AVG: 138976.0 micros | [132737,154900,129291]
>>fib4_7:  AVG: 129955.3 micros | [130810,129461,129595]
>>fib4_8:  AVG: 131024.3 micros | [130824,130597,131652]
>>fib4_9:  AVG: 131907.3 micros | [131885,132068,131769]
>>fib4_10: AVG: 141973.3 micros | [133670,158155,134095]

Parameter: 40 -- Start ----------
>>fib4_6:  AVG: 16494758.3 micros | [16551363,16241427,16691485]
>>fib4_7:  AVG: 16330959.0 micros | [16080028,16451705,16461144]
>>fib4_8:  AVG: 16324207.7 micros | [16235259,16569521,16167843]
>>fib4_9:  AVG: 16436575.7 micros | [16565366,16617853,16126508]
>>fib4_10: AVG: 16563677.0 micros | [16627788,16308796,16754447]

Parameter: 41 -- Start ----------
>>fib4_6:  AVG: 32710283.7 micros | [32047340,30147917,35935594]
>>fib4_7:  AVG: 38032535.3 micros | [35256794,41837955,37002857]
>>fib4_8:  AVG: 36848240.7 micros | [36764344,38615563,35164815]
>>fib4_9:  AVG: 36471677.7 micros | [38606656,36677249,34131128]
>>fib4_10: AVG: 35954456.0 micros | [34594636,37284090,35984642]

Parameter: 42 -- Start ----------
>>fib4_6:  AVG: 58263948.0 micros | [59179553,58509795,57102496]
>>fib4_7:  AVG: 62005141.7 micros | [58846929,57270566,69897930]
>>fib4_8:  AVG: 59224815.0 micros | [65590241,56918707,55165497]
>>fib4_9:  AVG: 56413852.3 micros | [57657222,57665314,53919021]
>>fib4_10: AVG: 54196771.0 micros | [54126552,54178425,54285336]

As you can see, the results are very close to each other. Though, it looks like that as the computation increases, the more processes created help. Of course, we saw that unlimited parallelism does not help, so there is a cap after which instead of gaining performance, we decrease it.

One last thing

I kept the best for the end ;-). The tests of fib0 were done with SMP enabled. So, although the code is single-threaded, Erlang runtime did its best to utilize the 2 cores. That’s why I turned of SMP in Erlang and run the fib0 tests again:

test:run(fib, [fib0], [10, 20, 30, 40], 3).
Parameter: 10 -- Start ----------
>>fib0: AVG: 327.0 micros      | [958,12,11]

Parameter: 20 -- Start ----------
>>fib0: AVG: 2510.0 micros      | [2553,2514,2463]

Parameter: 30 -- Start ----------
>>fib0: AVG: 301458.7 micros    | [301585,301662,301129]

Parameter: 40 -- Start ----------
>>fib0: AVG: 37499354.0 micros  | [37525814,37473104,37499144]

Parameter: 41 -- Start ----------
>>fib0: AVG: 63639971.0 micros  | [63708948,63542924,63668041]

Parameter: 42 -- Start ----------
>>fib0: AVG: 104382637.3 micros | [105213567,104180367,103753978

Comparing the fastest solution for each parameter with the fib0 gives:

Parameter fib0 (μs) Best (μs) Decrease
10 327.0 99.7 69.5%
20 2510.0 1391.3 44.5%
30 301458.7 84386.3 72.0%
40 37499354.0 16324207.7 56.4%
41 63639971.0 32710283.7 48.6%
42 104382637.3 54196771.0 48.0%

Wow! 72% less time.. Not bad for a Core2Duo processor!

Conlusions

So, what conclusions can we draw from what we saw?

  • Data parallelization is straightforward in some problems
  • The benefit from parallelizing can be quite big
  • Anmdhl’s law effect should always be taken into account
  • Unlimited parallelism.. does not work
  • SMP helps even with “sequential” code
  • Erlang is cool for these kind of problems.. Recursion and parallelization is more than simple!
  • Fibonacci is both a good and bad example for parallelization. The non-recursive solution runs in O(n) steps, where n is the input to Fibonacci function

You can get the source code here. (rename from fib.txt to fib.erl)

That’s it..

Mergesort will probably be the next “victim”!

Series NavigationParallelizing simple algorithms series

4 Responses to “Parallelizing simple algorithms : Fibonacci”

  • Yu Wang:

    Good article!
    I am also looking to MapReduce which is good to distributed computing.
    Maybe you can write an Erlang library for MapReduce.

    Here’s a presentation about Erlang I find very interesting.
    http://www.infoq.com/presentations/The-Evolution-of-the-Erlang-VM

    • Thank you! MapReduce should be “simple” to implement in Erlang..

      Very interesting video btw..

    • Dante D. Dinawanao:

      There’s the Disco Project: http://discoproject.org/. According to its website:

      Disco is an open-source, large scale data analysis platform. The platform includes an implementation of MapReduce, among other things. As the original framework, Disco supports parallel computations over massive data sets, running on an unreliable cluster of computers.

      The Disco core is written in Erlang, a functional language that is designed for building robust, fault-tolerant, distributed applications. Users of Disco typically write jobs in Python, making it possible to express even complex algorithms in only in tens of lines of code. This means that you can rapidly develop programs to process massive amounts of data.

  • […] can find a detailed post about Fibonacci sequence here. I duplicate some results here in order to show you that, some times, converting the function to […]

Leave a Reply

*