{"id":209,"date":"2011-02-27T18:28:18","date_gmt":"2011-02-27T17:28:18","guid":{"rendered":"http:\/\/trigonakis.com\/blog\/?p=209"},"modified":"2011-05-02T10:59:37","modified_gmt":"2011-05-02T09:59:37","slug":"parallelizing-simple-algorithms-fibonacci","status":"publish","type":"post","link":"http:\/\/trigonakis.com\/blog\/2011\/02\/27\/parallelizing-simple-algorithms-fibonacci\/","title":{"rendered":"Parallelizing simple algorithms : Fibonacci"},"content":{"rendered":"<div class=\"seriesmeta\">This entry is part 2 of 2 in the series <a href=\"http:\/\/trigonakis.com\/blog\/series\/parallelizing-simple-algorithms\/\" class=\"series-62\" title=\"Parallelizing simple algorithms\">Parallelizing simple algorithms<\/a><\/div><div id=\"attachment_210\" style=\"width: 188px\" class=\"wp-caption alignleft\"><a href=\"http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/FibonacciBlocks.png\"><img aria-describedby=\"caption-attachment-210\" loading=\"lazy\" class=\"size-full wp-image-210 \" title=\"Fibonacci Blocks\" src=\"http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/FibonacciBlocks.png\" alt=\"Fibonacci Blocks\" width=\"178\" height=\"110\" \/><\/a><p id=\"caption-attachment-210\" class=\"wp-caption-text\">Fibonacci Blocks<\/p><\/div>\n<p>It took me a little more than I expected, but finally I managed to write the first post for the <a href=\"http:\/\/trigonakis.com\/blog\/2011\/01\/23\/parallelizing-simple-algorithms-series\/\">Parallelizing simple algorithms series<\/a>.<\/p>\n<p>As I &#8220;promised&#8221;, I will start these series by parallelizing the <strong>Fibonacci Number<\/strong> sequence generation. As you probably already know, Fibonacci numbers are the integer sequence produced by the following relationship:<\/p>\n<pre>   F<code><sub>0<\/sub><\/code> = 0\r\n   F<code><sub>1<\/sub><\/code> = 1\r\n   F<code><sub>n<\/sub><\/code> = F<code><sub>n-2<\/sub><\/code> + F<code><sub>n-1<\/sub><\/code><\/pre>\n<p>The resulting sequence is: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, . . .<\/p>\n<h4>Lets start with programming!<\/h4>\n<p><!--more--><\/p>\n<p>Erlang is a functional language, so programming with Erlang is based on <strong>functions<\/strong> and <strong>recursion<\/strong>.<\/p>\n<p> For these two reasons, implementing the above recursive definition of Fibonacci sequence is straightforward.<\/p>\n<h5>Single process solution<\/h5>\n<pre lang=\"erlang\">\r\nfib0(0) ->\r\n    0;\r\nfib0(1) ->\r\n    1;\r\nfib0(N) ->\r\n    fib0(N-1) + fib0(N-2).\r\n<\/pre>\n<p>I believe that this implementation is self-documented, right? This is the plain, single threaded (or <strong>single process<\/strong> in Erlang terminology) solution.<\/p>\n<p>Lets <strong>parallelize<\/strong> it a little :-).<\/p>\n<h5>Two processes solution<\/h5>\n<p>First, lets break the computation into two processes:<\/p>\n<pre lang=\"erlang\">\r\nfib1(0) ->\r\n    0;\r\nfib1(1) ->\r\n    1;\r\nfib1(N) ->\r\n    Self = self(),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib0(N-1)\r\n\t  end),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib0(N-2)\r\n\t  end),\r\n    receive\r\n\tF1 ->\r\n\t    receive\r\n\t\tF2 ->\r\n\t\t    F1 + F2\r\n\t    end\r\n    end.\r\n<\/pre>\n<p>Two process are spawned, each of them running the sequential code of <code>fib0<\/code>. After the two processes complete, they send back the results to the process that spawned them. This process acts as a <em>coordinator<\/em>.<\/p>\n<p>In the <a href=\"http:\/\/trigonakis.com\/blog\/2011\/01\/23\/parallelizing-simple-algorithms-series\/\">introductory post<\/a> of the series, I presented the necessary theory behind parallelization. As you can easily udnerstand, the solution <code>fib1<\/code> uses <strong>Data parallelism<\/strong>, since it repeats the same computation on different data in parallel.<\/p>\n<h5>Unlimited parallelism<\/h5>\n<p>Lets write the solution that uses <strong>Unlimited parallelism<\/strong>:<\/p>\n<pre lang=\"erlang\">\r\nfib2(0) ->\r\n    0;\r\nfib2(1) ->\r\n    1;\r\nfib2(N) ->\r\n    Self = self(),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib2(N-1)\r\n\t  end),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib2(N-2)\r\n\t  end),\r\n    receive\r\n\tF1 ->\r\n\t    receive\r\n\t\tF2 ->\r\n\t\t    F1 + F2\r\n\t    end\r\n    end.\r\n<\/pre>\n<p>In this case, every call to <code>fib2<\/code> creates 2 new <em>worker<\/em> processes. So, in order to calculate <code>fib2(2)<\/code>, 2 extra processes are spawned, <code>fib2(3)<\/code>, 4 extra, <code>fib2(4)<\/code>, 8 extra, &#8230;, <code>fib2(N)<\/code>, 2<sup>N<\/sup> extra processes. It should be obvious that this exponential process growth cannot work.. As usual, unlimited parallelism is not an appropriate solution.<\/p>\n<p>So, 1-process, 2-processes, 2<sup>N<\/sup>-processes.. what else? Well, in the introductory post I omitted to write about something rather important.<\/p>\n<h5>Amdahl&#8217;s law<\/h5>\n<p>Amdahl&#8217;s law &#8220;sets&#8221; a limitation on the performance improvement that can be achieved in case that only one part of the system is improved. In parallel programming the meaning of the law can be simplified in the following example: if a piece of code cannot be parallelized, it immediately sets a limit to the speedup that can be achieved by parallelizing. For example, if a sequential application runs in 10 seconds and the bigger non-parallelizable part executes in 3 seconds, parallelizing the system is impossible to achieve better performance than these 3 seconds! Read more about Amdahl&#8217;s law on <a href=\"https:\/\/secure.wikimedia.org\/wikipedia\/en\/wiki\/Amdahls_law\" target=\"_new\">Wikipedia<\/a>.<\/p>\n<p>Why did I mention Amdahl&#8217;s law? Well, right now the best solution is the 2-threaded one. But, why not trying with more processes?<\/p>\n<h5>Four processes solution<\/h5>\n<p>Just to mention that by saying 4 process, I mean only the worker ones, excluding the coordinators. For example, in this case, the four worker processes are complemented with 3 coordinator ones (imagine it as a binary tree, where the leaves are the workers, and the others the coordinators).<\/p>\n<pre lang=\"erlang\">\r\nfib3(0) ->\r\n    0;\r\nfib3(1) ->\r\n    1;\r\nfib3(N) ->\r\n    Self = self(),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib1(N-1)\r\n\t  end),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib1(N-2)\r\n\t  end),\r\n    receive\r\n\tF1 ->\r\n\t    receive\r\n\t\tF2 ->\r\n\t\t    F1 + F2\r\n\t    end\r\n    end.\r\n<\/pre>\n<p><code>fib3<\/code> follows <code>fib1<\/code>&#8216;s logic. It spawns 2 <code>fib1<\/code> processes, which in turn become coordinators and spawn 2 more processes each, so 4 workers in total.<\/p>\n<p>Lets make a more generic solution.<\/p>\n<h5>Generic solution<\/h5>\n<pre lang=\"erlang\">\r\nfib4(0, _) ->\r\n    0;\r\nfib4(1, _) ->\r\n    1;\r\nfib4(N, M) when N >= M ->\r\n    Self = self(),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib4(N-1, M)\r\n\t  end),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib4(N-2, M)\r\n\t  end),\r\n    receive\r\n\tF1 ->\r\n\t    receive\r\n\t\tF2 ->\r\n\t\t    F1 + F2\r\n\t    end\r\n    end;\r\nfib4(N, M) when N < M ->\r\n    Self = self(),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib0(N-1)\r\n\t  end),\r\n    spawn(fun() ->\r\n\t\t  Self ! fib0(N-2)\r\n\t  end),\r\n    receive\r\n\tF1 ->\r\n\t    receive\r\n\t\tF2 ->\r\n\t\t    F1 + F2\r\n\t    end\r\n    end.\r\n\r\nfib4_5(N) ->\r\n    fib4(N, N-4).\r\nfib4_6(N) ->\r\n    fib4(N, N-5).\r\nfib4_7(N) ->\r\n    fib4(N, N-6).\r\nfib4_8(N) ->\r\n    fib4(N, N-7).\r\nfib4_9(N) ->\r\n    fib4(N, N-8).\r\nfib4_10(N) ->\r\n    fib4(N, N-9).\r\n<\/pre>\n<p><code>fib4(N, M)<\/code> spawns 2 new processes while N >= M. That means that if we call <code>fib4(20, 18)<\/code>, we will have 10 <code>fib0<\/code> worker processes. To check it, draw the &#8220;spawning&#8221; binary tree..<\/p>\n<h5>Disappointment<\/h5>\n<p>I never understood why people use Fibonacci as a parallelization example.. The sequential implementation below should explain why!<\/p>\n<pre lang=\"erlang\">\r\nfib5(0) ->\r\n    0;\r\nfib5(1) ->\r\n    1;\r\nfib5(N) ->\r\n    fib_bottom(0, 1, 2, N).\r\n\r\nfib_bottom(L, H, C, N) when C >= N ->\r\n    L+H;\r\nfib_bottom(L, H, C, N) ->\r\n    fib_bottom(H, L+H, C+1, N).\r\n<\/pre>\n<p>It is not complicated and, as you will see in the benchmarking section, it is fast.<\/p>\n<h4>Benchmarking<\/h4>\n<p>In order to make the measuring process easier, I created some testing functions in this <a href=\"http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/test.txt\">module<\/a>. Rename it from &#8220;test.txt&#8221; to &#8220;test.erl&#8221; to use it. The interface is:<\/p>\n<pre lang=\"erlang\">\r\nrun(Module, Functions, Parameters, Repeats)\r\n<\/pre>\n<p>Which runs, times, and generates the average time to <i>Repeats<\/i> runs of <i>Functions<\/i> in the <i>Module<\/i>, with <i>Parameters<\/i>. <i>Parameters<\/i> is a list of parameters. One item of the list is used as a parameter at a time.<\/p>\n<p>In order to run my tests, I booted my Ubuntu and used a <a href=\"https:\/\/secure.wikimedia.org\/wikipedia\/en\/wiki\/Virtual_console\" target=\"_new\">virtual console<\/a>, so that I manage a somehow unobstructed execution.<\/p>\n<h5>Measurements<\/h5>\n<p>So, first:<\/p>\n<pre lang=\"erlang\">\r\ntest:run(fib, [fib0, fib1, fib2, fib3, fib4_6, fib4_8, fib5], [10, 15, 20, 25], 3).\r\n<\/pre>\n<p>I will present the output parameter by parameter:<\/p>\n<pre>\r\nParameter: 10 -- Start ----------\r\n>>fib0:   AVG: 44.7 micros   | [40,38,56]\r\n>>fib1:   AVG: 106.7 micros  | [128,97,95]\r\n>>fib2:   AVG: 2202.0 micros | [2341,2274,1991]\r\n>>fib3:   AVG: 99.7 micros   | [93,131,75]\r\n>>fib4_6: AVG: 1584.0 micros | [1481,1570,1701]\r\n>>fib4_8: AVG: 2207.7 micros | [2332,2145,2146]\r\n>>fib5:   AVG: 8.3 micros    | [21,2,2]\r\n<\/pre>\n<p>The results are not very &#8220;stable&#8221;, but we can draw some conclusions. The computation (data) is very short in this case, so the overhead of parallelizing is more than the benefit gained. This overhead is due to the <strong>process creation<\/strong> and the <strong>communication<\/strong> (small, because the processes run locally).<\/p>\n<p>As expected, the non-recursive execution is the fastest and the performance is decreasing as the concurrency increases. From now on, I will simply not take into consideration the <code>fib5<\/code> solution. It will be the fastest, so it is not interesting..<\/p>\n<p>Then,<\/p>\n<pre>\r\nParameter: 15 -- Start ----------\r\n>>fib0:   AVG: 200.7 micros   | [190,218,194]\r\n>>fib1:   AVG: 215.3 micros   | [225,200,221]\r\n>>fib2:   AVG: 28233.3 micros | [29247,28091,27362]\r\n>>fib3:   AVG: 243.3 micros   | [261,278,191]\r\n>>fib4_6: AVG: 1822.0 micros  | [1887,1758,1821]\r\n>>fib4_8: AVG: 4603.7 micros  | [4906,4408,4497]\r\n>>fib5:   AVG: 2.7 micros     | [3,3,2]\r\n<\/pre>\n<p>We can already see that parallelizing is close to take effect. Both with the Fibonacci of 10 and of 15, it is obvious that the <strong>unlimited parallelism (<code>fib2<\/code>) is not working<\/strong>!<\/p>\n<pre>\r\nParameter: 20 -- Start ----------\r\n>>fib0:   AVG: 2332.0 micros   | [2403,2294,2299]\r\n>>fib1:   AVG: 1391.3 micros   | [1528,1335,1311]\r\n>>fib2:   AVG: 349473.3 micros | [340365,331448,376607]\r\n>>fib3:   AVG: 2222.7 micros   | [2146,2377,2145]\r\n>>fib4_6: AVG: 3894.0 micros   | [4029,3909,3744]\r\n>>fib4_8: AVG: 6970.3 micros   | [7371,6468,7072]\r\n>>fib5:   AVG: 3.3 micros      | [4,3,3]\r\n<\/pre>\n<p><strong>Parallelism &#8220;won&#8221;<\/strong>!! We can also notice that the measurements are stabilizing as the computation grows bigger. The two workers version (<code>fib1<\/code>) is the fastest. My processor has 2 cores, right? <\/p>\n<p>Lets continue<\/p>\n<pre>\r\nParameter: 25 -- Start ----------\r\n>>fib0:   AVG: 23961.0 micros   | [24605,23901,23377]\r\n>>fib1:   AVG: 23197.3 micros   | [23195,23126,23271]\r\n>>fib2:   AVG: 3915766.3 micros | [3833718,4018613,3894968]\r\n>>fib3:   AVG: 12501.0 micros   | [13569,11962,11972]\r\n>>fib4_6: AVG: 13220.3 micros   | [13163,13159,13339]\r\n>>fib4_8: AVG: 15519.0 micros   | [15137,15212,16208]\r\n>>fib5:   AVG: 3.0 micros       | [4,3,2]\r\n<\/pre>\n<p>The winner is <strong><code>fib3<\/code><\/strong> (4 worker processes)! So, why isn&#8217;t the best solution the one with <strong>#threds == #cores<\/strong>? Do you remember <i>Amdahl&#8217;s law<\/i>?<\/p>\n<p>What happens is that the longer execution (of the two workers), as explained by the Amdahl&#8217;s law, sets the lower limit that can be achieved with parallelizing. <code>fib1(25)<\/code> spawns <code>fib0(24)<\/code> and <code>fib0(23)<\/code>. If we independently measure them:<\/p>\n<pre>\r\n2> timer:tc(fib, fib0, [24]).\r\n{21558,75025}\r\n3> timer:tc(fib, fib0, [23]).\r\n{13324,46368}\r\n<\/pre>\n<p>We can understand that the <code>fib0(24)<\/code> call set the performance limit (~21500 microseconds). On the other hand, <code>fib3(25)<\/code> uses the <code>fib0(23)<\/code>, <code>fib0(22)<\/code>, <code>fib0(22)<\/code>, and <code>fib0(21)<\/code> workers:<\/p>\n<pre>\r\n4> timer:tc(fib, fib0, [23]).\r\n{13106,46368}\r\n5> timer:tc(fib, fib0, [22]).\r\n{8297,28657}\r\n6> timer:tc(fib, fib0, [21]).\r\n{5036,17711}\r\n<\/pre>\n<p>We can see that fib0(23) sets the time limit. Just to mention that the numbers appearing in these last measurements tend to be higher because I am taking them with the full OS loaded.<\/p>\n<p>We can notice this effect in the CPU utilization also (taken from a call to <code>fib1(40)<\/code>):<br \/>\n<div id=\"attachment_249\" style=\"width: 410px\" class=\"wp-caption aligncenter\"><a href=\"http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/CPU-utilization.png\"><img aria-describedby=\"caption-attachment-249\" loading=\"lazy\" src=\"http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/CPU-utilization.png\" alt=\"\" title=\"CPU-utilization\" width=\"400\" height=\"226\" class=\"size-full wp-image-249\" srcset=\"http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/CPU-utilization.png 400w, http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/CPU-utilization-300x169.png 300w\" sizes=\"(max-width: 400px) 100vw, 400px\" \/><\/a><p id=\"caption-attachment-249\" class=\"wp-caption-text\">The Cores Utilization while computing fib1(40)<\/p><\/div><br \/>\nAfter the faster of the two workers finishes (<code>fib0(38)<\/code>), the processing power is not fully utilized!<\/p>\n<h5>No more Unlimited Parallelism<\/h5>\n<p>As the fib2 is out of the contest (in the opposite way than <code>fib5<\/code>), I decided not to use it for tests with parameter higher than 25 (I was bored to wait :-P). With this in mind, I continued with the following tests:<\/p>\n<pre lang=\"erlang\">\r\ntest:run(fib, [fib0, fib1, fib3, fib4_6, fib4_8, fib5], [30, 40, 41, 42], 3).\r\n<\/pre>\n<pre>\r\nParameter: 30 -- Start ----------\r\n>>fib0:   AVG: 152250.0 micros | [159005,151430,146315]\r\n>>fib1:   AVG: 103349.7 micros | [99814,105109,105126]\r\n>>fib3:   AVG: 88625.7 micros  | [87651,94793,83433]\r\n>>fib4_6: AVG: 88146.3 micros  | [89569,91849,83021]\r\n>>fib4_8: AVG: 84386.3 micros  | [84099,84560,84500]\r\n>>fib5:   AVG: 3.3 micros      | [4,3,3]\r\n\r\nParameter: 40 -- Start ----------\r\n>>fib0:   AVG: 26544721.7 micros | [23168055,28247057,28219053]\r\n>>fib1:   AVG: 21077023.7 micros | [21047387,21032536,21151148]\r\n>>fib3:   AVG: 17733562.3 micros | [16394906,17871455,18934326]\r\n>>fib4_6: AVG: 16344892.0 micros | [16330718,16220437,16483521]\r\n>>fib4_8: AVG: 16306948.3 micros | [16206762,16446090,16267993]\r\n>>fib5:   AVG: 10.0 micros       | [5,22,3]\r\n\r\nParameter: 41 -- Start ----------\r\n>>fib0:   AVG: 47261960.3 micros | [47514272,47015432,47256177]\r\n>>fib1:   AVG: 34266284.3 micros | [32578166,34148118,36072569]\r\n>>fib3:   AVG: 33014777.7 micros | [30061008,36471809,32511516]\r\n>>fib4_6: AVG: 29779055.7 micros | [29715648,30000853,29620666]\r\n>>fib4_8: AVG: 29979546.3 micros | [30140758,29720501,30077380]\r\n>>fib5:   AVG: 4.0 micros        | [5,4,3]\r\n\r\nParameter: 42 -- Start ----------\r\n>>fib0:   AVG: 78229815.0 micros | [77388307,79200132,78101006]\r\n>>fib1:   AVG: 55877460.3 micros | [54076288,56712128,56843965]\r\n>>fib3:   AVG: 52787426.3 micros | [50520162,52976514,54865603]\r\n>>fib4_6: AVG: 50517868.7 micros | [52275541,50447961,48830104]\r\n>>fib4_8: AVG: 49112005.0 micros | [49638298,50027250,47670467]\r\n>>fib5:   AVG: 4.7 micros        | [5,5,4]\r\n<\/pre>\n<p>As we can see, the fib4_8 was generally the fastest. In case of parameter 30, there was a time decrease of ((152250-84386.3)\/152250) = <strong>44.6%<\/strong>!<\/p>\n<h5>More processes<\/h5>\n<p>Now, lets compare the fib4_* solutions, since they &#8220;look&#8221; more competitive with higher parameters.<\/p>\n<pre lang=\"erlang\">\r\ntest:run(fib, [fib4_6, fib4_7, fib4_8, fib4_9, fib4_10], [30, 40, 41, 42], 3).\r\n<\/pre>\n<pre>\r\nParameter: 30 -- Start ----------\r\n>>fib4_6:  AVG: 138976.0 micros | [132737,154900,129291]\r\n>>fib4_7:  AVG: 129955.3 micros | [130810,129461,129595]\r\n>>fib4_8:  AVG: 131024.3 micros | [130824,130597,131652]\r\n>>fib4_9:  AVG: 131907.3 micros | [131885,132068,131769]\r\n>>fib4_10: AVG: 141973.3 micros | [133670,158155,134095]\r\n\r\nParameter: 40 -- Start ----------\r\n>>fib4_6:  AVG: 16494758.3 micros | [16551363,16241427,16691485]\r\n>>fib4_7:  AVG: 16330959.0 micros | [16080028,16451705,16461144]\r\n>>fib4_8:  AVG: 16324207.7 micros | [16235259,16569521,16167843]\r\n>>fib4_9:  AVG: 16436575.7 micros | [16565366,16617853,16126508]\r\n>>fib4_10: AVG: 16563677.0 micros | [16627788,16308796,16754447]\r\n\r\nParameter: 41 -- Start ----------\r\n>>fib4_6:  AVG: 32710283.7 micros | [32047340,30147917,35935594]\r\n>>fib4_7:  AVG: 38032535.3 micros | [35256794,41837955,37002857]\r\n>>fib4_8:  AVG: 36848240.7 micros | [36764344,38615563,35164815]\r\n>>fib4_9:  AVG: 36471677.7 micros | [38606656,36677249,34131128]\r\n>>fib4_10: AVG: 35954456.0 micros | [34594636,37284090,35984642]\r\n\r\nParameter: 42 -- Start ----------\r\n>>fib4_6:  AVG: 58263948.0 micros | [59179553,58509795,57102496]\r\n>>fib4_7:  AVG: 62005141.7 micros | [58846929,57270566,69897930]\r\n>>fib4_8:  AVG: 59224815.0 micros | [65590241,56918707,55165497]\r\n>>fib4_9:  AVG: 56413852.3 micros | [57657222,57665314,53919021]\r\n>>fib4_10: AVG: 54196771.0 micros | [54126552,54178425,54285336]\r\n<\/pre>\n<p>As you can see, the results are very close to each other. Though, it looks like that as the computation increases, the more processes created help. Of course, we saw that unlimited parallelism does not help, so there is a cap after which instead of gaining performance, we decrease it.<\/p>\n<h5>One last thing<\/h5>\n<p>I kept the best for the end ;-). The tests of <code>fib0<\/code> were done with SMP enabled. So, although the code is single-threaded, Erlang runtime did its best to utilize the 2 cores. That&#8217;s why I turned of SMP in Erlang and run the <code>fib0<\/code> tests again:<\/p>\n<pre lang=\"erlang\">\r\ntest:run(fib, [fib0], [10, 20, 30, 40], 3).\r\n<\/pre>\n<pre>\r\nParameter: 10 -- Start ----------\r\n>>fib0: AVG: 327.0 micros      | [958,12,11]\r\n\r\nParameter: 20 -- Start ----------\r\n>>fib0: AVG: 2510.0 micros      | [2553,2514,2463]\r\n\r\nParameter: 30 -- Start ----------\r\n>>fib0: AVG: 301458.7 micros    | [301585,301662,301129]\r\n\r\nParameter: 40 -- Start ----------\r\n>>fib0: AVG: 37499354.0 micros  | [37525814,37473104,37499144]\r\n\r\nParameter: 41 -- Start ----------\r\n>>fib0: AVG: 63639971.0 micros  | [63708948,63542924,63668041]\r\n\r\nParameter: 42 -- Start ----------\r\n>>fib0: AVG: 104382637.3 micros | [105213567,104180367,103753978\r\n<\/pre>\n<p>Comparing the fastest solution for each parameter with the <code>fib0<\/code> gives:<\/p>\n<table>\n<tr>\n<th>Parameter<\/th>\n<th>fib0 (\u03bcs)<\/th>\n<th>Best (\u03bcs)<\/th>\n<th>Decrease<\/th>\n<\/tr>\n<tr>\n<td>10<\/td>\n<td>327.0<\/td>\n<td>99.7<\/td>\n<td>69.5%<\/td>\n<\/tr>\n<tr>\n<td>20<\/td>\n<td>2510.0<\/td>\n<td>1391.3<\/td>\n<td>44.5%<\/td>\n<\/tr>\n<tr>\n<td>30<\/td>\n<td>301458.7<\/td>\n<td>84386.3<\/td>\n<td>72.0%<\/td>\n<\/tr>\n<tr>\n<td>40<\/td>\n<td>37499354.0<\/td>\n<td>16324207.7<\/td>\n<td>56.4%<\/td>\n<\/tr>\n<tr>\n<td>41<\/td>\n<td>63639971.0<\/td>\n<td>32710283.7<\/td>\n<td>48.6%<\/td>\n<\/tr>\n<tr>\n<td>42<\/td>\n<td>104382637.3<\/td>\n<td>54196771.0<\/td>\n<td>48.0%<\/td>\n<\/tr>\n<\/table>\n<p>Wow! <strong>72% less time..<\/strong> Not bad for a Core2Duo processor!<\/p>\n<h5>Conlusions<\/h5>\n<p>So, what conclusions can we draw from what we saw?<\/p>\n<ul>\n<li>Data parallelization is straightforward in some problems<\/li>\n<li>The benefit from parallelizing can be quite big<\/li>\n<li>Anmdhl&#8217;s law effect should always be taken into account<\/li>\n<li>Unlimited parallelism.. does not work<\/li>\n<li>SMP helps even with &#8220;sequential&#8221; code<\/li>\n<li>Erlang is cool for these kind of problems.. Recursion and parallelization is more than simple!<\/li>\n<li>Fibonacci is both a good and bad example for parallelization. The non-recursive solution runs in <code>O(n)<\/code> steps, where <code>n<\/code> is the input to Fibonacci function<\/li>\n<\/ul>\n<p>You can get the source code <a href=\"http:\/\/trigonakis.com\/blog\/wp-content\/uploads\/2011\/02\/fib1.txt\">here<\/a>. (rename from <code>fib.txt<\/code> to <code>fib.erl<\/code>)<\/p>\n<p><strong>That&#8217;s it<\/strong>..<\/p>\n<p><strong>Mergesort<\/strong> will probably be the next &#8220;victim&#8221;!<\/p>\n","protected":false},"excerpt":{"rendered":"<div class=\"seriesmeta\">This entry is part 2 of 2 in the series <a href=\"http:\/\/trigonakis.com\/blog\/series\/parallelizing-simple-algorithms\/\" class=\"series-62\" title=\"Parallelizing simple algorithms\">Parallelizing simple algorithms<\/a><\/div><p>It took me a little more than I expected, but finally I managed to write the first post for the Parallelizing simple algorithms series. As I &#8220;promised&#8221;, I will start these series by parallelizing the Fibonacci Number sequence generation. As you probably already know, Fibonacci numbers are the integer sequence produced by the following relationship: [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[14,40,5,28],"tags":[66,26,39,27,42],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p1ouW6-3n","_links":{"self":[{"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/posts\/209"}],"collection":[{"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/comments?post=209"}],"version-history":[{"count":56,"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/posts\/209\/revisions"}],"predecessor-version":[{"id":584,"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/posts\/209\/revisions\/584"}],"wp:attachment":[{"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/media?parent=209"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/categories?post=209"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/trigonakis.com\/blog\/wp-json\/wp\/v2\/tags?post=209"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}