<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Parallel Matrix Multiplication with the Task Parallel Library (TPL)</title>
	<atom:link href="http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/feed/" rel="self" type="application/rss+xml" />
	<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/</link>
	<description></description>
	<lastBuildDate>Tue, 20 Dec 2011 21:26:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: Fabien Gigante</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-343</link>
		<dc:creator>Fabien Gigante</dc:creator>
		<pubDate>Tue, 20 Dec 2011 21:26:00 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-343</guid>
		<description>You&#039;re welcome.

In the meantime, I also tried an OpenCL implementation. For big matrixes the performances gain is really significative. (Works better on float than double on my GPU.)

If you have time, adding an OpenCL implemented in your benchmark can be a good idea for the sake of completeness of your new post.

As a starting point, here is my very first draft of OpenCL implementation:

namespace MatrixMultiplication
{
  public class OpenCL
  {
    static public ComputeContext Context { get; private set; }
    static public ComputeDevice Device { get; private set; }
    static OpenCL()
    {
      ComputePlatform platform = ComputePlatform.Platforms.FirstOrDefault();
      if (platform == null) throw new Exception(&quot;ERROR - No OpenCL platform available&quot;);
      ComputeContextPropertyList properties = new ComputeContextPropertyList(platform);
      Context = new ComputeContext(ComputeDeviceTypes.All, properties, null, IntPtr.Zero);
      Device = Context.Devices.FirstOrDefault();
      if (Device == null) throw new Exception(&quot;ERROR - No OpenCL device available&quot;);
      if (!Device.Extensions.Contains(&quot;cl_khr_fp64&quot;)) throw new Exception(&quot;ERROR - OpenCL device doesn&#039;t support doubles&quot;);
    }

    public static TimeSpan Multiply(int N)
    {
      Random rand = new Random();
      MatrixOpenCL b = new MatrixOpenCL(N, N); b.Apply(d =&gt; rand.Next());
      MatrixOpenCL c = new MatrixOpenCL(N, N); c.Apply(d =&gt; rand.Next());
      Stopwatch stopwatch = Stopwatch.StartNew();
      MatrixOpenCL a = b * c;
      stopwatch.Stop();
      return stopwatch.Elapsed;
    }
  }

  public class MatrixOpenCL
  {
    const String programSource = @&quot;
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void MatrixMultiply(__global double* C, __global double* A, __global double* B, int aCols, int bCols)
{
  int i = get_global_id(0);
  int j = get_global_id(1);
  double value = 0;
  for (int k = 0; k &lt; bCols; ++k)
    value += A[i*aCols+k] * B[k*bCols+j];
  C[i*bCols+j] = value;
}&quot;;

    static ComputeKernel matrixMultiplyKernel;
    static MatrixOpenCL()
    {
      ComputeProgram program = new ComputeProgram(OpenCL.Context, new string[] { programSource });
      program.Build(null, null, null, IntPtr.Zero);
      matrixMultiplyKernel = program.CreateKernel(&quot;MatrixMultiply&quot;);
    }

    public int NumLines { get; private set; }
    public int NumCols { get; private set; }
    double[] elements;
    public MatrixOpenCL(int numLines, int numCols)
    {
      NumLines = numLines; NumCols = numCols;
      elements = new double[numLines*numCols];
    }
    public static MatrixOpenCL operator *(MatrixOpenCL a, MatrixOpenCL b)
    {
      if (a.NumCols != b.NumLines) throw new ArgumentException();
      MatrixOpenCL c = new MatrixOpenCL(a.NumLines, b.NumCols);
      ComputeBuffer bufA = new ComputeBuffer(OpenCL.Context, ComputeMemoryFlags.ReadOnly, a.elements);
      ComputeBuffer bufB = new ComputeBuffer(OpenCL.Context, ComputeMemoryFlags.ReadOnly, b.elements);
      ComputeBuffer bufC = new ComputeBuffer(OpenCL.Context, ComputeMemoryFlags.ReadWrite, c.elements);
      matrixMultiplyKernel.SetMemoryArgument(0, bufC);
      matrixMultiplyKernel.SetMemoryArgument(1, bufA);
      matrixMultiplyKernel.SetMemoryArgument(2, bufB);
      matrixMultiplyKernel.SetValueArgument(3, a.NumCols);
      matrixMultiplyKernel.SetValueArgument(4, b.NumCols);
      ComputeCommandQueue commands = new ComputeCommandQueue(OpenCL.Context, OpenCL.Device, ComputeCommandQueueFlags.None);
      long y = ((long)c.NumCols + OpenCL.Device.MaxWorkGroupSize - 1) / OpenCL.Device.MaxWorkGroupSize;
      long x = ((long)c.NumLines * c.NumCols / y + OpenCL.Device.MaxWorkGroupSize - 1) / OpenCL.Device.MaxWorkGroupSize;
      commands.Execute(matrixMultiplyKernel, null, new long[] { c.NumLines, c.NumCols }, new long[] { c.NumLines / x, c.NumCols / y }, null);
      commands.ReadFromBuffer(bufC, ref c.elements, true, null);
      commands.Finish(); commands.Dispose();
      bufA.Dispose(); bufB.Dispose(); bufC.Dispose();
      return c;
    }
    public double this[int i, int j]
    {
      get { return elements[i * NumCols + j]; }
      set { elements[i * NumCols + j] = value; }
    }
    public void Apply(Func function)
    {
      Parallel.For(0, elements.Length, i =&gt; elements[i] = function(elements[i]));
    }

  }

}</description>
		<content:encoded><![CDATA[<p>You&#8217;re welcome.</p>
<p>In the meantime, I also tried an OpenCL implementation. For big matrixes the performances gain is really significative. (Works better on float than double on my GPU.)</p>
<p>If you have time, adding an OpenCL implemented in your benchmark can be a good idea for the sake of completeness of your new post.</p>
<p>As a starting point, here is my very first draft of OpenCL implementation:</p>
<p>namespace MatrixMultiplication<br />
{<br />
  public class OpenCL<br />
  {<br />
    static public ComputeContext Context { get; private set; }<br />
    static public ComputeDevice Device { get; private set; }<br />
    static OpenCL()<br />
    {<br />
      ComputePlatform platform = ComputePlatform.Platforms.FirstOrDefault();<br />
      if (platform == null) throw new Exception(&#8220;ERROR &#8211; No OpenCL platform available&#8221;);<br />
      ComputeContextPropertyList properties = new ComputeContextPropertyList(platform);<br />
      Context = new ComputeContext(ComputeDeviceTypes.All, properties, null, IntPtr.Zero);<br />
      Device = Context.Devices.FirstOrDefault();<br />
      if (Device == null) throw new Exception(&#8220;ERROR &#8211; No OpenCL device available&#8221;);<br />
      if (!Device.Extensions.Contains(&#8220;cl_khr_fp64&#8243;)) throw new Exception(&#8220;ERROR &#8211; OpenCL device doesn&#8217;t support doubles&#8221;);<br />
    }</p>
<p>    public static TimeSpan Multiply(int N)<br />
    {<br />
      Random rand = new Random();<br />
      MatrixOpenCL b = new MatrixOpenCL(N, N); b.Apply(d =&gt; rand.Next());<br />
      MatrixOpenCL c = new MatrixOpenCL(N, N); c.Apply(d =&gt; rand.Next());<br />
      Stopwatch stopwatch = Stopwatch.StartNew();<br />
      MatrixOpenCL a = b * c;<br />
      stopwatch.Stop();<br />
      return stopwatch.Elapsed;<br />
    }<br />
  }</p>
<p>  public class MatrixOpenCL<br />
  {<br />
    const String programSource = @&#8221;<br />
#pragma OPENCL EXTENSION cl_khr_fp64 : enable<br />
__kernel void MatrixMultiply(__global double* C, __global double* A, __global double* B, int aCols, int bCols)<br />
{<br />
  int i = get_global_id(0);<br />
  int j = get_global_id(1);<br />
  double value = 0;<br />
  for (int k = 0; k &lt; bCols; ++k)<br />
    value += A[i*aCols+k] * B[k*bCols+j];<br />
  C[i*bCols+j] = value;<br />
}&quot;;</p>
<p>    static ComputeKernel matrixMultiplyKernel;<br />
    static MatrixOpenCL()<br />
    {<br />
      ComputeProgram program = new ComputeProgram(OpenCL.Context, new string[] { programSource });<br />
      program.Build(null, null, null, IntPtr.Zero);<br />
      matrixMultiplyKernel = program.CreateKernel(&quot;MatrixMultiply&quot;);<br />
    }</p>
<p>    public int NumLines { get; private set; }<br />
    public int NumCols { get; private set; }<br />
    double[] elements;<br />
    public MatrixOpenCL(int numLines, int numCols)<br />
    {<br />
      NumLines = numLines; NumCols = numCols;<br />
      elements = new double[numLines*numCols];<br />
    }<br />
    public static MatrixOpenCL operator *(MatrixOpenCL a, MatrixOpenCL b)<br />
    {<br />
      if (a.NumCols != b.NumLines) throw new ArgumentException();<br />
      MatrixOpenCL c = new MatrixOpenCL(a.NumLines, b.NumCols);<br />
      ComputeBuffer bufA = new ComputeBuffer(OpenCL.Context, ComputeMemoryFlags.ReadOnly, a.elements);<br />
      ComputeBuffer bufB = new ComputeBuffer(OpenCL.Context, ComputeMemoryFlags.ReadOnly, b.elements);<br />
      ComputeBuffer bufC = new ComputeBuffer(OpenCL.Context, ComputeMemoryFlags.ReadWrite, c.elements);<br />
      matrixMultiplyKernel.SetMemoryArgument(0, bufC);<br />
      matrixMultiplyKernel.SetMemoryArgument(1, bufA);<br />
      matrixMultiplyKernel.SetMemoryArgument(2, bufB);<br />
      matrixMultiplyKernel.SetValueArgument(3, a.NumCols);<br />
      matrixMultiplyKernel.SetValueArgument(4, b.NumCols);<br />
      ComputeCommandQueue commands = new ComputeCommandQueue(OpenCL.Context, OpenCL.Device, ComputeCommandQueueFlags.None);<br />
      long y = ((long)c.NumCols + OpenCL.Device.MaxWorkGroupSize &#8211; 1) / OpenCL.Device.MaxWorkGroupSize;<br />
      long x = ((long)c.NumLines * c.NumCols / y + OpenCL.Device.MaxWorkGroupSize &#8211; 1) / OpenCL.Device.MaxWorkGroupSize;<br />
      commands.Execute(matrixMultiplyKernel, null, new long[] { c.NumLines, c.NumCols }, new long[] { c.NumLines / x, c.NumCols / y }, null);<br />
      commands.ReadFromBuffer(bufC, ref c.elements, true, null);<br />
      commands.Finish(); commands.Dispose();<br />
      bufA.Dispose(); bufB.Dispose(); bufC.Dispose();<br />
      return c;<br />
    }<br />
    public double this[int i, int j]<br />
    {<br />
      get { return elements[i * NumCols + j]; }<br />
      set { elements[i * NumCols + j] = value; }<br />
    }<br />
    public void Apply(Func function)<br />
    {<br />
      Parallel.For(0, elements.Length, i =&gt; elements[i] = function(elements[i]));<br />
    }</p>
<p>  }</p>
<p>}</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-342</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Tue, 20 Dec 2011 02:54:00 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-342</guid>
		<description>Fabien,

I made the adjustment you mentioned and the two parallel implementations are virtually identical. I will make a new post soon with those results. Thanks for reading and testing.

-Ian</description>
		<content:encoded><![CDATA[<p>Fabien,</p>
<p>I made the adjustment you mentioned and the two parallel implementations are virtually identical. I will make a new post soon with those results. Thanks for reading and testing.</p>
<p>-Ian</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-341</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Tue, 22 Nov 2011 21:50:00 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-341</guid>
		<description>Thanks Fabien, I&#039;ll take another look.</description>
		<content:encoded><![CDATA[<p>Thanks Fabien, I&#8217;ll take another look.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fabien Gigante</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-340</link>
		<dc:creator>Fabien Gigante</dc:creator>
		<pubDate>Sun, 13 Nov 2011 23:48:00 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-340</guid>
		<description>Looking at the IL, I was able to dig more into the problem described in my previous comment.

The performance difference is very likely coming from the fact that in the SimpleParallel lambda, the C# compiler decided to capture the variables by reference (equivalent to the C++11 lambda closure [&amp;]). By providing a static method, I forced those variables to be passed by value, therefore improving the performances.

To confirm this analysis, I changed the static function to the following code, forcing values to be passed as references. It gives back the original performances of SimpleParallel method, as expected.

private static void Multiply(int i, ref int N, ref double[][] A, ref double[][] B, ref double[][] C)

The next challenge was to modify the initial lamba expression, to have the good performances, without the need of a static function. The following syntax worked: it was probably explicit enough for the C# compiler to undertand that it could capture the variables by value.

Parallel.For(0, N, i =&gt; { int n = N; double[][] a = A, b = B, c = C; for (int k = 0; k &lt; n; k++) for (int j = 0; j &lt; n; j++) c[i][j] += a[i][k] * b[k][j]; });

</description>
		<content:encoded><![CDATA[<p>Looking at the IL, I was able to dig more into the problem described in my previous comment.</p>
<p>The performance difference is very likely coming from the fact that in the SimpleParallel lambda, the C# compiler decided to capture the variables by reference (equivalent to the C++11 lambda closure [&amp;]). By providing a static method, I forced those variables to be passed by value, therefore improving the performances.</p>
<p>To confirm this analysis, I changed the static function to the following code, forcing values to be passed as references. It gives back the original performances of SimpleParallel method, as expected.</p>
<p>private static void Multiply(int i, ref int N, ref double[][] A, ref double[][] B, ref double[][] C)</p>
<p>The next challenge was to modify the initial lamba expression, to have the good performances, without the need of a static function. The following syntax worked: it was probably explicit enough for the C# compiler to undertand that it could capture the variables by value.</p>
<p>Parallel.For(0, N, i =&gt; { int n = N; double[][] a = A, b = B, c = C; for (int k = 0; k &lt; n; k++) for (int j = 0; j &lt; n; j++) c[i][j] += a[i][k] * b[k][j]; });</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fabien Gigante</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-339</link>
		<dc:creator>Fabien Gigante</dc:creator>
		<pubDate>Sun, 13 Nov 2011 23:17:00 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-339</guid>
		<description>I modified the SimpleParallel method as follows, and it becomes almost as fast as OptimizedParallel.

The only different is that I call a static method instead of a lambda function in the Parallel.For loop.

Therefore, I&#039;m wondering if the improvement of OptimizedParallel is really due to a better partitioning compared to the default partitioning method of Parallel.For, or because lambda functions are causing a significant overhead in the processing... and why such an overhead if it is really the case...public static TimeSpan Multiply(int N)
public static TimeSpan Multiply(int N)

{
var C = new double[N][];

var A = new double[N][];
var B = new double[N][];
Util.Initialize(N, A, B, C);
Stopwatch stopwatch = Stopwatch.StartNew();
Parallel.For(0, N, i =&gt; Multiply(i, N, A, B, C));


stopwatch.Stop();
return stopwatch.Elapsed;

}
private static void Multiply(int i, int N, double[][] A, double[][] B, double[][] C)

{
for (int k = 0; k &lt; N; k++) for (int j = 0; j &lt; N; j++) C[i][j] += A[i][k] * B[k][j];

}</description>
		<content:encoded><![CDATA[<p>I modified the SimpleParallel method as follows, and it becomes almost as fast as OptimizedParallel.</p>
<p>The only different is that I call a static method instead of a lambda function in the Parallel.For loop.</p>
<p>Therefore, I&#8217;m wondering if the improvement of OptimizedParallel is really due to a better partitioning compared to the default partitioning method of Parallel.For, or because lambda functions are causing a significant overhead in the processing&#8230; and why such an overhead if it is really the case&#8230;public static TimeSpan Multiply(int N)<br />
public static TimeSpan Multiply(int N)</p>
<p>{<br />
var C = new double[N][];</p>
<p>var A = new double[N][];<br />
var B = new double[N][];<br />
Util.Initialize(N, A, B, C);<br />
Stopwatch stopwatch = Stopwatch.StartNew();<br />
Parallel.For(0, N, i =&gt; Multiply(i, N, A, B, C));</p>
<p>stopwatch.Stop();<br />
return stopwatch.Elapsed;</p>
<p>}<br />
private static void Multiply(int i, int N, double[][] A, double[][] B, double[][] C)</p>
<p>{<br />
for (int k = 0; k &lt; N; k++) for (int j = 0; j &lt; N; j++) C[i][j] += A[i][k] * B[k][j];</p>
<p>}</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Interesting Ski Reading &#187; Lab49 Blog</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-30</link>
		<dc:creator>Interesting Ski Reading &#187; Lab49 Blog</dc:creator>
		<pubDate>Tue, 06 Apr 2010 20:52:51 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-30</guid>
		<description>[...] Matrix Multiplication with the Task Parallel Library [...]</description>
		<content:encoded><![CDATA[<p>[...] Matrix Multiplication with the Task Parallel Library [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Interesting Ski Reading &#171; Tales from a Trading Desk</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-29</link>
		<dc:creator>Interesting Ski Reading &#171; Tales from a Trading Desk</dc:creator>
		<pubDate>Tue, 06 Apr 2010 20:00:34 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-29</guid>
		<description>[...] Matrix Multiplication with the Task Parallel Library [...]</description>
		<content:encoded><![CDATA[<p>[...] Matrix Multiplication with the Task Parallel Library [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: markmichaelis</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-28</link>
		<dc:creator>markmichaelis</dc:creator>
		<pubDate>Tue, 06 Apr 2010 13:50:25 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-28</guid>
		<description>Great job Ian!</description>
		<content:encoded><![CDATA[<p>Great job Ian!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: The Morning Brew - Chris Alcock &#187; The Morning Brew #573</title>
		<link>http://innovatian.com/2010/03/parallel-matrix-multiplication-with-the-task-parallel-library-tpl/comment-page-1/#comment-27</link>
		<dc:creator>The Morning Brew - Chris Alcock &#187; The Morning Brew #573</dc:creator>
		<pubDate>Tue, 06 Apr 2010 08:23:08 +0000</pubDate>
		<guid isPermaLink="false">http://innovatian.com/?p=362#comment-27</guid>
		<description>[...] Parallel Matrix Multiplication with the Task Parallel Library (TPL) - Ian Davis takes a look at the various methods of matric multiplication and shows how they can be parallelized using the Task Parallel Library, and takes a look at the performance improvement that can be realised. [...]</description>
		<content:encoded><![CDATA[<p>[...] Parallel Matrix Multiplication with the Task Parallel Library (TPL) &#8211; Ian Davis takes a look at the various methods of matric multiplication and shows how they can be parallelized using the Task Parallel Library, and takes a look at the performance improvement that can be realised. [...]</p>
]]></content:encoded>
	</item>
</channel>
</rss>

