Parallel Computing in Matlab

1. Definition

Here is the definition for multithreading (From Wikipedia):

In computer science, a thread of execution is the smallest unit of processing that can be scheduled by anoperating system. It generally results from a fork of acomputer program into two or more concurrently runningtasks.

On a single processor, multithreading generally occurs by time-division multiplexing (as inmultitasking): the processor switches between different threads. On a multiprocessor or multi-core system, the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task.

The difference between multithreading and parallel computing has been vague. Different people have different understandings.

Here (http://www.danielmoth.com/Blog/threadingconcurrency-vs-parallelism.aspx), parallel computing is just for multi-core/cpu system.

Here (http://stackoverflow.com/questions/1740304/difference-between-multithreaded-and-parallel-programming), it says “multithreading is a type of parallelism, so parallel programming would include multithreading. Other types of parallelism include multiprocessing (where you use multiple processes that communicate with each other) and distributed computing (where tasks run on different computers).”

Recently when people talk about parallel computing, they usually refer to using muti-core cpu or multiple computers or GPU. But generally speaking, I would rather use the latter one, since people have been using the terminology “parallel” for a long time in a lot of cases, such as parallel port vs serial port. Time division in optical communication is also common, which I think is also considered parallelism.

Here (http://zone.ni.com/devzone/cda/tut/p/id/6424) explained the difference between multithreading and multitasking.

2. Multithreading in matlab

http://www.mathworks.com/support/solutions/en/data/1-WNXI8/index.html?solution=1-WNXI8

Prior to MATLAB 7.4 (R2007a) MATLAB was a single threaded application. So on a hyperthreaded dual processor machine which runs 4 threads, two per CPU, MATLAB will take only 25% of CPU time corresponding to 1 thread.

As of MATLAB 7.4 (R2007a), MATLAB can be multi-threaded for certain applications.

http://www.mathworks.com/support/solutions/en/data/1-46OY0H/index.html?solution=1-46OY0H

Multithreaded computations, introduced in MATLAB R2007a, are turned on by default as of MATLAB R2008a release.

Another words, prior to matlab 7.4, there is no multithreading. Even if you have multi-core computer, it doesn’t help. While for higher version matlabs, multithreading are embedded, so we can benefit from multi-core computer.

I actually found some people made mex files using another language like c++ or java to achieve multithreading in Matlab. For example: http://www.instructables.com/id/Matlab-Multithreading-EASY/.

I was trying to compile it. I tried in Ubuntu with matlab 2010a, and in windows with matlab 2010b. I failed and got the errors as other people got:

mex mythreadprog.cpp

Error mythreadprog.cpp: 11 syntax error; found `MyThread’ expecting `;’
Error mythreadprog.cpp: 19 redeclaration of `mexFunction’ previously declared at .\mex.h 135
Warning mythreadprog.cpp: 23 assignment of pointer to void function(pointer to void) to pointer to int function(pointer to void)
Error mythreadprog.cpp: 33 syntax error; found `MyThread’ expecting `;’
Warning mythreadprog.cpp: 58 missing return value
Error mythreadprog.cpp: 33 undefined size for `void cdecl’
4 errors, 2 warnings

C:\PROGRA~1\MATLAB\R2007B\BIN\MEX.PL: Error: Compile of ‘mythreadprog.cpp’ failed.

Another multithreading example compiled from c is here:

http://robertoostenveld.ruhosting.nl/index.php/multithreading-matlab-mex/

3. Vectorization

Matlab is a matrix language. Vectors and Matrices are used in matlab to make computing faster and improve performance. You can find more details about the techniques improving performance in Matlab (http://www.mathworks.com/help/techdoc/matlab_prog/f8-784135.html and http://www.mathworks.com/support/solutions/en/data/1-15NM7/?solution=1-15NM7).

As I understand, vectorization is not parallel computing. Vectorization is used to improve performance due to higher memory efficiency. But of course, vector and matrix computing can also use parallel computing.

4. Parallel computing in Matlab

A great article can be found here: https://docs.google.com/viewer?url=http%3A%2F%2Fweb.eecs.utk.edu%2F~luszczek%2Fpubs%2Fparallelmatlab.pdf

The parallel computing toolbox was actually embedded in matlab since R2006b. But multithreading is built-in in recent versions. It seems the multithreading is built for multi-core computers. We won’t benefit from multithreading using a single-core computer. For more information about parallel computing in Matlab, check: http://www.mathworks.com/products/parallel-computing/, “Parallel Computing Toolbox™ lets you solve computationally and data-intensive problems using multicore processors, GPUs, and computer clusters.”, http://www.mathworks.com/help/toolbox/distcomp/f3-6010.html and http://www.mathworks.com/help/toolbox/optim/ug/briutum.html.

More Links for parallel computing matlab:

http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V4G-4WGHJMS-1&_user=699445&_coverDate=11%2F30%2F2009&_rdoc=1&_fmt=high&_orig=gateway&_origin=gateway&_sort=d&_docanchor=&view=c&_searchStrId=1699822384&_rerunOrigin=google&_acct=C000039258&_version=1&_urlVersion=0&_userid=699445&md5=85820ea2f90115e9f443437913a5f2b2&searchtype=a

https://docs.google.com/viewer?url=http%3A%2F%2Fwww.serc.iisc.ernet.in%2FComputingFacilities%2Fsoftware%2Fmatlab-7.11%2Fmatlab_parallel_toolbox%2Fdistcomp.pdf

http://www.mathworks.com/products/parallel-computing/builtin-parallel-support.html

http://www.mathworks.com/products/distriben/

http://www.mathworks.com/support/product/DM/installation/ver_current/

Even though R2006b was single-threaded, people can benefit from the parallel computing toolbox with network cluster computers and distributed arrays whichhas better memory efficient usage. “each lab has to store and process means a more efficient use of memory and faster processing”.

“Parallel for-loops let you run a for-loop across your labs simultaneously.” One thing I don’t understand is that when I tried Parallel for-loops  on R2006b, I can only have one lab, but parfor still runs faster than for loop.

5. Some test results

Note: Factors That Affect Speed (http://www.mathworks.com/help/toolbox/optim/ug/briutum.html)

a. Parallel overhead. There is overhead in calling parfor instead of for. If function evaluations are fast, this overhead could become appreciable. In particular, solving a problem in parallel can be slower than solving the problem serially.

b. No nested parfor loops. When executing serially, parfor loops run slower than for loops. Therefore, for best performance, ensure that only your outermost parallel loop calls parfor.

Here are some testing codes and results using Matlab R2010b in Windows XP.

Codes:

%%%%%%% 1 %%%%%%%

clear all

M=zeros(1,100000);

tic

for i=1:100000

M(i)=i;

end

toc

%%%%%%%% 2 %%%%%%%%%%%%

clear all

tic

for i=1:100000

M(i)=i;

end

toc

%%%%%%%%%3 %%%%%%%%%%%

clear all

M=zeros(1,100000);

tic

parfor i=1:100000

M(i)=i;

end

toc

%%%%%%%% 4 %%%%%%%%%%%%

clear all

tic

parfor i=1:100000

M(i)=i;

end

toc

%%%%%%%% 5 %%%%%%%%%%%%

clear all

tic

M = 1:100000;

toc

%%%%%%%%% 6 %%%%%%%%%%%

clear all

matlabpool open 2

tic

parfor i=1:100000

M(i)=i;

end

toc

matlabpool close

%%%%%%%%% 7 %%%%%%%%%%%

clear all

matlabpool open 2

tic

parfor i=1:100

fprintf(‘labindex = %d\n’, labindex);

M(i)=i;

end

toc

matlabpool close

%%%%%%%%% 8 %%%%%%%%%%%

clear all

matlabpool open 2

tic

spmd

labindex

end

tocmatlabpool close

%%%%%%%%%%%%%%%%%%%%

Results:

1) regular for loop using pre-allocation:

Elapsed time is 0.000813 seconds.

2) regular for loop without pre-allocation:

Elapsed time is 20.986812 seconds.

3) parfor-loop with pre-allocation

Elapsed time is 0.157174 seconds.

4) parfor-loop without pre-allocation

Elapsed time is 0.157019 seconds.

5) using vector

Elapsed time is 0.000505 seconds.

6) open a work pool of size 2 (the most I can create with my computer)

Elapsed time is 0.261995 seconds.

7) labindex = 1

8 ) Lab 1: 1, Lab 2: 2

In R2006, 4) is also faster than 2). Like I said, I don’t know the exact answer. Here, 4) is faster than 2) too in Matlab 2010b, without opening multiple worker labs. Basically I don’t how parfor-loop works. It’s called parallel simoutaneously. Different iterations can be run at the same time, and the order is not important. But if there’s only a single thread (R2006b), is it using multi-core of CPU? How does it do that if I don’t open multiple work labs using matlabpool. If it’s using a single core and single thread, how is the “parallelism” done? It seems the answer has to be “a cpu-core can be used without creating a worker lab.” But when I checked the processes for 1) through 4), only one matlab process is running which takes 50% cpu resources.

3) and 4) are about the same and a little slower than 1). This may be because the parfor parallel processing takes about the same time as pre-allocation in 1), but then parfor loop has its overhead added, which make it a little slower.

In 6) When matlabpool open is used, the parfor-loop even takes longer. With or without pre-allocation are about the same. And 7) shows all parfor loops are performed in lab 1. Lab 2 is not used. I don’t understand wny lab 2 is not used either. Another thing is that I have noticed in 6) and 7), there are totally 3 matlab processes, one of them is about 0%, each of the rest two takes about 50%. Another words, when two worker labs are created, two cpu cores are used. In total it takes 100% CPU. But when I show the labindex for each parfor loop, the labindex is always 1. And using two cores is not faster than just using one core. It seems one core is working for nothing. I tied to change in 6) from “parfor i=1:100000” to “parfor (i=100000,1)”, it takes about the same time.

It seems I’m missing some point here.
Using spmd in 8 ) seems to have two worker working for sure.

9 Responses

  1. Nice post. I think anyone serious about GPU computing in MATLAB has to be using Jacket by AccelerEyes though. The GPU feature in PCT is just marketing bologna and performs worse than a single CPU in most cases that we’ve tested (except for single FFT functions, but even then Jacket is faster). MW has a horrible design approach to that feature. Anyone, just thought I’d point that out. Good luck!

  2. I double checked 4) and 2) in matlab R2006b. They actually run about the same. This makes sense. The article here (https://docs.google.com/viewer?url=http%3A%2F%2Fweb.eecs.utk.edu%2F~luszczek%2Fpubs%2Fparallelmatlab.pdf) explained quite well. Parloop itself uses “low-level threading” implicit parallelism. Another words, it’s doing multithreading. Of course it won’t help in R2006b, unless a cluster of computers are used, and worker labs are generated. When parloop is used with matlabpool, it’s explicit parallelism. In this case, labs are created. So the reason when multiple labs are created, it run slower than without matlabpool, may be because there’s some extra work to work in different labs. This seems to explain all above questions.

  3. Some of the related questions and answers can be found here: http://www.mathworks.com/matlabcentral/newsreader/view_thread/306658.

  4. 1. Parallel computing in matlab is achieved by using multiple workers(labs). A worker is just a matlab instance. One client manages one or multiple worker(s). Basically, multiple matlab instaces are running without manually starting it, and all of them are connected.

    In every single instance, it’s still single threading. Multiple instances work together make multithreading.

    2. Vectorization:

    Vectorization is not considered as parallelism.
    “When you call a vectorized function on a vector instead of calling that same function for each individual element, you’re making 1 function call instead of numel(theVector) function calls. Calling a function does have some overhead associated with it, so reducing the number of calls may (depending on what the function is doing) save you some time.” – Steve

    This explains vectorization.

    3. Worker, Lab, Core:

    1 worker = 1 lab. But CPU core is not related. Multiple matlat instances can be run on a single core computer. More CPU cores will definitely help. But multiple cores will be used based on availability and the functionality of the matlab codes.

    “Parallel Computing Toolbox does not force different workers to execute
    on different cores (even though this is technically possible using “core
    affinity” mechanisms provided by your OS). You can force this yourself
    either via the Task Manager in Windows, or something like taskset on
    Linux.

    What we’ve found when we’ve investigated this is that typically the OS
    does a good job of migrating the worker processes between cores as it
    sees fit.”

    — Eric

    Multithreading can even be achieved on a single core computer. As of R2009b, a list of matlab functions can benefit from multithreaded computing.

    http://www.mathworks.com/support/solutions/en/data/1-4PG4AN/?solution=1-4PG4AN

    Multithreading can be achieved explicitly using matlabpool open and spmd.

    Implicit parallelism (the kind that just happens for you automatically when running on a multi-core/multi-processor system) and explicit parallelism via parfor and the Parallel Computing Toolbox

    4. Parfor

    Ex 7: shows all parfor loops are performed in lab 1. Lab 2 is not used.
    Actually, labindex is specifically designed for spmd, not for parfor.

    (http://www.mathworks.com/matlabcentral/newsreader/view_thread/306658
    http://www.mathworks.com/matlabcentral/newsreader/view_thread/306207#831053)

    Parfor don’t do parallel computing without labs/workers created. (http://www.mathworks.com/matlabcentral/newsreader/view_thread/306207#831053) Then 4) is faster than 2) still may be because, in this case, without matlabpool, parfor works like in R2006b version, making parallelism using NUMLABS.

    “New parallel for-loop (parfor-loop) functionality automatically executes a loop body in parallel on dynamically allocated cluster resources, allowing interleaved serial and parallel code. ” (http://www.mathworks.com/help/toolbox/distcomp/rn/bq7xin2-1.html#bq7xipz)

  5. Ex. 4) and 3) are about the same. That seems parfor does preallocation automatically. That’s why it’s faster than for loop without parallel computing. Steven confirmed it in http://www.mathworks.com/matlabcentral/newsreader/view_thread/306658#833649.

    Ex. 6), I can create 8 labs: matlabpool open myConf 8. But the time is about the same.

    Elapsed time is 0.244412 seconds.

  6. Ex. 6)

    matlabpool open 1
    Elapsed time is 0.134579 seconds.

    matlabpool open 2
    Elapsed time is 0.100496 seconds.

    matlabpool open 8
    Elapsed time is 0.244412 seconds.

    This confirms when threads number is equal to core number.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: