# parallel_run

parallel calls to a function

### Syntax

[R1[,…,Rm]]=parallel_run(A1[,…,Ak], f[,types][,dims], p)

### Arguments

- Ai
Argument matrix of scalars (real) for function f. The Ai matrices do not need to have the same number of columns. Let n be the largest number of columns for the Ai : f will be called

`n`

times with arguments Ai(:,k) for k=1…n. For Ai with less than n columns, the columns are 'recycled' and the kth call is passed Ai(:,(modulo(k-1,n)+1).- f
A scilab macro or a string containing the name of the function to call.

- types
strings containing the type name (as per typeof) of results of the function

**f**. Currently, only "constant" for reals (scalar or matrix) is supported. Defaults to "constant".- dims
matrix with 1 or 2 columns containing the dimensions of the results of the function

**f**. Defaults to 1 (i.e. scalar results).- p
parameters list (as created with init_param) used for tuning the parallelization strategy. See

**§Tuning the Parallelization with Configuration Parameters**for the various parameters that can be set.- Ri
vectors of n (see above) columns containing the results of the function called. The number of rows are given by the corresponding size argument if any (1 otherwise).

### Description

parallel_run makes parallel calls (on a multicore system) to the provided function on the supplied vectors of arguments. The function can be the name of either a compiled foreign function (see ilib_for_link) or a Scilab macro. In the latter case, the macro should not rely on side effects because some of them will be lost (those performed in other processes than the main scilab process). The number of calls (and dimension of the result vectors) is given by the length of the longest vector of arguments.

Except for side effects, [R1[,…,Rm]]=parallel_run(A1[,…,Ak], 'f'[,types][,dims][,C]) is equivalent to the Scilab pseudo code:

for i = 1:max(size(A1,'c'),…,size(Ak,'c')) b1 = modulo(i-1,size(A1,'c')) + 1 bk = modulo(i-1,size(Ak,'c')) + 1 [R1(:,i),…,Rm(:,i)] = f(A1(:b1),…,Ak(:bk)); end

Note that the equivalent code does not need `types`

, `dims`

or `p`

.
`parallel_run`

needs them for performance reasons : `types`

and `dims`

are used
to pre-allocate the result variables and `p`

is used to fine tune the
parallelization strategy.

#### Limitations

In this current version of Scilab, `parallel_run`

uses only one core on Windows platforms.

In order to provide the most efficient implementation possible, the least that parallel_run can do is to avoid imposing an overhead over an explicit loop. Unfortunately, this can only be achieved on some data structures.
For matrices of scalar, extracting a single column to pass as argument to the function can be done very efficiently.
Not so for matrices of strings or sparse matrices (not to mention lists !).
For now, parallel_run only handles arguments and results of scalar matrices of real values and the **types** argument is not used.

Furthermore, no locking primitives are available in Scilab to handle concurrent accesses to shared variables. For this reason, the concurrent execution of Scilab macros cannot be safe in a shared memory environment and each parallel execution must be done in a separate memory space. As a result, one should not rely on side effects such as modifying variables from outer scope : only the data stored into the result variables will be copied back into the calling environment.

As it is not possible to synchronize the access to Java Virtual Machine across separate memory spaces, Scilab macros called by parallel_run are not allowed to use the JVM. This unfortunately encompass all the graphics functions as the Scilab GUI is based on Java. The spawned Scilab process are run in "nwni" mode their calls to disp() are redirected to the standard output.

#### Advanced usage

To achieve maximum performance and void any unwarranted copy and heavy context switching, it is possible to use parallel_run on compiled foreign functions. Such and advanced usage of parallel_run is detailed bellow in **§ Advanced Usage : Compiled Foreign Functions**.

### Examples

#### First steps

First we rewrite the most trivial loop with parallel_run:

for i = 1:10 res(i) = i*i; end; // for parallel_run, we need to have a function performing the computation function a=g(arg1) a=arg1*arg1 endfunction res = parallel_run(1:10, g); // => res = [1 4 9 16 25 36 49 64 81 100];

The matrix res was filled with res(i)= g(i), with concurrent calls to g(). To see that the calls to `g`

are not sequential, we can add some output to the function :

function a=g(arg1) disp("called on "+string(arg1)) a = arg1*arg1 disp("computed "+string(a)) endfunction res = parallel_run(1:6, g); // the actual output depends on the architecture (i.e. 4 cores) and current load

called on 3 computed 9 called on 2 called on 5 computed 25 computed 4 called on 1 called on 6 computed 36 computed 1 called on 4 computed 16

As we can see, not only were the calls to `g`

'out of order', but they where also interleaved.

This example is for illustrative purposes only ! When the operations performed by the function are available in vector form, it is much more efficient to use these operators than to call parallel_run. For instance, the most efficient way to perform g() on a vector of values would be to rewrite it as a= arg1 .* arg1 an call it directly on the matrix of arguments (i.e res= g(1:10);). parallel_run is intended for more involved functions. |

To illustrate the running time gain, we instrument the next example with calls to getdate(). However, the next examples will focus only on the illustrative purpose irrespectively of any actual gain. Running the simplest Scilab function with a relevant signature on datasets small enough to be printed is unlikely to benefit from multicore parallel scheduling ! Each example will perform a computation with an explicit loop, and then the same computation with parallel_run.

#### Wallclock gain

function r=f(n) if(n == 0) r = 1; else r = n - m(f(n-1)); end; endfunction function r=m(n) if(n == 0) r = 0; else r = n - f(m(n-1)); end; endfunction; n_max = 40; t0 = getdate(); for i = 1:n_max r(i)= m(i); end etime(getdate(), t0) // output the wallclock time for the explicit loop computation t0 = getdate(); r = parallel_run(1:n_max,"m"); // output the wallclock time for the parallel_run computation, // it should be lower that the previous on multicore architectures. etime(getdate(), t0)

#### Plurality of arguments and returned values

It is of course possible to have more than one argument and more than one returned value :

function [r_min, r_med, r_max]=min_med_max(a, b, c) r_min = min(a,b,c); r_med = median([a,b,c]); r_max = max(a,b,c); endfunction N = 10; A = rand(1:N); B = rand(1:N); C = rand(1:N); Min = zeros(N); Med = Min; Max = Min; for i = 1:N [Min(i), Med(i), Max(i)]= min_med_max(A(i), B(i), C(i)); end; [Min, Med, Max] = parallel_run(A,B,C,"min_med_max");// equivalent to the previous loop.

As we have seen in the Note of §First steps, the best performance is achieved by making use of vector (1xn matrices) operations. With parallel_run, it is also possible to take advantage of those operations by taking vector arguments and returning vector results. When returning a vector, one must use the `dims`

so that `parallel_run`

knows beforehand the number of returned values to expect.

#### Vector arguments and results

function r=quantiles(data, ranks) tmp = gsort(data,'g','i'); r = tmp(ranks); endfunction N = 100; data = matrix(rand(1:N*10), N, 10); R4_6_7 = matrix(zeros(1:3*10),3,10); for i = 1:10 R4_6_7(:,i) = quantiles(data(:,i),[4;6;7]); end; // note the last argument to inform that quantiles returns 3 scalars. R4_6_7 = parallel_run(data,[4;6;7],"quantiles", 3);

While it is only possible to pass vectors (1 x n) as columns of argument and/or result matrices, it is easy to reshape any matrix into such vector to circumvent this restriction. The previous example was passing arguments with different numbers of columns. In fact, the second argument to pass to `quantiles`

only has one columns, because all the calls to the function must be made with the same ranks. This column is recycled as if the second argument had as much columns as the first, all with the same values.

**Note :** In the previous example, the shape of the second argument is of utmost importance for parallel_run ! While it was possible to write
*R4_6_7(:,i)= quantiles(data(:,i),[4 ,6,7]);*
(note the comma) in the explicit loop, calling

*parallel_run(data,[4;6;7],"quantiles", 3)*would have computed a very different result :

--> // with previous definitions --> parallel_run(data,[4;6;7],"quantiles", 3) ans = 0.0000426 0.0000512 0.0000387 0.0000271 0.0000175 0.0000462 0.0000243 0.0000302 0.0000483 0.0000099 0.0000799 0.0000681 0.0000516 0.0000435 0.0000576 0.0000645 0.0000700 0.0000478 0.0000604 0.0000252 0.0000963 0.0000953 0.0000649 0.0000452 0.0000650 0.0000805 0.0000793 0.0000615 0.0000666 0.0000335 --> parallel_run(data,[4,6,7],"quantiles", 3) ans = 0.0000426 0.0000681 0.0000649 0.0000271 0.0000576 0.0000805 0.0000243 0.0000478 0.0000666 0.0000099 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

The second call to `parallel_run`

has a second argument of 3 columns and 1 row, so each call to `quantiles`

only gets 1 scalar as second argument. The first call to `quantiles`

gets '4' and returns the 4th value of the first columns of data. This is confirmed by the fact that the returned value (0.0000426) is the same as the first returned by the previous call to `parallel_run`

(which returned the 4th, 6th and 7th values). For the second call to `parallel_run`

, the two other values of the first result column (here 0. ) are undefined because we stated that `quantiles`

would return 3 values (the last argument to `parallel_run`

) while in fact `quantiles`

only returned one (because it only got one rank as second argument). The second call to `quantiles`

gets '6' and returns the 6th value of the second columns of data. This is confirmed by the fact that the returned value (0.0000681) is the same as the first returned by the previous call to `parallel_run`

(which returned the 4th, 6th and 7th values). For the second call to `parallel_run`

, the two other values of the second result column (here 0. ) are undefined for the reason we have just seen. The third call to `quantiles`

gets '7' and computes the 7th rank. For the next calls, the argument matrix with only three columns has its columns recycled so that `quantiles`

gets '4' and then '6' and then '7' and then '4' etc.

#### Handling other data types

##### Handling other argument types

While it is not possible to pass or return most Scilab types with functions called by `parallel_run`

, this does not mean that one cannot interact with any data structure. The easiest is to read from any kind of data type: as the global environment is shared amongst the parallel execution context, one can just pass column numbers instead of columns for instance, and let the macro access the data.

##### Handling a sparse matrix argument

The following code sample shows how to access to elements of a sparse matrix, to compute quantiles of those elements.

N = 5; // size of a block L = 10; // number of blocks data = spzeros(N*L,N*L); for k = 1:L // init diagonal blocks of sparse matrix with random numbers kk = (k-1)*N+1:k*N; data(kk,kk) = matrix(rand(1:N*N),N,N); end // function to compute quantiles in blocks from the data sparse matrix function r=quantiles_sparse(indices_i, indices_j, ranks) tmp = gsort(full(data(indices_i, indices_j)),'g','i'); r = tmp(ranks); endfunction // Compute quantiles of diagonal blocks // a) with explicit loop R3_6_9 = matrix(zeros(1:3*L),3,L); for k = 1:L kk = (k-1)*N+1 : k*N; R3_6_9(:,k) = quantiles_sparse(kk,kk, [3,6,9]); end // b) with parallel_run idx = matrix(1:(L*N), N,L); R3_6_9 = parallel_run(idx,idx,[3;6;9],"quantiles_sparse",3);

##### Handling a matrix of strings argument

The following code sample shows how to access to elements of a matrix of strings, to compute the number of digits in those strings..

a = ["a1";"b2b";"1cc2c";"4d555"] function result=count_digits(str) result = sum(isdigit(str)) endfunction for i = 1:size(a,'r') nb_digits(:,i) = count_digits(a(i)); end // We must define another function taking an index argument // and accessing the string from the variable in outer scope: function result=count_digits_i(index) result = sum(isdigit(a(index))); endfunction nb_digits = parallel_run(1:size(a,'r'),"count_digits_i");

#### Handling other result types

Other result types are much harder to handle and should be avoid if at all possible. It would be useless to write to variables from outer scope because each execution environment has its own copy of the memory space. The only solution is to find a mapping to a vector of scalars and channel the result data through it.

##### Handling a sparse matrix argument and result type

The following code sample shows how to access to elements of a sparse matrix, to compute the invert of a block diagonal sparse matrix, and return the result in a sparse matrix.

N = 5; // size of a block L = 10; // number of blocks data = spzeros(N*L,N*L); for k = 1:L // init sparse matrix with random diagonal blocks kk = (k-1)*N+1:k*N; data(kk,kk) = matrix(rand(1:N*N),N,N); end // Invert the diagonal blocks of the sparse matrix with explicit loop r = data; for k = 1:L kk = (k-1)*N+1 : k*N; r(kk,kk) = inv(data(kk,kk)); end // Function to compute the invert of a block from the sparse matrix // it cannot take a sparse matrix argument so it must take the block number (k) // a retrieve the block from the variable data in outer scope // it also cannot put the result directly in a sparse matrix result // so we return the full matrix of the inverted block function r=invert_one_block(block_size, k) kk = (k-1)*block_size+1 : k*block_size; r = full(inv(data(kk,kk))); // data from global scope endfunction // Function to make a sparse matrix and fill it with the full results // from parallel_run function r=to_sparse(full_diag, block_size) // full_diag is L * block_size^2 s = size(full_diag); nb_blocks = s(2); r = data; // data from global scope for k = 1:s(2) kk = (k-1)*block_size+1 : k*block_size; r(kk,kk) = matrix(full_diag(1:s(1),k),block_size,block_size) end endfunction // call parallel_run() to operate on blocks of the sparse matrix // and to_sparse() to build the sparse result r = to_sparse(parallel_run(N,1:L,"invert_one_block",[N,N]),N);

##### Handling matrices of strings as argument and result type

The following uses the workaround seen in **§Handling a matrix of strings argument** to access a string matrix argument, and shows how to return strings by using their numeric representation with ascii(), to flip odd characters in strings.

a = ["one a";"b";"c and a word"]; // only works for ASCII function p=censor_odds(str) odds = (0:((length(str)-1)/2))*2+1; p = ascii(str); p(odds) = ascii('#'); p = ascii(p); endfunction for i = 1:size(a,'r') res(i) = censor_odds(a(i)); end function p=censor_odds_i(i, max_len) // pass an index and the max length strlen = length(a(i)); odds = (0:((strlen-1)/2))*2+1; p = ascii(a(i)+blanks(max_len-strlen)); p(odds) = ascii('#'); endfunction function v=censor_v(codes) for i = 1:size(codes,'r') v(i) = ascii(codes(i,:)) end endfunction len = max(length(a)); res = censor_v(parallel_run(1:size(a,'r'), len, "censor_odds_i",len)');

### Advanced Usage : Compiled Foreign Functions

When aiming at maximum performance, it sometimes makes sense to rewrite a few lines of Scilab code in a compiled language such as C++, C or Fortran.
ilib_for_link allows to do it easily from Scilab, and `parallel_run`

allows you to run your compiled code on your multicore architecture. Moreover, as it is possible to make compiled code thread-safe, it is also possible to launch compiled foreign function in lightweight (with less scheduling overhead) shared memory workers (threads), as will be seen in REF Parameters.

#### Compiled Foreign Functions API

The restrictions imposed on the data types handled by `parallel_run`

() allows to use a simpler API :

void f(void const* const* args, void *const* res)

Both arguments refer to data of opaque type `(void*)`

and the implementation has to know which kind of data it receives and cast the pointers accordingly.
(Note that current implementation only supports double data, but integer types will soon be available.)

**void const* const* args** is a pointer to an array of the data of each argument Ai. If there is only one argument, then only args[0] is valid and contains the pointer to its data. Each argument i can be a vector (it is a column of the Scilab matrix argument) and args[i-1] contains the pointer to its data. If argument i is a scalar (Scilab matrix Ai has only one row), only args[i][0] is valid and contains the value. Note : the data referenced by **args is declared const because it refers to arguments of the function. However, the data is currently copied from from underlying Scilab variables before being passed to the function, so it can safely be overwritten by casting away the constness(but conversely cannot be used as in/out parameters). If k arguments are provided to parallel_run, the args[0]...args[k-1] will always be valid for each call, even if some arguments had less columns than others : parallel_run will handle the "recycling" shown in **§Description**.

**void *const* res** is a pointer to an array of the data of each result Ri. If there is only one argument, then only res[0] is valid and contains the pointer to its data. Each result i (counted from 0) can be a vector as specified by the dims argument and res[i] contains the pointer to its data. If res i is a scalar (default when no dims argument is used), only res[i][0] is valid and available to store the value.

#### Examples

The following examples are the compiled foreign function rewrite of those from the previous section. However, we will not handle other data types because accessing those data structures from compiled code would be outside the scope of this document. Furthermore, when speed is of essence (as usual when one goes to the length of rewriting some Scilab code into a compiled language) one should strive to stick to dense matrices.

##### First steps

The most simple application of parallel_run is to

function make_compiled_function(name, ext, code) filename = name + ext mputl(code, filename) ilib_for_link(fun_name,filename,[],"c") exec loader.sce mdelete(filename) endfunction fun_name = 'test_fun'; c_prog = ['#include <math.h>' 'void '+fun_name+'(void const* const* args, void *const* res) {' '*((double*)*res)= (((double*)args[0])[0])*(((double*)args[0])[0]);' '}' ]; make_compiled_function(fun_name, '.c',c_prog); res = parallel_run(1:10, fun_name); // res = [1., 4., 9., 16., 25., 36., 49., 64., 81., 100.];

##### Wallclock gain

In order to illustrate the parallelization gains, the next example includes an explicit loop calling a foreign c function with call. The remaining examples will only include the parallel_run version, as their purpose was already illustrated in the previous section.

function make_compiled_function(name, ext, code) filename = name + ext; mputl(code, filename); ilib_for_link(name,filename,[],"c"); exec loader.sce; mdelete(filename); endfunction fun_name = 'male'; c_prog = ['static int m(int);' 'static int f(int n) { return n ? (n-m(f(n-1))) : 1 ; }' 'static int m(int n) { return n ? (n-f(m(n-1))) : 0 ; }' 'void '+fun_name+'(double* res, double* n) ' '{ *res= (double)m((int)*n);}' ]; make_compiled_function(fun_name, '.c',c_prog); t0 = getdate(); for i = 1:200 res_c(i) = call(fun_name, i,2,"d","out", [1,1],1,"d"); end etime(getdate(), t0) // note that we ran the computation until m(200)! fun_name = 'male_p'; // same implementation, only change is the API c_prog(4:5) = ['void '+fun_name+'(void const* const* args, void *const* res) ' '{ *((double*)*res)= (double)m((int)((double*)args[0])[0]); }' ]; make_compiled_function(fun_name, '.c',c_prog); t0 = getdate(); res = parallel_run(1:200, fun_name); etime(getdate(), t0) // execution time is shorter on multicore

##### Plurality of arguments and returned values

It is of course possible to have more than one argument, and more than one returned value :

N=10; A = rand(1:N); B = rand(1:N); C = rand(1:N); f1=['#include <algorithm>', 'extern ""C""{', ' void min_med_max_cxx(void const*const* args, void*const* res){', ' double a(((double const*const*)args)[0][0])', ' ,b(((double const*const*)args)[1][0]),c(((double const*const*)args)[2][0]);', ' if(b<a){ std::swap(a,b);}', ' if(c<a){ std::swap(a,c);}', ' if(c<b){ std::swap(b,c);}', ' ((double*const*)res)[0][0]=a; ((double*const*)res)[1][0]=b; ((double*const*)res)[2][0]= c;', ' return;', ' }', '}']; mputl(f1, 'min_med_max_cxx.cxx'); ilib_for_link('min_med_max_cxx','min_med_max_cxx.cxx',[],"c") exec loader.sce [Min, Med, Max] = parallel_run(A,B,C, "min_med_max_cxx")

##### Vector arguments and results

f1 = [ '#include<algorithm>', 'extern ""C""{', ' void quantiles_cxx(void const*const* args, void*const* res){', ' double * data(((double *const*)args)[0]);', ' int data_size(static_cast<int>(((double const*const*)args)[1][0]));', ' double const* ranks(((double *const*)args)[2]);', ' int ranks_size(static_cast<int>(((double *const*)args)[3][0]));', ' std::sort(data,data+data_size);', ' for(int i(0); i != ranks_size; ++i)', ' { ((double*const*)res)[0][i]= data[static_cast<int>(ranks[i])-1]; }', ' return;', ' }', ' }' ]; mputl(f1,'quantiles_cxx.cxx'); ilib_for_link('quantiles_cxx','quantiles_cxx.cxx',[],"c") exec loader.sce N = 10; data = matrix(rand(1:N*N),N,N); // note the last argument to inform that quantiles returns 3 scalars. R4_6_7 = parallel_run(data',N,[4;6;7],3,"quantiles_cxx", 3);

### Tuning the Parallelization with Configuration Parameters

As we have seen in the syntax, it is possible to add a configuration parameter as a last argument to parallel_run. This argument is handled by the `params`

module and created withinit_param() (further information on how to handle parameters can be found in the help pages of add_param, set_param and remove_param).

#### Number of workers

The number of computing resources used in parallel can be set by the parameter`nb_workers`

. The default value (0) uses as many workers as there are cores available.#### Shared (threads) or separate (process) memory

While threads are the most efficient (lightweight) parallelization model, current Scilab implementation does not allow threadsafe code, so threads can only be used for foreign function (which must be threadsafe!) and processes are used for Scilab macros. The behavior is controlled par the`shared_memory`

parameter which is a numeric value used as a boolean : threads are used if `shared_memory`

is not 0 and `f`

is a foreign function. This default value (0) implies separate memory (processes).#### Dynamic scheduling

The number of function calls assigned to each worker can either be static (each worker gets the same number of function calls to do) or dynamic. In the latter case, function calls are assigned to workers when they are available. The behavior is controlled par the`dynamic_scheduling`

parameter which is a numeric value used as a boolean : dynamic scheduling is used if `dynamic_scheduling`

is not 0. This default value (1) implies dynamic scheduling. Static scheduling is more efficient when there are many calls to the function `f`

and they will all take the same amount of time, while dynamic scheduling should be preferred when the amount a time for each call can varies a lot.#### Chunk size

In order to reduce dynamic scheduling overhead, the n function calls are dispatched in chunks. The chunk size can be set by the parameter`chunk_size`

. The default value (0) uses chunks of size 1. For static scheduling, there is virtually no scheduling overhead and the difference between the minimum and maximum number of function calls performed by the workers is at most 1.#### Prologue and Epilogue

When using separate memory, it is not uncommon to perform process-specific initialization and finalization. For example, when making Monte Carlo simulations, one must insure that each process uses different random number generator initial state, otherwise all the processes would perform the same pseudo-random sequences in parallel ! The`prologue`

(resp. `epilogue`

) parameter is a string value either empty ("" default) or containing the name of a macro taking one scalar argument to be run upon starting (resp. ending) a new process (separate memory worker, see above). Each process is given a unique scalar passed as argument to both prologue and epilogue macros.
function init_rand(n) rand('seed',n); endfunction; function res=rand_macro(nb) res = rand(1, nb); endfunction nb = 5; // We use a configuration parameter to force nb_workers = 2 even on monocore // without seeding the 2 workers have the same rng state. res = parallel_run([nb,nb], 'rand_macro', nb, init_param('nb_workers', 2)); // when setting the seed, they should have different random numbers. res = parallel_run([nb,nb], 'rand_macro', nb, init_param('nb_workers', 2,'prologue','init_rand'));

### See also

- feval — multiple evaluation
- typeof — explicit type or overloading code of an object
- ilib_for_link — utility for shared library management with link
- init_param — Initialize the structure which will handles the parameters list
- call — Fortran or C user routines call

## Comments

Add a comment:Please login to comment this page.