3. F90 and Python Implementations of the Test Problem

Several serial and parallel implementations of the test problem, described in the previous section, have been written and executed. Its description and results of processing performance are shown below. Implementations include F90, taken as a reference, standard Python, Fortran-to-Python (F2Py), Cython, and Numba (including Numba-GPU). In this work, all parallel versions of these implementations are based on the Message Passing Interface (MPI) communication library, which is wrapped by the MPI API for Python (mpi4py) [5], and allows the execution of MPI processes of the Python environment. Parallelization using threads is restricted to the execution of a Numba-compiled Python function in a GPU, which will also be discussed later. Parallelization using mpi4py allows the use of one or more computer nodes.

3.1. F90 Serial and Parallel

F90 is described because it was taken as a reference for Python implementations, and was compiled with GNU gfortran. The serial version is the implementation of the algorithm described in the previous section, and the parallel version employs the standard MPI asynchronous non-blocking communication functions MPI_ISend() and MPI_IRecv(). At the end of each time step of the algorithm, synchronization is necessary for each sub-domain to update the phantom zones of all neighboring sub-domains. For a square grid with $ N \times N $ points and $ p $ MPI processes, each process is assigned to a sub-domain with a total of $ [(N / p) +2] \times [(N / p) +2 ] $ points. The part of the code that requires performance, updates the domain grid using the 5-point stencil (Listing 1).

Listing 1. Time-consuming part of the F90 code of the test problem.

do j = 2,by+1
  do i = 2,bx+1
    anew(i,j)= 1/2.0*(aold(i,j) + 1/4.0*(aold(i-1,j) + aold(i+1,j) + aold(i,j-1) + aold(i,j+1)))
  enddo
enddo

3.2. Standard Python Serial and Parallel

The portability of the F90 code to Python is straightforward, without any external library except the NumPy library of numerical tools. Part of the Python loops can be executed transparently by resources from the Numpy library. The structure and sequence of the original code are preserved and executed interactively by the Python interpreter in the Jupyter Notebook environment. In this way, it is easy to modify the code or its parameters, show and record the results, perform the prototyping to check the accuracy of the algorithm and document its code and description. In this step, the user can take advantage of the modular nature of Python to selectively optimize the code, for example, by porting a specific module to F90 or replacing it with a library function for performance. Parallelization is possible using the Python multiprocessing environment which provides many ways of providing parallel execution, according to the chosen library. However, in this work, the library mpi4py (MPI for Python) was used.

3.3. F2Py Serial and Parallel

F2Py is a wrapper that creates a Python module by compiling the F90 source code and requires the definition of a function with a list of arguments to be passed for the execution of the Python module, such as number of grid points, location and heat rate of sources, number of iterations, etc. In the case of the F90 code already parallelized with MPI, the module can be executed in parallel by the F90 code itself. However, if the F90 code has not yet been parallelized, the typical alternative is to use the mpi4py library (MPI for Python), but, eventually, it will be necessary to break the original code into parts and choose computationally intensive ones to parallelize with MPI. Therefore, F2Py seems more convenient when different Python modules need to be generated and called interactively. The mpi4py library resembles the F90 MPI and allows Python to perform a job using one or more shared memory computer nodes, each with multi-core processors.

3.4. Cython Serial and Parallel

Cython is a static optimization compiler for the Python and Cython programming languages. It is generally used to create standard modules and to wrap C/C++ code. The Cython source code is first compiled for the C/C++ language, which is then compiled transparently by the operating system's standard compiler, generating executable machine code. Each module generated by Cython requires the definition of a corresponding standard Python function, and the performance depends on the syntactic and semantic extensions used, the resources used by the Python interpreter and the choice of libraries. The result of the compilation is a Python module that includes an API for the generated code. In the case of the serial test problem, the complete code was compiled by Cython, while in the case of the parallel code, a separate module was created compiling the double loop that updates the 2D domain, and then the mpi4py library was used. The time-consuming function that updates the domain grid using the 5-point stencil is shown in Listing 2.

Listing 2. Time-consuming part of the Cython code of the test problem.

cpdef stp(double[:,::1] anew, double[:,::1] aold,Py_ssize_t by,Py_ssize_t bx) :
  for i in range(1,bx+1) :
    for j in range(1,by+1) :
      anew[i,j] = 1/2.0*(aold[i,j] + 1/4.0*(aold[i-1,j] + aold[i+1,j] + aold[i,j-1] + aold[i,j+1]))

3.5. Numba Serial and Parallel

Numba is a JIT (just-in-time) compiler, which compiles the code at run time and uses part of the Python language and part of the NumPy module. It uses resources from the LLVM project, which is a set of modular and reusable technologies and tools for building compilers, developed by the University of Illinois at Urbana-Campaign since 2000, which allows the generation of machine code optimized for CPU or GPU. In order to compile the code for the test problem, an approach similar to Cython was adopted, creating a Python function that embeds part of the code and is compiled by Numba. The rest of the Python code is interpreted by the standard Python, as it runs quickly. In the case of parallel code, to run on multiple cores, the mpi4py library is used, and GPU execution is also possible, as Numba supports part of the NVidia CUDA API, requiring the definition of the kernel that will run on the GPU. The processing performance for the Numba-GPU version is shown below in subsection 4.2. The time-consuming function is shown in Listing 3.

Listing 3. Time-consuming part of the Numba code of the test problem.

@jit(nopython=True)
def kernel(anew,aold):
  anew[1:-1,1:-1] = 1/2.0*(aold[1:-1,1:-1] + 1/4.0*(aold[2:,1:-1] + aold[:-2,1:-1] + aold[1:-1,2:] + aold[1:-1,:-2]))