Skip to content

Python + MPI

Última atualização: 2019-11-18

Instalar OpenMPI no nó de login

x@master:/scratch$ sudo apt install -y openmpi-bin openmpi-common libopenmpi2 libopenmpi-dev

Instalar OpenMPI nos demais 6 nós, em paralelo

x@master:~$ sudo su -
root@node05:~# srun --nodes=6 apt install -y openmpi-bin openmpi-common libopenmpi2 libopenmpi-dev

Testando com /scratch/hello_mpi.c

#include <stdio.h>
#include <mpi.h>

int main(int argc, char** argv){
    int node;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &node);

    printf("Hello World from Node %d!\n", node);

    MPI_Finalize();
}

Compilar

x@master:/scratch$ mpicc -O0 hello_mpi.c -o hello_mpi

Script de submissão /scratch/sub_mpi.sh

#!/bin/bash

cd $SLURM_SUBMIT_DIR

# Print the node that starts the process
echo "Master node: $(hostname)"

# Run our program using OpenMPI.
# OpenMPI will automatically discover resources from SLURM.
mpirun hello_mpi

Rodar

x@master:/scratch$ sbatch --nodes=7 --ntasks-per-node=1 sub_mpi.sh
Submitted batch job 160

slurm-160.out

Master node: node01
Hello World from Node 0!
Hello World from Node 3!
Hello World from Node 6!
Hello World from Node 4!
Hello World from Node 5!
Hello World from Node 1!
Hello World from Node 2!

PYTHON

Instalar pip nos 7 nós, em paralelo

x@master:/scratch$ srun --nodes=7 sudo apt install -y python3-pip

Instalar as bibliotecas numpy e mpi4py

x@master:/scratch$ srun --nodes=7 sudo -H pip3 install numpy mpi4py

Testando

x@master:/scratch$ srun --nodes=7 python3 -c "print('Hello')"
Hello
Hello
Hello
Hello
Hello
Hello
Hello
x@master:/scratch$ srun --ntasks=30 python3 -c "print('Hello')"
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
x@master:/scratch$ srun --nodelist=node02,node03 --ntasks=16 python3 -c "print('Hello')"
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello

Python MPI

Exemplo cálculo PI /scratch/calc-pi.py (baseado no código existente no repositório mpi4py)

from mpi4py import MPI
from math   import pi as PI
from numpy  import array
import time

def comp_pi(n, myrank=0, nprocs=1):
    h = 1.0 / n
    s = 0.0
    for i in range(myrank + 1, n + 1, nprocs):
        x = h * (i - 0.5)
        s += 4.0 / (1.0 + x**2)
    return s * h

def prn_pi(pi, PI):
    message = "pi is approximately %.16f, error is %.16f"
    print  (message % (pi, abs(pi - PI)))

comm = MPI.COMM_WORLD
nprocs = comm.Get_size()
myrank = comm.Get_rank()

n    = array(0, dtype=int)
pi   = array(0, dtype=float)
mypi = array(0, dtype=float)

if myrank == 0:
    _n = 1000  # Enter the number of intervals
    n.fill(_n)
    print('--- Printing from Rank 0 ---')
    print('Number of intervals = ', _n)
    print('Processes comm.Get_size() = ', nprocs)
    print('Rank 0 MPI.Get_processor_name() = ', MPI.Get_processor_name())
    print('--- from Processes ---------')

comm.Bcast([n, MPI.INT], root=0)
time1 = time.time()
wt = MPI.Wtime()
_mypi = comp_pi(n, myrank, nprocs)
wt = MPI.Wtime() - wt
time2 = time.time() - time1
print('Running Rank = ', myrank, '  time.time() = ', time2, '  Wtime() = ', wt)
mypi.fill(_mypi)
comm.Reduce([mypi, MPI.DOUBLE], [pi, MPI.DOUBLE], op=MPI.SUM, root=0)

if myrank == 0:
    prn_pi(pi, PI)

Script de submissão /scratch/sub_calc_pi.sh com 30 processos

#!/bin/bash
#SBATCH --ntasks=30

cd $SLURM_SUBMIT_DIR

mpiexec -n $SLURM_NTASKS python3 calc-pi.py

Rodando o cálculo com 1000 intervalos

x@master:/scratch$ sbatch sub_calc_pi.sh
Submitted batch job 166

slurm-166.out

--- Printing from Rank 0 ---
Number of intervals =  1000
Processes comm.Get_size() =  30
Rank 0 MPI.Get_processor_name() =  node01
--- from Processes ---------
Running Rank =  27   time.time() =  9.608268737792969e-05   Wtime() =  8.821599476505071e-05
Running Rank =  11   time.time() =  0.0001761913299560547   Wtime() =  0.0001629960024729371
Running Rank =  23   time.time() =  0.00015282630920410156   Wtime() =  0.0001459560007788241
Running Rank =  26   time.time() =  0.00011587142944335938   Wtime() =  0.0001111230012611486
Running Rank =  10   time.time() =  0.0001895427703857422   Wtime() =  0.00017130805645138025
Running Rank =  22   time.time() =  0.00038170814514160156   Wtime() =  0.0003614960005506873
Running Rank =  20   time.time() =  0.00014591217041015625   Wtime() =  0.0001399520260747522
Running Rank =  21   time.time() =  0.0001621246337890625   Wtime() =  0.00015257601626217365
Running Rank =  9   time.time() =  0.000179290771484375   Wtime() =  0.00017114309594035149
Running Rank =  8   time.time() =  0.00012946128845214844   Wtime() =  0.00012361095286905766
Running Rank =  24   time.time() =  0.018949508666992188   Wtime() =  0.01893674700113479
Running Rank =  25   time.time() =  0.019182443618774414   Wtime() =  0.019168964001437416
Running Rank =  3   time.time() =  0.00010800361633300781   Wtime() =  0.00010345797636546195
Running Rank =  4   time.time() =  0.0001316070556640625   Wtime() =  0.000124647980555892
Running Rank =  2   time.time() =  7.915496826171875e-05   Wtime() =  7.492199074476957e-05
Running Rank =  7   time.time() =  0.00011229515075683594   Wtime() =  0.00010660302359610796
Running Rank =  5   time.time() =  0.0001583099365234375   Wtime() =  0.00015073898248374462
Running Rank =  0   time.time() =  0.00011539459228515625   Wtime() =  0.00011255498975515366
Running Rank =  6   time.time() =  0.00016236305236816406   Wtime() =  0.0001544909318909049
Running Rank =  28   time.time() =  9.393692016601562e-05   Wtime() =  8.16650062915869e-05
Running Rank =  29   time.time() =  9.1552734375e-05   Wtime() =  8.835500193526968e-05
Running Rank =  1   time.time() =  9.775161743164062e-05   Wtime() =  9.155602310784161e-05
Running Rank =  12   time.time() =  0.014123678207397461   Wtime() =  0.014117746999545489
Running Rank =  13   time.time() =  0.014143943786621094   Wtime() =  0.014137095997284632
Running Rank =  14   time.time() =  0.020174503326416016   Wtime() =  0.020163690998742823
Running Rank =  15   time.time() =  0.020157814025878906   Wtime() =  0.020146622002357617
Running Rank =  16   time.time() =  0.019798994064331055   Wtime() =  0.01979123400087701
Running Rank =  17   time.time() =  0.01237344741821289   Wtime() =  0.012366331997327507
Running Rank =  18   time.time() =  0.019823312759399414   Wtime() =  0.01981667899963213
Running Rank =  19   time.time() =  0.02013564109802246   Wtime() =  0.02012383499823045
pi is approximately 3.1415927369231267, error is 0.0000000833333336

Agora rodando o cálculo com 1000000000 intervalos

x@master:/scratch$ sbatch sub_calc_pi.sh
Submitted batch job 167
x@master:/scratch$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               167 mycluster sub_calc        x  R       0:04      7 node[01-07]
x@master:/scratch$ scontrol show node | grep Load
   CPUAlloc=0 CPUErr=0 CPUTot=2 CPULoad=0.06
   CPUAlloc=4 CPUErr=0 CPUTot=4 CPULoad=1.04
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.00
   CPUAlloc=8 CPUErr=0 CPUTot=8 CPULoad=0.00
   CPUAlloc=2 CPUErr=0 CPUTot=2 CPULoad=0.01
   CPUAlloc=2 CPUErr=0 CPUTot=2 CPULoad=0.00
   CPUAlloc=2 CPUErr=0 CPUTot=2 CPULoad=0.04
   CPUAlloc=4 CPUErr=0 CPUTot=4 CPULoad=0.01

sinfo

x@master:/scratch$ sinfo  
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
mycluster*    up   infinite      7  alloc node[01-07]

top

x@node01:~$ top
Tasks: 254 total,   7 running, 184 sleeping,   0 stopped,   0 zombie
%Cpu(s): 87.6 us, 11.8 sy,  0.0 ni,  0.5 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16288556 total, 13402360 free,   658440 used,  2227756 buff/cache
KiB Swap:  2097148 total,  2096368 free,      780 used. 15269808 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                  
 8891 x         20   0  715664  40428  21932 R 103.6  0.2   1439:25 python3                  
13097 x         20   0  715664  39852  21432 R  86.4  0.2   0:24.72 python3                  
13096 x         20   0  715664  40096  21708 R  78.8  0.2   0:21.18 python3                  
13098 x         20   0  715664  39780  21360 R  67.2  0.2   0:22.24 python3                  
13099 x         20   0  715664  40236  21820 R  53.3  0.2   0:21.02 python3                  
 8892 x         20   0  714388  39560  22156 R   7.6  0.2  85:08.23 python3                  
13120 x         20   0   53008   4156   3376 R   0.3  0.0   0:00.03 top  

Cancelando o job. Observar que não é instantâneo, no exemplo abaixo dois nós aguardam em "CG" ("Completing") antes de terminar.

x@master:/scratch$ scancel 167
x@master:/scratch$ squeue    
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               167 mycluster sub_calc        x CG       2:44      2 node[02-03]
x@master:/scratch$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

os códigos que aparecem em squeue estão em https://slurm.schedmd.com/squeue.html

rodando apenas em node02

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --nodelist=node02

cd $SLURM_SUBMIT_DIR

mpiexec -n $SLURM_NTASKS python3 calc-pi.py

x@master:/scratch$ sbatch sub_calc_pi.sh
Submitted batch job 151

slurm-151.out

--- Printing from Rank 0 ---
Number of intervals =  1000
Processes comm.Get_size() =  8
Rank 0 MPI.Get_processor_name() =  node02
--- from Processes ---------
Running Rank =  0   time.time() =  0.0004057884216308594   Wtime() =  0.0004029989941045642
Running Rank =  1   time.time() =  0.0003402233123779297   Wtime() =  0.0003352019703015685
Running Rank =  3   time.time() =  0.00038361549377441406   Wtime() =  0.0003773680655285716
Running Rank =  5   time.time() =  0.00035691261291503906   Wtime() =  0.00034958391916006804
Running Rank =  7   time.time() =  0.0003273487091064453   Wtime() =  0.0003221730003133416
Running Rank =  2   time.time() =  0.00036454200744628906   Wtime() =  0.00035880901850759983
Running Rank =  6   time.time() =  0.0003533363342285156   Wtime() =  0.0003446540795266628
Running Rank =  4   time.time() =  0.0003407001495361328   Wtime() =  0.0003334229113534093
pi is approximately 3.1415927369231262, error is 0.0000000833333331

REFERÊNCIAS

  • https://medium.com/@glmdev/building-a-raspberry-pi-cluster-784f0df9afbd
  • https://www.linuxwave.info/2019/10/installing-slurm-workload-manager-job.html
  • https://slurm.schedmd.com/