[1d207bb] | 1 | |
---|
[fb0d416] | 2 | FLEXPART VERSION 10.0 beta (MPI) |
---|
[1d207bb] | 3 | |
---|
| 4 | Description |
---|
| 5 | ----------- |
---|
| 6 | |
---|
| 7 | This branch contains both the standard (serial) FLEXPART, and a parallel |
---|
| 8 | version (implemented with MPI). The latter is under developement, so not |
---|
| 9 | every FLEXPART option is implemented yet. |
---|
| 10 | |
---|
| 11 | MPI related subroutines and variables are in file mpi_mod.f90. |
---|
| 12 | |
---|
| 13 | Most of the source files are identical/shared between the serial and |
---|
| 14 | parallel versions. Those that depend on the MPI module have '_mpi' |
---|
| 15 | apppended to their names, e.g. 'timemanager_mpi.f90' |
---|
| 16 | |
---|
| 17 | |
---|
| 18 | Installation |
---|
| 19 | ------------ |
---|
| 20 | |
---|
| 21 | A MPI library must be installed on the target platform, either as a |
---|
| 22 | system library or compiled from source. |
---|
| 23 | |
---|
| 24 | So far, we have tested the following freely available implementations: |
---|
| 25 | mpich2 -- versions 3.0.1, 3.0.4, 3.1, 3.1.3 |
---|
| 26 | OpenMPI -- version 1.8.3 |
---|
| 27 | |
---|
| 28 | Based on testing so far, OpenMPI is recommended. |
---|
| 29 | |
---|
| 30 | Compiling the parallel version (executable: FP_ecmwf_MPI) is done by |
---|
| 31 | |
---|
| 32 | 'make [-j] ecmwf-mpi' |
---|
| 33 | |
---|
| 34 | The makefile has resolved dependencies, so 'make -j' will compile |
---|
| 35 | and link in parallel. |
---|
| 36 | |
---|
| 37 | The included makefile must be edited to match the target platform |
---|
| 38 | (location of system libraries, compiler etc.). |
---|
| 39 | |
---|
| 40 | |
---|
| 41 | Usage |
---|
| 42 | ----- |
---|
| 43 | |
---|
| 44 | Running the parallel version with MPI is done with the "mpirun" command |
---|
| 45 | (some MPI implementations may use a "mpiexec" command instead). The |
---|
| 46 | simplest case is: |
---|
| 47 | |
---|
| 48 | 'mpirun -n [number] ./FP_ecmwf_MPI' |
---|
| 49 | |
---|
| 50 | where 'number' is the number of processes to launch. Depending on the |
---|
| 51 | target platform, useful options regarding process-to-processor bindings |
---|
| 52 | can be specified (for performance reasons), e.g, |
---|
| 53 | |
---|
| 54 | 'mpirun --bind-to l3cache -n [number] ./FP_ecmwf_MPI' |
---|
| 55 | |
---|
| 56 | |
---|
| 57 | Implementation |
---|
| 58 | -------------- |
---|
| 59 | |
---|
| 60 | The current parallel model is based on distributing particles equally |
---|
| 61 | among the running processes. In the code, variables like 'maxpart' and |
---|
| 62 | 'numpart' are complemented by variables 'maxpart_mpi' and 'numpart_mpi' |
---|
| 63 | which are the run-time determined number of particles per process, i.e, |
---|
[fb0d416] | 64 | maxpart_mpi = maxpart/np, where np are the number of processes. The variable 'numpart' |
---|
[1d207bb] | 65 | is still used in the code, but redefined to mean 'number of particles |
---|
| 66 | per MPI process' |
---|
| 67 | |
---|
| 68 | The root MPI process writes concentrations to file, following a MPI |
---|
| 69 | communication step where each process sends its contributions to root, |
---|
| 70 | where the individual contributions are summed. |
---|
| 71 | |
---|
| 72 | In the parallel version one can choose to set aside a process dedicated |
---|
| 73 | to reading and distributing meteorological data ("windfields"). This process will |
---|
| 74 | thus not participate in the calculation of trajectories. This might not be |
---|
| 75 | the optimal choice when running with very few processes. |
---|
| 76 | As an example, running with a total number of processes np=4 and |
---|
| 77 | using one of these processes for reading windfields will normally |
---|
| 78 | be faster than running with np=3 and no dedicated 'reader' process. |
---|
| 79 | But it is also possible that the |
---|
| 80 | program will run even faster if the 4th process is participating in |
---|
[fb0d416] | 81 | the calculation of particle trajectories instead. This will largely depend on |
---|
[1d207bb] | 82 | the problem size (total number of particles in the simulation, resolution |
---|
| 83 | of grids etc) and hardware being used (disk speed/buffering, memory |
---|
| 84 | bandwidth etc). |
---|
| 85 | |
---|
| 86 | To control this |
---|
| 87 | behavior, edit the parameter 'read_grp_min' in file mpi_mod.f90. This |
---|
| 88 | sets the minimum number of total processes at which one will be set |
---|
| 89 | aside for reading the fields. Experimentation is required to find |
---|
| 90 | the optimum value. At typical NILU machines (austre.nilu.no, |
---|
| 91 | dmz-proc01.nilu.no) with 24-32 cores, a value of 6-8 seems to be a |
---|
| 92 | good choice. |
---|
| 93 | |
---|
| 94 | An experimental feature, which is an extension of the functionality |
---|
| 95 | described above, is to hold 3 fields in memory instead of the usual 2. |
---|
| 96 | Here, the transfer of fields from the "reader" process to the "particle" |
---|
| 97 | processes is done on the vacant field index, simultaneously while the |
---|
| 98 | "particle" processes are calculating trajectories. To use this feature, |
---|
| 99 | set 'lmp_sync=.false'. in file mpi_mod.f90 and set numwfmem=3 in file |
---|
| 100 | par_mod.f90. At the moment, this method does not seem to produce faster |
---|
| 101 | running code (about the same as the "2-fields" version). |
---|
| 102 | |
---|
| 103 | |
---|
| 104 | Performance efficency considerations |
---|
| 105 | ------------------------------------ |
---|
| 106 | |
---|
| 107 | A couple of reference runs have been set up to measure performace of the |
---|
| 108 | MPI version (as well as checking for errors in the implementation). |
---|
| 109 | They are as follows: |
---|
| 110 | |
---|
| 111 | Reference run 1 (REF1): |
---|
| 112 | * Forward modelling (24h) of I2-131, variable number of particles |
---|
| 113 | * Two release locations |
---|
| 114 | * 360x720 Global grid, no nested grid |
---|
| 115 | * Species file modified to include (not realistic) values for |
---|
| 116 | scavenging/deposition |
---|
| 117 | |
---|
| 118 | |
---|
| 119 | As the parallization is based on particles, it follows that if |
---|
| 120 | FLEXPART-MPI is run with no (or just a few) particles, no performance |
---|
| 121 | improvement is possible. In this case, most processing time is spent |
---|
[5f42c27] | 122 | in the 'getfields'-routine. |
---|
[1d207bb] | 123 | |
---|
| 124 | A) Running without dedicated reader process |
---|
| 125 | ---------------------------------------- |
---|
| 126 | Running REF1 with 100M particles on 16 processes (NILU machine 'dmz-proc04'), |
---|
| 127 | a speedup close to 8 is observed (~50% efficiency). |
---|
| 128 | |
---|
| 129 | Running REF1 with 10M particles on 8 processes (NILU machine 'dmz-proc04'), |
---|
| 130 | a speedup close to 3 is observed (~40% efficiency). Running with 16 |
---|
| 131 | processes gives only marginal improvements (speedup ~3.5) because of the 'getfields' |
---|
| 132 | bottleneck. |
---|
| 133 | |
---|
| 134 | Running REF1 with 1M particles: Here 'getfields' consumes ~70% of the CPU |
---|
| 135 | time. Running with 4 processes gives a speedup of ~1.5. Running with more |
---|
| 136 | processes does not help much here. |
---|
| 137 | |
---|
| 138 | B) Running with dedicated reader process |
---|
| 139 | ---------------------------------------- |
---|
| 140 | |
---|
| 141 | Running REF1 with 40M particles on 16 processes (NILU machine 'dmz-proc04'), |
---|
| 142 | a speedup above 10 is observed (~63% efficiency). |
---|
| 143 | |
---|
| 144 | :TODO: more to come... |
---|
| 145 | |
---|
| 146 | |
---|
| 147 | Advice |
---|
| 148 | ------ |
---|
| 149 | From the tests referred to above, the following advice can be given: |
---|
| 150 | |
---|
| 151 | * Do not run with too many processes. |
---|
| 152 | * Do not use the parallel version when running with very few particles. |
---|
| 153 | |
---|
| 154 | |
---|
| 155 | What is implemented in the MPI version |
---|
| 156 | -------------------------------------- |
---|
| 157 | |
---|
| 158 | -The following should work (have been through initial testing): |
---|
| 159 | |
---|
| 160 | * Forward runs |
---|
| 161 | * OH fields |
---|
| 162 | * Radioactive decay |
---|
| 163 | * Particle splitting |
---|
| 164 | * Dumping particle positions to file |
---|
| 165 | * ECMWF data |
---|
| 166 | * Wet/dry deposition |
---|
| 167 | * Nested grid output |
---|
| 168 | * NetCDF output |
---|
| 169 | * Namelist input/output |
---|
| 170 | * Domain-filling trajectory calculations |
---|
| 171 | * Nested wind fields |
---|
| 172 | |
---|
[ec7fc72] | 173 | -Implemented but untested: |
---|
| 174 | |
---|
| 175 | * Backward runs (but not initial_cond_output.f90) |
---|
| 176 | |
---|
[fb0d416] | 177 | -The following will most probably not work (untested/under developement): |
---|
[1d207bb] | 178 | |
---|
[ec7fc72] | 179 | * Calculation/output of fluxes |
---|
[1d207bb] | 180 | |
---|
| 181 | -This will positively NOT work yet |
---|
| 182 | |
---|
| 183 | * Subroutine partoutput_short (MQUASILAG = 1) will not dump particles |
---|
| 184 | correctly at the moment |
---|
| 185 | * Reading particle positions from file (the tools to implement this |
---|
| 186 | are available in mpi_mod.f90 so it will be possible soon) |
---|
| 187 | |
---|
| 188 | |
---|
| 189 | Please keep in mind that running the serial version (FP_ecmwf_gfortran) |
---|
| 190 | should yield identical results as running the parallel version |
---|
| 191 | (FP_ecmwf_MPI) using only one process, i.e. "mpirun -n 1 FP_ecmwf_MPI". |
---|
| 192 | If not, this indicates a bug. |
---|
| 193 | |
---|
| 194 | When running with multiple processes, statistical differences are expected |
---|
| 195 | in the results. |
---|
| 196 | |
---|
| 197 | Contact |
---|
| 198 | ------- |
---|
| 199 | |
---|
| 200 | If you have questions, or wish to work with the parallel version, please |
---|
| 201 | contact Espen Sollum (eso@nilu.no). Please report any errors/anomalies! |
---|