1 | |
---|
2 | FLEXPART VERSION 10.0 beta (MPI) |
---|
3 | |
---|
4 | Description |
---|
5 | ----------- |
---|
6 | |
---|
7 | This branch contains both the standard (serial) FLEXPART, and a parallel |
---|
8 | version (implemented with MPI). The latter is under developement, so not |
---|
9 | every FLEXPART option is implemented yet. |
---|
10 | |
---|
11 | MPI related subroutines and variables are in file mpi_mod.f90. |
---|
12 | |
---|
13 | Most of the source files are identical/shared between the serial and |
---|
14 | parallel versions. Those that depend on the MPI module have '_mpi' |
---|
15 | apppended to their names, e.g. 'timemanager_mpi.f90' |
---|
16 | |
---|
17 | |
---|
18 | Installation |
---|
19 | ------------ |
---|
20 | |
---|
21 | A MPI library must be installed on the target platform, either as a |
---|
22 | system library or compiled from source. |
---|
23 | |
---|
24 | So far, we have tested the following freely available implementations: |
---|
25 | mpich2 -- versions 3.0.1, 3.0.4, 3.1, 3.1.3 |
---|
26 | OpenMPI -- version 1.8.3 |
---|
27 | |
---|
28 | Based on testing so far, OpenMPI is recommended. |
---|
29 | |
---|
30 | Compiling the parallel version (executable: FP_ecmwf_MPI) is done by |
---|
31 | |
---|
32 | 'make [-j] ecmwf-mpi' |
---|
33 | |
---|
34 | The makefile has resolved dependencies, so 'make -j' will compile |
---|
35 | and link in parallel. |
---|
36 | |
---|
37 | The included makefile must be edited to match the target platform |
---|
38 | (location of system libraries, compiler etc.). |
---|
39 | |
---|
40 | |
---|
41 | Usage |
---|
42 | ----- |
---|
43 | |
---|
44 | Running the parallel version with MPI is done with the "mpirun" command |
---|
45 | (some MPI implementations may use a "mpiexec" command instead). The |
---|
46 | simplest case is: |
---|
47 | |
---|
48 | 'mpirun -n [number] ./FP_ecmwf_MPI' |
---|
49 | |
---|
50 | where 'number' is the number of processes to launch. Depending on the |
---|
51 | target platform, useful options regarding process-to-processor bindings |
---|
52 | can be specified (for performance reasons), e.g, |
---|
53 | |
---|
54 | 'mpirun --bind-to l3cache -n [number] ./FP_ecmwf_MPI' |
---|
55 | |
---|
56 | |
---|
57 | Implementation |
---|
58 | -------------- |
---|
59 | |
---|
60 | The current parallel model is based on distributing particles equally |
---|
61 | among the running processes. In the code, variables like 'maxpart' and |
---|
62 | 'numpart' are complemented by variables 'maxpart_mpi' and 'numpart_mpi' |
---|
63 | which are the run-time determined number of particles per process, i.e, |
---|
64 | maxpart_mpi = maxpart/np, where np are the number of processes. The variable 'numpart' |
---|
65 | is still used in the code, but redefined to mean 'number of particles |
---|
66 | per MPI process' |
---|
67 | |
---|
68 | The root MPI process writes concentrations to file, following a MPI |
---|
69 | communication step where each process sends its contributions to root, |
---|
70 | where the individual contributions are summed. |
---|
71 | |
---|
72 | In the parallel version one can choose to set aside a process dedicated |
---|
73 | to reading and distributing meteorological data ("windfields"). This process will |
---|
74 | thus not participate in the calculation of trajectories. This might not be |
---|
75 | the optimal choice when running with very few processes. |
---|
76 | As an example, running with a total number of processes np=4 and |
---|
77 | using one of these processes for reading windfields will normally |
---|
78 | be faster than running with np=3 and no dedicated 'reader' process. |
---|
79 | But it is also possible that the |
---|
80 | program will run even faster if the 4th process is participating in |
---|
81 | the calculation of particle trajectories instead. This will largely depend on |
---|
82 | the problem size (total number of particles in the simulation, resolution |
---|
83 | of grids etc) and hardware being used (disk speed/buffering, memory |
---|
84 | bandwidth etc). |
---|
85 | |
---|
86 | To control this |
---|
87 | behavior, edit the parameter 'read_grp_min' in file mpi_mod.f90. This |
---|
88 | sets the minimum number of total processes at which one will be set |
---|
89 | aside for reading the fields. Experimentation is required to find |
---|
90 | the optimum value. At typical NILU machines (austre.nilu.no, |
---|
91 | dmz-proc01.nilu.no) with 24-32 cores, a value of 6-8 seems to be a |
---|
92 | good choice. |
---|
93 | |
---|
94 | An experimental feature, which is an extension of the functionality |
---|
95 | described above, is to hold 3 fields in memory instead of the usual 2. |
---|
96 | Here, the transfer of fields from the "reader" process to the "particle" |
---|
97 | processes is done on the vacant field index, simultaneously while the |
---|
98 | "particle" processes are calculating trajectories. To use this feature, |
---|
99 | set 'lmp_sync=.false'. in file mpi_mod.f90 and set numwfmem=3 in file |
---|
100 | par_mod.f90. At the moment, this method does not seem to produce faster |
---|
101 | running code (about the same as the "2-fields" version). |
---|
102 | |
---|
103 | |
---|
104 | Performance efficency considerations |
---|
105 | ------------------------------------ |
---|
106 | |
---|
107 | A couple of reference runs have been set up to measure performace of the |
---|
108 | MPI version (as well as checking for errors in the implementation). |
---|
109 | They are as follows: |
---|
110 | |
---|
111 | Reference run 1 (REF1): |
---|
112 | * Forward modelling (24h) of I2-131, variable number of particles |
---|
113 | * Two release locations |
---|
114 | * 360x720 Global grid, no nested grid |
---|
115 | * Species file modified to include (not realistic) values for |
---|
116 | scavenging/deposition |
---|
117 | |
---|
118 | |
---|
119 | As the parallization is based on particles, it follows that if |
---|
120 | FLEXPART-MPI is run with no (or just a few) particles, no performance |
---|
121 | improvement is possible. In this case, most processing time is spent |
---|
122 | in the 'getfields'-routine. |
---|
123 | |
---|
124 | A) Running without dedicated reader process |
---|
125 | ---------------------------------------- |
---|
126 | Running REF1 with 100M particles on 16 processes (NILU machine 'dmz-proc04'), |
---|
127 | a speedup close to 8 is observed (~50% efficiency). |
---|
128 | |
---|
129 | Running REF1 with 10M particles on 8 processes (NILU machine 'dmz-proc04'), |
---|
130 | a speedup close to 3 is observed (~40% efficiency). Running with 16 |
---|
131 | processes gives only marginal improvements (speedup ~3.5) because of the 'getfields' |
---|
132 | bottleneck. |
---|
133 | |
---|
134 | Running REF1 with 1M particles: Here 'getfields' consumes ~70% of the CPU |
---|
135 | time. Running with 4 processes gives a speedup of ~1.5. Running with more |
---|
136 | processes does not help much here. |
---|
137 | |
---|
138 | B) Running with dedicated reader process |
---|
139 | ---------------------------------------- |
---|
140 | |
---|
141 | Running REF1 with 40M particles on 16 processes (NILU machine 'dmz-proc04'), |
---|
142 | a speedup above 10 is observed (~63% efficiency). |
---|
143 | |
---|
144 | :TODO: more to come... |
---|
145 | |
---|
146 | |
---|
147 | Advice |
---|
148 | ------ |
---|
149 | From the tests referred to above, the following advice can be given: |
---|
150 | |
---|
151 | * Do not run with too many processes. |
---|
152 | * Do not use the parallel version when running with very few particles. |
---|
153 | |
---|
154 | |
---|
155 | What is implemented in the MPI version |
---|
156 | -------------------------------------- |
---|
157 | |
---|
158 | -The following should work (have been through initial testing): |
---|
159 | |
---|
160 | * Forward runs |
---|
161 | * OH fields |
---|
162 | * Radioactive decay |
---|
163 | * Particle splitting |
---|
164 | * Dumping particle positions to file |
---|
165 | * ECMWF data |
---|
166 | * Wet/dry deposition |
---|
167 | * Nested grid output |
---|
168 | * NetCDF output |
---|
169 | * Namelist input/output |
---|
170 | |
---|
171 | -Implemented but untested: |
---|
172 | * Domain-filling trajectory calculations |
---|
173 | * Nested wind fields |
---|
174 | |
---|
175 | -The following will most probably not work (untested/under developement): |
---|
176 | |
---|
177 | * Backward runs |
---|
178 | |
---|
179 | -This will positively NOT work yet |
---|
180 | |
---|
181 | * Subroutine partoutput_short (MQUASILAG = 1) will not dump particles |
---|
182 | correctly at the moment |
---|
183 | * Reading particle positions from file (the tools to implement this |
---|
184 | are available in mpi_mod.f90 so it will be possible soon) |
---|
185 | |
---|
186 | |
---|
187 | Please keep in mind that running the serial version (FP_ecmwf_gfortran) |
---|
188 | should yield identical results as running the parallel version |
---|
189 | (FP_ecmwf_MPI) using only one process, i.e. "mpirun -n 1 FP_ecmwf_MPI". |
---|
190 | If not, this indicates a bug. |
---|
191 | |
---|
192 | When running with multiple processes, statistical differences are expected |
---|
193 | in the results. |
---|
194 | |
---|
195 | Contact |
---|
196 | ------- |
---|
197 | |
---|
198 | If you have questions, or wish to work with the parallel version, please |
---|
199 | contact Espen Sollum (eso@nilu.no). Please report any errors/anomalies! |
---|