source: flexpart.git/README_PARALLEL.md @ 02095e3

10.4.1_peseiGFS_025bugfixes+enhancementsdevrelease-10release-10.4.1scaling-bugunivie
Last change on this file since 02095e3 was ec7fc72, checked in by Espen Sollum ATMOS <eso@…>, 8 years ago

Minor cosmetic edits

  • Property mode set to 100644
File size: 7.3 KB
RevLine 
[1d207bb]1
[fb0d416]2                FLEXPART VERSION 10.0 beta (MPI)
[1d207bb]3
4Description
5-----------
6
7  This branch contains both the standard (serial) FLEXPART, and a parallel
8  version (implemented with MPI). The latter is under developement, so not
9  every FLEXPART option is implemented yet.
10
11  MPI related subroutines and variables are in file mpi_mod.f90.
12
13  Most of the source files are identical/shared between the serial and
14  parallel versions. Those that depend on the MPI module have '_mpi'
15  apppended to their names, e.g. 'timemanager_mpi.f90'
16
17
18Installation
19------------
20
21  A MPI library must be installed on the target platform, either as a
22  system library or compiled from source.
23
24  So far, we have tested the following freely available implementations:           
25  mpich2  -- versions 3.0.1, 3.0.4, 3.1, 3.1.3
26  OpenMPI -- version 1.8.3
27
28  Based on testing so far, OpenMPI is recommended.
29
30  Compiling the parallel version (executable: FP_ecmwf_MPI) is done by
31
32    'make [-j] ecmwf-mpi'
33
34  The makefile has resolved dependencies, so 'make -j' will compile
35  and link in parallel.
36
37  The included makefile must be edited to match the target platform
38  (location of system libraries, compiler etc.).
39
40
41Usage
42-----
43
44  Running the parallel version with MPI is done with the "mpirun" command
45  (some MPI implementations may use a "mpiexec" command instead). The
46  simplest case is:
47
48    'mpirun -n [number] ./FP_ecmwf_MPI'
49
50  where 'number' is the number of processes to launch. Depending on the
51  target platform, useful options regarding process-to-processor bindings
52  can be specified (for performance reasons), e.g,
53
54    'mpirun --bind-to l3cache -n [number] ./FP_ecmwf_MPI'
55
56
57Implementation
58--------------
59
60  The current parallel model is based on distributing particles equally
61  among the running processes. In the code, variables like 'maxpart' and
62  'numpart' are complemented by variables 'maxpart_mpi' and 'numpart_mpi'
63  which are the run-time determined number of particles per process, i.e,
[fb0d416]64  maxpart_mpi = maxpart/np, where np are the number of processes. The variable 'numpart'
[1d207bb]65  is still used in the code, but redefined to mean 'number of particles
66  per MPI process'
67
68  The root MPI process writes concentrations to file, following a MPI
69  communication step where each process sends its contributions to root,
70  where the individual contributions are summed.
71
72  In the parallel version one can choose to set aside a process dedicated
73  to reading and distributing meteorological data ("windfields"). This process will
74  thus not participate in the calculation of trajectories. This might not be
75  the optimal choice when running with very few processes.
76  As an example, running with a total number of processes np=4 and
77  using one of these processes for reading windfields will normally
78  be faster than running with np=3 and no dedicated 'reader' process.
79  But it is also possible that the
80  program will run even faster if the 4th process is participating in
[fb0d416]81  the calculation of particle trajectories instead. This will largely depend on
[1d207bb]82  the problem size (total number of particles in the simulation, resolution
83  of grids etc) and hardware being used (disk speed/buffering, memory
84  bandwidth etc).
85
86  To control this
87  behavior, edit the parameter 'read_grp_min' in file mpi_mod.f90. This
88  sets the minimum number of total processes at which one will be set
89  aside for reading the fields. Experimentation is required to find
90  the optimum value. At typical NILU machines (austre.nilu.no,
91  dmz-proc01.nilu.no) with 24-32 cores, a value of 6-8 seems to be a
92  good choice.
93
94  An experimental feature, which is an extension of the functionality
95  described above, is to hold 3 fields in memory instead of the usual 2.
96  Here, the transfer of fields from the "reader" process to the "particle"
97  processes is done on the vacant field index, simultaneously while the
98  "particle" processes are calculating trajectories. To use this feature,
99  set 'lmp_sync=.false'. in file mpi_mod.f90 and set numwfmem=3 in file
100  par_mod.f90. At the moment, this method does not seem to produce faster
101  running code (about the same as the "2-fields" version).
102 
103
104Performance efficency considerations
105------------------------------------
106
107  A couple of reference runs have been set up to measure performace of the
108  MPI version (as well as checking for errors in the implementation).
109  They are as follows:
110   
111  Reference run 1 (REF1):
112    * Forward modelling (24h) of I2-131, variable number of particles
113    * Two release locations
114    * 360x720 Global grid, no nested grid
115    * Species file modified to include (not realistic) values for
116        scavenging/deposition
117 
118
119  As the parallization is based on particles, it follows that if 
120  FLEXPART-MPI is run with no (or just a few) particles, no performance
121  improvement is possible. In this case, most processing time is spent
[5f42c27]122  in the 'getfields'-routine.
[1d207bb]123
124  A) Running without dedicated reader process
125  ----------------------------------------
126  Running REF1 with 100M particles on 16 processes (NILU machine 'dmz-proc04'),
127  a speedup close to 8 is observed (~50% efficiency).
128
129  Running REF1 with 10M particles on 8 processes (NILU machine 'dmz-proc04'),
130  a speedup close to 3 is observed (~40% efficiency). Running with 16
131  processes gives only marginal improvements (speedup ~3.5) because of the 'getfields'
132  bottleneck.
133 
134  Running REF1 with 1M particles: Here 'getfields' consumes ~70% of the CPU
135  time. Running with 4 processes gives a speedup of ~1.5. Running with more
136  processes does not help much here.
137
138  B) Running with dedicated reader process
139  ----------------------------------------
140
141  Running REF1 with 40M particles on 16 processes (NILU machine 'dmz-proc04'),
142  a speedup above 10 is observed (~63% efficiency).
143
144  :TODO: more to come...
145
146
147Advice 
148------
149  From the tests referred to above, the following advice can be given:
150
151    * Do not run with too many processes.
152    * Do not use the parallel version when running with very few particles.
153     
154
155What is implemented in the MPI version
156--------------------------------------
157
158 -The following should work (have been through initial testing):
159
160    * Forward runs
161    * OH fields
162    * Radioactive decay
163    * Particle splitting
164    * Dumping particle positions to file
165    * ECMWF data
166    * Wet/dry deposition
167    * Nested grid output
168    * NetCDF output
169    * Namelist input/output
170    * Domain-filling trajectory calculations
171    * Nested wind fields
172
[ec7fc72]173 -Implemented but untested:
174
175    * Backward runs (but not initial_cond_output.f90)
176
[fb0d416]177 -The following will most probably not work (untested/under developement):
[1d207bb]178
[ec7fc72]179    * Calculation/output of fluxes
[1d207bb]180
181 -This will positively NOT work yet
182
183    * Subroutine partoutput_short (MQUASILAG = 1) will not dump particles
184      correctly at the moment
185    * Reading particle positions from file (the tools to implement this
186      are available in mpi_mod.f90 so it will be possible soon)
187
188
189  Please keep in mind that running the serial version (FP_ecmwf_gfortran)
190  should yield identical results as running the parallel version
191  (FP_ecmwf_MPI) using only one process, i.e. "mpirun -n 1 FP_ecmwf_MPI".
192  If not, this indicates a bug.
193 
194  When running with multiple processes, statistical differences are expected
195  in the results.
196
197Contact
198-------
199
200  If you have questions, or wish to work with the parallel version, please
201  contact Espen Sollum (eso@nilu.no). Please report any errors/anomalies!
Note: See TracBrowser for help on using the repository browser.
hosted by ZAMG