Context Navigation

source: flexpart.git/README_PARALLEL.md @ 17c3c47

10.4.1_peseiGFS_025bugfixes+enhancementsdevrelease-10release-10.4.1scaling-bugunivie

Last change on this file since 17c3c47 was ec7fc72, checked in by Espen Sollum ATMOS <eso@…>, 8 years ago
Minor cosmetic edits
Property mode set to `100644`
File size: 7.3 KB

Rev	Line
[1d207bb]	1
[fb0d416]	2	FLEXPART VERSION 10.0 beta (MPI)
[1d207bb]	3
	4	Description
	5	-----------
	6
	7	This branch contains both the standard (serial) FLEXPART, and a parallel
	8	version (implemented with MPI). The latter is under developement, so not
	9	every FLEXPART option is implemented yet.
	10
	11	MPI related subroutines and variables are in file mpi_mod.f90.
	12
	13	Most of the source files are identical/shared between the serial and
	14	parallel versions. Those that depend on the MPI module have '_mpi'
	15	apppended to their names, e.g. 'timemanager_mpi.f90'
	16
	17
	18	Installation
	19	------------
	20
	21	A MPI library must be installed on the target platform, either as a
	22	system library or compiled from source.
	23
	24	So far, we have tested the following freely available implementations:
	25	mpich2 -- versions 3.0.1, 3.0.4, 3.1, 3.1.3
	26	OpenMPI -- version 1.8.3
	27
	28	Based on testing so far, OpenMPI is recommended.
	29
	30	Compiling the parallel version (executable: FP_ecmwf_MPI) is done by
	31
	32	'make [-j] ecmwf-mpi'
	33
	34	The makefile has resolved dependencies, so 'make -j' will compile
	35	and link in parallel.
	36
	37	The included makefile must be edited to match the target platform
	38	(location of system libraries, compiler etc.).
	39
	40
	41	Usage
	42	-----
	43
	44	Running the parallel version with MPI is done with the "mpirun" command
	45	(some MPI implementations may use a "mpiexec" command instead). The
	46	simplest case is:
	47
	48	'mpirun -n [number] ./FP_ecmwf_MPI'
	49
	50	where 'number' is the number of processes to launch. Depending on the
	51	target platform, useful options regarding process-to-processor bindings
	52	can be specified (for performance reasons), e.g,
	53
	54	'mpirun --bind-to l3cache -n [number] ./FP_ecmwf_MPI'
	55
	56
	57	Implementation
	58	--------------
	59
	60	The current parallel model is based on distributing particles equally
	61	among the running processes. In the code, variables like 'maxpart' and
	62	'numpart' are complemented by variables 'maxpart_mpi' and 'numpart_mpi'
	63	which are the run-time determined number of particles per process, i.e,
[fb0d416]	64	maxpart_mpi = maxpart/np, where np are the number of processes. The variable 'numpart'
[1d207bb]	65	is still used in the code, but redefined to mean 'number of particles
	66	per MPI process'
	67
	68	The root MPI process writes concentrations to file, following a MPI
	69	communication step where each process sends its contributions to root,
	70	where the individual contributions are summed.
	71
	72	In the parallel version one can choose to set aside a process dedicated
	73	to reading and distributing meteorological data ("windfields"). This process will
	74	thus not participate in the calculation of trajectories. This might not be
	75	the optimal choice when running with very few processes.
	76	As an example, running with a total number of processes np=4 and
	77	using one of these processes for reading windfields will normally
	78	be faster than running with np=3 and no dedicated 'reader' process.
	79	But it is also possible that the
	80	program will run even faster if the 4th process is participating in
[fb0d416]	81	the calculation of particle trajectories instead. This will largely depend on
[1d207bb]	82	the problem size (total number of particles in the simulation, resolution
	83	of grids etc) and hardware being used (disk speed/buffering, memory
	84	bandwidth etc).
	85
	86	To control this
	87	behavior, edit the parameter 'read_grp_min' in file mpi_mod.f90. This
	88	sets the minimum number of total processes at which one will be set
	89	aside for reading the fields. Experimentation is required to find
	90	the optimum value. At typical NILU machines (austre.nilu.no,
	91	dmz-proc01.nilu.no) with 24-32 cores, a value of 6-8 seems to be a
	92	good choice.
	93
	94	An experimental feature, which is an extension of the functionality
	95	described above, is to hold 3 fields in memory instead of the usual 2.
	96	Here, the transfer of fields from the "reader" process to the "particle"
	97	processes is done on the vacant field index, simultaneously while the
	98	"particle" processes are calculating trajectories. To use this feature,
	99	set 'lmp_sync=.false'. in file mpi_mod.f90 and set numwfmem=3 in file
	100	par_mod.f90. At the moment, this method does not seem to produce faster
	101	running code (about the same as the "2-fields" version).
	102
	103
	104	Performance efficency considerations
	105	------------------------------------
	106
	107	A couple of reference runs have been set up to measure performace of the
	108	MPI version (as well as checking for errors in the implementation).
	109	They are as follows:
	110
	111	Reference run 1 (REF1):
	112	* Forward modelling (24h) of I2-131, variable number of particles
	113	* Two release locations
	114	* 360x720 Global grid, no nested grid
	115	* Species file modified to include (not realistic) values for
	116	scavenging/deposition
	117
	118
	119	As the parallization is based on particles, it follows that if
	120	FLEXPART-MPI is run with no (or just a few) particles, no performance
	121	improvement is possible. In this case, most processing time is spent
[5f42c27]	122	in the 'getfields'-routine.
[1d207bb]	123
	124	A) Running without dedicated reader process
	125	----------------------------------------
	126	Running REF1 with 100M particles on 16 processes (NILU machine 'dmz-proc04'),
	127	a speedup close to 8 is observed (~50% efficiency).
	128
	129	Running REF1 with 10M particles on 8 processes (NILU machine 'dmz-proc04'),
	130	a speedup close to 3 is observed (~40% efficiency). Running with 16
	131	processes gives only marginal improvements (speedup ~3.5) because of the 'getfields'
	132	bottleneck.
	133
	134	Running REF1 with 1M particles: Here 'getfields' consumes ~70% of the CPU
	135	time. Running with 4 processes gives a speedup of ~1.5. Running with more
	136	processes does not help much here.
	137
	138	B) Running with dedicated reader process
	139	----------------------------------------
	140
	141	Running REF1 with 40M particles on 16 processes (NILU machine 'dmz-proc04'),
	142	a speedup above 10 is observed (~63% efficiency).
	143
	144	:TODO: more to come...
	145
	146
	147	Advice
	148	------
	149	From the tests referred to above, the following advice can be given:
	150
	151	* Do not run with too many processes.
	152	* Do not use the parallel version when running with very few particles.
	153
	154
	155	What is implemented in the MPI version
	156	--------------------------------------
	157
	158	-The following should work (have been through initial testing):
	159
	160	* Forward runs
	161	* OH fields
	162	* Radioactive decay
	163	* Particle splitting
	164	* Dumping particle positions to file
	165	* ECMWF data
	166	* Wet/dry deposition
	167	* Nested grid output
	168	* NetCDF output
	169	* Namelist input/output
	170	* Domain-filling trajectory calculations
	171	* Nested wind fields
	172
[ec7fc72]	173	-Implemented but untested:
	174
	175	* Backward runs (but not initial_cond_output.f90)
	176
[fb0d416]	177	-The following will most probably not work (untested/under developement):
[1d207bb]	178
[ec7fc72]	179	* Calculation/output of fluxes
[1d207bb]	180
	181	-This will positively NOT work yet
	182
	183	* Subroutine partoutput_short (MQUASILAG = 1) will not dump particles
	184	correctly at the moment
	185	* Reading particle positions from file (the tools to implement this
	186	are available in mpi_mod.f90 so it will be possible soon)
	187
	188
	189	Please keep in mind that running the serial version (FP_ecmwf_gfortran)
	190	should yield identical results as running the parallel version
	191	(FP_ecmwf_MPI) using only one process, i.e. "mpirun -n 1 FP_ecmwf_MPI".
	192	If not, this indicates a bug.
	193
	194	When running with multiple processes, statistical differences are expected
	195	in the results.
	196
	197	Contact
	198	-------
	199
	200	If you have questions, or wish to work with the parallel version, please
	201	contact Espen Sollum (eso@nilu.no). Please report any errors/anomalies!

Note: See TracBrowser for help on using the repository browser.

Download in other formats: