1 | J. Brioude, Sept 19 2013 |
---|
2 | ************************************************************** |
---|
3 | To compile flexwrf, choose your compiler in makefile.mom (line 23), the path to the NetCDF library and then type |
---|
4 | make -f makefile.mom mpi for MPI+OPENMP hybrid run |
---|
5 | make -f makefile.mom omp for OPENMP parallel run |
---|
6 | make -f makefile.mom serial for a serial run |
---|
7 | ******************************************************************** |
---|
8 | To run flexwrf, you can pass an argument to the executable that gives the name of the input file. |
---|
9 | for instance |
---|
10 | ./flexwrf31_mpi /home/jbrioude/inputfile.txt |
---|
11 | Otherwise, the file flexwrf.input in the current directory is read by default. |
---|
12 | |
---|
13 | Examples of forward and backward runs are available in the examples directory. |
---|
14 | |
---|
15 | |
---|
16 | ***************************************************************** |
---|
17 | Versions timeline |
---|
18 | |
---|
19 | version 3.1: bug fix on the sign of sshf in readwind.f90 |
---|
20 | modifications of advance.f90 to limit the vertical velocity from cbl scheme |
---|
21 | bug fix in write_ncconc.f90 |
---|
22 | modifications of interpol*f90 routines to avoid crashes using tke_partition_hanna.f90 and tke_partition_my.f90 |
---|
23 | version 3.0 First public version |
---|
24 | |
---|
25 | version 2.4.1: New modifications on the wet deposition scheme from Petra Seibert |
---|
26 | |
---|
27 | version 2.3.1: a NetCDF format output is implemented. |
---|
28 | |
---|
29 | version 2.2.7: CBL scheme is implemented. a new random generator is implemented. |
---|
30 | |
---|
31 | version 2.0.6: |
---|
32 | -map factors are used in advance.f90 when converting the calculated distance |
---|
33 | into a WRF grid distance. |
---|
34 | -fix on the divergence based vertical wind |
---|
35 | |
---|
36 | version 2.0.5: |
---|
37 | the time over which the kernel is not used has been reduced from 10800 seconds |
---|
38 | to 7200 seconds. Those numbers depend on the horizontal resolution, and a more |
---|
39 | flexible solution might come up in a future version |
---|
40 | version 2.0.4: |
---|
41 | - bug fix for regular output grid |
---|
42 | - IO problems in ASCII have been fixed |
---|
43 | - add the option of running flexpart with an argument that gives the name of |
---|
44 | the inputfile instead of flexwrf.input |
---|
45 | version 2.0.3: |
---|
46 | - bug fix when flexpart is restarted. |
---|
47 | -bug fix in coordtrafo.f90 |
---|
48 | - a new option that let the user decide if the time for the the time average |
---|
49 | fields from WRF has to be corrected or not. |
---|
50 | |
---|
51 | version 2.0.2: |
---|
52 | - bug fix in sendint2_mpi_old.f90 |
---|
53 | - all the *mpi*.f90 have been changed to handle more properly the memory. |
---|
54 | - timemanager_mpi has changed accordingly. Some bug fix too |
---|
55 | - bug fix in writeheader |
---|
56 | - parallelization of calcpar and verttransform.f90, same for the nests. |
---|
57 | |
---|
58 | version 2.0.1: |
---|
59 | -1 option added in flexwrf.input to define the output grid with dxout and dyout |
---|
60 | -fix in readinput.f90 to calculate maxpart more accurately |
---|
61 | |
---|
62 | version 2.0: first OPENMP/MPI version |
---|
63 | |
---|
64 | version 1.0: |
---|
65 | This is a fortran 90 version of FLEXPART. |
---|
66 | Compared to PILT, the version from Jerome Fast available on the NILU flexpart website, several bugs and improvements have been made (not |
---|
67 | necessarily commented) in the subroutines. |
---|
68 | non exhaustive list: |
---|
69 | 1) optimization of the kein-fritch convective scheme (expensive) |
---|
70 | 2) possibility to output the flexpart run in a regular lat/lon output grid. |
---|
71 | flexwrf.input has 2 options to let the model know which coordinates are used |
---|
72 | for the output domaine and the release boxes. |
---|
73 | 3) Differences in earth radius between WRF and WRF-chem is handled. |
---|
74 | 4) time averaged wind, instantaneous omega or a vertical velocity internally calculated in FLEXPART can be used now. |
---|
75 | 5) a bug fix in pbl_profile.f due to the variable kappa. |
---|
76 | |
---|
77 | Turb option 2 and 3 from Jerome Fast's version lose mass in the model. Those |
---|
78 | options are not recommended. |
---|
79 | |
---|
80 | *********************************************************************** |
---|
81 | General comments on The hybrid version of flexpart wrf: |
---|
82 | This version includes a parallelized hybrid version of FLEXPART that can be |
---|
83 | used with: |
---|
84 | - 1 node (1 computer) with multi threads using openmp in shared memory, |
---|
85 | - or several nodes (computers) in distributed memory (using mpi) and several threads in shared memory (using openmp). |
---|
86 | if a mpi library is not available with your compiler, use makefile.nompi to compile flexwrf |
---|
87 | |
---|
88 | The system variable OMP_NUM_THREADS has to be set before running the model to define the number of thread used. |
---|
89 | it can also be fixed in timemanager*f90. |
---|
90 | If not, flexwrf20_mpi will use 1 thread. |
---|
91 | |
---|
92 | When submitting a job to several nodes, mpiexec or mpirun needs to know that 1 task has to be allocated per node to let openmp doing the work within each node in shared memory. |
---|
93 | See submit.sh as an example. |
---|
94 | |
---|
95 | Compared to the single node version, this version includes modifications of: |
---|
96 | |
---|
97 | - flexwrf.f90 that is renamed into flexwrf_mpi.f90 |
---|
98 | - timemanager.f90 that is renamed into timemanager_mpi.f90 |
---|
99 | - the interpol*f90 and hanna* has been modified. |
---|
100 | - the routines *mpi*.f90 are used to send or receive data between nodes. |
---|
101 | |
---|
102 | The most important modifications are in timemanager_mpi.f90, initialize.f90 and advance.f90. |
---|
103 | search for JB in timemanager_mpi.f90 to have additional comments. |
---|
104 | in advance.f90, I modified the way the random number is picked up (line 187). I use a simple count and the id of the thread instead of the random pick up that uses ran3. |
---|
105 | If the series of random number is output for a give release box (uncomment lines 195 to 198), the distribution is quite good, and I don't see any bigger bias that the one in the single thread version. |
---|
106 | of course, the distribution is less and less random when you increase the number of nodes or threads. |
---|
107 | |
---|
108 | |
---|
109 | ********************************************************* |
---|
110 | performance: |
---|
111 | this is the performance of the loop line 581 in timemanager_mpi.f90 that calculates the trajectories. |
---|
112 | I use the version v74 as the reference (single thread, fortran 77). |
---|
113 | There is a loss in performance between v74 and v90 because of the temporary variables th_* that has to be used as private variables in timemanager_mpi.f90 |
---|
114 | v74 |
---|
115 | v90 1thread 0.96 |
---|
116 | v90 2threads 1.86 |
---|
117 | v90 4threads 3.57 |
---|
118 | v90 8threads 6.22 |
---|
119 | |
---|
120 | performance of the communication between nodes: |
---|
121 | depends on the system. The super computer that I use can transfer about 1Gb in 1 second. |
---|
122 | in timemanager_mpi.f90, the output lines 540 and 885 give the time needed by the system to communicate between nodes. using 100 millions particles and say 4 nodes, it takes about 1 second. |
---|
123 | |
---|