[16] | 1 | J. Brioude, Sept 19 2013 |
---|
| 2 | ************************************************************** |
---|
| 3 | To compile flexwrf, choose your compiler in makefile.mom (line 23), the path to the NetCDF library and then type |
---|
| 4 | make -f makefile.mom mpi for MPI+OPENMP hybrid run |
---|
| 5 | make -f makefile.mom omp for OPENMP parallel run |
---|
| 6 | make -f makefile.mom serial for a serial run |
---|
| 7 | ******************************************************************** |
---|
| 8 | To run flexwrf, you can pass an argument to the executable that gives the name of the input file. |
---|
| 9 | for instance |
---|
| 10 | ./flexwrf31_mpi /home/jbrioude/inputfile.txt |
---|
| 11 | Otherwise, the file flexwrf.input in the current directory is read by default. |
---|
| 12 | |
---|
| 13 | Examples of forward and backward runs are available in the examples directory. |
---|
| 14 | |
---|
| 15 | |
---|
| 16 | ***************************************************************** |
---|
| 17 | Versions timeline |
---|
| 18 | |
---|
| 19 | version 3.1: bug fix on the sign of sshf in readwind.f90 |
---|
| 20 | modifications of advance.f90 to limit the vertical velocity from cbl scheme |
---|
| 21 | bug fix in write_ncconc.f90 |
---|
| 22 | modifications of interpol*f90 routines to avoid crashes using tke_partition_hanna.f90 and tke_partition_my.f90 |
---|
| 23 | version 3.0 First public version |
---|
| 24 | |
---|
| 25 | version 2.4.1: New modifications on the wet deposition scheme from Petra Seibert |
---|
| 26 | |
---|
| 27 | version 2.3.1: a NetCDF format output is implemented. |
---|
| 28 | |
---|
| 29 | version 2.2.7: CBL scheme is implemented. a new random generator is implemented. |
---|
| 30 | |
---|
| 31 | version 2.0.6: |
---|
| 32 | -map factors are used in advance.f90 when converting the calculated distance |
---|
| 33 | into a WRF grid distance. |
---|
| 34 | -fix on the divergence based vertical wind |
---|
| 35 | |
---|
| 36 | version 2.0.5: |
---|
| 37 | the time over which the kernel is not used has been reduced from 10800 seconds |
---|
| 38 | to 7200 seconds. Those numbers depend on the horizontal resolution, and a more |
---|
| 39 | flexible solution might come up in a future version |
---|
| 40 | version 2.0.4: |
---|
| 41 | - bug fix for regular output grid |
---|
| 42 | - IO problems in ASCII have been fixed |
---|
| 43 | - add the option of running flexpart with an argument that gives the name of |
---|
| 44 | the inputfile instead of flexwrf.input |
---|
| 45 | version 2.0.3: |
---|
| 46 | - bug fix when flexpart is restarted. |
---|
| 47 | -bug fix in coordtrafo.f90 |
---|
| 48 | - a new option that let the user decide if the time for the the time average |
---|
| 49 | fields from WRF has to be corrected or not. |
---|
| 50 | |
---|
| 51 | version 2.0.2: |
---|
| 52 | - bug fix in sendint2_mpi_old.f90 |
---|
| 53 | - all the *mpi*.f90 have been changed to handle more properly the memory. |
---|
| 54 | - timemanager_mpi has changed accordingly. Some bug fix too |
---|
| 55 | - bug fix in writeheader |
---|
| 56 | - parallelization of calcpar and verttransform.f90, same for the nests. |
---|
| 57 | |
---|
| 58 | version 2.0.1: |
---|
| 59 | -1 option added in flexwrf.input to define the output grid with dxout and dyout |
---|
| 60 | -fix in readinput.f90 to calculate maxpart more accurately |
---|
| 61 | |
---|
| 62 | version 2.0: first OPENMP/MPI version |
---|
| 63 | |
---|
| 64 | version 1.0: |
---|
| 65 | This is a fortran 90 version of FLEXPART. |
---|
| 66 | Compared to PILT, the version from Jerome Fast available on the NILU flexpart website, several bugs and improvements have been made (not |
---|
| 67 | necessarily commented) in the subroutines. |
---|
| 68 | non exhaustive list: |
---|
| 69 | 1) optimization of the kein-fritch convective scheme (expensive) |
---|
| 70 | 2) possibility to output the flexpart run in a regular lat/lon output grid. |
---|
| 71 | flexwrf.input has 2 options to let the model know which coordinates are used |
---|
| 72 | for the output domaine and the release boxes. |
---|
| 73 | 3) Differences in earth radius between WRF and WRF-chem is handled. |
---|
| 74 | 4) time averaged wind, instantaneous omega or a vertical velocity internally calculated in FLEXPART can be used now. |
---|
| 75 | 5) a bug fix in pbl_profile.f due to the variable kappa. |
---|
| 76 | |
---|
| 77 | Turb option 2 and 3 from Jerome Fast's version lose mass in the model. Those |
---|
| 78 | options are not recommended. |
---|
| 79 | |
---|
| 80 | *********************************************************************** |
---|
| 81 | General comments on The hybrid version of flexpart wrf: |
---|
| 82 | This version includes a parallelized hybrid version of FLEXPART that can be |
---|
| 83 | used with: |
---|
| 84 | - 1 node (1 computer) with multi threads using openmp in shared memory, |
---|
| 85 | - or several nodes (computers) in distributed memory (using mpi) and several threads in shared memory (using openmp). |
---|
| 86 | if a mpi library is not available with your compiler, use makefile.nompi to compile flexwrf |
---|
| 87 | |
---|
| 88 | The system variable OMP_NUM_THREADS has to be set before running the model to define the number of thread used. |
---|
| 89 | it can also be fixed in timemanager*f90. |
---|
| 90 | If not, flexwrf20_mpi will use 1 thread. |
---|
| 91 | |
---|
| 92 | When submitting a job to several nodes, mpiexec or mpirun needs to know that 1 task has to be allocated per node to let openmp doing the work within each node in shared memory. |
---|
| 93 | See submit.sh as an example. |
---|
| 94 | |
---|
| 95 | Compared to the single node version, this version includes modifications of: |
---|
| 96 | |
---|
| 97 | - flexwrf.f90 that is renamed into flexwrf_mpi.f90 |
---|
| 98 | - timemanager.f90 that is renamed into timemanager_mpi.f90 |
---|
| 99 | - the interpol*f90 and hanna* has been modified. |
---|
| 100 | - the routines *mpi*.f90 are used to send or receive data between nodes. |
---|
| 101 | |
---|
| 102 | The most important modifications are in timemanager_mpi.f90, initialize.f90 and advance.f90. |
---|
| 103 | search for JB in timemanager_mpi.f90 to have additional comments. |
---|
| 104 | in advance.f90, I modified the way the random number is picked up (line 187). I use a simple count and the id of the thread instead of the random pick up that uses ran3. |
---|
| 105 | If the series of random number is output for a give release box (uncomment lines 195 to 198), the distribution is quite good, and I don't see any bigger bias that the one in the single thread version. |
---|
| 106 | of course, the distribution is less and less random when you increase the number of nodes or threads. |
---|
| 107 | |
---|
| 108 | |
---|
| 109 | ********************************************************* |
---|
| 110 | performance: |
---|
| 111 | this is the performance of the loop line 581 in timemanager_mpi.f90 that calculates the trajectories. |
---|
| 112 | I use the version v74 as the reference (single thread, fortran 77). |
---|
| 113 | There is a loss in performance between v74 and v90 because of the temporary variables th_* that has to be used as private variables in timemanager_mpi.f90 |
---|
| 114 | v74 |
---|
| 115 | v90 1thread 0.96 |
---|
| 116 | v90 2threads 1.86 |
---|
| 117 | v90 4threads 3.57 |
---|
| 118 | v90 8threads 6.22 |
---|
| 119 | |
---|
| 120 | performance of the communication between nodes: |
---|
| 121 | depends on the system. The super computer that I use can transfer about 1Gb in 1 second. |
---|
| 122 | in timemanager_mpi.f90, the output lines 540 and 885 give the time needed by the system to communicate between nodes. using 100 millions particles and say 4 nodes, it takes about 1 second. |
---|
| 123 | |
---|