Opened 5 years ago
Last modified 4 years ago
#252 new Defect
FLEXPART-WRF crashes for array bound mismatch
Reported by: | harish | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | FP coding/compilation | Version: | FLEXPART-WRF |
Keywords: | mpi, sendint_mpi | Cc: |
Description (last modified by pesei)
Hi
The flexpart-wrf version 3.3.2 crashes after about five hours of backward mode run when it is run using mpirun.
Using -fbacktrace and -fbounds-check option, I found that it crashes at line 132 in subroutine sendint_mpi.f90
if (tag.eq.1) npoint(jj2+1:numpart2)=mpi_npoint(1:chunksize2)
with error
Fortran runtime error: Array bound mismatch for dimension 1 of array 'npoint' (2500/2511)
My input file is similar to flexwrf.input.backward1 provided in the flexwrf_v31_testcases.tar.gz except for the release locations and dates.
Also, the par_mod.f90 file has been modified to accommodate larger grid size of the WRF data being used in following manner
integer, parameter :: naxmax=721, nymax=361,nuvzmax=64, nwzmax=64, nzmax=64
The code was run using LSF command bsub < mpijobfile where the mpijobfile is
mpirun -np 12 ./flexwrf33_gnu_mpi flexwrf.input.backward1
The code runs without crashing for serial mode, however grid_time file does not contain other than zero values in that case with all file size 372 bytes. I am not sure if this two issues are related or separate.
Any help/hint to resolve this issue will be highly appreciated. The input file being used is attached. I will be glad to provide more information.
Harish
Attachments (1)
Change History (8)
Changed 5 years ago by harish
comment:1 Changed 5 years ago by harish
Hi
The flexpart-wrf version 3.3.2 crashes after about five hours of backward mode run when it is run using mpirun.
Using -fbacktrace and -fbounds-check option, I found that it crashes at line 132 in subroutine sendint_mpi.f90
if (tag.eq.1) npoint(jj2+1:numpart2)=mpi_npoint(1:chunksize2)
with error
Fortran runtime error: Array bound mismatch for dimension 1 of array 'npoint' (2500/2511)
My input file is similar to flexwrf.input.backward1 provided in the flexwrf_v31_testcases.tar.gz except for the release locations and dates.
Also, the par_mod.f90 file has been modified to accommodate larger grid size of the WRF data being used.
integer, parameter :: naxmax=721, nymax=361,nuvzmax=64, nwzmax=64, nzmax=64
The code was run using LSF command bsub < mpijobfile where the mpijobfile contains
mpirun -np 12 ./flexwrf33_gnu_mpi flexwrf.input.backward1
The code runs without crashing for serial mode run. However, the grid_time_yyyymmddhhmmss files contain only zero values and their size is 372 bytes. I am not sure if this two issues are related or not.
Any help/hint to resolve this issue will be highly appreciated. The input file being used is attached. I will be glad to provide more information.
Harish
comment:2 Changed 5 years ago by pesei
Thank you for reporting this issue. I am on leave at this moment, not in my office, and I regret that I can't investigate the issue soon. I suggest that you post this also in the FLEXPART mailing list, hoping that other users might be able to help.
comment:3 follow-up: ↓ 6 Changed 5 years ago by harish
I think the error is because of the way chunksize2 variable is calculated at line numbers 66 to 71 in file sendint_mpi.f90. The lines are shown below.
ii=0 do jj=1, numpart2, ntasks ii = ii + 1 jj2= jj enddo chunksize2=ii+numpart2-jj2
The ntasks variable is number of nodes over which the job is distributed. When it is exact multiple of numpart2 which is total number of particles being released in the simulation, value calculated for chunksize2 is wrong.
Until this bug is fixed, one can avoid this by deliberately setting either total number of particles to be released or number nodes over which job to be distributed such that number of nodes is not integer factor of total number of particles to be released.
comment:4 Changed 5 years ago by harish
If possible this ticket should be moved from support to defect category.
comment:5 Changed 5 years ago by pesei
- Type changed from Support to Defect
comment:6 in reply to: ↑ 3 Changed 5 years ago by harish
Replying to harish:
Until this bug is fixed, one can avoid this by deliberately setting either total number of particles to be released or number nodes over which job to be distributed such that number of nodes is not integer factor of total number of particles to be released.
Earlier I mentioned that one can avoid the crash by setting total number of particles to be released such that it is not an integer multiple of number of nodes over which the job to be distributed.
However, when the number of particles emitted over a time range e.g. one hour, one day, etc., even when the total number of particles is not integer multiple of the number of nodes, the code can still crash.
This is because when the particles are emitted over a time range, they are emitted at regular interval in small bunches. And at some point, the cumulative sum of emitted particles can become integer multiple of the number of nodes and make program to crash. There is no easy way to avoid crash under this circumstance.
I had to modify few lines in subroutines sendreal_mpi.f90, sendreal2d_mpi.f90, sendint2_mpi.f90, senddouble_mpi.f90 and sendint_mpi.f90 in order to avoid the crash.
Following lines which are same in all the subroutine mentioned above and starts around line number 75 in the original code are commented out.
!ii = 0
!do jj=1, numpart2, ntasks
! ii = ii + 1
! jj2 = jj
!enddo
!chunksize2=ii+numpart2-jj2
Immediately following the above block after commenting, I added following lines.
jj2 = 1 + ntasks * int(numpart2/ntasks) if (jj2 .gt. numpart2) jj2 = numpart2 chunksize2 = int(numpart2/ntasks) + mod(numpart2, ntasks)
This avoids the crash and works for me but not sure if the changes mentioned above has side effects elsewhere in the program since the variable chunksize2 is a common variable provided through mpi_mod.f90. Hopefully, it isn't!
comment:7 Changed 4 years ago by pesei
- Description modified (diff)
I am sorry, but it will be difficult to solve this issue as we don't have a real maintainer for FpWRF at the moment, and I am not very familiar with MPI.
If you have any new findings, please post them.
input file used for the run