Opened 5 years ago

Last modified 4 years ago

#252 new Defect

FLEXPART-WRF crashes for array bound mismatch

Reported by: harish Owned by:
Priority: major Milestone:
Component: FP coding/compilation Version: FLEXPART-WRF
Keywords: mpi, sendint_mpi Cc:

Description (last modified by pesei)

Hi
The flexpart-wrf version 3.3.2 crashes after about five hours of backward mode run when it is run using mpirun.

Using -fbacktrace and -fbounds-check option, I found that it crashes at line 132 in subroutine sendint_mpi.f90

if (tag.eq.1) npoint(jj2+1:numpart2)=mpi_npoint(1:chunksize2)

with error

Fortran runtime error: Array bound mismatch for dimension 1 of array 'npoint' (2500/2511)

My input file is similar to flexwrf.input.backward1 provided in the flexwrf_v31_testcases.tar.gz except for the release locations and dates.

Also, the par_mod.f90 file has been modified to accommodate larger grid size of the WRF data being used in following manner

integer, parameter :: naxmax=721, nymax=361,nuvzmax=64, nwzmax=64, nzmax=64

The code was run using LSF command bsub < mpijobfile where the mpijobfile is

mpirun -np 12 ./flexwrf33_gnu_mpi flexwrf.input.backward1

The code runs without crashing for serial mode, however grid_time file does not contain other than zero values in that case with all file size 372 bytes. I am not sure if this two issues are related or separate.

Any help/hint to resolve this issue will be highly appreciated. The input file being used is attached. I will be glad to provide more information.

Harish

Attachments (1)

flexwrf.input.backward1 (6.2 KB) - added by harish 5 years ago.
input file used for the run

Download all attachments as: .zip

Change History (8)

Changed 5 years ago by harish

input file used for the run

comment:1 Changed 5 years ago by harish

Hi
The flexpart-wrf version 3.3.2 crashes after about five hours of backward mode run when it is run using mpirun.

Using -fbacktrace and -fbounds-check option, I found that it crashes at line 132 in subroutine sendint_mpi.f90

if (tag.eq.1) npoint(jj2+1:numpart2)=mpi_npoint(1:chunksize2)

with error

Fortran runtime error: Array bound mismatch for dimension 1 of array 'npoint' (2500/2511)

My input file is similar to flexwrf.input.backward1 provided in the flexwrf_v31_testcases.tar.gz except for the release locations and dates.

Also, the par_mod.f90 file has been modified to accommodate larger grid size of the WRF data being used.

integer, parameter :: naxmax=721, nymax=361,nuvzmax=64, nwzmax=64, nzmax=64

The code was run using LSF command bsub < mpijobfile where the mpijobfile contains

mpirun -np 12 ./flexwrf33_gnu_mpi flexwrf.input.backward1

The code runs without crashing for serial mode run. However, the grid_time_yyyymmddhhmmss files contain only zero values and their size is 372 bytes. I am not sure if this two issues are related or not.

Any help/hint to resolve this issue will be highly appreciated. The input file being used is attached. I will be glad to provide more information.

Harish

comment:2 Changed 5 years ago by pesei

Thank you for reporting this issue. I am on leave at this moment, not in my office, and I regret that I can't investigate the issue soon. I suggest that you post this also in the FLEXPART mailing list, hoping that other users might be able to help.

comment:3 follow-up: Changed 4 years ago by harish

I think the error is because of the way chunksize2 variable is calculated at line numbers 66 to 71 in file sendint_mpi.f90. The lines are shown below.

 ii=0
 do jj=1, numpart2, ntasks
   ii = ii + 1
   jj2= jj
  enddo
chunksize2=ii+numpart2-jj2

The ntasks variable is number of nodes over which the job is distributed. When it is exact multiple of numpart2 which is total number of particles being released in the simulation, value calculated for chunksize2 is wrong.

Until this bug is fixed, one can avoid this by deliberately setting either total number of particles to be released or number nodes over which job to be distributed such that number of nodes is not integer factor of total number of particles to be released.

comment:4 Changed 4 years ago by harish

If possible this ticket should be moved from support to defect category.

comment:5 Changed 4 years ago by pesei

  • Type changed from Support to Defect

comment:6 in reply to: ↑ 3 Changed 4 years ago by harish

Replying to harish:

Until this bug is fixed, one can avoid this by deliberately setting either total number of particles to be released or number nodes over which job to be distributed such that number of nodes is not integer factor of total number of particles to be released.

Earlier I mentioned that one can avoid the crash by setting total number of particles to be released such that it is not an integer multiple of number of nodes over which the job to be distributed.

However, when the number of particles emitted over a time range e.g. one hour, one day, etc., even when the total number of particles is not integer multiple of the number of nodes, the code can still crash.

This is because when the particles are emitted over a time range, they are emitted at regular interval in small bunches. And at some point, the cumulative sum of emitted particles can become integer multiple of the number of nodes and make program to crash. There is no easy way to avoid crash under this circumstance.

I had to modify few lines in subroutines sendreal_mpi.f90, sendreal2d_mpi.f90, sendint2_mpi.f90, senddouble_mpi.f90 and sendint_mpi.f90 in order to avoid the crash.

Following lines which are same in all the subroutine mentioned above and starts around line number 75 in the original code are commented out.

!ii = 0
!do jj=1, numpart2, ntasks
!       ii = ii + 1
!       jj2 = jj
!enddo
!chunksize2=ii+numpart2-jj2

Immediately following the above block after commenting, I added following lines.

jj2 = 1 + ntasks * int(numpart2/ntasks)
if (jj2 .gt. numpart2) jj2 = numpart2
chunksize2 = int(numpart2/ntasks) + mod(numpart2, ntasks)

This avoids the crash and works for me but not sure if the changes mentioned above has side effects elsewhere in the program since the variable chunksize2 is a common variable provided through mpi_mod.f90. Hopefully, it isn't!

comment:7 Changed 4 years ago by pesei

  • Description modified (diff)

I am sorry, but it will be difficult to solve this issue as we don't have a real maintainer for FpWRF at the moment, and I am not very familiar with MPI.

If you have any new findings, please post them.

Note: See TracTickets for help on using tickets.
hosted by ZAMG