Opened 2 years ago

Last modified 16 months ago

#169 assigned Support

Seg fault - particles moving out of ABL by hor. diffusion

Reported by: victoria.sinclair@… Owned by: jbrioude
Priority: major Milestone:
Component: FP physics/numerics Version: FLEXPART-WRF
Keywords: Cc: petra.seibert@…

Description

I am using flexpart-wrf version 3.3.1. I have it compiled in parallel (mpi) using intel compilers. I have successfully run flexpart with 2 million particles for 4 simulation, but on the 5th simulation, I get this error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

These 5 flexpart simulations are all using output from the same WRF simulation - the only thing I am changing is the release time. In the experiment which fails, if I change the number of particles from 2 million to 500000 it works ok. Any idea how I can fix this problem? I have tried giving the jobs lots of memory so I don't think that is the problem. Thanks

Change History (21)

comment:1 Changed 2 years ago by pesei

  • Component changed from FP other to FP coding/compilation
  • Owner set to pesei
  • Status changed from new to accepted

comment:2 Changed 2 years ago by pesei

This is a sign that your stack is becoming (too) large.

  • Try to compile with -mcmodel=medium. If you are using ifort, read also the respective man page section.
  • Make sure that you allow enough stacksize, e.g. unlimit stacksize,
  • and of course that you have enough free memory in your machine

comment:3 Changed 2 years ago by victoria.sinclair@…

I had already compiled with the -mcmodel=medium option. I tried unlimiting the stacksize but still get the same segmentation fault. I am using ifort and tried also compiling with the option -heap-array as suggested on this page

https://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors

but again I get the segmentation error.

Any other ideas of what to try?

Thanks, Victoria

comment:4 Changed 2 years ago by pesei

I assume that you have already studied the recommendations by Intel. As you say that with smaller dimensions the programme works, it is hard to imagine that the cause is not exhaustion of your stacksize or your physical memory. As you mention that you use parallelisation, you have to make sure that the memory available for each thread is sufficient. If necessary, you should use environment variables to control the distribution of threads to your CPUs. I suggest that you turn to a sysadmin and/or supercomputing support of your site for futher help - it is probably not a FLEXPART code problem. Please keep us posted.

comment:5 Changed 2 years ago by victoria.sinclair@…

So with the help our computing support, I have run the code using a debugger. The code fails in conccalc_irreg on line 135. The print statement

if (xtra1(i).ne.xtra1(i)) print*,i,itra1(i),xtra1(i),ytra1(i),ix,jy

triggers and the output of this print statement is

1061847 10800 NaN NaN -2147483648 -2147483648

Since this print statement was already in the code, I am wondering if someone else has had this problem before. We will continue to investigate, but if anyone know why this line was put in, it would probably help us.

comment:6 Changed 2 years ago by pesei

Thank you for the update. Did you only compile with just -g -fbacktrace or also with -fbounds-check? If yes, please do this and repleat. And please use the `...` or {{{...}}} syntax if you quote code.

As for the print statement, I hope that jbrioude can answer.

comment:7 Changed 2 years ago by victoria.sinclair@…

I use the following compile options:

GNU_FFLAGS = -g -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -fno-range-check

GNU_LDFLAGS = -g -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -lnetcdff -fno-range-check

(Previously I had compiled with intel, but I get the same problem with both compilers.)

We also noticed from the debugger that it is just one particle that "goes bad" i.e when we check the values of the particle position, all particles except one have sensible values, but one has the x,y, and z positions all as NaNs?. So far we have not figured out where the first NaN appears.

comment:8 Changed 2 years ago by pesei

Please repeat your run after compiling with -fbacktrace -fbounds-check, please! It could be an array boundary violation. You don't need to use -g or the debugger for this.

comment:9 Changed 2 years ago by victoria.sinclair@…

I have now recompiled and re-run with the following options:

GNU_FFLAGS = -fbacktrace -fbounds-check -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -fno-range-check
GNU_LDFLAGS = -fbacktrace -fbounds-check -lnetcdff -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -fno-range-check

the run now fails earlier with this error:

At line 133 of file interpol_wind.f90

Fortran runtime error: Index '0' of dimension 1 of array 'height' below lower bound of 1 srun: error: c57: task 0: Exited with exit code 2

comment:10 Changed 2 years ago by pesei

  • Cc petra.seibert@… added
  • Owner changed from pesei to jbrioude
  • Status changed from accepted to assigned

OK, so now we have a the right place to check. Next steps would be to work either with the debugger (setting break points and printing variables) and/or print statements to find out why this happens. Are you able to do that? It might be complicated if we have to do it, as it seems to be not so easy to reproduce this error elsewhere.

I am re-assigning this ticket to Jerome, I think it falls more into his realm, but I'll keep tracking it and if I can to help.

Thank you for your patience.

comment:11 Changed 2 years ago by victoria.sinclair@…

Hello,

So after some extensive de-bugging we have found the source of the segmentation fault. In line 90 in hanna.f90, the argument of the square root is negative. This line is

sigw=sqrt(1.2*wst**2*(1.-.9*zeta)*zeta**0.66666+ (1.8-1.4*zeta)*ust**2)+1.e-2

when the code fails wst=0.577664, ust = 0.317023 and zeta=1.67207. A quick check shows that for these values of ust and wst (which I think look sensible?), the square root will become negative once zeta (=z/h) is greater than 1.2.

How should I fix this? The easy solution would be to add a check in the code here to set sigw to zero if the square root is negative but this doesn't seem a very elegant solution. Any advice on how to proceed would be become.

Thanks,

comment:12 Changed 2 years ago by pesei

Without going in details and looking into the code, my answer is: sigw should only be calculated in the ABL or maybe within the EZ which often is taken as 0.8 < zeta < 1.2.
Thus, we need to find out why hanna is called at all for particles that are higher than 1.2 .h

comment:13 Changed 2 years ago by victoria.sinclair@…

Ok, thanks. I think this problem might be because I am trying to run flexpart over the Himalayas where the terrain is very complex. I'll try and investigate more.

comment:14 Changed 2 years ago by pesei

  • Component changed from FP coding/compilation to FP physics/numerics
  • Priority changed from minor to major

Thank you for the information. At what resolution of the meteo grid does this happen, and is it in a model region with steep slopes? I am suspecting that horizontal diffusion might drive the particle out of the ABL. If this can happen, we should fix it.

comment:15 Changed 2 years ago by victoria.sinclair@…

The input meteo grid is at 1 km grid spacing. The error occurs where the topography height is 5932m. There are very steep slopes - if I go another 3 grid in the x-direction the topography is 8270m - so the topography height increases by 2338 meters in just 3km. So I guess it is not surprising that if the horizontal diffusion acts on constant height levels that the particle can be mixed out of the boundary layer just by moving sideways. The boundary layer depth (h in hanna.f90) is 1660 m and the the Obukhov length (ol) is -481.26 m. I guess I am just pushing the model to its limits! However, if you have any suggestions how to solve this, I'd certainly appreciate them

Thanks

comment:16 Changed 2 years ago by pesei

If you just want a quick fix, I would suggest to simply limit the value of zeta to 1.2 (or 1.2-tiny(1.)).

In the longer run, we probably should

  1. make sure that particles which move out of the ABL leave the section of the code that is reserved vor inside-ABL
  2. introduce some option of geometrically horizontal diffusion which we have put into the MM5 version of FLEXPART also in the WRF version, and make sure it does what it should do.

comment:17 follow-up: Changed 2 years ago by pesei

Did the quick fix work, or what's the status?

comment:18 in reply to: ↑ 17 Changed 16 months ago by pesei

Replying to pesei:

Did the quick fix work, or what's the status?

Please reply, it would be of interest to others how you solved the issue.

comment:19 Changed 16 months ago by pesei

  • Summary changed from Segmentation fault to Seg fault - particles moving out of ABL by hor. diffusion

comment:20 Changed 16 months ago by victoria.sinclair@…

Sorry for not replying earlier. I managed to get this simulation to work by adding an if statement conditional on zeta before the call to hanna in advance.f90

if (zeta .lt. 1.15) then

I did first try with a larger value of 1.2 but this also crashed. It was trial and error to find a value which worked - there is no theoretical basis for this value of 1.15.

In my version of the code (where I have added numerous print statements, this is at line 505 which is in the section of the code under the subheadings

! Compute the turbulent disturbances
! Determine the sigmas and the timescales ):

However, I feel this was a bit of a hack really and if others have similar problems a more elegant and physically meaningful solution should be found.

comment:21 Changed 16 months ago by pesei

Thank you. So let's keep the ticket open as it needs to be properly fixed.

Note: See TracTickets for help on using tickets.
hosted by ZAMG