Opened 7 years ago

Last modified 4 years ago

#169 assigned Defect

Seg fault - particles moving out of ABL by hor. diffusion

Reported by: victoria.sinclair@… Owned by: jbrioude
Priority: major Milestone: FLEXPART_WRF_3.4_FPbase_9
Component: FP physics/numerics Version: FLEXPART-WRF
Keywords: Cc: petra.seibert@…

Description

I am using flexpart-wrf version 3.3.1. I have it compiled in parallel (mpi) using intel compilers. I have successfully run flexpart with 2 million particles for 4 simulation, but on the 5th simulation, I get this error:

forrtl: severe (174): SIGSEGV, segmentation fault occurred

These 5 flexpart simulations are all using output from the same WRF simulation - the only thing I am changing is the release time. In the experiment which fails, if I change the number of particles from 2 million to 500000 it works ok. Any idea how I can fix this problem? I have tried giving the jobs lots of memory so I don't think that is the problem. Thanks

Attachments (1)

flexwrf.input (6.4 KB) - added by broozitalab 4 years ago.
input runtime file

Download all attachments as: .zip

Change History (36)

comment:1 Changed 7 years ago by pesei

  • Component changed from FP other to FP coding/compilation
  • Owner set to pesei
  • Status changed from new to accepted

comment:2 Changed 7 years ago by pesei

This is a sign that your stack is becoming (too) large.

  • Try to compile with -mcmodel=medium. If you are using ifort, read also the respective man page section.
  • Make sure that you allow enough stacksize, e.g. unlimit stacksize,
  • and of course that you have enough free memory in your machine

comment:3 Changed 7 years ago by victoria.sinclair@…

I had already compiled with the -mcmodel=medium option. I tried unlimiting the stacksize but still get the same segmentation fault. I am using ifort and tried also compiling with the option -heap-array as suggested on this page

https://software.intel.com/en-us/articles/determining-root-cause-of-sigsegv-or-sigbus-errors

but again I get the segmentation error.

Any other ideas of what to try?

Thanks, Victoria

comment:4 Changed 7 years ago by pesei

I assume that you have already studied the recommendations by Intel. As you say that with smaller dimensions the programme works, it is hard to imagine that the cause is not exhaustion of your stacksize or your physical memory. As you mention that you use parallelisation, you have to make sure that the memory available for each thread is sufficient. If necessary, you should use environment variables to control the distribution of threads to your CPUs. I suggest that you turn to a sysadmin and/or supercomputing support of your site for futher help - it is probably not a FLEXPART code problem. Please keep us posted.

comment:5 Changed 7 years ago by victoria.sinclair@…

So with the help our computing support, I have run the code using a debugger. The code fails in conccalc_irreg on line 135. The print statement

if (xtra1(i).ne.xtra1(i)) print*,i,itra1(i),xtra1(i),ytra1(i),ix,jy

triggers and the output of this print statement is

1061847 10800 NaN NaN -2147483648 -2147483648

Since this print statement was already in the code, I am wondering if someone else has had this problem before. We will continue to investigate, but if anyone know why this line was put in, it would probably help us.

comment:6 Changed 7 years ago by pesei

Thank you for the update. Did you only compile with just -g -fbacktrace or also with -fbounds-check? If yes, please do this and repleat. And please use the `...` or {{{...}}} syntax if you quote code.

As for the print statement, I hope that jbrioude can answer.

comment:7 Changed 7 years ago by victoria.sinclair@…

I use the following compile options:

GNU_FFLAGS = -g -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -fno-range-check

GNU_LDFLAGS = -g -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -lnetcdff -fno-range-check

(Previously I had compiled with intel, but I get the same problem with both compilers.)

We also noticed from the debugger that it is just one particle that "goes bad" i.e when we check the values of the particle position, all particles except one have sensible values, but one has the x,y, and z positions all as NaNs?. So far we have not figured out where the first NaN appears.

comment:8 Changed 7 years ago by pesei

Please repeat your run after compiling with -fbacktrace -fbounds-check, please! It could be an array boundary violation. You don't need to use -g or the debugger for this.

comment:9 Changed 7 years ago by victoria.sinclair@…

I have now recompiled and re-run with the following options:

GNU_FFLAGS = -fbacktrace -fbounds-check -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -fno-range-check
GNU_LDFLAGS = -fbacktrace -fbounds-check -lnetcdff -m64 -mcmodel=medium -fconvert=little-endian -finit-local-zero -fno-range-check

the run now fails earlier with this error:

At line 133 of file interpol_wind.f90

Fortran runtime error: Index '0' of dimension 1 of array 'height' below lower bound of 1 srun: error: c57: task 0: Exited with exit code 2

comment:10 Changed 7 years ago by pesei

  • Cc petra.seibert@… added
  • Owner changed from pesei to jbrioude
  • Status changed from accepted to assigned

OK, so now we have a the right place to check. Next steps would be to work either with the debugger (setting break points and printing variables) and/or print statements to find out why this happens. Are you able to do that? It might be complicated if we have to do it, as it seems to be not so easy to reproduce this error elsewhere.

I am re-assigning this ticket to Jerome, I think it falls more into his realm, but I'll keep tracking it and if I can to help.

Thank you for your patience.

comment:11 Changed 7 years ago by victoria.sinclair@…

Hello,

So after some extensive de-bugging we have found the source of the segmentation fault. In line 90 in hanna.f90, the argument of the square root is negative. This line is

sigw=sqrt(1.2*wst**2*(1.-.9*zeta)*zeta**0.66666+ (1.8-1.4*zeta)*ust**2)+1.e-2

when the code fails wst=0.577664, ust = 0.317023 and zeta=1.67207. A quick check shows that for these values of ust and wst (which I think look sensible?), the square root will become negative once zeta (=z/h) is greater than 1.2.

How should I fix this? The easy solution would be to add a check in the code here to set sigw to zero if the square root is negative but this doesn't seem a very elegant solution. Any advice on how to proceed would be become.

Thanks,

comment:12 Changed 7 years ago by pesei

Without going in details and looking into the code, my answer is: sigw should only be calculated in the ABL or maybe within the EZ which often is taken as 0.8 < zeta < 1.2.
Thus, we need to find out why hanna is called at all for particles that are higher than 1.2 .h

comment:13 Changed 7 years ago by victoria.sinclair@…

Ok, thanks. I think this problem might be because I am trying to run flexpart over the Himalayas where the terrain is very complex. I'll try and investigate more.

comment:14 Changed 7 years ago by pesei

  • Component changed from FP coding/compilation to FP physics/numerics
  • Priority changed from minor to major

Thank you for the information. At what resolution of the meteo grid does this happen, and is it in a model region with steep slopes? I am suspecting that horizontal diffusion might drive the particle out of the ABL. If this can happen, we should fix it.

comment:15 Changed 7 years ago by victoria.sinclair@…

The input meteo grid is at 1 km grid spacing. The error occurs where the topography height is 5932m. There are very steep slopes - if I go another 3 grid in the x-direction the topography is 8270m - so the topography height increases by 2338 meters in just 3km. So I guess it is not surprising that if the horizontal diffusion acts on constant height levels that the particle can be mixed out of the boundary layer just by moving sideways. The boundary layer depth (h in hanna.f90) is 1660 m and the the Obukhov length (ol) is -481.26 m. I guess I am just pushing the model to its limits! However, if you have any suggestions how to solve this, I'd certainly appreciate them

Thanks

comment:16 Changed 7 years ago by pesei

If you just want a quick fix, I would suggest to simply limit the value of zeta to 1.2 (or 1.2-tiny(1.)).

In the longer run, we probably should

  1. make sure that particles which move out of the ABL leave the section of the code that is reserved vor inside-ABL
  2. introduce some option of geometrically horizontal diffusion which we have put into the MM5 version of FLEXPART also in the WRF version, and make sure it does what it should do.

comment:17 follow-up: Changed 7 years ago by pesei

Did the quick fix work, or what's the status?

comment:18 in reply to: ↑ 17 Changed 6 years ago by pesei

Replying to pesei:

Did the quick fix work, or what's the status?

Please reply, it would be of interest to others how you solved the issue.

comment:19 Changed 6 years ago by pesei

  • Summary changed from Segmentation fault to Seg fault - particles moving out of ABL by hor. diffusion

comment:20 Changed 6 years ago by victoria.sinclair@…

Sorry for not replying earlier. I managed to get this simulation to work by adding an if statement conditional on zeta before the call to hanna in advance.f90

if (zeta .lt. 1.15) then

I did first try with a larger value of 1.2 but this also crashed. It was trial and error to find a value which worked - there is no theoretical basis for this value of 1.15.

In my version of the code (where I have added numerous print statements, this is at line 505 which is in the section of the code under the subheadings

! Compute the turbulent disturbances
! Determine the sigmas and the timescales ):

However, I feel this was a bit of a hack really and if others have similar problems a more elegant and physically meaningful solution should be found.

comment:21 Changed 6 years ago by pesei

Thank you. So let's keep the ticket open as it needs to be properly fixed.

comment:22 Changed 4 years ago by pesei

  • Milestone set to FLEXPART_WRF_3.4_FPbase_9

comment:23 Changed 4 years ago by pesei

  • Type changed from Support to Defect

comment:24 follow-up: Changed 4 years ago by broozitalab

Hi,

I have encountered the same problem: I am doing 72-hours back trajectory runs using FLEXPART-WRF and it works fine for some days while giving segmentation faults for other days or releasing times. Using Intel, I used these flags:
INTEL_FFLAGS = -heap-arrays -g -traceback -CB -O2 -cpp -mcmodel=medium -shared-intel
INTEL_LDFLAGS = -heap-arrays -g -traceback -CB -O2 -cpp -mcmodel=medium -shared-intel -lnetcdff
I am using compiled libraries based on parallel_studio available on the university's HPC. However, I had to use mpifort instead of mpiifort as it seems it is not available. (I'm not a computer scientist and I barely know what the difference is, sorry!)

Anyhow, the error is
forrtl: severe (408): fort: (2): Subscript #1 of the array ITRA1 has value 11001 which is greater than the upper bound of 11000 and it basically redirects to timemanager_mpi.f90 line 804.
I found this may be a memory issue! However, I get this error after using ulimit -s unlimited and export KMP_STACKSIZE=512M.
I guess it should be more like a run-time error but I appreciate if you could give me some hints.
Thanks in advance.

comment:25 in reply to: ↑ 24 Changed 4 years ago by pesei

Replying to broozitalab:
Are you also working in complex terrain?

comment:26 Changed 4 years ago by broozitalab

I am focused on northern India, so yes, a small portion of the domain's corner covers Himalayas.

comment:27 Changed 4 years ago by broozitalab

An update that may be useful for developers: It worked just fine by changing wind-option from 'based on divergence(-1)' to 'snapshot winds (0)'. Basically, these are the options I used:

-10 CTL

10 IFINE
5 IOUT
1 IPOUT
1 LSUBGRID
0 LCONVECTION
3600 DT_CONV (real)
0 LAGESPECTRA
0 IPIN
1 IFLUX
1 LINIT_COND
1 TURB_OPTION
1 LU_OPTION
1 CBL SCHEME
0 SFC_OPTION
0 WIND_OPTION
0 TIME_OPTION

comment:28 Changed 4 years ago by pesei

I think your problem is the same as encountered by Victoria, and you could apply the same fix.

I am wondering why you switsched to snapshot winds - it's less accurate.

comment:29 Changed 4 years ago by broozitalab

I had implemented the if-condition that Victoria suggested but it didn't solve the problem. In general, I agree about using simplified options but I guess using snapshot winds wouldn't be significantly important as I only want to find the path of trajectory (mass mean), do they?

comment:30 Changed 4 years ago by pesei

Try to reduce the limiting value of zeta.

It does matter for central mass trajectory - how mach is difficult to say, but dynamically balanced wind fields should be better.

comment:31 Changed 4 years ago by broozitalab

It didn't help! I changed the limit to even 0.5, the model went further (and pass the if-condition less) but again complained using divergence-based wind-option. It should be mentioned, particles central mass is not close to complex terrain or drastic shift at the time error occurs.

comment:32 Changed 4 years ago by pesei

Thank you for trying. I am wondering why limiting zeta to, e.g., 1.0 does not solve the problem.
Are you certain to get exactly the same error condition? Could you please write out all information about the particle?

It would be helpful if you could keep a kind of minimum version of this setting as a test case, to develop a proper fix.

Changed 4 years ago by broozitalab

input runtime file

comment:33 Changed 4 years ago by broozitalab

Attached, is the input file I used and worked fine. I tried using divergence-based wind-option and it complained. Adding the zeta condition (<1.0) for calling hanna didn't work either. Basically, it stopped at the similar time. As the simplified snapshot option works for my purpose, I am fine with using that option. Anyhow, thanks a lot for your time and help.

comment:34 Changed 4 years ago by pesei

After looking at your input file, I should say, the best option the wind setting is to use time-averaged fields, which, however, requires to make WRF produce them.

Could you please keep your WRF files, and possibly make them available to us for developing the fix? (How big would the data set be?)

comment:35 Changed 4 years ago by broozitalab

You are right. I didn't have required fields in my wrfout files to use time-averaged values.
The original files are pretty big (~800 MB hourly files) but I can extract essential variables. Please let me know what variables are important and an email that I could share the data with you.

Note: See TracTickets for help on using tickets.
hosted by ZAMG