Opened 4 years ago

Last modified 3 years ago

#6409 assigned defect

Simulation is not stopped when an error happens

Reported by: Simone Bosotti <simone1.bosotti@…> Owned by: Andreas Heuermann
Priority: high Milestone:
Component: Run-time Version: 1.18.0-dev
Keywords: Cc:

Description

My "Network" simulation, in the output tab, gives these errors:

Error solving nonlinear system 6721 at time 40.5601
nonlinear system 6721 fails: at t=40.5601
Fatal error on line 1118, column 11: tag mismatch

but the process Network.exe is not stopped after this, it continues running. This is not so clear in the interface (see screenshot) because the progress bar remains stuck at 0%.

OMEdit allows to start a new try of the simulation (so it does not recognize that the process is still running), but the compilation an raises an error because it can't overwrite Network.exe.

C:\PROGRA~1\OPENMO~2.0-D\tools\msys\mingw64\bin\ld: cannot open output file Network.exe: Permission denied
clang: error: unable to remove file: Permission denied
clang: error: linker command failed with exit code 1 (use -v to see invocation)
mingw32-make: *** [Network.makefile:39: omc_main_target] Error 1
mingw32-make: Leaving directory 'C:/Users/simon/AppData/Local/Temp/OPENMO~1/OMEdit/Net~1.NET'
Compilation process failed. Exited with code 2.

OpenModelica v1.18.0-dev-132-gc048acb712 (64-bit)
Windows 10 64-bit

Attachments (1)

Immagine.png (11.3 KB ) - added by Simone Bosotti <simone1.bosotti@…> 4 years ago.

Download all attachments as: .zip

Change History (26)

by Simone Bosotti <simone1.bosotti@…>, 4 years ago

Attachment: Immagine.png added

in reply to:  description comment:1 by Francesco Casella, 4 years ago

Milestone: NeedsInput1.18.0
Owner: changed from somebody to Francesco Casella
Status: newassigned

Replying to Simone Bosotti <simone1.bosotti@…>:

My "Network" simulation, in the output tab, gives these errors:

Error solving nonlinear system 6721 at time 40.5601
nonlinear system 6721 fails: at t=40.5601
Fatal error on line 1118, column 11: tag mismatch

but the process Network.exe is not stopped after this, it continues running. This is not so clear in the interface (see screenshot) because the progress bar remains stuck at 0%.

This is strange. If the simulation has already reached time = 40.560, why is the process bar still at 0%? What is the final time? Can you set it to, say 50, and see what happens?

I'll assign this to somebody when the root cause becomes clearer.

OMEdit allows to start a new try of the simulation (so it does not recognize that the process is still running), but the compilation an raises an error because it can't overwrite Network.exe.

This is a known issue, see #5974. We should eventually take care of that.

comment:2 by Simone Bosotti <simone1.bosotti@…>, 4 years ago

It's stuck at 0% because the final time is around 7000 seconds.
Setting it to 50, the simulation is stuck at 68%.

comment:3 by Francesco Casella, 4 years ago

Owner: changed from Francesco Casella to Andreas Heuermann

@AnHeuermann, this tag mismatch error looks familiar, you fixed something similar a recently. Any idea what this could be?

@simone, try selecting "kinsol" as nonlinear solver in the simulation options.

comment:4 by Simone Bosotti <simone1.bosotti@…>, 4 years ago

I tried the same model with Kinsol and, when the simulation fails, Network.exe is stopped correctly.

Just to clarify: the model that I run has some problems. I opened the ticket because, if I forget to manually kill the simulation, it continues running indefinitely, consuming an increasing amount of RAM.

I tried the today's build too (OpenModelica v1.18.0-dev-133-gabdf40bd71) and the behaviour is the same.

in reply to:  4 comment:5 by Francesco Casella, 4 years ago

Replying to Simone Bosotti <simone1.bosotti@…>:

I tried the same model with Kinsol and, when the simulation fails, Network.exe is stopped correctly.

OK, so we have a problem with the default solver. In the meantime, please use kinsol.

comment:6 by Andreas Heuermann, 4 years ago

@Simone Can you provide us with the test example?
If not can you try to run your model from command line and see if the simulation terminates after the error is thrown.
I'm not sure if it is a problem of the C runtime or with OMEdit and where the fatal error is thrown.

You can create a mos file test.mos that is similar to

loadModel(Modelica, {"3.2.3"}); getErrorString();
loadFile("pathToYourModelFile.mo"); getErrorString();
simulate(Network, simflags="-lv=LOG_STATS"); getErrorString();

and run it from CMD

>"C:\Program Files\OpenModelica1.18.0-dev-64bit\bin\omc.exe" test.mos
Last edited 4 years ago by Andreas Heuermann (previous) (diff)

comment:7 by Andreas Heuermann, 4 years ago

@Simone. And can you add the log flag LOG_NLS_V to a simulation run and add the output to this ticket?
You can add it in the Simulation Setup->Simulation Flags->Logging and choose LOG_NLS_V.

comment:8 by Andreas Heuermann, 4 years ago

It seems the error message

Error solving nonlinear system 6721 at time 40.5601

is from function void printErrorEqSyst(equationSystemError err, EQUATION_INFO eq, double time).

This function is used in the Newton and the Hybrid solver, but not in Kinsol. When the simulation is in the specific case for that error message the loop re-trying with newton should give up. I can say more when I have a working minimal example or see the logs.

in reply to:  6 comment:9 by Simone Bosotti <simone1.bosotti@…>, 4 years ago

Replying to AnHeuermann:

If not can you try to run your model from command line and see if the simulation terminates after the error is thrown.

In command line I see only

true
""
true
true
""

And the exe is not stopped. After manually killing the exe from task manager, the only warnings I get are:

"Warning: The model contains alias variables with redundant start and/or conflicting nominal values. It is recommended to resolve the conflicts, because otherwise the system could be hard to solve. To print the conflicting alias sets and the chosen candidates please use -d=aliasConflicts.
Warning: There are nonlinear iteration variables with default zero start attribute found in NLSJac150. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
Warning: There are nonlinear iteration variables with default zero start attribute found in NLSJac151. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
Warning: There are nonlinear iteration variables with default zero start attribute found in NLSJac152. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
Warning: There are nonlinear iteration variables with default zero start attribute found in NLSJac153. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
Warning: There are nonlinear iteration variables with default zero start attribute found in NLSJac154. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
Warning: There are nonlinear iteration variables with default zero start attribute found in NLSJac190. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
Warning: The initial conditions are not fully specified. For more information set -d=initialization. In OMEdit Tools->Options->Simulation->Show additional information from the initialization process, in OMNotebook call setCommandLineOptions(\"-d=initialization\").
"

It generated also a 250 MB log file in about 2 minutes of executions. It contains the same lines repeated thousands of times:

assert            | warning | The following assertion has been violated at time 35.860495
assert            | debug   | Model error: Argument of sqrt(station.TC4.compressor.b ^ 2.0 - 4.0 * station.TC4.compressor.a * station.TC4.compressor.c) was -0.0222076 should be >= 0

-0.0222076 is not always the same, but it assumes only a few different values.

And can you add the log flag LOG_NLS_V to a simulation run and add the output to this ticket?

The log is huge (120+ MB) and I don't know if I can share it.

comment:10 by Andreas Heuermann, 4 years ago

Yes, the log files are quite huge for decent models with LOG_NLS_V enabled.

It seems that the simulation is in some endless-loop. An assert is triggered, that is caught somewhere and after a few number of re-tries the simulation should stop. Apparently that is not the case.

I have some guesses where this happens but am not sure, so I would need some example to test that.

Can you search for

Error solving nonlinear system

and post what ever comes directly after that message? I'm that something like

### No Solution! ###
 after XXX restarts"

is in the logs. I would need the next few lines to see in which case the simulation is.

in reply to:  10 comment:11 by Simone Bosotti <simone1.bosotti@…>, 4 years ago

Can you search for

Error solving nonlinear system

and post what ever comes directly after that message? I'm that something like

### No Solution! ###
 after XXX restarts"

is in the logs. I would need the next few lines to see in which case the simulation is.

Sure. Here it is:

Error solving nonlinear system 17482 at time 35.8605
### No Solution! ###
 after 144 restarts
nls status
variables
[1] station.TC3.compressor.w  = 0.00000000000000000000e+00
 - scaling factor internal = 1.0002976983495444e+00
 - scaling factor external = 1.0000000000000000e+01
[2] station.TC5.compressor.w  = 0.00000000000000000000e+00
 - scaling factor internal = 1.0002976983484608e+00
 - scaling factor external = 1.0000000000000000e+01
[3] station.w  = -8.35545943060227273236e+02
 - scaling factor internal = 3.2817675810719258e+01
 - scaling factor external = 8.3554594306022727e+02
[4] station.TC3.compressor.QStar  = 0.00000000000000000000e+00
 - scaling factor internal = 2.7777777778024495e+00
 - scaling factor external = 1.0000000000000000e+00
[5] station.TC5.compressor.QStar  = 0.00000000000000000000e+00
 - scaling factor internal = 2.7777777778024495e+00
 - scaling factor external = 1.0000000000000000e+00
[6] station.TC5.compressor.NStar  = 1.00000000000000022204e+00
 - scaling factor internal = 6.5000000011373131e+03
 - scaling factor external = 1.0000000000000002e+00
[7] station.TC4.compressor.QStar  = -7.33961231823885107417e+00
 - scaling factor internal = 1.3474331859763883e+04
 - scaling factor external = 7.3396123182388511e+00
[8] station.TC4.compressor.NStar  = -1.43294356283598456336e+00
 - scaling factor internal = 7.1284539836064665e+04
 - scaling factor external = 1.4329435628359846e+00
[9] station.TC3.compressor.NStar  = 1.00000000000000022204e+00
 - scaling factor internal = 6.5000000000399677e+03
 - scaling factor external = 1.0000000000000002e+00
[10] station.TC3.NStar  = 1.00000000000000000000e+00
 - scaling factor internal = 1.0000000000000000e+00
 - scaling factor external = 1.0000000000000000e+00
[11] station.generalDistance  = 3.05191524646234704221e+01
 - scaling factor internal = 9.9999999988897770e-01
 - scaling factor external = 3.0519152464623470e+01
[12] station.TC4.NStar  = -1.43294356283598411927e+00
 - scaling factor internal = 1.0000000000000000e+00
 - scaling factor external = 1.4329435628359841e+00
[13] station.TC5.NStar  = 1.00000000000000000000e+00
 - scaling factor internal = 1.0000000000000000e+00
 - scaling factor external = 1.0000000000000000e+00
[14] station.s  = 6.37957568387626849926e-01
 - scaling factor internal = 6.4999999996034103e+06
 - scaling factor external = 1.0000000000000000e+00
[15] station.r  = -4.97265043890755364941e+00
 - scaling factor internal = 2.4999999993005596e+02
 - scaling factor external = 4.9726504389075536e+00
functions
res[1] = 2.92978726433696677235e-17 [scaling factor = 0.0000000000000000e+00]
res[2] = 4.77484718430787324905e-12 [scaling factor = 0.0000000000000000e+00]
res[3] = -2.51457095146179199219e-08 [scaling factor = 0.0000000000000000e+00]
res[4] = 1.56490872287542467836e-15 [scaling factor = 0.0000000000000000e+00]
res[5] = 1.38780075989300090413e-17 [scaling factor = 0.0000000000000000e+00]
res[6] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
res[7] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
res[8] = -7.10542735760100185871e-14 [scaling factor = 0.0000000000000000e+00]
res[9] = 2.66970347102387640916e+04 [scaling factor = 0.0000000000000000e+00]
res[10] = -3.55271367880050092936e-15 [scaling factor = 0.0000000000000000e+00]
res[11] = 3.27231844468870320952e+01 [scaling factor = 0.0000000000000000e+00]
res[12] = -7.26431608200073242188e-08 [scaling factor = 0.0000000000000000e+00]
res[13] = 1.51847372163710067374e-15 [scaling factor = 0.0000000000000000e+00]
res[14] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
res[15] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
statistics
nfunc = 51
error = 1.00000000000000000000e+00
error_scaled = 1.00000000000000000000e+00
Solution status: FAILED
 number of iterations           : 1086
 number of function evaluations : 17524
 number of jacobian evaluations : 1065
solution values:
[ 1] station.TC3.compressor.w  =                0
[ 2] station.TC5.compressor.w  =                0
[ 3]               station.w  =       -835.54594
[ 4] station.TC3.compressor.QStar  =                0
[ 5] station.TC5.compressor.QStar  =                0
[ 6] station.TC5.compressor.NStar  =                1
[ 7] station.TC4.compressor.QStar  =       -7.3396123
[ 8] station.TC4.compressor.NStar  =       -1.4329436
[ 9] station.TC3.compressor.NStar  =                1
[10]       station.TC3.NStar  =                1
[11] station.generalDistance  =        30.519152
[12]       station.TC4.NStar  =       -1.4329436
[13]       station.TC5.NStar  =                1
[14]               station.s  =       0.63795757
[15]               station.r  =       -4.9726504
nonlinear system 17482 fails: at t=35.8605
[1] Real station.TC3.compressor.w(start=0, nominal=10)
[2] Real station.TC5.compressor.w(start=0, nominal=10)
[3] Real station.w(start=0, nominal=10)
[4] Real station.TC3.compressor.QStar(start=0, nominal=1)
[5] Real station.TC5.compressor.QStar(start=0, nominal=1)
[6] Real station.TC5.compressor.NStar(start=1, nominal=1)
[7] Real station.TC4.compressor.QStar(start=0, nominal=1)
[8] Real station.TC4.compressor.NStar(start=1, nominal=1)
[9] Real station.TC3.compressor.NStar(start=1, nominal=1)
[10] Real station.TC3.NStar(start=1, nominal=1)
[11] Real station.generalDistance(start=0, nominal=1)
[12] Real station.TC4.NStar(start=1, nominal=1)
[13] Real station.TC5.NStar(start=1, nominal=1)
[14] Real station.s(start=0, nominal=1)
[15] Real station.r(start=0, nominal=1)

comment:12 by Andreas Heuermann, 4 years ago

Wow, we have a while loop spanning over 600 lines of code for solveHybrd(). That's not very nice.
We need to re-do the nonlinear hybrid solver. It has like 6 different counters for number of retries and try and catch blocks all over the place.

I added a break after the error message. But I'm not 100% sure that this enough.
We are using the function solveHybrd in the nonlinear homotopy solver as well. And there it is not clear to me what happens with the return value.
The homotopy solver has a similar issue: Giant unclear while loops.

See PR https://github.com/OpenModelica/OpenModelica/pull/7288

comment:13 by Andreas Heuermann, 4 years ago

PR 7288 is now merged. The next nightly build (tomorrow) will have this change.
@Simone Can you try with one of the next nightly builds and check if that solves your issue?

in reply to:  13 comment:14 by Simone Bosotti <simone1.bosotti@…>, 4 years ago

Replying to AnHeuermann:

PR 7288 is now merged. The next nightly build (tomorrow) will have this change.
@Simone Can you try with one of the next nightly builds and check if that solves your issue?

I'm sorry, but nothing changed. The execution of the exe continues indefinitely.
This is the new output of the LOG_NLS_V.

Error solving nonlinear system 17518 at time 35.8605
### No Solution! ###
 after 144 restarts
nls status
variables
[1] station.TC3.compressor.w  = 0.00000000000000000000e+00
 - scaling factor internal = 1.0002976983484608e+00
 - scaling factor external = 1.0000000000000000e+01
[2] station.TC5.compressor.w  = 0.00000000000000000000e+00
 - scaling factor internal = 1.0002976983484608e+00
 - scaling factor external = 1.0000000000000000e+01
[3] station.w  = -8.35545940552257093259e+02
 - scaling factor internal = 3.2817675810719258e+01
 - scaling factor external = 8.3554594055225709e+02
[4] station.TC3.compressor.QStar  = 0.00000000000000000000e+00
 - scaling factor internal = 2.7777777778024495e+00
 - scaling factor external = 1.0000000000000000e+00
[5] station.TC5.compressor.QStar  = 0.00000000000000000000e+00
 - scaling factor internal = 2.7777777778024495e+00
 - scaling factor external = 1.0000000000000000e+00
[6] station.TC5.compressor.NStar  = 1.00000000000000022204e+00
 - scaling factor internal = 6.5000000011373131e+03
 - scaling factor external = 1.0000000000000002e+00
[7] station.TC4.compressor.QStar  = -7.33961229611659415895e+00
 - scaling factor internal = 1.3474331845211967e+04
 - scaling factor external = 7.3396122961165942e+00
[8] station.TC4.compressor.NStar  = -1.43294355807376572187e+00
 - scaling factor internal = 7.1284539836064665e+04
 - scaling factor external = 1.4329435580737657e+00
[9] station.TC3.compressor.NStar  = 1.00000000000000022204e+00
 - scaling factor internal = 6.5000000000399677e+03
 - scaling factor external = 1.0000000000000002e+00
[10] station.TC3.NStar  = 1.00000000000000000000e+00
 - scaling factor internal = 1.0000000000000000e+00
 - scaling factor external = 1.0000000000000000e+00
[11] station.generalDistance  = 3.05191522650963378283e+01
 - scaling factor internal = 9.9999999988897770e-01
 - scaling factor external = 3.0519152265096338e+01
[12] station.TC4.NStar  = -1.43294355807377704615e+00
 - scaling factor internal = 1.0000000000000000e+00
 - scaling factor external = 1.4329435580737770e+00
[13] station.TC5.NStar  = 1.00000000000000000000e+00
 - scaling factor internal = 1.0000000000000000e+00
 - scaling factor external = 1.0000000000000000e+00
[14] station.s  = 6.37957568377593875475e-01
 - scaling factor internal = 6.4999999996034103e+06
 - scaling factor external = 1.0000000000000000e+00
[15] station.r  = -4.97265042887570807295e+00
 - scaling factor internal = 2.4999999993005596e+02
 - scaling factor external = 4.9726504288757081e+00
functions
res[1] = 4.93434652508476323902e-17 [scaling factor = 0.0000000000000000e+00]
res[2] = 4.88853402202948927879e-12 [scaling factor = 0.0000000000000000e+00]
res[3] = -2.51457095146179199219e-08 [scaling factor = 0.0000000000000000e+00]
res[4] = 4.62596133086742261756e-17 [scaling factor = 0.0000000000000000e+00]
res[5] = -4.63451217237824686130e-16 [scaling factor = 0.0000000000000000e+00]
res[6] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
res[7] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
res[8] = 2.66970344779274091707e+04 [scaling factor = 0.0000000000000000e+00]
res[9] = 1.42108547152020037174e-14 [scaling factor = 0.0000000000000000e+00]
res[10] = -8.88178419700125232339e-16 [scaling factor = 0.0000000000000000e+00]
res[11] = 3.27231841520902904108e+01 [scaling factor = 0.0000000000000000e+00]
res[12] = -7.26431608200073242188e-08 [scaling factor = 0.0000000000000000e+00]
res[13] = -3.04479229760059189515e-16 [scaling factor = 0.0000000000000000e+00]
res[14] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
res[15] = 0.00000000000000000000e+00 [scaling factor = 0.0000000000000000e+00]
statistics
nfunc = 51
error = 1.00000000000000000000e+00
error_scaled = 1.00000000000000000000e+00
Solution status: FAILED
 number of iterations           : 1086
 number of function evaluations : 17524
 number of jacobian evaluations : 1065
solution values:
[ 1] station.TC3.compressor.w  =                0
[ 2] station.TC5.compressor.w  =                0
[ 3]               station.w  =       -835.54594
[ 4] station.TC3.compressor.QStar  =                0
[ 5] station.TC5.compressor.QStar  =                0
[ 6] station.TC5.compressor.NStar  =                1
[ 7] station.TC4.compressor.QStar  =       -7.3396123
[ 8] station.TC4.compressor.NStar  =       -1.4329436
[ 9] station.TC3.compressor.NStar  =                1
[10]       station.TC3.NStar  =                1
[11] station.generalDistance  =        30.519152
[12]       station.TC4.NStar  =       -1.4329436
[13]       station.TC5.NStar  =                1
[14]               station.s  =       0.63795757
[15]               station.r  =       -4.9726504
nonlinear system 17518 fails: at t=35.8605
[1] Real station.TC3.compressor.w(start=0, nominal=10)
[2] Real station.TC5.compressor.w(start=0, nominal=10)
[3] Real station.w(start=0, nominal=10)
[4] Real station.TC3.compressor.QStar(start=0, nominal=1)
[5] Real station.TC5.compressor.QStar(start=0, nominal=1)
[6] Real station.TC5.compressor.NStar(start=1, nominal=1)
[7] Real station.TC4.compressor.QStar(start=0, nominal=1)
[8] Real station.TC4.compressor.NStar(start=1, nominal=1)
[9] Real station.TC3.compressor.NStar(start=1, nominal=1)
[10] Real station.TC3.NStar(start=1, nominal=1)
[11] Real station.generalDistance(start=0, nominal=1)
[12] Real station.TC4.NStar(start=1, nominal=1)
[13] Real station.TC5.NStar(start=1, nominal=1)
[14] Real station.s(start=0, nominal=1)
[15] Real station.r(start=0, nominal=1)

Version: OpenModelica v1.18.0-dev-146-gf50e95b736 (64-bit)

comment:15 by Andreas Heuermann, 4 years ago

Okay. You are using the correct version so I missed something.

comment:16 by Andreas Heuermann, 4 years ago

Component: *unknown*Run-time

comment:17 by Andreas Heuermann, 4 years ago

I'm afraid I need to have an example to fix the error. I don't see why it won't stop and need to jump through it with a debugger.

@Simone Is it possible for you to share the example with us?
If it can't be made public you could send it to us via mail (zip file with password or gpg encrypted with our public OpenModelica key).

in reply to:  17 comment:18 by Simone Bosotti <simone1.bosotti@…>, 4 years ago

Replying to AnHeuermann:

@Simone Is it possible for you to share the example with us?
If it can't be made public you could send it to us via mail (zip file with password or gpg encrypted with our public OpenModelica key).

Ok, I've asked and I can send the file via email.
Since I don't need encryption, it's enough to send it as attachment to openmodelica@ ida.liu.se ?

comment:19 by Adrian Pop, 4 years ago

Yes. That would be fine.

comment:20 by Andreas Heuermann, 4 years ago

@Simone I wrote you an E-Mail last week. I'll need an additional library to run your example to solve this ticket.

in reply to:  20 comment:21 by Simone Bosotti <simone1.bosotti@…>, 4 years ago

Replying to AnHeuermann:

@Simone I wrote you an E-Mail last week. I'll need an additional library to run your example to solve this ticket.

I replied to that one week ago. I retried now, let me know if it's ok.

comment:22 by Andreas Heuermann, 4 years ago

So now it's the brute force approach. Just throw an error to stop the simulation.
See https://github.com/OpenModelica/OpenModelica/pull/7321

comment:23 by Andreas Heuermann, 4 years ago

Problem is in DASSL and not the hybrid solver. I'll need to come up with a better solution.

comment:24 by Andreas Heuermann, 4 years ago

Another issue that popped up during debugging this: #6436.

comment:25 by Francesco Casella, 3 years ago

Milestone: 1.18.0

Ticket retargeted after milestone closed

Note: See TracTickets for help on using tickets.