Opened 6 years ago

Closed 4 years ago

Last modified 3 years ago

#5485 closed defect (fixed)

Large memory leak in kinsol nonlinear solver

Reported by: RSchiemann Owned by: Lennart Ochel
Priority: blocker Milestone: 1.18.0
Component: Run-time Version: v1.14.0-dev-nightly
Keywords: Cc: Karim Adbdelhak, Andreas Heuermann

Description (last modified by Martin Sjölund)

When running a simulation of a "large" (8000 equations, 4000 of which are trivial) model with -nls=kinsol, the memory allocated by the simulation process grows very large. On a machine with 16 GB, the machine runs out of memory if simulation is large enough. I observed memory accumulation rates of 40 MB/s, which is a lot.

I use the following setup:
OMEdit v1.14.0-dev-26461-gf6f5727072 (64-bit)
Connected to v1.14.0-dev-26461-gf6f5727072 (64-bit)
Connected to OMSimulator v2.1.0-dev-103-g9e04b31-mingw

I don't use the new NF.

The simulation is called with the following arguments (I removed unnecessary paths and replaced them by [...]):

model.exe -port=59221 -logFormat=xmltcp -override=startTime=0,stopTime=5000,stepSize=500,tolerance=0.0001,solver=dassl,outputFormat=mat,variableFilter=.* -r=[...]model_res.mat -jacobian=symbolical -nls=kinsol -w -lv=LOG_DASSL,LOG_STATS,LOG_STATS_V -inputPath=[...]/AppData/Local/Temp/OpenModelica/OMEdit/Outotec.PlantModels.PelletPlant14 -outputPath=[...]/AppData/Local/Temp/OpenModelica/OMEdit/Outotec.PlantModels.PelletPlant14

I see great similarity to this old ticket:
https://trac.openmodelica.org/OpenModelica/ticket/3646

When I use the default solver (which is much, much slower than kinsol for my model), memory usage remains practically constant.

Attachments (2)

TestKinsol.mo (1.4 KB ) - added by Francesco Casella 5 years ago.
log1.txt (75.6 KB ) - added by Francesco Casella 5 years ago.

Download all attachments as: .zip

Change History (28)

comment:1 by Martin Sjölund, 6 years ago

Description: modified (diff)

comment:2 by Martin Sjölund, 6 years ago

Description: modified (diff)

comment:3 by Martin Sjölund, 6 years ago

I did resolve a few leaks that were in our code, but it seems that KINSol will never free or reuse memory if you call KINSol multiple times...

==21704== 7,416,120 (4,514,160 direct, 2,901,960 indirect) bytes in 26,870 blocks are definitely lost in loss record 247 of 247
==21704==    at 0x4C2FB0F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21704==    by 0x844F3BA: klu_malloc (in /home/marsj/OpenModelica/build/lib/x86_64-linux-gnu/omc/libklu.so)
==21704==    by 0x844D9FE: klu_factor (in /home/marsj/OpenModelica/build/lib/x86_64-linux-gnu/omc/libklu.so)
==21704==    by 0x76341EB: kinKLUSetup (in /home/marsj/OpenModelica/build/lib/x86_64-linux-gnu/omc/libsundials_kinsol.so.1)
==21704==    by 0x762EC62: KINLinSolDrv (in /home/marsj/OpenModelica/build/lib/x86_64-linux-gnu/omc/libsundials_kinsol.so.1)
==21704==    by 0x762D3C8: KINSol (in /home/marsj/OpenModelica/build/lib/x86_64-linux-gnu/omc/libsundials_kinsol.so.1)
==21704==    by 0x50A4BA7: nlsKinsolSolve (in /home/marsj/OpenModelica/build/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)

I'll see if we could reset it on each iteration or every 1000 or something.

comment:4 by Martin Sjölund, 6 years ago

It's even worse than that. Despite the facts that the offical FAQ says you can call KINSol again, it leaks memory if you do so. If you don't reinit the solver every timestep, it leaks.

https://github.com/OpenModelica/OpenModelica/pull/193 should fix the memory leak, but it decreased the overall performance of simulations using the KINSol solver by around 25% for me (simple non-linear system).

comment:5 by Martin Sjölund, 6 years ago

Actually, the performance degradation is even larger for actual tests (20-minute timeout was hit on the testsuite...)

comment:6 by Martin Sjölund, 6 years ago

Owner: changed from somebody to Lennart Ochel
Status: newassigned

in reply to:  4 comment:7 by Francesco Casella, 6 years ago

Replying to sjoelund.se:

It's even worse than that. Despite the facts that the offical FAQ says you can call KINSol again, it leaks memory if you do so. If you don't reinit the solver every timestep, it leaks.

This looks quite weird. Are we sure this is not due to the way the runtime calls it?

https://github.com/OpenModelica/OpenModelica/pull/193 should fix the memory leak, but it decreased the overall performance of simulations using the KINSol solver by around 25% for me (simple non-linear system).

I'm not sure that reinitializing the solver each time is a good solution. I think we should further investigate why Kinsol leaks memory if called multiple times.

comment:8 by RSchiemann, 5 years ago

Has a potential fix for the problem already been identified? Do I need to supply additional information?

in reply to:  8 ; comment:9 by Francesco Casella, 5 years ago

Replying to RSchiemann:

Has a potential fix for the problem already been identified? Do I need to supply additional information?

Any further information would be welcome. Even better would be a test case that reproduces the issue.

in reply to:  9 comment:10 by RSchiemann, 5 years ago

Replying to casella:

Any further information would be welcome. Even better would be a test case that reproduces the issue.

I checked the possibility but I cannot share the model in this case. Also I think that reproducing the issue was already successful (see earlier comments by sjoelund.se). I am not sure what else I can supply.

comment:11 by Francesco Casella, 5 years ago

@RSchiemann, did you notice any improvement after PR #193? This was merged in master 25 days ago, you can test it with any recent nightly build.

I understand your system has a large sparse algebraic loop in the regular equations, so Kinsol is called at each time step, but eventually you are using an ODE solver. Have you tried simulating your system with -daeMode? This would avoid the causalization, and use IDA as a native DAE solver. Internally, it will still use Kinsol and KLU, but not through our interface.

in reply to:  11 ; comment:12 by RSchiemann, 5 years ago

I didn‘t realize this was fixed since days ago, the status of this ticket says otherwise after all.
I will try your suggestion soon and will report back. It will take at least two weeks though.

in reply to:  12 comment:13 by Francesco Casella, 5 years ago

Replying to RSchiemann:

I didn‘t realize this was fixed since days ago, the status of this ticket says otherwise after all.

As a rule, we leave it to the person that opened the ticket, because he/she is the most qualified person to assess whether the fix is good or not. Unless there is a test case attached, in which case we can test it and close the ticket outright if the problem is solved.

I will try your suggestion soon and will report back. It will take at least two weeks though.

We're not in a hurry :)

comment:14 by Martin Sjölund, 5 years ago

It is not fixed. I removed some memory leaks, but the only way to prevent KINSol from leaking memory is to reset KINSol after every single solution has been calculated. But this means it is incredibly slow and basically unusable. (Note that trying to free every 1000 or steps to mitigate this does not work; KINSol only frees the latest allocation)

https://github.com/OpenModelica/OpenModelica/pull/193/files#diff-4ca9f0c49eb33e9be816735ce0dda176R1007 still skips this free after every step for performance, leaking memory. I guess we could have a flag that re-enables it, but performance drops close to 100x for some systems...

comment:15 by RSchiemann, 5 years ago

Despite possible workarounds discussed above, is the outcome of analyses so far that the issue of leaking memory is on KINSol side?

comment:16 by RSchiemann, 5 years ago

A quick google search revealed other people having this (or a similar) problem with KINSol as well and even a possible workaround:
http://sundials.2283335.n4.nabble.com/out-of-memory-in-calling-kinsol-in-loop-td4655601.html#a4655602

Is that appraoch already known? Would it be a suitable strategy for OM as well?

Sorry if I am only bringing to the table what's already obvious to the development team.

comment:17 by Martin Sjölund, 5 years ago

I verified that it is KINSol leaking memory by using valgrind. The above approach seems to be for the fortran interface. I am unsure exactly what we do inside the loop, but as far as I could tell it was only the internal kinsol routine allocating memory once per call of KINSol. And from that link, it seems they only call KINSol once, which like I said above removes the memory leak. We need to call it multiple times to solve the same problem with different input multiple times, which is either slow or leaks memory.

comment:18 by Francesco Casella, 5 years ago

I checked the SUNDIALS mailing list for "Kinsol memory leak" and I basically found two references

The first has been mentioned in comment:16, and may be potentially relevant to our issue. The second one addresses the requirement of changing the linear solver between KINSol calls, which is not relevant for us.

I can't see any fundamental reason why it shouldn't be possible to change the inputs and call KINSol multiple times, if the problem and the low-level solvers are not changed, so I think we should definitely involve the developers and try to fix the problem. The brute-force approach of reinitializing the solver at each call is not acceptable.

I am willing to help with that, but I think we should have a clear statement of our issue, which is still a bit fuzzy.

@sjoelund.se,

  • what model did you use to obtain the result reported in comment:3?
  • can I get the report by running valgrind --leak-check=yes omc test.mos?
  • is this the loop that may be causing the memory leak?
Last edited 5 years ago by Francesco Casella (previous) (diff)

comment:19 by RSchiemann, 5 years ago

I tried the -daeMode (and with ida as time stepper, kinsol as nonlinear solver and klu as linear solver), however during translation, a lot of errors happened. Most of them were errors about failed equation traversals, such as the following one:

[5] 13:21:14 Translation Error
[C:/dev/OM64bit/OMCompiler/Compiler/BackEnd/DAEMode.mo: 524:7-526:90]: Internal error DAEMode.traverserStrongComponents failed on equation:
1/1 (3): (Bed_WB2.bedPressureDrop.xiO2_port_a, Bed_WB2.bedPressureDrop.xiCO2_port_a, Bed_WB2.bedPressureDrop.xiH2O_port_a) = Functions.MassToMoleFractions(smooth(0, if (-Bed_WB2.bedPressureDrop.port_b.m_flow) > 0.0 then Bed_WB2.mid.port_top.xO2 else Bed_WB2.bedPressureDrop.port_a.xO2), smooth(0, if (-Bed_WB2.bedPressureDrop.port_b.m_flow) > 0.0 then Bed_WB2.mid.port_top.xCO2 else Bed_WB2.bedPressureDrop.port_a.xCO2), smooth(0, if (-Bed_WB2.bedPressureDrop.port_b.m_flow) > 0.0 then Bed_WB2.mid.port_top.xH2O else Bed_WB2.bedPressureDrop.port_a.xH2O)) Variables:
1: Bed_WB2.mid.port_top.xH2O:VARIABLE(min = 0.0 max = 1.0 start = 0.02 unit = "1" )  type: Real 
2: valve2.Vp:VARIABLE(start = 52.0 unit = "m3/s" )  type: Real 
3: valve2.dp:VARIABLE(start = 5690.0 unit = "Pa" )  type: Real  

The other ones were the following:

[7] 13:23:17 Translation Error
[C:/dev/OM64bit/OMCompiler/Compiler/BackEnd/BackendDAEUtil.mo: 9112:5-9113:77]: Internal error BackendDAEUtil.traverseEqSystemStrongComponents failed with function:
[unknown function name]


[8] 13:23:17 Translation Error
post-optimization module createDAEmodeBDAE (simulation) failed.

I am currently trying with the latest nightly build from tonight. Simulating without daeMode is currently not working for my model

comment:20 by Francesco Casella, 5 years ago

I dug out a scalable nonlinear algebraic model that I wrote for an EOOLT paper and prepared a nice test for this issue, please find it attached. In fact, I should put it in the ScalableTestSuite as soon as I have some spare time to do so.

I ran TestKinsol.Test100 on my laptop (an Intel i7-8550 with 16 GB RAM running Windows 10 64-bit), it ran for 22 seconds and, according to the process monitor leaked about 2.7 GB of memory. There are smaller and larger tests in the package if one wants to experiment a bit more.

by Francesco Casella, 5 years ago

Attachment: TestKinsol.mo added

comment:21 by Francesco Casella, 5 years ago

I compiled TestKinsol.Test10 (in fact, a version with Interval=1e-3, otherwise valgrind would take forever to run), and then ran valgrind --leak-check=yes ./TestKinsol.Test10. I report an excerpt of the output (attached):

==21869== 43,147 (216 direct, 42,931 indirect) bytes in 3 blocks are definitely lost in loss record 122 of 127
==21869==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21869==    by 0x4ED3BA7: ??? (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x68E168F: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x68E239B: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x68E671A: XML_ParseBuffer (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x4ECF5C8: read_input_xml (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4EC1414: initRuntimeAndSimulation (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4EC1DBA: _main_SimulationRuntime (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x406834: main (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==21869== 
==21869== 76,361 (72 direct, 76,289 indirect) bytes in 1 blocks are definitely lost in loss record 125 of 127
==21869==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21869==    by 0x4ED3BA7: ??? (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x68E168F: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x68E239B: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x68E3CBD: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x68E4414: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x68E671A: XML_ParseBuffer (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==21869==    by 0x4ECF5C8: read_input_xml (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4EC1414: initRuntimeAndSimulation (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4EC1DBA: _main_SimulationRuntime (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x406834: main (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==21869== 
==21869== 184,000 (24,000 direct, 160,000 indirect) bytes in 1,000 blocks are definitely lost in loss record 127 of 127
==21869==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==21869==    by 0x4E8767F: createValueElement (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4E862DE: updateInitialGuessDB (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4E86F21: solve_nonlinear_system (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x40620E: TestKinsol_Test10_eqFunction_94 (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==21869==    by 0x408B49: TestKinsol_Test10_functionAlgebraics (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==21869==    by 0x4064AF: TestKinsol_Test10_updateContinuousSystem (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==21869==    by 0x404B55: TestKinsol_Test10_performSimulation (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==21869==    by 0x4E9B0C8: solver_main (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4EC0580: ??? (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4EBF94B: startNonInteractiveSimulation (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==21869==    by 0x4EC1DE0: _main_SimulationRuntime (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)

The first two losses should probably also be investigated, but as they only happen at initialization, they are probably not a big deal.

The third one instead is the important one, because it will grow if we go from 1000 steps (as in this test) to longer simulations, as well as to larger systems of equations.

I don't see any leak in KINSOL function calls, apparently the problem is with the C runtime, function solve_nonlinear_system.

by Francesco Casella, 5 years ago

Attachment: log1.txt added

comment:22 by Francesco Casella, 5 years ago

Cc: Karim Adbdelhak Andreas Heuermann added

I ran another experiment with half the Interval length, and the result is:

==22109== 43,147 (216 direct, 42,931 indirect) bytes in 3 blocks are definitely lost in loss record 122 of 127
==22109==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22109==    by 0x4ED3BA7: ??? (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x68E168F: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x68E239B: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x68E671A: XML_ParseBuffer (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x4ECF5C8: read_input_xml (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4EC1414: initRuntimeAndSimulation (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4EC1DBA: _main_SimulationRuntime (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x406834: main (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==22109== 
==22109== 76,361 (72 direct, 76,289 indirect) bytes in 1 blocks are definitely lost in loss record 125 of 127
==22109==    at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22109==    by 0x4ED3BA7: ??? (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x68E168F: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x68E239B: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x68E3CBD: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x68E4414: ??? (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x68E671A: XML_ParseBuffer (in /lib/x86_64-linux-gnu/libexpat.so.1.6.0)
==22109==    by 0x4ECF5C8: read_input_xml (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4EC1414: initRuntimeAndSimulation (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4EC1DBA: _main_SimulationRuntime (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x406834: main (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==22109== 
==22109== 368,000 (48,000 direct, 320,000 indirect) bytes in 2,000 blocks are definitely lost in loss record 127 of 127
==22109==    at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==22109==    by 0x4E8767F: createValueElement (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4E862DE: updateInitialGuessDB (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4E86F21: solve_nonlinear_system (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x40620E: TestKinsol_Test10_eqFunction_94 (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==22109==    by 0x408B49: TestKinsol_Test10_functionAlgebraics (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==22109==    by 0x4064AF: TestKinsol_Test10_updateContinuousSystem (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==22109==    by 0x404B55: TestKinsol_Test10_performSimulation (in /home/casella/test/TestKinsol/TestKinsol.Test10)
==22109==    by 0x4E9B0C8: solver_main (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4EC0580: ??? (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4EBF94B: startNonInteractiveSimulation (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109==    by 0x4EC1DE0: _main_SimulationRuntime (in /usr/lib/x86_64-linux-gnu/omc/libSimulationRuntimeC.so)
==22109== 

the first two are the same as expected, the last one doubled as expected.

Karim, Andreas, I guess you should have a look at the C runtime functions which are involved in the leak.

Note that in this case I am using the euler solver, so there is only one call to Kinsol for each time step. It is well possible that if I use an implicit solver, there are other leaks.

comment:23 by Francesco Casella, 5 years ago

I ran the same experiment with s="dassl" but I got the same leak.

I understand @sjoelund.se tried to "fix" the leak in PR 193, in a way that led to severe performance degradation, see comment:4, comment.5. I don't think this is acceptable.

I would suggest that we

  • fix the remaining issue in the runtime
  • undo PR 193
  • clearly identify the issue and contact the developers of Sundials to sort that out in the Kinsol source code
Last edited 5 years ago by Francesco Casella (previous) (diff)

comment:24 by Francesco Casella, 5 years ago

Milestone: 1.14.02.0.0
Priority: highblocker

We may not be able to sort this out for 1.14.0, but we definitely should in 2.0.0. Memory leaks of this magnitude that increase over simulation time are not acceptable in production-quality code.

comment:25 by Francesco Casella, 4 years ago

Milestone: 2.0.01.18.0
Resolution: fixed
Status: assignedclosed

Fixed in 75437148.

comment:26 by Francesco Casella, 3 years ago

Component: *unknown*Run-time
Note: See TracTickets for help on using tickets.