Opened 7 years ago

Last modified 6 years ago

#4755 new enhancement

Reduce memory usage when building models in OpenModelica

Reported by: casella Owned by: sjoelund.se
Priority: high Milestone: 2.1.0
Component: *unknown* Version:
Keywords: Cc: perost

Description

Please check the results of DistributionSystemModelica_N_112_M_112in the new front end Hudson report. The following message appears many time

GC Warning: Failed to expand heap by 16777216 bytes

then the final message is issued

GC Warning: Out of Memory! Heap size: 3240 MiB. Returning NULL!

If I am not mistaken this mode should have a quarter million equations, so it is large but not exceedingly so; it is the largest model so far in the testsuite. The N_80_M_80 model, which is half as big, clears the NF in four seconds with about 1.6 GB memory allocation, which non-trivial but still pretty manageable.

Can we configure the Hudson job to also support the N_112_M_112 model? All other times should be below 300 s, save the C compilation, which will however improve a lot as soon as Willi is finished with the new daeMode backend workflow.

Attachments (1)

execstat_with_explicit_gc.txt (4.7 KB) - added by sjoelund.se 7 years ago.
Ran execstat with GC for each logged event

Download all attachments as: .zip

Change History (20)

comment:1 follow-up: Changed 7 years ago by sjoelund.se

I limited models to 4GB due to OS crashes on ripper.openmodelica.org

comment:2 follow-up: Changed 7 years ago by casella

  • Cc perost added

OK, so the next question is: do we really need 3.2 GB to process 250.000 equations? That's about 12 kB per equation, which seems to be a lot, particularly with the NF, who is supposed to only do the scalarization at very late stages.

Of course the higher priority taks is now to increase coverage with the NF, but once that is achieved, we should aim at reducing memory usage also.

BTW, under Linux I see memory usage constantly increasing during the whole building process. Can't some older data structures be discarded as the process goes on, so that they can be reclaimed by the GC and re-used?

comment:3 in reply to: ↑ 2 ; follow-up: Changed 7 years ago by sjoelund.se

Replying to casella:

BTW, under Linux I see memory usage constantly increasing during the whole building process. Can't some older data structures be discarded as the process goes on, so that they can be reclaimed by the GC and re-used?

Data does become reclaimed by the GC, but on Linux there is a limit in the OS on how many times you can give back memory to the OS so if you monitor the process the total memory allocated can only increase (since we disabled giving back memory to the OS in order to prevent OMEdit from being killed).

From the log, the process dies in BackendVarTransform.replaceEquations and at this stage none of the NF data structures will be present.

For size 80, after NF allocations: 0.5625 kB / 1.654 GB, free: 14.27 MB / 1.153 GB which means a total of 1.65GB of memory allocations including the parsing (which uses 300MB). We could even disable GC for this. Most memory is used in connection handling and converting to the old DAE structure.

clockPartitioning (simulation) allocates a lot of memory (and is this even a clocked system?)

Otherwise it seems to be the usual backend modules (removeSimpleEquations) and templates consuming memory.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 7 years ago by casella

Replying to sjoelund.se:

Data does become reclaimed by the GC, but on Linux there is a limit in the OS on how many times you can give back memory to the OS so if you monitor the process the total memory allocated can only increase (since we disabled giving back memory to the OS in order to prevent OMEdit from being killed).

I'm aware of that, but of course the idea is that memory reclaimed by the GC is reused in later stages, so this shouldn't be much of a problem, right? At least, it's not until N separate gcc processes are started while the the omc process still allocates all this memory from the OS, but that's another story.

For size 80, after NF allocations: 0.5625 kB / 1.654 GB, free: 14.27 MB / 1.153 GB which means a total of 1.65GB of memory allocations including the parsing (which uses 300MB). We could even disable GC for this. Most memory is used in connection handling and converting to the old DAE structure.

I report an excerpt of the N_80_M_80 log here for convenience

Performance of loadModel(ScalableTestSuite): allocations: 317.8 MB / 324.1 MB, free: 11.78 MB / 265.1 MB
Performance of FrontEnd - loaded program: allocations: 0 / 403.2 MB, free: 12.47 MB / 345.1 MB
Performance of NFInst.instantiate: allocations: 78.11 MB / 481.3 MB, free: 4.543 MB / 393.1 MB
Performance of NFInst.instExpressions: allocations: 389.5 kB / 481.7 MB, free: 4.156 MB / 393.1 MB
Performance of NFTyping.typeComponents: allocations: 26.61 kB / 481.7 MB, free: 4.133 MB / 393.1 MB
Performance of NFTyping.typeBindings: allocations: 59.92 kB / 481.8 MB, free: 4.074 MB / 393.1 MB
Performance of NFTyping.typeSections: allocations: 39.94 kB / 481.8 MB, free: 4.035 MB / 393.1 MB
Performance of NFFlatten.flatten: allocations: 277.3 MB / 0.7414 GB, free: 10.18 MB / 0.6187 GB
Performance of NFFlatten.resolveConnections: allocations: 301.7 MB / 1.036 GB, free: 11.19 MB / 0.776 GB
Performance of NFFlatten.flattenFunctions: allocations: 49.52 MB / 1.084 GB, free: 9.66 MB / 0.8228 GB
Performance of NFPackage.collectConstants: allocations: 49.52 MB / 1.133 GB, free: 8.117 MB / 0.8697 GB
Performance of NFScalarize.scalarize: allocations: 10.21 MB / 1.143 GB, free: 13.9 MB / 0.8853 GB
Performance of NFConvertDAE.convert: allocations: 0.5117 GB / 1.654 GB, free: 14.27 MB / 1.153 GB
Performance of FrontEnd: time 9.358e-06/5.039, allocations: 0.5625 kB / 1.654 GB, free: 14.27 MB / 1.153 GB
Performance of Transformations before backend: allocations: 10.22 MB / 1.664 GB, free: 4.047 MB / 1.153 GB
Performance of Generate backend data structure: allocations: 0.64 GB / 2.304 GB, free: 10.88 MB / 1.532 GB
Performance of prepare preOptimizeDAE: allocations: 11.98 kB / 2.304 GB, free: 10.87 MB / 1.532 GB
Performance of preOpt normalInlineFunction (simulation): allocations: 62.88 MB / 2.366 GB, free: 341.2 MB / 1.579 GB
Performance of preOpt evaluateParameters (simulation): allocations: 300.4 MB / 2.659 GB, free: 219.2 MB / 1.579 GB

From what I understand, the first figure is the amount allocated by the specific funcion or processing phase, the second figure is the cumulated value of allocated memory, the third is the amount of memory freed by GC and avaliable for later processing without asking the OS for more memory, and the fourth value is the cumulated value of memory freed by the GC. So, the difference between the second and fourth value should be the net allocated memory, and all other memory held by the OS should be available for later function to grab. Is that correct?

From the log, the process dies in BackendVarTransform.replaceEquations and at this stage none of the NF data structures will be present.

I understand that function is found in preOpt evaluateParameters. At the end of that phase, the net memory allocation for the N_80_M_80 case is about 1.1 GB. That system has about 128k equations, so the memory allocation is about 10 kB per equation/variable. The ASCII representation of the DAE that I get from the DAE dump takes a few hundred bytes per equation/variable, why should the internal representation require two orders of magnitude more memory?

clockPartitioning (simulation) allocates a lot of memory (and is this even a clocked system?)

No, it's not. I've opened #4756 about that.

Otherwise it seems to be the usual backend modules (removeSimpleEquations) and templates consuming memory.

I guess I should have a closer look at that and try to figure out what are the worst bottlenecks.

comment:5 in reply to: ↑ 4 ; follow-up: Changed 7 years ago by sjoelund.se

Replying to casella:

From what I understand, the first figure is the amount allocated by the specific funcion or processing phase, the second figure is the cumulated value of allocated memory, the third is the amount of memory freed by GC and avaliable for later processing without asking the OS for more memory, and the fourth value is the cumulated value of memory freed by the GC. So, the difference between the second and fourth value should be the net allocated memory, and all other memory held by the OS should be available for later function to grab. Is that correct?

The fourth is the memory allocated by the GC (roughly equivalent to the memory used by the process).

From the log, the process dies in BackendVarTransform.replaceEquations and at this stage none of the NF data structures will be present.

I understand that function is found in preOpt evaluateParameters. At the end of that phase, the net memory allocation for the N_80_M_80 case is about 1.1 GB. That system has about 128k equations, so the memory allocation is about 10 kB per equation/variable. The ASCII representation of the DAE that I get from the DAE dump takes a few hundred bytes per equation/variable, why should the internal representation require two orders of magnitude more memory?

I'm not sure. Perhaps I can debug this slightly to see the sizes of internal data structures. But part of the reason is that the GC doesn't run before producing the numbers, so you don't really know how much memory is still in use from these logs.

comment:6 Changed 7 years ago by sjoelund.se

As for the original problem that OMC just dies from running out of memory, this is a feature of the memory limitation in the Linux OS. I think I will add a scripting API that can call GC_set_max_heap_size so that Boehm does not try to allocate too much memory (and I would set this limit lower than the ulimit since the testing script needs to handle older OMC versions as well).

comment:7 Changed 7 years ago by sjoelund.se

That seems to work; of course Boehm simply returns a NULL pointer and we get a segmentation fault, but we could possibly catch that.

comment:8 in reply to: ↑ 5 ; follow-up: Changed 7 years ago by casella

Replying to sjoelund.se:

The fourth is the memory allocated by the GC (roughly equivalent to the memory used by the process).

That's really confusing. Everywhere else in these logs x / y means x is the local quantity and y is the cumulated one.

Could we have something like

Performance of preOpt evaluateParameters (simulation): 
allocations: 300.4 MB / 2.659 GB, free: 219.2 MB, total: 1.579 GB

instead of

Performance of preOpt evaluateParameters (simulation): 
allocations: 300.4 MB / 2.659 GB, free: 219.2 MB / 1.579 GB

I'm not sure. Perhaps I can debug this slightly to see the sizes of internal data structures. But part of the reason is that the GC doesn't run before producing the numbers, so you don't really know how much memory is still in use from these logs.

This is also quite confusing. What do you mean by "the CG doesn't run before producing the numbers?". You mean the GC is fired asynchronously w.r.t. the moment the log is dumped?

comment:9 in reply to: ↑ 8 Changed 7 years ago by sjoelund.se

Replying to casella:

This is also quite confusing. What do you mean by "the CG doesn't run before producing the numbers?". You mean the GC is fired asynchronously w.r.t. the moment the log is dumped?

The GC runs whenever Boehm feels it is running low on memory. It might take many backend phases before garbage is collected; so you never really know how much memory is consumed.

Changed 7 years ago by sjoelund.se

Ran execstat with GC for each logged event

comment:10 Changed 7 years ago by sjoelund.se

So, I did a run with execstat on my local machine (-O0 so some things might not have been removed from the stack), but based on that it seems that the total used memory (total-free) does not really increase much except during template construction, a little during SimCode, a little during removeSimpleEquations and a lot during frontend/backend. Perhaps the DAE structure is not collected? It would be good if I could see if it is removed, but perhaps it is just the expressions taking this amount of memory.

comment:11 Changed 7 years ago by sjoelund.se

For DistributionSystemModelica_N_80_M_80 old inst:

Notification: FrontEnd DAE uses 407.2 MB of memory.
Notification: FCore.Graph uses 1.945 MB of memory.
Notification: FCore.Cache uses 5.494 MB of memory.
Notification: Symbol Table (Absyn and SCode) uses 259.4 MB of memory.
Notification: FrontEnd DAE after transformations uses 407.2 MB of memory.
Notification: BackendDAECreate.lower uses 439.3 MB of memory.
Notification: BackendDAE (simulation) uses 305.5 MB of memory.
Notification: BackendDAE (initialization) uses 168.6 MB of memory.
Notification: BackendDAE (simulation+initialization) uses 380.6 MB of memory.
Notification: removedInitialEquationLst uses 0 of memory.
Notification: simCode uses 0.5931 GB of memory.

New inst:

Notification: FrontEnd DAE uses 454.3 MB of memory.
Notification: FCore.Graph uses 48 of memory.
Notification: FCore.Cache uses 192 of memory.
Notification: Symbol Table (Absyn and SCode) uses 259.4 MB of memory.
Notification: FrontEnd DAE after transformations uses 454.3 MB of memory.
Notification: BackendDAECreate.lower uses 495.2 MB of memory.
Notification: BackendDAE (simulation) uses 361.8 MB of memory.
Notification: BackendDAE (initialization) uses 196.8 MB of memory.
Notification: BackendDAE (simulation+initialization) uses 444.1 MB of memory.
Notification: removedInitialEquationLst uses 0 of memory.
Notification: simCode uses 0.6516 GB of memory.

So as you can see, even though we duplicate the equations into simulation and initialization, those system of equations together are smaller than the initial system (and share some data; probably variable attributes, etc). So the memory consumed should be shrinking. But perhaps the initial BackendDAE is always kept on the stack?

14x14 with execstat forcing GC at each phase:

Generate backend data structure: free: 73.43 MB / 0.6217 GB
  SymbolTable,Cache,Graph: 275MB
  FrontEnd DAE: 13MB
  BackendDAECreate.lower: 16MB
  Whatever was consumed before the translateModel call (in addition to symbol table): 140MB
  ???: 60MB

comment:12 Changed 7 years ago by sjoelund.se

Note: Collecting this data on consumed memory and adding GC on every execstat comes with a 10x performance penalty...

Much of the data is actually shared, so I don't really know why GC uses up twice as much memory as expected... New idea would be to check the local roots if maybe the print buffer or something like that has hidden allocations. Otherwise SimCode uses up 15x as much memory as the BackEnd simulation+init+lambda0 systems and would probably be the best place to start shaving off allocated data...

Notification: Symbol Table, DAE, Graph, OldGraph, Cache, OldCache uses 331.5 MB of memory.
Notification: Performance of Serialize FrontEnd: time 28.16/31.56, allocations: 3.688 kB / 0.8353 GB, free: 95.24 MB / 0.6529 GB

comment:13 Changed 7 years ago by sjoelund.se

I calculated the memory used by GC a bit wrong before. It is 16-byte graunles and 1 byte for internal GC use. And it seems GC doesn't always clean up everything; I would suspect something in a register stored somewhere messes things up. A few times it doesn't even clean up the memory if I call GC.gcollect in main.

Much of the data stored on the heap is the symbol table Absyn/SCode at 297.5 MB of memory (225.8 MB without GC overhead).

eqs=number during sorting+matching (simulation):

14x14 14x14/eq 80x80 80x80/eq
Frontend DAE 15.0 MB 12.6 kB 474 MB 12.6 kB
Backend DAE (sim+init+lambda) 16.8 MB 14.1 kB 450 MB 11.9 kB
SimCode 245 MB 206 kB 0.67 GB 18.3 kB

The SimCode numbers/eq should be way off because I think SimCode now includes a reference to the Absyn structure. I agree 12 kB/equation seems a little excessive though.

comment:14 Changed 7 years ago by sjoelund.se

The 2000-ish alias variables consume on average 2 kB of memory each (4 kB of data without considering shared structures). If the varName is set to a dummy, the data consumed drops to about half. The shared structures in 2.009 MB of component references is only 6 kB.

Part of the problem is that we store the component references in the wrong order, so we can't share data inside of them in a good way (load[3,1].T and load[3,1].R have T and R as leaf nodes respectively). The good news is that the NF internally stores them in the fast order (load[3,1] as the leaf node with two nodes T and R pointing to it); the bad news is that it converts this structure for the backend. Changing this in the old frontend would be a major undertaking, but might be feasible to change once we drop support for it and adapt the backend to a different storage of component references (we could also convert this during Backend lowering, but would need a new ComponentReference module and crefs in DAE.Exp would use the other kind of component reference just to make life more annoying).

The current memory consumption of a typical cref in this model is:

idents     48 bytes
types      0.5625 kB (lots that should be possible to share; all variables on the same level have the same type regardless of their subscripts)
subscripts 192 bytes (simple integer subscripts)

comment:15 Changed 7 years ago by casella

  • Milestone changed from 2.0.0 to 2.1.0

I guess this is off the critical path until we have a solid implementation of the new front-end and we have resolved all pending coverage issues.

Once that is done, I think this optimization would be worth investigating. BTW, I assume that allocating memory and writing stuff to it takes time, so we may end up speeding up the model building phase, besides reducing the memory footprint.

comment:16 follow-up: Changed 7 years ago by ceraolo

I read everything on this ticket, but I see a big memory leakage issue which seems to go beyond this ticket's analysis.
In particular it is solving which, I think, should not be regarded as optimisation.

Steps to reproduce (OM 1.13 dev 380, win 64 bit):

1) launch OMEdit Now Task manager reports 450 MB as allocated to OMEdit.
2) load Buildings library (5.0.0: should be the latest) and double-click on Air.Systems.SingleZone.Vav.ChillerDXHeatingEconomizer
Memory reaches 3.65 GB

3) unload Buildings library. The loaded memory stays above 3.6 GB whatever you do, except closing OMEdit.

Should I open a different ticket on this?

Last edited 7 years ago by ceraolo (previous) (diff)

comment:17 in reply to: ↑ 16 Changed 7 years ago by casella

Replying to ceraolo:

Should I open a different ticket on this?

From what I understand, there is no memory leak in this case.

If you unload a library the OMC internal garbage collector reclaims the memory so it can be re-used by the OMEdit process (which incorporates OMC via DLL), e.g. if you open another model or load another library. However, the memory is not given back to the OS. I understand doing so (particularly on Linux) would be problematic, maybe others can comment on this.

As I see it, the issue at stake is that OMC/OMEdit is still too inefficient in terms of memory requirements, because the model in question is not that big (Dymola reports 874 equations). This may just be related to how the current OMEdit uses the old front-end to get all the graphical information.

My advice is:

  • please open a separate ticket which specifically addresses this test case (this issue is not really related to Hudson)
  • check the performance once the new front-end-based API (the one that supports replaceable models and that is the primary goal of release 1.13.0) is finally rolled out, as it should be a lot more efficient

comment:18 in reply to: ↑ 1 Changed 6 years ago by casella

Replying to sjoelund.se:

I limited models to 4GB due to OS crashes on ripper.openmodelica.org

@sjoelund.se, for the time being could you set this limit to, say, 6 GB?

This would allow the two largest models in the ScalableTestSuite that currently fail because of that limit, namely DistributionSystemModelicaIndividual_N_80_M_80 and DistributionSystemModelica_N_112_M_112 to also run.

In the worst case this would take (6-4)*2=4 additional GB from the heap. I guess the Ripper server can take this, can't it?

comment:19 Changed 6 years ago by casella

  • Component changed from Testing Framework to *unknown*
  • Summary changed from Test on large models fail due to lack of memory on Hudson to Reduce memory usage when building models in OpenModelica
  • Type changed from defect to enhancement

This ticket contains useful discussion items on how to save memory, but it strayed away from the original topic. I have changed the summary and opened #4982 on this topic

Note: See TracTickets for help on using tickets.