Opened 3 years ago

Closed 3 years ago

Last modified 3 years ago

#6348 closed defect (fixed)

Simulation works in Linux but not Windows

Reported by: christiankral Owned by: adrpo
Priority: blocker Milestone: 1.17.0
Component: Frontend Version: v1.17.0-dev
Keywords: Cc: AnHeuermann, perost, adrpo, sjoelund.se

Description

I prepared a simulation model (attachment) for my students that I developed under Linux. If my students use this model under Windows, the simulation model does not simulate. Even if they use the exact same version I was using to test my model in Linux, the Windows version fails:

OpenModelica-v1.17.0-dev-341-gd731e9e716

Problem: The translation works OK, but the simulations gets stuck at 0%.

If my students use the 1.16.0, 1.16.1 or 1.16.2 version the problem remains. But: Switching back to the old frontend does help and provides at least a workaround for now.

If I switch back to the old frontend, the simulation runs with no problems and shows the correct results.

Attachments (1)

InductionMotor.mo (6.5 KB) - added by christiankral 3 years ago.
Simulation models work in Linux but not Windwos

Download all attachments as: .zip

Change History (34)

Changed 3 years ago by christiankral

Simulation models work in Linux but not Windwos

comment:1 Changed 3 years ago by perost

I use Linux myself, but I ran the compilation through valgrind and found one issue that I fixed in 60f3fc1. I have no idea if that's the actual issue though, so if you're able to please check if the issue is fixed in the next nightly build. Otherwise I'll try to get one of our Windows developers to take a look at it.

comment:2 Changed 3 years ago by adeas31

This doesn't fix the problem. I have tried to attach the gdb to the running simulation and it is stuck at generic_array_create. Here is the backtrace,

Thread 1 (Thread 41928.0x9024):
#0  0x0000000000476658 in generic_array_create ()
No symbol table info available.
#1  0x000000000040b26c in InductionMotor_eqFunction_1605 ()
No symbol table info available.
#2  0x0000000000416480 in InductionMotor_functionInitialEquations_0 ()
No symbol table info available.
#3  0x000000000041e8c3 in InductionMotor_functionInitialEquations ()
No symbol table info available.
#4  0x00000000004a4d4b in symbolic_initialization ()
No symbol table info available.
#5  0x00000000004a6460 in initialization ()
No symbol table info available.
#6  0x000000000048d840 in initializeModel ()
No symbol table info available.
#7  0x000000000048ee10 in solver_main ()
No symbol table info available.
#8  0x00000000004aa245 in callSolver.isra.0 ()
No symbol table info available.
#9  0x00000000004aa924 in startNonInteractiveSimulation ()
No symbol table info available.
#10 0x00000000004ae496 in _main_SimulationRuntime ()
No symbol table info available.
#11 0x00000000019d5ead in main ()
No symbol table info available.

comment:3 Changed 3 years ago by AnHeuermann

  • Cc AnHeuermann Karim.Abdelhak added

comment:4 follow-up: Changed 3 years ago by casella

I report here what I found out on this topic. First of all, it looks similar to the issue that I had when the lambd0-initial equations were not generated. At some point the runtime was fixed to run the lambda = 0 step also (though with the initial equations, not with the lambda0-initial ones), but the solver showed the same behaviour, i.e. the solver hanged forever without issuing any log message.

I tried @christiankral's model with OMEdit v1.17.0-dev-356-gc2c52350cb on Windows 10, with solver logging.

When I use the OF, there are lots of conditional equations depending on parameters that are not recognised as structural, hence they end up in several nonlinear equation systems. In fact, with the given value of the parameter, they are indeed linear, so the solver should have no problem solving them.

During initialization the runtime logging (LOG_NLS_V reports solving nonlinear equation systems 134 (24 equations, 6 tearing variables), 170 (24 equations, 6 tearing variables), and 421 (316 equations, 30 tearing variables), before apparently getting stuck solving equation system 636 (24 equations, 6 tearing variables), this time with no logging at all. In fact, I have no idea what the cpu is doing, but it keeps going full steam ahead until I kill it.

The NF recognizes the structural parameters and evaluates them, so there are no nonlinear equations, only linear system 293 with 88 equations and 12 linear variables. Using LOG_LS_V I get

Residual Norm 2.01084365779198e-08 of solution x:
System 293 numVars 12.
[1] imc.ir[3].im = -199.995265638199
[2] imc.ir[3].re = -263.081292871012
[3] imc.ir[2].im = -127.837450067651
[4] imc.ir[2].re = 304.741627114803
[5] imc.rotorCage.V_m.re = -56.2612179843652
[6] imc.rotorCage.V_m.im = 442.729714380106
[7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048256
[8] idealCloser.idealClosingSwitch[1].s.im = -352.388736218346
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844
[10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738

then the simulation hangs in the (apparently) same endless loop. Notice that in this case there are no other nonlinear initialization equations to be solved, only assignments.

From what I understand, there is some potential memory management issue with the nonlinear solver (possibly related to generic_array_create) that may or may not show up randomly depending on the memory layout of the different versions of OMC and of the host OS.

I think this should be investigated with high priority before releasing 1.17.0.

Last edited 3 years ago by casella (previous) (diff)

comment:5 in reply to: ↑ 4 ; follow-up: Changed 3 years ago by perost

Replying to casella:

The NF recognizes the structural parameters and evaluates them, so there are no nonlinear equations, only linear system 293 with 88 equations and 12 linear variables. Using LOG_LS_V I get

Residual Norm 2.01084365779198e-08 of solution x:
System 293 numVars 12.
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844
[10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738
[1] imc.ir[3].im = -199.995265638199
[2] imc.ir[3].re = -263.081292871012
[3] imc.ir[2].im = -127.837450067651
[4] imc.ir[2].re = 304.741627114803
[5] imc.rotorCage.V_m.re = -56.2612179843652
[6] imc.rotorCage.V_m.im = 442.729714380106
[7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048256
[8] idealCloser.idealClosingSwitch[1].s.im = -352.388736218346
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844
[10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738

then the simulation hangs in the (apparently) same endless loop.

Is this a copy-paste error or did you actually get duplicates of equation 9-12? When I run the simulation on Linux I get this:

Residual Norm 2.12444458646675e-08 of solution x:
System 293 numVars 12.
[1] imc.ir[3].im = -199.995265638374
[2] imc.ir[3].re = -263.081292874695
[3] imc.ir[2].im = -127.837450070753
[4] imc.ir[2].re = 304.741627116796
[5] imc.rotorCage.V_m.re = -56.2612179820831
[6] imc.rotorCage.V_m.im = 442.729714384532
[7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048752
[8] idealCloser.idealClosingSwitch[1].s.im = -352.388736217346
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631770403
[10] idealCloser.idealClosingSwitch[2].s.im = 140.325387168594
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373027
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316026239

As usual there are slight differences in the numbers due to floating point differences between the platforms, but the difference in the residual norm is relatively large (2.12e-08 vs 2.01e-08). I don't know if that has any significance though.

I also ran the simulation through valgrind, I didn't think to do that earlier. But it didn't show anything other than the usual GC noise.

comment:6 in reply to: ↑ 5 ; follow-up: Changed 3 years ago by casella

Replying to perost:

Is this a copy-paste error

Yes. I tried to copy-and-paste the whole message but it didn't work, then I messed up. I'll correct the comment now.

As usual there are slight differences in the numbers due to floating point differences between the platforms, but the difference in the residual norm is relatively large (2.12e-08 vs 2.01e-08). I don't know if that has any significance though.

Probably not, though I'm not 100% sure.

I also ran the simulation through valgrind, I didn't think to do that earlier. But it didn't show anything other than the usual GC noise.

Does it run forever also in your case?

comment:7 Changed 3 years ago by casella

  • Cc perost added; Karim.Abdelhak removed
  • Owner changed from perost to Karim.Abdelhak
  • Status changed from new to assigned

comment:8 in reply to: ↑ 6 Changed 3 years ago by perost

Replying to casella:

Does it run forever also in your case?

No, on Linux it simulates successfully without any issues whatsoever.

comment:9 Changed 3 years ago by casella

  • Cc adrpo added

Aha, good to know.

@adrpo, @adeas31, any clue?

comment:10 Changed 3 years ago by dr.christian.kral@…

This issue unfortunately also breaks the HanserModelica system library: see https://github.com/christiankral/HanserModelica/issues/12

comment:11 Changed 3 years ago by casella

  • Milestone changed from NeedsInput to 1.17.0

comment:12 Changed 3 years ago by dr.christian.kral@…

I am not sure if this helps: I somehow remember my students saying, that the "old frontend" option worked in OM 1.16.0 (which is the version we us at school) and 1.16.1 but not in 1.16.2. I did not double this information so I am not sure if it is correct. However, I do not have a Windows system where I could try to install and compare the different versions.

comment:13 Changed 3 years ago by casella

It may be relevant, though it's difficult to understand why old frontend should work in Windows and new frontend in Linux.

We need to figure out what is the solver doing when it is stuck in the first place. I'll discuss this tomorrow in the developers' meeting.

comment:14 Changed 3 years ago by adrpo

Something really bad happens here, I get in gdb:
nr_of_elements = 13836186197161934851
here:
https://github.com/OpenModelica/OpenModelica/blob/master/OMCompiler/SimulationRuntime/c/util/generic_array.c#L122
So this for for(i = 0; i < nr_of_elements; ++i) takes forever.

Last edited 3 years ago by adrpo (previous) (diff)

comment:15 Changed 3 years ago by casella

Interesting :)

13836186197161934851 = 1100000000000100000000100000101000000000000000000000000000000011 in base 2, or C004020A00000003 in base 16. It is a 64-bit number. But of course the last 32 bits contain the value '3', which is probably what should be the actual meaningful value.

I'm not sure how size_t is actually interpreted by clang on Windows. From stackoverflow I understand size_t is an unsigned integer of "at least 16 bits". Maybe there is something wrong in alloc_base_array that blows up in Window but not on Linux.

Last edited 3 years ago by casella (previous) (diff)

comment:16 Changed 3 years ago by adrpo

The problem is like this:

// call via this macro to generic_array_create
alloc_Modelica_ComplexBlocks_Interfaces_ComplexInput_array(&tmp203, 1, 3);
// which calls 
alloc_base_array(1, 3)
// which contains the line
dest->dim_size[i] = va_arg(ap, _index_t); <--- expected type on stack is _index_t but we pass int 3
// note that _index_t is an internal type defining a 64bit integer

Changing the first line to:

alloc_Modelica_ComplexBlocks_Interfaces_ComplexInput_array(&tmp203, 1, (_index_t)3);

makes windows happy and the simulation goes fine.

Last edited 3 years ago by adrpo (previous) (diff)

comment:17 Changed 3 years ago by casella

Oh. My. God.

Do you mean we should explicitly cast all of those literal constants everywhere in the generated code?

comment:18 Changed 3 years ago by casella

Well, this is the reason why I want to do modelling in Modelica and not in C :)

comment:19 Changed 3 years ago by casella

  • Cc sjoelund.se added

Adding @sjoelund.se in cc:, maybe he's got some ideas

comment:20 Changed 3 years ago by adrpo

Not everywhere. The problem is *only* with variable arguments list which is what is used here for all the dimension sizes. When sending in variable argument lists you really need to be careful how you send in the data and how you compute it from the list.

comment:21 Changed 3 years ago by casella

Do you know all the places in the code generator where this pattern is used?

comment:22 Changed 3 years ago by adrpo

Yes, I'm working on a PR right now.

comment:23 follow-up: Changed 3 years ago by casella

Great! This was actually a quite nasty bug...

#5799 is probably due to the same issue, the behaviour is exactly the same. We'll check when your PR is merged in.

comment:24 Changed 3 years ago by casella

  • Owner changed from Karim.Abdelhak to adrpo

comment:26 Changed 3 years ago by casella

Checked after merging PR 7207 on v1.18.0-dev-74-gc28aeab1dd (64-bit) under Windows 10, the simulation ran fine.

@adrpo, if the review of your PR is positive, can you please port it to 1.17.0 and then close this ticket?

comment:27 in reply to: ↑ 23 Changed 3 years ago by casella

Replying to casella:

#5799 is probably due to the same issue, the behaviour is exactly the same. We'll check when your PR is merged in.

Unfortunately the issue in #5799 is still present. It is less urgent to fix it and it could be postponed to 1.18.0.

comment:28 Changed 3 years ago by casella

Please port to maintenance/1.17

comment:29 Changed 3 years ago by adrpo

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:30 Changed 3 years ago by dr.christian.kral@…

@casella Do I understand correctly that this bug fix is not back ported to v1.16.4 and v1.16.5 respectively?

comment:31 Changed 3 years ago by adrpo

Yes, only to 1.17.

comment:32 Changed 3 years ago by casella

@christiankral, the idea of 1.16.2 was to fix a couple of critical issues in 1.16.0 at the end of December, while 1.17.0 was in the works.

Unfortunately the Christmas break, some post-break backlog and some really bad luck meant that the fixes were not really good until 1.16.5 a few days ago. But in any case, 1.16.x was not meant to carry all the fixes that were made in January and February, those are meant to be part of 1.17.0. We kept it just in case we have trouble with 1.17.0-beta, or if someone wants to stick to 1.16 for some reason (e.g. using MacOS).

You can try 1.17.0-dev.beta2, it is out already, it includes a lot of improvements and some critical fixes from 1.18.0-dev (including this one) and is expected to be far superior to 1.16.5 on many fronts.

comment:33 Changed 3 years ago by casella

Tested successfully with v1.17.0-dev.beta2 (64-bit) under Windows 10.

Note: See TracTickets for help on using tickets.