Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#6348 closed defect (fixed)

Simulation works in Linux but not Windows

Reported by: Christian Kral Owned by: Adrian Pop
Priority: blocker Milestone: 1.17.0
Component: Frontend Version: v1.17.0-dev
Keywords: Cc: Andreas Heuermann, Per Östlund, Adrian Pop, Martin Sjölund

Description

I prepared a simulation model (attachment) for my students that I developed under Linux. If my students use this model under Windows, the simulation model does not simulate. Even if they use the exact same version I was using to test my model in Linux, the Windows version fails:

OpenModelica-v1.17.0-dev-341-gd731e9e716

Problem: The translation works OK, but the simulations gets stuck at 0%.

If my students use the 1.16.0, 1.16.1 or 1.16.2 version the problem remains. But: Switching back to the old frontend does help and provides at least a workaround for now.

If I switch back to the old frontend, the simulation runs with no problems and shows the correct results.

Attachments (1)

InductionMotor.mo (6.5 KB ) - added by Christian Kral 4 years ago.
Simulation models work in Linux but not Windwos

Download all attachments as: .zip

Change History (34)

by Christian Kral, 4 years ago

Attachment: InductionMotor.mo added

Simulation models work in Linux but not Windwos

comment:1 by Per Östlund, 4 years ago

I use Linux myself, but I ran the compilation through valgrind and found one issue that I fixed in 60f3fc1. I have no idea if that's the actual issue though, so if you're able to please check if the issue is fixed in the next nightly build. Otherwise I'll try to get one of our Windows developers to take a look at it.

comment:2 by Adeel Asghar, 4 years ago

This doesn't fix the problem. I have tried to attach the gdb to the running simulation and it is stuck at generic_array_create. Here is the backtrace,

Thread 1 (Thread 41928.0x9024):
#0  0x0000000000476658 in generic_array_create ()
No symbol table info available.
#1  0x000000000040b26c in InductionMotor_eqFunction_1605 ()
No symbol table info available.
#2  0x0000000000416480 in InductionMotor_functionInitialEquations_0 ()
No symbol table info available.
#3  0x000000000041e8c3 in InductionMotor_functionInitialEquations ()
No symbol table info available.
#4  0x00000000004a4d4b in symbolic_initialization ()
No symbol table info available.
#5  0x00000000004a6460 in initialization ()
No symbol table info available.
#6  0x000000000048d840 in initializeModel ()
No symbol table info available.
#7  0x000000000048ee10 in solver_main ()
No symbol table info available.
#8  0x00000000004aa245 in callSolver.isra.0 ()
No symbol table info available.
#9  0x00000000004aa924 in startNonInteractiveSimulation ()
No symbol table info available.
#10 0x00000000004ae496 in _main_SimulationRuntime ()
No symbol table info available.
#11 0x00000000019d5ead in main ()
No symbol table info available.

comment:3 by Andreas Heuermann, 4 years ago

Cc: Andreas Heuermann Karim Adbdelhak added

comment:4 by Francesco Casella, 4 years ago

I report here what I found out on this topic. First of all, it looks similar to the issue that I had when the lambd0-initial equations were not generated. At some point the runtime was fixed to run the lambda = 0 step also (though with the initial equations, not with the lambda0-initial ones), but the solver showed the same behaviour, i.e. the solver hanged forever without issuing any log message.

I tried @christiankral's model with OMEdit v1.17.0-dev-356-gc2c52350cb on Windows 10, with solver logging.

When I use the OF, there are lots of conditional equations depending on parameters that are not recognised as structural, hence they end up in several nonlinear equation systems. In fact, with the given value of the parameter, they are indeed linear, so the solver should have no problem solving them.

During initialization the runtime logging (LOG_NLS_V reports solving nonlinear equation systems 134 (24 equations, 6 tearing variables), 170 (24 equations, 6 tearing variables), and 421 (316 equations, 30 tearing variables), before apparently getting stuck solving equation system 636 (24 equations, 6 tearing variables), this time with no logging at all. In fact, I have no idea what the cpu is doing, but it keeps going full steam ahead until I kill it.

The NF recognizes the structural parameters and evaluates them, so there are no nonlinear equations, only linear system 293 with 88 equations and 12 linear variables. Using LOG_LS_V I get

Residual Norm 2.01084365779198e-08 of solution x:
System 293 numVars 12.
[1] imc.ir[3].im = -199.995265638199
[2] imc.ir[3].re = -263.081292871012
[3] imc.ir[2].im = -127.837450067651
[4] imc.ir[2].re = 304.741627114803
[5] imc.rotorCage.V_m.re = -56.2612179843652
[6] imc.rotorCage.V_m.im = 442.729714380106
[7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048256
[8] idealCloser.idealClosingSwitch[1].s.im = -352.388736218346
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844
[10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738

then the simulation hangs in the (apparently) same endless loop. Notice that in this case there are no other nonlinear initialization equations to be solved, only assignments.

From what I understand, there is some potential memory management issue with the nonlinear solver (possibly related to generic_array_create) that may or may not show up randomly depending on the memory layout of the different versions of OMC and of the host OS.

I think this should be investigated with high priority before releasing 1.17.0.

Last edited 4 years ago by Francesco Casella (previous) (diff)

in reply to:  4 ; comment:5 by Per Östlund, 4 years ago

Replying to casella:

The NF recognizes the structural parameters and evaluates them, so there are no nonlinear equations, only linear system 293 with 88 equations and 12 linear variables. Using LOG_LS_V I get

Residual Norm 2.01084365779198e-08 of solution x:
System 293 numVars 12.
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844
[10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738
[1] imc.ir[3].im = -199.995265638199
[2] imc.ir[3].re = -263.081292871012
[3] imc.ir[2].im = -127.837450067651
[4] imc.ir[2].re = 304.741627114803
[5] imc.rotorCage.V_m.re = -56.2612179843652
[6] imc.rotorCage.V_m.im = 442.729714380106
[7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048256
[8] idealCloser.idealClosingSwitch[1].s.im = -352.388736218346
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844
[10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738

then the simulation hangs in the (apparently) same endless loop.

Is this a copy-paste error or did you actually get duplicates of equation 9-12? When I run the simulation on Linux I get this:

Residual Norm 2.12444458646675e-08 of solution x:
System 293 numVars 12.
[1] imc.ir[3].im = -199.995265638374
[2] imc.ir[3].re = -263.081292874695
[3] imc.ir[2].im = -127.837450070753
[4] imc.ir[2].re = 304.741627116796
[5] imc.rotorCage.V_m.re = -56.2612179820831
[6] imc.rotorCage.V_m.im = 442.729714384532
[7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048752
[8] idealCloser.idealClosingSwitch[1].s.im = -352.388736217346
[9] idealCloser.idealClosingSwitch[3].s.re = 284.468631770403
[10] idealCloser.idealClosingSwitch[2].s.im = 140.325387168594
[11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373027
[12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316026239

As usual there are slight differences in the numbers due to floating point differences between the platforms, but the difference in the residual norm is relatively large (2.12e-08 vs 2.01e-08). I don't know if that has any significance though.

I also ran the simulation through valgrind, I didn't think to do that earlier. But it didn't show anything other than the usual GC noise.

in reply to:  5 ; comment:6 by Francesco Casella, 4 years ago

Replying to perost:

Is this a copy-paste error

Yes. I tried to copy-and-paste the whole message but it didn't work, then I messed up. I'll correct the comment now.

As usual there are slight differences in the numbers due to floating point differences between the platforms, but the difference in the residual norm is relatively large (2.12e-08 vs 2.01e-08). I don't know if that has any significance though.

Probably not, though I'm not 100% sure.

I also ran the simulation through valgrind, I didn't think to do that earlier. But it didn't show anything other than the usual GC noise.

Does it run forever also in your case?

comment:7 by Francesco Casella, 4 years ago

Cc: Per Östlund added; Karim Adbdelhak removed
Owner: changed from Per Östlund to Karim Adbdelhak
Status: newassigned

in reply to:  6 comment:8 by Per Östlund, 4 years ago

Replying to casella:

Does it run forever also in your case?

No, on Linux it simulates successfully without any issues whatsoever.

comment:9 by Francesco Casella, 4 years ago

Cc: Adrian Pop added

Aha, good to know.

@adrpo, @adeas31, any clue?

comment:10 by dr.christian.kral@…, 4 years ago

This issue unfortunately also breaks the HanserModelica system library: see https://github.com/christiankral/HanserModelica/issues/12

comment:11 by Francesco Casella, 4 years ago

Milestone: NeedsInput1.17.0

comment:12 by dr.christian.kral@…, 4 years ago

I am not sure if this helps: I somehow remember my students saying, that the "old frontend" option worked in OM 1.16.0 (which is the version we us at school) and 1.16.1 but not in 1.16.2. I did not double this information so I am not sure if it is correct. However, I do not have a Windows system where I could try to install and compare the different versions.

comment:13 by Francesco Casella, 4 years ago

It may be relevant, though it's difficult to understand why old frontend should work in Windows and new frontend in Linux.

We need to figure out what is the solver doing when it is stuck in the first place. I'll discuss this tomorrow in the developers' meeting.

comment:14 by Adrian Pop, 4 years ago

Something really bad happens here, I get in gdb:
nr_of_elements = 13836186197161934851
here:
https://github.com/OpenModelica/OpenModelica/blob/master/OMCompiler/SimulationRuntime/c/util/generic_array.c#L122
So this for for(i = 0; i < nr_of_elements; ++i) takes forever.

Last edited 4 years ago by Adrian Pop (previous) (diff)

comment:15 by Francesco Casella, 4 years ago

Interesting :)

13836186197161934851 = 1100000000000100000000100000101000000000000000000000000000000011 in base 2, or C004020A00000003 in base 16. It is a 64-bit number. But of course the last 32 bits contain the value '3', which is probably what should be the actual meaningful value.

I'm not sure how size_t is actually interpreted by clang on Windows. From stackoverflow I understand size_t is an unsigned integer of "at least 16 bits". Maybe there is something wrong in alloc_base_array that blows up in Window but not on Linux.

Last edited 4 years ago by Francesco Casella (previous) (diff)

comment:16 by Adrian Pop, 4 years ago

The problem is like this:

// call via this macro to generic_array_create
alloc_Modelica_ComplexBlocks_Interfaces_ComplexInput_array(&tmp203, 1, 3);
// which calls 
alloc_base_array(1, 3)
// which contains the line
dest->dim_size[i] = va_arg(ap, _index_t); <--- expected type on stack is _index_t but we pass int 3

Changing the first line to:

alloc_Modelica_ComplexBlocks_Interfaces_ComplexInput_array(&tmp203, 1, (_index_t)3);

makes windows happy and the simulation goes fine.

Version 0, edited 4 years ago by Adrian Pop (next)

comment:17 by Francesco Casella, 4 years ago

Oh. My. God.

Do you mean we should explicitly cast all of those literal constants everywhere in the generated code?

comment:18 by Francesco Casella, 4 years ago

Well, this is the reason why I want to do modelling in Modelica and not in C :)

comment:19 by Francesco Casella, 4 years ago

Cc: Martin Sjölund added

Adding @sjoelund.se in cc:, maybe he's got some ideas

comment:20 by Adrian Pop, 4 years ago

Not everywhere. The problem is *only* with variable arguments list which is what is used here for all the dimension sizes. When sending in variable argument lists you really need to be careful how you send in the data and how you compute it from the list.

comment:21 by Francesco Casella, 4 years ago

Do you know all the places in the code generator where this pattern is used?

comment:22 by Adrian Pop, 4 years ago

Yes, I'm working on a PR right now.

comment:23 by Francesco Casella, 4 years ago

Great! This was actually a quite nasty bug...

#5799 is probably due to the same issue, the behaviour is exactly the same. We'll check when your PR is merged in.

comment:24 by Francesco Casella, 4 years ago

Owner: changed from Karim Adbdelhak to Adrian Pop

comment:26 by Francesco Casella, 4 years ago

Checked after merging PR 7207 on v1.18.0-dev-74-gc28aeab1dd (64-bit) under Windows 10, the simulation ran fine.

@adrpo, if the review of your PR is positive, can you please port it to 1.17.0 and then close this ticket?

in reply to:  23 comment:27 by Francesco Casella, 4 years ago

Replying to casella:

#5799 is probably due to the same issue, the behaviour is exactly the same. We'll check when your PR is merged in.

Unfortunately the issue in #5799 is still present. It is less urgent to fix it and it could be postponed to 1.18.0.

comment:28 by Francesco Casella, 4 years ago

Please port to maintenance/1.17

comment:29 by Adrian Pop, 4 years ago

Resolution: fixed
Status: assignedclosed

comment:30 by dr.christian.kral@…, 4 years ago

@casella Do I understand correctly that this bug fix is not back ported to v1.16.4 and v1.16.5 respectively?

comment:31 by Adrian Pop, 4 years ago

Yes, only to 1.17.

comment:32 by Francesco Casella, 4 years ago

@christiankral, the idea of 1.16.2 was to fix a couple of critical issues in 1.16.0 at the end of December, while 1.17.0 was in the works.

Unfortunately the Christmas break, some post-break backlog and some really bad luck meant that the fixes were not really good until 1.16.5 a few days ago. But in any case, 1.16.x was not meant to carry all the fixes that were made in January and February, those are meant to be part of 1.17.0. We kept it just in case we have trouble with 1.17.0-beta, or if someone wants to stick to 1.16 for some reason (e.g. using MacOS).

You can try 1.17.0-dev.beta2, it is out already, it includes a lot of improvements and some critical fixes from 1.18.0-dev (including this one) and is expected to be far superior to 1.16.5 on many fronts.

comment:33 by Francesco Casella, 4 years ago

Tested successfully with v1.17.0-dev.beta2 (64-bit) under Windows 10.

Note: See TracTickets for help on using tickets.