#6348 closed defect (fixed)
Simulation works in Linux but not Windows
Reported by: | Christian Kral | Owned by: | Adrian Pop |
---|---|---|---|
Priority: | blocker | Milestone: | 1.17.0 |
Component: | Frontend | Version: | v1.17.0-dev |
Keywords: | Cc: | Andreas Heuermann, Per Östlund, Adrian Pop, Martin Sjölund |
Description
I prepared a simulation model (attachment) for my students that I developed under Linux. If my students use this model under Windows, the simulation model does not simulate. Even if they use the exact same version I was using to test my model in Linux, the Windows version fails:
OpenModelica-v1.17.0-dev-341-gd731e9e716
Problem: The translation works OK, but the simulations gets stuck at 0%.
If my students use the 1.16.0, 1.16.1 or 1.16.2 version the problem remains. But: Switching back to the old frontend does help and provides at least a workaround for now.
If I switch back to the old frontend, the simulation runs with no problems and shows the correct results.
Attachments (1)
Change History (34)
by , 4 years ago
Attachment: | InductionMotor.mo added |
---|
comment:1 by , 4 years ago
I use Linux myself, but I ran the compilation through valgrind and found one issue that I fixed in 60f3fc1. I have no idea if that's the actual issue though, so if you're able to please check if the issue is fixed in the next nightly build. Otherwise I'll try to get one of our Windows developers to take a look at it.
comment:2 by , 4 years ago
This doesn't fix the problem. I have tried to attach the gdb to the running simulation and it is stuck at generic_array_create
. Here is the backtrace,
Thread 1 (Thread 41928.0x9024): #0 0x0000000000476658 in generic_array_create () No symbol table info available. #1 0x000000000040b26c in InductionMotor_eqFunction_1605 () No symbol table info available. #2 0x0000000000416480 in InductionMotor_functionInitialEquations_0 () No symbol table info available. #3 0x000000000041e8c3 in InductionMotor_functionInitialEquations () No symbol table info available. #4 0x00000000004a4d4b in symbolic_initialization () No symbol table info available. #5 0x00000000004a6460 in initialization () No symbol table info available. #6 0x000000000048d840 in initializeModel () No symbol table info available. #7 0x000000000048ee10 in solver_main () No symbol table info available. #8 0x00000000004aa245 in callSolver.isra.0 () No symbol table info available. #9 0x00000000004aa924 in startNonInteractiveSimulation () No symbol table info available. #10 0x00000000004ae496 in _main_SimulationRuntime () No symbol table info available. #11 0x00000000019d5ead in main () No symbol table info available.
comment:3 by , 4 years ago
Cc: | added |
---|
follow-up: 5 comment:4 by , 4 years ago
I report here what I found out on this topic. First of all, it looks similar to the issue that I had when the lambd0-initial equations were not generated. At some point the runtime was fixed to run the lambda = 0 step also (though with the initial equations, not with the lambda0-initial ones), but the solver showed the same behaviour, i.e. the solver hanged forever without issuing any log message.
I tried @christiankral's model with OMEdit v1.17.0-dev-356-gc2c52350cb
on Windows 10, with solver logging.
When I use the OF, there are lots of conditional equations depending on parameters that are not recognised as structural, hence they end up in several nonlinear equation systems. In fact, with the given value of the parameter, they are indeed linear, so the solver should have no problem solving them.
During initialization the runtime logging (LOG_NLS_V
reports solving nonlinear equation systems 134 (24 equations, 6 tearing variables), 170 (24 equations, 6 tearing variables), and 421 (316 equations, 30 tearing variables), before apparently getting stuck solving equation system 636 (24 equations, 6 tearing variables), this time with no logging at all. In fact, I have no idea what the cpu is doing, but it keeps going full steam ahead until I kill it.
The NF recognizes the structural parameters and evaluates them, so there are no nonlinear equations, only linear system 293 with 88 equations and 12 linear variables. Using LOG_LS_V
I get
Residual Norm 2.01084365779198e-08 of solution x: System 293 numVars 12. [9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844 [10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009 [11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318 [12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738 [1] imc.ir[3].im = -199.995265638199 [2] imc.ir[3].re = -263.081292871012 [3] imc.ir[2].im = -127.837450067651 [4] imc.ir[2].re = 304.741627114803 [5] imc.rotorCage.V_m.re = -56.2612179843652 [6] imc.rotorCage.V_m.im = 442.729714380106 [7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048256 [8] idealCloser.idealClosingSwitch[1].s.im = -352.388736218346 [9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844 [10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009 [11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318 [12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738
then the simulation hangs in the (apparently) same endless loop. Notice that in this case there are no other nonlinear initialization equations to be solved, only assignments.
From what I understand, there is some potential memory management issue with the nonlinear solver (possibly related to generic_array_create
) that may or may not show up randomly depending on the memory layout of the different versions of OMC and of the host OS.
I think this should be investigated with high priority before releasing 1.17.0.
follow-up: 6 comment:5 by , 4 years ago
Replying to casella:
The NF recognizes the structural parameters and evaluates them, so there are no nonlinear equations, only linear system 293 with 88 equations and 12 linear variables. Using
LOG_LS_V
I get
Residual Norm 2.01084365779198e-08 of solution x: System 293 numVars 12. [9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844 [10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009 [11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318 [12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738 [1] imc.ir[3].im = -199.995265638199 [2] imc.ir[3].re = -263.081292871012 [3] imc.ir[2].im = -127.837450067651 [4] imc.ir[2].re = 304.741627114803 [5] imc.rotorCage.V_m.re = -56.2612179843652 [6] imc.rotorCage.V_m.im = 442.729714380106 [7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048256 [8] idealCloser.idealClosingSwitch[1].s.im = -352.388736218346 [9] idealCloser.idealClosingSwitch[3].s.re = 284.468631771844 [10] idealCloser.idealClosingSwitch[2].s.im = 140.32538717009 [11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373318 [12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316014738then the simulation hangs in the (apparently) same endless loop.
Is this a copy-paste error or did you actually get duplicates of equation 9-12? When I run the simulation on Linux I get this:
Residual Norm 2.12444458646675e-08 of solution x: System 293 numVars 12. [1] imc.ir[3].im = -199.995265638374 [2] imc.ir[3].re = -263.081292874695 [3] imc.ir[2].im = -127.837450070753 [4] imc.ir[2].re = 304.741627116796 [5] imc.rotorCage.V_m.re = -56.2612179820831 [6] imc.rotorCage.V_m.im = 442.729714384532 [7] idealCloser.idealClosingSwitch[3].s.im = 212.063349048752 [8] idealCloser.idealClosingSwitch[1].s.im = -352.388736217346 [9] idealCloser.idealClosingSwitch[3].s.re = 284.468631770403 [10] idealCloser.idealClosingSwitch[2].s.im = 140.325387168594 [11] idealCloser.idealClosingSwitch[2].s.re = -325.886563373027 [12] idealCloser.idealClosingSwitch[1].s.re = 41.4179316026239
As usual there are slight differences in the numbers due to floating point differences between the platforms, but the difference in the residual norm is relatively large (2.12e-08 vs 2.01e-08). I don't know if that has any significance though.
I also ran the simulation through valgrind, I didn't think to do that earlier. But it didn't show anything other than the usual GC noise.
follow-up: 8 comment:6 by , 4 years ago
Replying to perost:
Is this a copy-paste error
Yes. I tried to copy-and-paste the whole message but it didn't work, then I messed up. I'll correct the comment now.
As usual there are slight differences in the numbers due to floating point differences between the platforms, but the difference in the residual norm is relatively large (2.12e-08 vs 2.01e-08). I don't know if that has any significance though.
Probably not, though I'm not 100% sure.
I also ran the simulation through valgrind, I didn't think to do that earlier. But it didn't show anything other than the usual GC noise.
Does it run forever also in your case?
comment:7 by , 4 years ago
Cc: | added; removed |
---|---|
Owner: | changed from | to
Status: | new → assigned |
comment:8 by , 4 years ago
Replying to casella:
Does it run forever also in your case?
No, on Linux it simulates successfully without any issues whatsoever.
comment:10 by , 4 years ago
This issue unfortunately also breaks the HanserModelica system library: see https://github.com/christiankral/HanserModelica/issues/12
comment:11 by , 4 years ago
Milestone: | NeedsInput → 1.17.0 |
---|
comment:12 by , 4 years ago
I am not sure if this helps: I somehow remember my students saying, that the "old frontend" option worked in OM 1.16.0 (which is the version we us at school) and 1.16.1 but not in 1.16.2. I did not double this information so I am not sure if it is correct. However, I do not have a Windows system where I could try to install and compare the different versions.
comment:13 by , 4 years ago
It may be relevant, though it's difficult to understand why old frontend should work in Windows and new frontend in Linux.
We need to figure out what is the solver doing when it is stuck in the first place. I'll discuss this tomorrow in the developers' meeting.
comment:14 by , 4 years ago
Something really bad happens here, I get in gdb:
nr_of_elements = 13836186197161934851
here:
https://github.com/OpenModelica/OpenModelica/blob/master/OMCompiler/SimulationRuntime/c/util/generic_array.c#L122
So this for for(i = 0; i < nr_of_elements; ++i)
takes forever.
comment:15 by , 4 years ago
Interesting :)
13836186197161934851 = 1100000000000100000000100000101000000000000000000000000000000011 in base 2, or C004020A00000003 in base 16. It is a 64-bit number. But of course the last 32 bits contain the value '3', which is probably what should be the actual meaningful value.
I'm not sure how size_t
is actually interpreted by clang on Windows. From stackoverflow I understand size_t
is an unsigned integer of "at least 16 bits". Maybe there is something wrong in alloc_base_array
that blows up in Window but not on Linux.
comment:16 by , 4 years ago
The problem is like this:
// call via this macro to generic_array_create alloc_Modelica_ComplexBlocks_Interfaces_ComplexInput_array(&tmp203, 1, 3); // which calls alloc_base_array(1, 3) // which contains the line dest->dim_size[i] = va_arg(ap, _index_t); <--- expected type on stack is _index_t but we pass int 3 // note that _index_t is an internal type defining a 64bit integer
Changing the first line to:
alloc_Modelica_ComplexBlocks_Interfaces_ComplexInput_array(&tmp203, 1, (_index_t)3);
makes windows happy and the simulation goes fine.
comment:17 by , 4 years ago
Oh. My. God.
Do you mean we should explicitly cast all of those literal constants everywhere in the generated code?
comment:18 by , 4 years ago
Well, this is the reason why I want to do modelling in Modelica and not in C :)
comment:20 by , 4 years ago
Not everywhere. The problem is *only* with variable arguments list which is what is used here for all the dimension sizes. When sending in variable argument lists you really need to be careful how you send in the data and how you compute it from the list.
comment:21 by , 4 years ago
Do you know all the places in the code generator where this pattern is used?
follow-up: 27 comment:23 by , 4 years ago
Great! This was actually a quite nasty bug...
#5799 is probably due to the same issue, the behaviour is exactly the same. We'll check when your PR is merged in.
comment:24 by , 4 years ago
Owner: | changed from | to
---|
comment:26 by , 4 years ago
Checked after merging PR 7207 on v1.18.0-dev-74-gc28aeab1dd (64-bit) under Windows 10, the simulation ran fine.
@adrpo, if the review of your PR is positive, can you please port it to 1.17.0 and then close this ticket?
comment:27 by , 4 years ago
comment:29 by , 4 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
comment:30 by , 4 years ago
@casella Do I understand correctly that this bug fix is not back ported to v1.16.4 and v1.16.5 respectively?
comment:32 by , 4 years ago
@christiankral, the idea of 1.16.2 was to fix a couple of critical issues in 1.16.0 at the end of December, while 1.17.0 was in the works.
Unfortunately the Christmas break, some post-break backlog and some really bad luck meant that the fixes were not really good until 1.16.5 a few days ago. But in any case, 1.16.x was not meant to carry all the fixes that were made in January and February, those are meant to be part of 1.17.0. We kept it just in case we have trouble with 1.17.0-beta, or if someone wants to stick to 1.16 for some reason (e.g. using MacOS).
You can try 1.17.0-dev.beta2, it is out already, it includes a lot of improvements and some critical fixes from 1.18.0-dev (including this one) and is expected to be far superior to 1.16.5 on many fronts.
Simulation models work in Linux but not Windwos