Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#3804 closed defect (invalid)

Is there something wrong with UTF-8?

Reported by: massimo ceraolo Owned by: Adeel Asghar
Priority: normal Milestone:
Component: OMEdit Version: v1.9.4-v1.9.x
Keywords: Cc:

Description

I enclose two files
1) TestUFT.mo. Contains characters outside latin-1. When saved by OM it is interpreted correctly as UTF by Dymola and my text editor (Superedi)
2) UTFPkg.mo. This too contains characters outside latin-1. When this file is saved by OM is NOT interpreted as UTF-8 by Dymola and Superedi.

What is the difference between the two?
Can something be done to make OM output fully compatibile with Dymola? I.e., make OM create files that are recognised as UTF-8 by Dymola and other text editors?

Attachments (5)

TestUTF.mo (138 bytes ) - added by massimo ceraolo 9 years ago.
UTFPkg.mo (1.8 KB ) - added by massimo ceraolo 9 years ago.
TestUTfByDym.mo (158 bytes ) - added by massimo ceraolo 9 years ago.
TestUTFByOM.mo (146 bytes ) - added by massimo ceraolo 9 years ago.
UTF report.pdf (200.4 KB ) - added by massimo ceraolo 9 years ago.

Download all attachments as: .zip

Change History (29)

by massimo ceraolo, 9 years ago

Attachment: TestUTF.mo added

by massimo ceraolo, 9 years ago

Attachment: UTFPkg.mo added

comment:1 by massimo ceraolo, 9 years ago

Summary: Is there something wronng with TF-8?Is there something wrong with UTF-8?

comment:2 by Martin Sjölund, 9 years ago

I tested both on Linux and OMEdit does not change any of the characters (adding a space, saving, removing the space and re-saving ends up in a file identical to the previous file). My installed version of Dymola (Dymola 2016, Linux) handles neither of the attached files. Does the Windows version of Dymola open the attached files properly?

comment:3 by Martin Sjölund, 9 years ago

Oh, and could you attach the file that you get after OMEdit saves it?

comment:4 by massimo ceraolo, 9 years ago

I made all my tests under Windows. I will send more details in the afternoon, because I want to send them exactly from the same computer and programs that I've used this morning (from my work place, now I'm at home).

BTW, I received this message this morning from Modelon:

The default behavior of Dymola is to only use UTF-8 when needed. In Dymola 2016 FD01 a flag was introduced to enforce UTF-8, I have attached the text from the release notes below.
Using this flag the customer model should be stored in UTF-8.
Improved handling of UniCode (UTF-8) versus iso-latin-1 coding

A flag Advanced.PreferWritingLatin1 is now available to set what coding to use. The
default value of the flag is true, meaning that iso-latin-1 will be preferred instead of UTF-8
when writing files.
However, if any non iso-latin-1 character is present, UTF-8 will always be used

comment:5 by massimo ceraolo, 9 years ago

Well, Martin, now I had similar results as yours.

In the enclosed UTF report.pdf file I've put several screenshots with the corresponding legends.
All my tests were made in a Win 10 system using Dymola 2016 FD01 (64 bit) and OM 1.9.5-dev20

About the enclosed files:

  • the file TestUTFByDym is saved by dymola and then unchanged.
  • the file TestUFTByOM is saved by OM and then unchanged.

Now it seems clear that something gets wrong in the OM UTF encoding.
At least both Dymola and SuperEdi (that is advertised as a "unicode editor") believe so.

Last edited 9 years ago by massimo ceraolo (previous) (diff)

by massimo ceraolo, 9 years ago

Attachment: TestUTfByDym.mo added

by massimo ceraolo, 9 years ago

Attachment: TestUTFByOM.mo added

by massimo ceraolo, 9 years ago

Attachment: UTF report.pdf added

comment:6 by Martin Sjölund, 9 years ago

The difference between the three files is:

  1. Original file uses LF for line endings
  2. Dymola uses CRLF for line endings and uses a UTF-8 byte-order mark
  3. OMEdit uses LF for line endings

All of the files are valid UTF-8. OMEdit will strip the byte-order mark when saving since you are recommended to not put these in files (UTF-8 does not need a byte-order mark unlike UTF-16 which has different encoding depending on endian). Since this is all OMEdit does, I can just conclude that the Dymola implementation of UTF-8 reading is broken.

comment:7 by Martin Sjölund, 9 years ago

Milestone: 1.10.0
Resolution: invalid
Status: newclosed

I'll close this as invalid as it is a problem with another tool.

comment:8 by massimo ceraolo, 9 years ago

BTW,
I could not find any statement about line endings in Modelica specifications.
Maybe I'm wrong. Do you know whether in some place it is specified that Unix-style (only LF) and DOS-style (CRLF) are both correct under both operating systems?

comment:9 by Martin Sjölund, 9 years ago

Spec says:

The whitespace characters are space, tabulator, and line separators (carriage return and line feed)

in reply to:  9 comment:10 by Adeel Asghar, 9 years ago

Replying to sjoelund.se:

Spec says:

The whitespace characters are space, tabulator, and line separators (carriage return and line feed)

We are only doing line feed which is against the specs then.

comment:11 by massimo ceraolo, 9 years ago

Yeah,
we never learn enough from the specs!

I think that if switching to saving with CRLF makes our mo's Dymola compliant (and possibly all tools compliant) it is worth while doing so.
First of all we must be sure that this would solve the observed problem. I will make some checks myself.

Last edited 9 years ago by massimo ceraolo (previous) (diff)

comment:12 by Martin Sjölund, 9 years ago

No, it says line separators are whitespace. This means CR or LF. Which means LFCR and CRLF are both allowed. Or only one of them.

comment:13 by Christoph <buchner@…>, 9 years ago

I also recently discovered that OM stores file with non-native line endings (i.e. LF on Windows, instead of the native CRLF). This is somewhat annoying (e.g. when using version control) if you also edit/create mo files with other editors, which typically write platform-native line endings. It would be great if OM would follow the platform convention (LF on non-windows, CRLF on windows).

comment:14 by Martin Sjölund, 9 years ago

OM does follow the convention on all its platforms: the "Windows" platform is MinGW - a Unix platform which of course uses Unix line endings. If you use git, it is trivial to set it to store files as LF (and of course, to tell the repository to normalize line endings; setting .mo-files to text mode is essential).

comment:15 by Christoph <buchner@…>, 9 years ago

OM does follow the convention on all its platforms: the "Windows" platform is MinGW - a Unix platform which of course uses Unix line endings.

Ah I was not aware of this.

If you use git, it is trivial to set it to store files as LF (and of course, to tell the repository to normalize line endings; setting .mo-files to text mode is essential).

This is not the whole story, though - if you use .gitattributes to do this, a simple

* text=auto

should normally be sufficient (and is sufficient for files created with other editors). If you want to be sure that Modelica files get recognized as text, you can also add

*.mo text
*.mos text
*.order text

but this does not work as expected, because OM creates LF files, which Git recognizes as not being native (and warns you about that). You have to set

# Use LF as eol to keep things uniform (OMEdit creates LF files)
*.mo text eol=lf
*.mos text eol=lf
*.order text eol=lf

for the warnings to not appear, and then instruct all your other editors to *also* create LF files (contrary to Windows convention).

comment:16 by Martin Sjölund, 9 years ago

You can use *.mo text eol=native as well. Or change the per-repository crlf settings, etc. Git knows how to deal with these things even if you get warnings when the line endings change.

comment:17 by Christoph <buchner@…>, 9 years ago

You can use *.mo text eol=native as well.

If I do that, and create a new model with OM, and stage the new file, Git warns me with

Updating the Git index failed.  A rescan will be automatically started to resynchronize git-gui.

warning: LF will be replaced by CRLF in MH_FC_system/CRLF_test.mo.
The file will have its original line endings in your working directory.

Or change the per-repository crlf settings, etc.

Setting this via .gitattributes as described above is a per-repo setting.

in reply to:  17 comment:18 by Martin Sjölund, 9 years ago

Replying to Christoph <buchner@…>:

Or change the per-repository crlf settings, etc.

Setting this via .gitattributes as described above is a per-repo setting.

That affects everyone else checking out the same code (who might be working on CRLF).

If I do that, and create a new model with OM, and stage the new file, Git warns me with

But it works as expected.

comment:19 by Christoph <buchner@…>, 9 years ago

That affects everyone else checking out the same code (who might be working on CRLF).

Sure, that's the idea -- to have harmonious line-ending handling even if people using different OSes work together on a repo.
I previously handled the transition of a big-ish FOSS repo from the config hell of having to rely on individual devs (Win, Mac, Linux) to *all* have the respective correct setting in git config autocrlf, to having this specified in the repo, and it solved nearly all line-ending related problems that kept cropping up regularly. * text=auto was the correct (and sufficient for most files) setting in .gitattributes.

But it works as expected.

All right, if you consider source control issuing a warning for every newly added file as "works as expected", I don't have any arguments left to say.

comment:20 by massimo ceraolo, 9 years ago

I've checked.
If OM added as the first three bytes the UTF-8 BOM (EE BB BF) then the resulting files are correctly read by Dymola!

I think it is highly advisable to add them, since we would gain good compatibility with a very little effort.
At least the option to add them should be given (a checkBox in OMEdit, that I recommend to use checked as default)

comment:21 by Adrian Pop, 9 years ago

@Christoph: I opened new enhancement ticket #3807 for the LF vs CRLF discussion. I guess this could be a setting in OMEdit.

Personally I prefer LF even in Windows but yes, there are always issues with the line endings and version control so anything we can do to make life easier for modelers we should investigate.

comment:22 by Christoph <buchner@…>, 9 years ago

@adrpo: thank you!

comment:23 by massimo ceraolo, 9 years ago

Well I myself will resume the BOM issue in another ticket I'm going to open

in reply to:  20 comment:24 by Martin Sjölund, 9 years ago

Replying to ceraolo:

I've checked.
If OM added as the first three bytes the UTF-8 BOM (EE BB BF) then the resulting files are correctly read by Dymola!

I think it is highly advisable to add them, since we would gain good compatibility with a very little effort.
At least the option to add them should be given (a checkBox in OMEdit, that I recommend to use checked as default)

From the almighty wikipedia:

The Unicode Standard permits the BOM in UTF-8 but does not require or recommend its use. Byte order has no meaning in UTF-8 ... The IETF recommends that if a protocol either (a) always uses UTF-8 [like Modelica] or [...]

Note: See TracTickets for help on using tickets.