Page 1 of 1

Problem and BUG with comparing two different encoded text files

Posted: Sun Sep 27, 2020 2:30 pm
by Mark Ginger
I needed to compare two CSV files, where I knew there where differences in them.
Examdiff Pro (V11.0.1.0) detected the differences correctly. So far so good.

But because of Examdiffs automatic file format recognition it didn't tell me that both are encoded in two different file formats. One as ANSI (by accident stored in wrong file format) and the other in UTF-8, which led to an incorrect data import in my database.
Only after switching to binary comparison, I detected the real differences.

Even in directory comparison mode and activated option "Perform Full File Comparison", Examdiff treats different encoded text the same, when their modification time is equal and they have the same "textual"/"logical" content. More worse, Examdiff doesn't compare the file size which is different when the files contains Non-ANSI characters.

Yes, I know Examdiff is showing me the file format (of which it thinks the text file is) in the lower right corner of the file pane.
But normally I do not look at this place, I normally focus on the popup "different" or "not different".

I didn't find any option to disable the the automatic file format recognition in normal file comparison mode.
Is there one?

If not, I would like to request a new feature for Examdiff:
a) a check box in the "Text Comparison Options" section, saying "treat different encoded files as different" and
b) the same logic in directory comparison mode and enabled "Perform Full File Comparison" option.

Regards
Mark

Re: Problem and BUG with comparing two different encoded text files

Posted: Mon Sep 28, 2020 9:03 am
by psguru
You can enforce the encoding from the Compare dialog by using the File Open dialog's option at the bottom.

Re: Problem and BUG with comparing two different encoded text files

Posted: Mon Sep 28, 2020 3:11 pm
by Mark Ginger
Yes, I see. But this dialog isn't really often used (by me at least) and it implies that I (the user) know that the two files are of different encoding before doing the comparison

For me the normal use case is either send one (or both) files via the explorer context menu to Examdiff or to type (or copy) the file name directly in the "First" and "Second" field of the Compare dialog and start the comparison.
There is no option to force the file encoding in these cases. And to go back to the file selection dialog, press the file chooser button only for selecting the file encoding Examdiff has to use, is a really silly, non-intuitive and contra-productive way of a file comparison. And again this assumes, that I know before that there may be differences in the two files only because of different encodings (e.g. umlauts).
When Examdiff compares two files and finds no difference (for the sake of simplicity, I assume there are no ignore options set), then my expectation is that the two files are really identical and not only by their logical content, i.e. they are binary identically.

I really would like to see at least a warning message of Examdiff when it find no differences in the two files like "The two files have identical content, but differs in encoding (which still can lead to different files. Please check by switching to binary comparison." and/or having a menu entry/check box in the toolbar for "compare both files regardless of their file encoding" (which is quite the same as a binary comparison, but you have more ignore options (line ranges, excluded columns, case sensitivity, ignoring white space or line endings)).

Second, your hint doesn't work in directory comparison mode. When I check "Full File Comparison", than I expect that two files in two different folders are not identical when they differ in file encoding.

All these examples are logical the same case as with comparing two files (in file comparison mode or directory comparison) which only differs in line endings (e.g. NL vs. CRNL). In this case Examdiff always marks the files as different (in file comparison mode using no ignore options and in directory comparison mode always, regardless of the state of the "Full File Comparison" option).

Re: Problem and BUG with comparing two different encoded text files

Posted: Tue Sep 29, 2020 10:14 am
by MSpagni
FWIW I haven't had that problem often, but I agree with Mark.

Re: Problem and BUG with comparing two different encoded text files

Posted: Tue Sep 29, 2020 2:05 pm
by psguru
The proper analogy for file encoding would be other file metadata, such as file size or timestamp. Note that for file comparison, when encodings don't match, the status bar boxes with encoding are colored with the Changed color, denoting the difference. Showing a message during file comparison is akin to showing a message about different timestamps.

For directory comparison, you are right, there's no option to ignore encodings under the "Perform file metadata comparison" section of Directory Compare options, although you could use the "different sizes" option. I suppose we could add an option for ignoring file encodings in the next version.

Re: Problem and BUG with comparing two different encoded text files

Posted: Wed Sep 30, 2020 1:13 am
by Mark Ginger
Hello psguru,

first, thank you for considering a solution for the directory comparison mode.

Second, I agree with you about the the colored box for different encodings.
But for me as a really long termed user of Examdiff (since V4.5, I think), this wasn't obvious, not saying that I overlooked it always.
Most of the times I ignore the status bar for daily work and rely on the different ignore/non-ignore setting I apply during comparison and the answer of Examdiff. Either by a popup or different optical feedback (colors, bold face in directory comparison).

The behavior/the meaning of (color of) the ninth pane of the status is only mentioned in the help file under the section "Status Bars" and in one little line in the build history section on your website for V7.0.0.0. Nor in the Unicode Support section of the help file, where it could be mentioned as well. Not really prominent.

I can accept that you treat file encoding as a kind of meta data, but than more like "case" and not like "timestamps".
For both you have ignore options in the settings, but not for file encodings (for me a good place would be under the [Text Compare] Advanced Tab in the "Force text/binary file comparison" or "Misc" section).

To your reply
"Showing a message during file comparison is akin to showing a message about different timestamps."
.
I don't agree with you.
You (Examdiff) have a lot of messages which can be opted-out like "Message about identical text files", "Message about identical binary files" and so on. A message like "Message about identical text files, but with different encodings" falls in the same use-case, resp. user experience. So there is room for introducing such a message and configuration setting.

I know this is not a really often use case for my wish. But as a database admin and sometimes app programmer, I'm often confronted with this.
I gave out some app-UI-text for review and got back a corrected version, but with wrong encoding, only because the reviewer uses a different editor which saves the modified file (by mistake) in its default encoding (i.e. ANSI instead of UTF-8 or UTF-8 instead of Unicode/UTF-16 LE). Or I get a reviewed/corrected/updated file for a database import, which is now wrong encoded by the other person or by different means during file transfer.

Kind regards
Mark

Re: Problem and BUG with comparing two different encoded text files

Posted: Wed Sep 30, 2020 12:10 pm
by JeremyNicoll
psguru said

"The proper analogy for file encoding would be other file metadata, such as file size or timestamp."


I don't see why. File metadata (like its size or timestamp) is information kept outside the file, by the file-system.

Encoding information is inside the file, is it not?

Re: Problem and BUG with comparing two different encoded text files

Posted: Wed Sep 30, 2020 1:09 pm
by psguru
The reason I compared file encoding to other metadata (as opposed to "case") is that encoding is a file attribute, not a line/character attribute. It's one-per-file, like size or timestamp.

In any case, this particular scenario is not something that comes up a lot (it's the first time we've been asked about it), and, given that the OP is a user since version 4.5, probably not so common. Having this feature added to EDP feels like bloat, especially since it involves GUI changes (new message, new option for directory comparison).