Binary comparison of PDF files

General questions about using ExamDiff Pro, ideas for new features, bug reports, and usage tips.
Post Reply
rgr09
New Member
Posts: 5
Joined: Sat Sep 28, 2013 6:20 am

Binary comparison of PDF files

Post by rgr09 » Mon Jan 22, 2018 6:39 am

I am using version 7 of examdiff pro with the plug-ins installed, and when I compare two pdfs what I get is a comparison of the output of pdftotext for each file. If neither file has text, even though one file has 1 page and the other 1000 pages, they are still identified as the same. How can I force a binary comparison of pdf files in ver 7?

A more general question: when is the approximate date for release of EDP version 10? If it is within a year, I will probably upgrade now. Thanks for your regular updates to EDP, it is an indispensable tool.

User avatar
psguru
Site Admin
Posts: 1759
Joined: Sat May 15, 2004 4:23 pm
Location: California
Contact:

Re: Binary comparison of PDF files

Post by psguru » Mon Jan 22, 2018 7:46 am

I am using version 7 of examdiff pro with the plug-ins installed, and when I compare two pdfs what I get is a comparison of the output of pdftotext for each file. If neither file has text, even though one file has 1 page and the other 1000 pages, they are still identified as the same. How can I force a binary comparison of pdf files in ver 7?
You can do so by disabling the PDF plug-in (or all plug-ins) under Options | Plug-ins.
A more general question: when is the approximate date for release of EDP version 10? If it is within a year, I will probably upgrade now. Thanks for your regular updates to EDP, it is an indispensable tool.
We expect 10.0 to be in Beta in a couple of months, and released before summer.
psguru
PrestoSoft

JeremyNicoll
Expert Member
Posts: 58
Joined: Sun May 02, 2010 12:00 pm
Location: Edinburgh

Re: Binary comparison of PDF files

Post by JeremyNicoll » Tue Jan 23, 2018 9:13 am

Instead of binary compare, which is unlikely to tell you anything meaningful, you might be better simply have a more useful result from the plugin.

I'm using v9 of EDP but I assume v7 is similar... If you go to the edp \Plugins folder and look in \Xpdf you'll find the pdftotext.exe that does the conversion, plus documentation saying what version it is and where it came from, and what command line arguments it can take. My quick look (at my probably-more-recent version) suggests that none are much use for your problem but I've not tried them out. However some of the documentation says that there are other Xpdf tools as well. I went to: http://www.foolabs.com/xpdf to find out what they are.

Go to the 'Download' tab and pick the 'Xpdf tools'. Unpack the zip somewhere and you will find a set of tools, including the most uptodate pdftotext.exe. In particular there's one called pdfinfo.exe. I copied that into the same edp \Plugins folder as the existing pdftotext.exe - though I'm not sure that's all that sensible since it'll get lost when you update edp. It'd be better to put it elsewhere on your system according to however you structure your files.


Then I experimented a bit, and wrote a small .bat file. I added a new plugin definition so that when pdf files are being compared my batch file is called instead of the pdftotest.exe program. My plugin definition read as follows:

plug-in: PDF experiments
name filter *.pdf
application JNPDFBAT.BAT
arguments $INPUTFILEPATH $OUTPUTFILEPATH

I did also tick the option 'use exit code' and set success to 0, however in the batch file itself I didn't bother to do any returncode checking.

I created a batch file: JNPDFBAT.bat in a folder that's on my system's PATH.

Inside the batch file I ended up with these lines (between the dashes):


-------------------------------------------------------------------------------------------
@echo off
rem Experimenting with the PDFtoTEXT plugin used in EDP, which produces no output if files
rem only contain images, thus making two such files appear identical even when not.
rem Try to put file-info in text file before output from pdftotext:
rem
rem args are pdffilepath and outputfilepath

echo Meta-data reported by 'pdfinfo.exe' is: >> %2
echo. >> %2
"C:\Program Files\~P-folder\PrestoSoft\ExamDiff Pro\Plug-Ins\Xpdf\pdfinfo.exe" -meta %1 >> %2
echo. >> %2
echo. >> %2

"C:\Program Files\~P-folder\PrestoSoft\ExamDiff Pro\Plug-Ins\Xpdf\pdftotext.exe" -table %1 %TEMP%\PDFtoTEXT.txt

echo Text extracted by 'pdftotext.exe' is: >> %2
echo. >> %2
type %TEMP%\PDFtoTEXT.txt >> %2
echo. >> %2

exit /b 0
-------------------------------------------------------------------------------------------



The batch file is called by EDP, with two arguments - the full paths of a PDF file and that of a temporary file. The first thing my code does:

echo Meta-data reported by 'pdfinfo.exe' is: >> %2
echo. >> %2

is write "Meta-data reported by 'pdfinfo.exe' is:" and a blank line into the second file. It then calls 'pdfinfo' requesting that the PDF file's metadata
be written out. That'd normally go to the terminal so is redirected to the second file. Then I write a couple more blank lines.

The pdftotext command unfortunately seems only able to be called with the name of a text file. If I call it and give it the name of the file I've just put
other stuff into, the new output will overwrite what's there already. Instead I tell it to put pdftotext's output into: %TEMP%\PDFtoTEXT.txt (This is not a good solution; a better .bat file would take care to find the name of a unused temporary file at this point, but I'm only showing you what's possible. If you can code in anything else, use it instead of .bat, I'd say, which might make it easier to do something robust.)

After that I write "Text extracted by 'pdftotext.exe' is:" and another blank line to the second file, then 'type' whatever the output from pdftotext was, redirecting that so it is also appended to the temporary file, then follow that with another blank line.

The result should be that each pdf file is now represented by a file containing information about the pdf file followed by whatever pdftotext.exe produces.

Hope that helps.

JeremyNicoll
Expert Member
Posts: 58
Joined: Sun May 02, 2010 12:00 pm
Location: Edinburgh

Re: Binary comparison of PDF files

Post by JeremyNicoll » Tue Jan 23, 2018 9:18 am

A screenshot of such a compare:
20180123 EDP PDF compare.jpg
20180123 EDP PDF compare.jpg (198.56 KiB) Viewed 606 times

User avatar
psguru
Site Admin
Posts: 1759
Joined: Sat May 15, 2004 4:23 pm
Location: California
Contact:

Re: Binary comparison of PDF files

Post by psguru » Tue Jan 23, 2018 9:23 am

This is actually pretty cool. Thanks for posting it.
psguru
PrestoSoft

JeremyNicoll
Expert Member
Posts: 58
Joined: Sun May 02, 2010 12:00 pm
Location: Edinburgh

Re: Binary comparison of PDF files

Post by JeremyNicoll » Tue Jan 23, 2018 9:46 am

Thank-you!

MSpagni
Expert Member
Posts: 333
Joined: Mon Mar 30, 2009 12:53 am
Location: Italy

Re: Binary comparison of PDF files

Post by MSpagni » Wed Jan 24, 2018 11:40 am

Released Xpdf 4.00 and moved the Xpdf web site to www.xpdfreader.com.
2017 Aug 10
So you should find it here: http://www.xpdfreader.com/download.html

Good hint. I'll think about it, thanks.

JeremyNicoll
Expert Member
Posts: 58
Joined: Sun May 02, 2010 12:00 pm
Location: Edinburgh

Re: Binary comparison of PDF files

Post by JeremyNicoll » Wed Jan 24, 2018 12:15 pm

Hmm. If I enter: http://www.foolabs.com/xpdf/ into the URL bar in Firefox, I end up on the newer website - I'd not noticed that. There must be a redirect defined at the foolabs.com/xpdf address to take one there. If however one goes to http://www.foolabs.com then there's no automatic redirect but there is, as you say, a small sentence saying XPDF has gone elsewhere.

MSpagni
Expert Member
Posts: 333
Joined: Mon Mar 30, 2009 12:53 am
Location: Italy

Re: Binary comparison of PDF files

Post by MSpagni » Tue Jan 30, 2018 1:03 pm

The pdftotext command unfortunately seems only able to be called with the name of a text file.
Nope.
SYNOPSIS
pdftotext [options] [PDF-file [text-file]]
<omissis>
If text-file is '-', the text is sent to stdout.
So my batch is simply:

Code: Select all

@echo off
"C:\Programmi\ExamDiff Pro\Plug-Ins\Xpdf\pdfinfo.exe" -meta "%1"
echo ---------------------------------------------------
"C:\Programmi\ExamDiff Pro\Plug-Ins\Xpdf\pdftotext.exe" -table "%1" -
N.B. I prefer to omit the -meta option.

Post Reply