weird behavior explained

Posts: 10
Joined: February 10, 2017 - 14:41
weird behavior explained

I noticed a strange behavior when updating metadata in acrobat : sometimes, after the update, the tags were displayed in explorer, and sometimes, the new values were not displayed. I had to relaod the pdf, update the metadata, and then : the values frome the _prévious_ update were displayed in explorer.

So I was wondering if pdf property extension was using some cached data, and where.

Then I thought of a problem with the DocInfo -> XMP Crosswalk ( https://www.pdfa.org/pdfa-metadata-xmp-rdf-dublin-core/), wondering how pdf property extension could deal with these two datasets.

Now, the problem is _how_ acrobat (in my case : acrobat X) updates the metadata.

I wrote a script, adding to buttons to the acrobat tool bar, in order to update title, author, and keywords withs specific values.

Initialy, I saw the warning of thomas parker (although he provided these examples) but decided to use the doc.info object.

And this is the problem.

The JavaScript for Acrobat API Reference V10 says (p248) that the doc.info obbject is R/W. And it even provide (p249) this example showing how to _set_ metadata with the doc.info object :

Set three authors for this document.
this.info.Authors=["Robat, A. C.", "Obe, A. D.","Torys, D. P."];

Now to see what happens :

take a pdf and santize it (protetcion>remove hidden information>select metadata, and remove). When the pdf is opened with notepad++, it can be seen that the xmp object and the info dictionary  are no longer there.

then, open the pdf with acrobat, highlight a word (for example) and save it. When the pdf is opened with notepad++, the xmp object and the info doc are there again (blank).

Then open the pdf with acrobat, and run the doc.info example provided by the JavaScript for Acrobat API Reference to update the authors metadata with this.info.Authors. Save and close the pdf.

Open the pdf with notepad++ : there are now two XMP objects, and their values are different.

And each time you use the doc.info object to update metadata, acrobat will "add" a new XMP object in the pdf. And these XMP objects have different values.

So, I am wondering how pdf property extension deal with many XMP objects.

If I am right, the advice could be : Do not trust the JavaScript for Acrobat API Reference and NEVER use the doc.info object to set metadata.

So, now I have to modify my scripts and use E4X, hoping that -this way- the xmp object won't be duplicated in the same pdf with inconsistent values...

Has anyone already noticed this weird behavior?

--

pc

Posts: 10
Joined: February 10, 2017 - 14:41
Re: weird behavior explained

I did some more digging : the problem is far more deeper in the sense that multiple XMP packets in a pdf turns out to be a normal behavior (called "incremental save"), but acrobat itself does very strange things with metadata, depending on what is updated, from where, and this leads to weird discrepancies. Moreover, acrobat (acrobat pro X) is sometimes "lying" about the exact actual content of metadata in the pdf. So, I begin to realise how painfull it could be to write the pdfpropertyextension....

At the moment, what I understand is that there are two ways to store metadata in a pdf : in an XMP packet, and in an "information dictionary" (ie. one line near the end of the pdf source).

Each time a pdf is saved, acrobat does indeed an incrematal save : the existing XMP packet is not modified, but a new one is created. So the pdf does keep the history of the metadata modifications. But : the information dictionary is not incrementaly saved : there is always -as far as I see- one and only one infodictionary in the pdf.

An important behavior of acrobat (but you have to dig hundreds of pages of technical doc to eventually find it) is that if a pdf is saved by "save as", then the metadata history is cleaned, so it remains only one XMP Packet (the last) and the info dictionary in the pdf.

First, there is a problem when updating metadata with the properties poupup window in acrobat (ie. without running javascript to access the Doc.info object, nor using E4X to set an XMP packet). If you take a minimal pdf (e.g. create a pdf from a blank.txt) and you sanitize it : there is no XMP packet and no infodictionary in the pdf source. Then you add "ptitre" as title with the properties popup: when the pdf is saved, an XMP Packet is created with "ptitre" in the creator element of the dublin core, and an infodictionary is created, with a title "ptitre". And PdfPropertyexension displays "ptitre", as expected.

Then, if the title "ptitre" is removed with the popup, and the pdf saved, "ptitre" is removed from the infodictionary, and a new XMP packet is created with an empty creator element. And PdfPropertyexension does not display any more title.

Then, if we aplly the same procedure with the Author metadata, it produces a similar -appropriate- behavior. the "pauteur" created is displayed by pdfpropertyextension, then blanked when removed with the properties popup.

But : if we now add "pkey" as keyword with the properties popup and save the pdf : pdfpropertyextension displays "pkey". In the pdf source, "pkey" is added at three places : in a newly created subject element of the dublin core, in a newly created pdf keywords element, and in the infodictionary.

Now, if we remove "pkey" with the properties popup, pdfpropertyexentsion still displays "pkey". Is this a pdfpropertyextension bug? It does not seem. This is  acrobat weird behavior : when "pkey" is removed with the popup, and the pdf saved, in the pdf source the subject element is removed from the dublin core and  the pdf keywords element is removed, but : "pkey" is actualy not removed from the infodictionary.

So, it seems that pdfpropertyextension first reads the infodictionary (knowing that the modification dates of the xmp packet and of the infodictionary are exactly identical).

The question is : could it be possible to have an option in pdfpropertyextension to take first the last xmp packet into account, instead of the infodictionary content?

Moreover: We sanitize again the pdf. The xmp packet and infodictionary are removed from the source. then we use the console to run this.info.Author="cauteur"; The properties popup then displays "cauteur" as author. And more : the advanced panel of the popup displays "cauteur" in a creator element of the dublin core. If we save the pdf : pdfpropertyextension does not display "cauteur". Is this a pdfpropertyextension bug? No. If we look at the pdf source : there is no xmp packet, and no infodictionary. In this case, acrobat does not update the pdf source with the modifications it itself deceptively displays to the user interface when the pdf is saved. In that case : it seems obvious that there is nothing that can be done with pdfpropertyextension to circumvent the problem.

Then, we sanitize again. then add "ptitre" as title and "pauteur" as author with the popup. we save the pdf. pdfpropertyxetension displays "ptitre" and "pauteur". We save the pdf with "save as". In the pdf source, the creator element contains "pauteur" and the title element contains "ptitre", but the infodictionary is removed. Pdfpropertyextensions still displays "pauteur" and "ptitre" : that is, it is taking the xmp into account (since the info dictionary does not exist anymore). Then we run this.info.Keywords="ckey"; in the console and save the pdf. Pdfpropertyextension displays "ckey". In the pdf source : the last xmp packet contains a pdf keywords with "ckey", but no dublin core subject elemnt, and thre is no infodictionary. Again, pdf property extension reads the xmp packet since there is no infodictionary.

Then we open the properties popup, which displays "ckey" and substitute this value with "pkey", then save. Pdfpropertyextension displays pkey, pauteur, and ptitre. In the pdf source a subject element with "pkey" is added to the dublin core, and the pdf keywords element now contains "pkey", and, an info dictionary is created, with pauteur, ptitre, and pkey. The, we run this.info.Keywords="ckey"; with the console, and save. pdfpropertyextension displays "ckey" and does not take into account the "pkey" value wich was set with the properties popup. and in the pdf source, the info dictionary and the pdf keywords element contains "ckey", but : the subject element in the dublin core contains "pkey". The popup displays ckey;pkey in the keywords field.

And again acrobat is lying : the advanced panel diplays a dublin core subject element with :[1]ckey, and [2]pkey, and a pdf keywords set to ckey;pkey. Actually, what the pdf source contains is a subject element with only "pkey" and a pdf element with only "ckey". the infodictionary contains "ckey".  Pdfpropertyextension only displays "ckey".

Here again, it would be cool that Pdfpropertyextension allows to first take the laxt xmp packet into account, rather than the infodictionary.

There are actually some more problems, especially when dealing with adding multiple keywords with javascript (if what is needed is to get real separated keywords, and not a string that is only one keyword within double quotes with pseudo keywords separated by commas or semicolons).

I'm digging this issue too...

Posts: 10
Joined: February 10, 2017 - 14:41
Re: weird behavior explained

As promised, more on the keywords problems:

---------------

First, an erratum : I double checked this behavior : when a pdf is saved with "save as", only the last xmp packet remains, and the info dictionary is actualy deleted. Moreover, there can be indeed  multiple info dictionaries incrementaly saved (may be some more digging needed there).

---------------

A summary of the metadata updates behavior in acrobat X:

1 creating a tiltle "ptitre" with the properties popup set "ptitre" in the dc title of the xmp packet, and in the title() of the info dictionary.

2 creating an  author "pauteur" with the properties popup set "pauteur" in the dc creator of the xmp packet, and in the author() of the info dictionary.

3 creating a tiltle "ctitre" by running this.info.Title  set "ctitre" in the dc title of the xmp packet, and in the title() of the info dictionary.

4 creating an author"cauteur" by running this.info.Author  set "cauteur" in the dc creator of the xmp packet, and in the author() of the info dictionary.

5 creating a pair of keywords pk1;pk2 with the properties popup creates (if needed) a dc subject with two li: pk1, and pk2; creates a pdf keywords pk1;pk2, and in the info dictionary, set Keywords() to pk1;pk2.

at this point Pdrpropertyexension displays pk1;pk2.

6 creating a pair of keywords ck1;ck2 by running this.info.Keywords sets pdf keywords to ck1;ck2 in the xmp packet, but does not update the two dc subject li. in the info dictionary, it sets Keywords() to ck1;ck2.

At this point, there are indeed 4 keywords: pk1 and pk2 in the dc subject, and ck1;ck2 in pdf keywords and in keywords() in the info dictionary. Pdfpropertyextension displays ck1;ck2. it is assumed that pdfpropertiesextension takes its data from the info dictionary.

The properties displayed in the popup are title:ptitre, author: pauteur,and the keywords: "ck1;ck2"; pk1; pk2. the xmp described in the advanced panel displays "ck1;ck2"; pk1; pk2 as pdf keywords, and, in the dublin core : 3 li : [1]ck1;ck2 [2]pk1 and [3]pk2. Once again acrobat is lying : the actual content of the pdf source seen eg with notepad++ is: only two li in dc subect : pk1 and pk2.  and only ck1;ck2 in pdf keywords (morevoer, there are no double quotes around ck1;ck2). (We also note that in the info dictionary keywords() is set to ck1;ck2 without double quotes.)

At this point we see that Thom Parker was wrong, because he was misleaded by the deceitful acrobat properties popup, and trusted the advanced panel instead of checking the real pdf content with an external low level editor. Indeed it seems that there is no double quotes problem : this double quotes exist only in the acrobat properties popup, they do not exist in the pdf.

This means that we can actually use the doc.info object to set the metadata : the only problem is that if we only do that for keywords, there will be discrepancies since the dc subject li could be different. From the pdfpropertyextension point of view, it seems that this would not be a problem, since it seems that pdfproperties extension (if needed*, that is if the info dictionary was deleted by a "save as") does not read the dc subject, but takes data from pdf keywords.

But: to be clean, if we set the keywords with the doc.info object, we should also set the same values to dc subject with E4X.

*If then we save the pdf with "save as", the info dictionary is deleted, but : pdfpropertyextension still displays ck1;ck2, that is: it reads its data from the pdf keywords in the xmp packet (and not from the dc subject).

Now, there are still more problems: when trying to clean keywords:

If we take the pdf in the state 6 above (that is, the info dictionary  has not been deleted by a "save as"),  if we delete "ck1;ck2"; pk1; pk2 in the properties popup, pdfproperties extension still displays ck1;ck2. Is this a pdfpropertyextension bug? No. At this point, the propeties popup displays an empty keywords field (and no more keywords in the advance panel). But : What actually happens in the pdf source is : the dc subject and pdf keywords are deleted from the last xmp packet, but : the Keywords() in the info dictionary is stll set to ck1;ck2. Acrobat properties popup does not clean the info dictionary keywords, which is another acrobat weird behavior since, on the other and, it actualy sets that value when asked.

Now, to clean this , we try to run this.info.Keywords=""; this does not do anything. But if we run this.info.Keywords=" "; (ie with a space between the double quotes), then pdfpropertyextension does not display ck1;ck2 anymore. If we check the pdf source, we see that the Keywords() in the info dictionary has been set to ( ). (that is the value is a space, it is not empty.)

If we are in the case of an existing pdf keywords value in the last xmp packet, and run this.info.Keywords=" ";  the Keywords() in the info dictionary Is set to ( ) (with a space)  and in the xmp packet we have   : the pdf keywords is not deleted, and contains a space.

So, some more digging is needed there too...

Posts: 10
Joined: February 10, 2017 - 14:41
Re: weird behavior explained

One last post:

There is a way to clean the Keywords() of the information dictionary: instead of running this.info.Keywords="" , which has no effect, or running this.info.Keywords=" ", which sets a space in the Keywords() in the information directory, and a space in the pdf keywords in the xmp packet, it is possible to run:

var bleachIt=""; this.info.Keywords= bleachIt;

This way, the space value is removed from the keywords() of the information dictionary, and from inside the pdf keywords element of the xmp packet. Thus, we have to ways to remove a keyword that acrobat cannot clean with the properties popup (the other one being to delete the information dictionary by saving the pdf with a "save as").

At this point, I now think that the Pdpropertyextension must stay as it is, that is : taking first the information dictionary into account, and then the xmp packet (if the information dictionary has been wiped). The main reason I was asking to be able to ask for Pdpropertyextension to first take the xmp packet into account was precisely the keywords double quotes problem... which indeed turned out to not be a problem at all, as we have seen that these double quotes absolutely does not exist in the pdf source, and are only weirdly displayed by the acrobat properties popup.

I am not a specialist, my job is absolutely not software programming, so it took me some time to find the appropriate documentations, and to understand what was going on, and why I was seeing so weird behaviors. I wish I  could have find the above explanations somewhere on the internet, or -at least- in the acrobat developer guide... This was not the case, and knowing the paramount importance of keywords, especially for scholars, this is beyond stunning... 

May be if all this can be reported to adobe support at the correct level (eg: 3...) it could be useful... Although, I didn't try the last acrobat version (and with what I've seen, I will certainly not give one euro to adobe in order to check that).

Hope this helps...

--

pc

 

Posts: 1121
Joined: March 25, 2012 - 01:19
Re: weird behavior explained
komelensoso wrote:
I am not a specialist, my job is absolutely not software programming, so it took me some time to find the appropriate documentations

Well, this is a ton of information for a non-technical user; you're the perfect king of user for a developer.
Way better than the ones reporting "It doesn't work..." ;).

komelensoso wrote:
The main reason I was asking to be able to ask for Pdpropertyextension to first take the xmp packet into account was precisely the keywords double quotes problem... which indeed turned out to not be a problem at all, as we have seen that these double quotes absolutely does not exist in the pdf source, and are only weirdly displayed by the acrobat properties popup.

These are the steps PDFPropertyExtension follows to retrieve document info:

  1. load all of the XRef tables, starting from "startxref" offset (at the end of file) and following PrevXRef offsets contained in each XRef table
  2. if Root object number is found in XRef table, load it and search for a "Metadata" key that should point to an Info dictionary
  3. if the key exists, check if that Info dictionary also exists (a lot of bad PDF have wrong or expired pointers...), load and parse it; the Info dictionary could be an XML fragment or a plain dictionary
  4. after the Info dictionary has been parsed check if it contains a link to a previous version (incremental updates, such a bad thing) and, if yes, load it and jump back to step 3) but parse only new data:
    if Author data was already found, the Author present in a previous dictionary is supposed to be older and outdated
  5. after the last Info dictionary has been parsed, get back to 2) and search the XRef table for an Info object: it was the object containing references to Info dictionaries before the introduction of Root one

So there's no real "precedence" in reading XMP over plain info, but looking for Root object first then for Info object makes XML parsing happen before plain because newer documents containing a Root object are also supposed to contain an XMP info dictionary.
That said, the two info dictionaries should be aligned and do not contain different info.

komelensoso wrote:
I did some more digging : the problem is far more deeper in the sense that multiple XMP packets in a pdf turns out to be a normal behavior (called "incremental save"), but acrobat itself does very strange things with metadata, depending on what is updated, from where, and this leads to weird discrepancies. Moreover, acrobat (acrobat pro X) is sometimes "lying" about the exact actual content of metadata in the pdf. So, I begin to realise how painfull it could be to write the pdfpropertyextension....

PDF parsing is such a pain because PDF format is more than 25 years old, so it had a lot of changes and integrations leading to hard-to-read specifications.
PDFPropertyExtension doesn't use any external PDF parsing library, because of size and speed concerns: most of them contain a lot features for content parsing and rendering that are completely useless here; that's why I started writing a custom parser, both for efficiency and... to learn something new ;)

Incremental saving had a reason to exist years ago, when PCs were not so powerful and RAM cost so much; actually there should be no reason to use it.
Anyway, maybe Acrobat saves incrementally during normal work session and you have a feature such as "Finalize document" that only write a single/final version of the XMP table.
I don't know Acrobat, I'm just thinking out loud...

komelensoso wrote:
So I was wondering if pdf property extension was using some cached data, and where.

PDFPropertyExtension is only a "metadata" provider for Windows Explorer, which caches the received informations as it likes.
It must cache them, otherwise metadata providers will be such a huge CPU hogs.
Think about a folder full of MP3, PDF, ODT, DOC, XLS: each of them has its own provider and some files are much more hard to parse (ODT files must be uncompressed to extract info).
Things get more complicated if you open a network share with Windows Explorer.

So yes, caching exists but it's out of PDFPropertyExtension control. I can only decide to enable it or not. I already tried diasbling it became sooo slow.

komelensoso wrote:
Once again acrobat is lying : the actual content of the pdf source seen eg with notepad++ is: only two li in dc subect : pk1 and pk2.  and only ck1;ck2 in pdf keywords (morevoer, there are no double quotes around ck1;ck2). (We also note that in the info dictionary keywords() is set to ck1;ck2 without double quotes.)

Well, you should investigate this with Acrobat and/or dig into its configuration options.
Again, I don't know Acrobat at all but if it makes something weird I don't think PDFPropertyExtension should workaround it.

warning

Warning, JavaScript is disabled!

JavaScript is not available, maybe because you disabled it globally into your browser settings or you are using an addon like NoScript.

We do not have any dangerous JavaScript running here.
Please enable JavaScript; if you're using NoScript this image will help you adding CoolSoft to your whitelist.

Thanks for your comprehension and enjoy CoolSoft.