Encoding problem after "save as"

Posts: 10
Joined: February 10, 2017 - 14:41
Encoding problem after "save as"

Hi,

I have a problem with some metadata containing swedish or french names : they are correct once updated with acrobat, but if I save the pdf with a "save as" (which seems to remove the info dictionary) the encoding is broken in explorer.

eg : if I update the title with the value"accentué", and save the pdf, explorer displays "accentué, but it diplays "accentué" after a "save as". In the pdf source, with hex editor neo, in the information dictionary the hexa value of the "é" is e9, and it is c3 a9 in the xmp packet (which seems to be the right utf-8 value in such a packet?).

--

pc

Posts: 1121
Joined: March 25, 2012 - 01:19
Re: Encoding problem after "save as"

Could you please post a small sample PDF file, before and after the SaveAs?
The smallest the better...

Posts: 10
Joined: February 10, 2017 - 14:41
Re: Encoding problem after "save as"

Hi,

Here you are... two pdf, the second is a copy of the first after the info dictionary has been deleted by a save as.(so, it only remains the utf-8 xmp metadata)

(I observed the same problem with hundreds of pdf)

--

pc

Attachments (Only registered users)
BeforeAndAfterSaveAs.rar
Posts: 1121
Joined: March 25, 2012 - 01:19
Re: Encoding problem after "save as"

I've found an issue on reading XMP objects where PDFPropertyExtension doesn't respect content encoding.

Attached here you'll find an updated version with the fix.
Please test it and report here any issue.

Attachments (Only registered users)
PdfPropertyExtension_1.8.3-beta1.zip
Posts: 10
Joined: February 10, 2017 - 14:41
Re: Encoding problem after "save as"

Hi,

Thank you, it's all right now: I checked many files, and the wmp content is now accurately displayed. This way I may now "batch saved as" my folders.

It seems important for many reasons : first, I don't need to keep metadata updates history in the pdf (history increases the size, and creates a security/privacy risk), but , second, and this is very important : It could help to solve a remaining problem with displaying the meta in explorer in two cases:

1 when performing a search like *.pdf in a folder

2 when  performing any arrangement different than "by folder" in a library : in any arrangement where pdf from different folders must be displayed in the same liste, the metadata are not displayed.

I suspect that this is linked to the index (windows.edb) update. (according to microsoft, any folder being part of a library is gathered by the indexing process).

What I don't understand is : why explorer displays the correct metadata as soon as I save or save as  a  pdf, and why it should access the windows.edb to display these metadata when I explore a library or perform a search? (with the delay imposed by the indexing process in order to get the right values)

Finaly : as I understand it -but I'm not sure- the pdf property handler is used by windows to update the index with the new metadata, so :

1 where is that property handler? wich one shoud I use, or could have been installed on my computer by what program?

2 more basically : what are metadata "offcially" or exactly? What I mean is that : when a modified pdf is gathered, what is a keyword? the dc:subject? the pdf keywords? or the keywords in the infodictionary? I none of these three places are blank and have different values, what is the value that will be set in the index? what are the rules? Are they the reason why I don't see any metadata when exploring a library or performing a search in explorer?

--

pc

 

 

Posts: 10
Joined: February 10, 2017 - 14:41
Re: Encoding problem after "save as"

I did some more testing, and there are some tricky issues:

I want to be able to see the metadata in explorer in many ways:

-A) when I search for *.pdf in a folder, in order to get all the pdf of that folder and of its subfolders (sorted by a specific key/metadata)

-B) when I use a library not arranged by folder. I arrange by type, and then explore the pdf type : this way I can see all the pdf in the library sorted by my specific key.

Now, I had problems because most of the files metadata were not displayed in these two cases.

What I did: I uninstalled nirsoft shellexview. Then I saw that the advanced properties of the pdf (in explorer:right-click, properties, advanced) had been set as _not_ to "allow this file to have contents indexed in addition to file". So I selected all the pdf in all my pdf folders, and forced that option to "allow this file to have contents indexed in addition to file". (I also did it at the folder level, with the folders properties)

Then, I also discovered that two bunches of pdf where "blocked" (explorer, right-click, properties). I downladed sysinternal streams, and bleached all that with command lines.

After a while, the indexing process ended, and most of the  metadata where displayed in the cases A) and B) : "most of them", but for 17 pdf files, they were still not displayed.(the total amount of pdf is 1677).

What I saw during the indexing process, is searchprotocolhost calling pdpropertyextension.dll.

For the 17 faulty files, in A), the modified date was incorrect. that is: the modified date in the column of explorer, and the modified date in the details tab after a right-clik/properties of the file.

These 17 files, in acrobat, had file attachments. I did a remove hidden info and removed the metadata and the file attachment, then rebuild the metadata, and saved the files with a save as. Nothing happened in explorer (case A)) until I closed acrobat. What happens next is that the system process (but not searchprotocolhost, which called pdfpropertyextension.dll during the indexing proces) accessed the saved files, and then, after a few seconds, the metadata where accurately displayed in explorer (A).

I did another test before cleaning all the faulty pdf : I copied some pdf folders in another location (a new folder not marked for indexing) : when I searched for *.pdf in that location, all the metadata where accurately displayed.

What I think I understand at the moment is : you can display metadata in a folder that is indexed, and in a folder that is not indexed. If the folder is indexed, there may be some problems with some file attachments inside the pdf that will prevent the indexing process to get the correct modified date.

And it seems that libraries are considered by windows as a bunch of indexed folders. That is, if you want to see the metadata in a library, the indexing process should be correct (in that case too, you have to clean the pdf by removing the annoying inside file attachments).

If you think it could be interesting to take a look at the faulty pdf, I upload a few of them (the smallers, before I removed the attachments)

Thank you for your help.

--

pc

Attachments (Only registered users)
5faulty.rar
Posts: 1121
Joined: March 25, 2012 - 01:19
Re: Encoding problem after "save as"

Thanks for the sample files.
I can confirm that PDF file attachments don't have any issue with PDFPropertyExtension, so I suppose that Windows Search is faulty with them.

The attached files displayed correctly on my development machine (Win7-x64), in both Search and default explorer view but I've disabled Windows Search a long time ago because I don't trust it, and I consider it a big CPU hog.
I often need to search into file content ("search content" feature) and Windows Search is simply not reliable.
The issue you're describing is just another reason to disable it ;)

warning

Warning, JavaScript is disabled!

JavaScript is not available, maybe because you disabled it globally into your browser settings or you are using an addon like NoScript.

We do not have any dangerous JavaScript running here.
Please enable JavaScript; if you're using NoScript this image will help you adding CoolSoft to your whitelist.

Thanks for your comprehension and enjoy CoolSoft.