My validation server loads all the XML documents it can from the file system. I have “skipcorrupt” set to true so that non-well-formed documents don’t fail the database creation or update attempt.
However, as part of my validation services, I need to be able to report those documents that are not well formed and therefore didn’t make it into the database.
I’m wondering what the easiest/most efficient way to do that would be within BaseX?
I’m working with on the order of 36K files. The files as stored in BaseX have the same path and filename as the files on the file system relative to the directory I import from, so the correlation between files and docs in BaseX is direct and simple.
One easy solution would be to simply get the disjoint of the list of files on the file system and the docs in the database and then attempt to load each file to verify that is in fact unparseable and not just not-yet-imported.
But maybe there’s a more direct way that I’ve overlooked?
Thanks,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
On Sat, Feb 26, 2022 at 02:53:46PM +0000, Eliot Kimber scripsit:
But maybe there’s a more direct way that I’ve overlooked?
If you trust the load process, you can get what's on disk with file:list(), and you can get what's in the system with some variation on collection()/document-uri(). You would then have to adjust the path names a little so they've got the same notional root.
Once you've done that, $disk[not(. = $system)] tells you which files aren't well-formed.
I'd expect this to be pretty brisk, and you don't have to try to parse anything a second time.
Graydon,
That seems like a good solution. I will pursue it.
My only practical wrinkle is that I’m reading from local git clones so I have to make sure I’ve attempted to load any files pulled since the last load before checking for failed-to-load files, but that’s doable of course.
Cheers,
E.
_____________________________________________ Eliot Kimber Sr Staff Content Engineer O: 512 554 9368 M: 512 554 9368 servicenow.comhttps://www.servicenow.com LinkedInhttps://www.linkedin.com/company/servicenow | Twitterhttps://twitter.com/servicenow | YouTubehttps://www.youtube.com/user/servicenowinc | Facebookhttps://www.facebook.com/servicenow
From: Graydon graydonish@gmail.com Date: Saturday, February 26, 2022 at 9:05 AM To: Eliot Kimber eliot.kimber@servicenow.com Cc: basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Subject: Re: [basex-talk] Identify Unparseable XML Files in File System [External Email]
On Sat, Feb 26, 2022 at 02:53:46PM +0000, Eliot Kimber scripsit:
But maybe there’s a more direct way that I’ve overlooked?
If you trust the load process, you can get what's on disk with file:list(), and you can get what's in the system with some variation on collection()/document-uri(). You would then have to adjust the path names a little so they've got the same notional root.
Once you've done that, $disk[not(. = $system)] tells you which files aren't well-formed.
I'd expect this to be pretty brisk, and you don't have to try to parse anything a second time.
-- Graydon Saunders | graydonish@gmail.com Þæs oferéode, ðisses swá mæg. -- Deor ("That passed, so may this.")
basex-talk@mailman.uni-konstanz.de