Hi,
While we were discussing possible usecases for basex, a colleague asked me if it is also possible to load libreoffice and Word documents into Basex and then perform full-text analysis on them. In essence, these are both XML files, so it should be possible.
Does anybody have experience with this?
Ben
Hi Ben,
Yes, that’s possible. Office files are simple ZIP archives, so you can create a database with ZIP parsing turned on.
If you supply a Word file to the collection() function, the document will be parsed on-the-fly. Just run the following query on the attached document:
collection('HelloWorld.docx')//text()[. contains text 'hello']
In practice, you’ll surely have to invest some more time, as an Office text string may be distributed across multiple nodes.
Best, Christian
On Tue, Jan 28, 2020 at 2:01 PM Ben Engbers Ben.Engbers@be-logical.nl wrote:
Hi,
While we were discussing possible usecases for basex, a colleague asked me if it is also possible to load libreoffice and Word documents into Basex and then perform full-text analysis on them. In essence, these are both XML files, so it should be possible.
Does anybody have experience with this?
Ben
Hi Ben
This will be problematic with real world docx files at least. The text in there can be split into numerous tags disregarding any word boundaries depending on the edit history of the document. As BaseX has no means to ignore inline elements in the index this will always be a rather slow process. To formulate an XQuery will be a complicated task. Unless you clean up the docx XML beforehand that is.
Omar
Am 28.01.2020 um 14:01 schrieb Ben Engbers:
Hi,
While we were discussing possible usecases for basex, a colleague asked me if it is also possible to load libreoffice and Word documents into Basex and then perform full-text analysis on them. In essence, these are both XML files, so it should be possible.
Does anybody have experience with this?
Ben
basex-talk@mailman.uni-konstanz.de