Hi Ben
This will be problematic with real world docx files at least. The text in there can be split into numerous tags disregarding any word boundaries depending on the edit history of the document. As BaseX has no means to ignore inline elements in the index this will always be a rather slow process. To formulate an XQuery will be a complicated task. Unless you clean up the docx XML beforehand that is.
Omar
Am 28.01.2020 um 14:01 schrieb Ben Engbers:
Hi,
While we were discussing possible usecases for basex, a colleague asked me if it is also possible to load libreoffice and Word documents into Basex and then perform full-text analysis on them. In essence, these are both XML files, so it should be possible.
Does anybody have experience with this?
Ben