Load LibreOffice- and Word-documents? - BaseX-Talk - mailman.uni-konstanz.de

List overview All Threads
Download

Load LibreOffice- and Word-documents?

db:optimize

NullPointerException parsing...

Ben Engbers

28 Jan 2020 28 Jan '20

8:01 a.m.

Hi,

While we were discussing possible usecases for basex, a colleague asked me if it is also possible to load libreoffice and Word documents into Basex and then perform full-text analysis on them. In essence, these are both XML files, so it should be possible.

Does anybody have experience with this?

Ben

Reply

Show replies by date

Christian Grün

28 Jan 28 Jan

8:08 a.m.

Hi Ben,

Yes, that’s possible. Office files are simple ZIP archives, so you can create a database with ZIP parsing turned on.

If you supply a Word file to the collection() function, the document will be parsed on-the-fly. Just run the following query on the attached document:

collection('HelloWorld.docx')//text()[. contains text 'hello']

In practice, you’ll surely have to invest some more time, as an Office text string may be distributed across multiple nodes.

Best, Christian

On Tue, Jan 28, 2020 at 2:01 PM Ben Engbers Ben.Engbers@be-logical.nl wrote:

Hi,

While we were discussing possible usecases for basex, a colleague asked me if it is also possible to load libreoffice and Word documents into Basex and then perform full-text analysis on them. In essence, these are both XML files, so it should be possible.

Does anybody have experience with this?

Ben

Reply

Omar Siam

8:27 a.m.

Hi Ben

This will be problematic with real world docx files at least. The text in there can be split into numerous tags disregarding any word boundaries depending on the edit history of the document. As BaseX has no means to ignore inline elements in the index this will always be a rather slow process. To formulate an XQuery will be a complicated task. Unless you clean up the docx XML beforehand that is.

Omar

Am 28.01.2020 um 14:01 schrieb Ben Engbers:

Hi,

While we were discussing possible usecases for basex, a colleague asked me if it is also possible to load libreoffice and Word documents into Basex and then perform full-text analysis on them. In essence, these are both XML files, so it should be possible.

Does anybody have experience with this?

Ben

Reply

1997

Age (days ago)

1997

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

2 comments

3 participants

tags (0)

participants (3)

Ben Engbers
Christian Grün
Omar Siam