Call for assistance : BaseX as a preprocessor ?

List overview All Threads
Download

newer

older

jobs

Should it be possible to declare a...

maxzor

23 Feb 2020 23 Feb '20

10:31 a.m.

Hello,

Thank you for your software which GUI has been my savior every time I needed to deal with XML.

I would like to know if I can stream xml transforms, to pipe wikimedia XML dumps into a format acceptable by postgres copy ? I know very well SQL, but nothing about XPath or XQuery

I managed to mock a XPath (or is it XQuery ? :/) snippet from postgres itself, but obviously this would need rewriting for basex CLI : https://stackoverflow.com/questions/60361030/how-to-transform-and-stream-lar...

Best regards, Maxime Chambonnet

Show replies by date

Majewski, Steven Dennis (sdm7g)

23 Feb 23 Feb

5:42 p.m.

What do you mean by “stream xml transforms? ?

Do you mean stream a single large XML file ? A series of XML files, or stream a file thru a series of XQuery|XSLT|XPath transforms.

Depends on what you mean by “stream” .

I don’t believe BaseX uses a streaming XML parser, so probably can’t handle streaming a single large XML file and produce output before it’s parsed the complete file.

But it looks like, from the link in your stackoverflow post that the data is already sharded into a collection of separate XML files that each contain multiple <page> elements.

— Steve M.

...

On Feb 23, 2020, at 10:31 AM, maxzor maxzor@maxzor.eu wrote:

Hello,

Thank you for your software which GUI has been my savior every time I needed to deal with XML.

I would like to know if I can stream xml transforms, to pipe wikimedia XML dumps into a format acceptable by postgres copy ? I know very well SQL, but nothing about XPath or XQuery

I managed to mock a XPath (or is it XQuery ? :/) snippet from postgres itself, but obviously this would need rewriting for basex CLI : https://stackoverflow.com/questions/60361030/how-to-transform-and-stream-lar...

Best regards, Maxime Chambonnet

maxzor

6:54 p.m.

...

Do you mean stream a single large XML file ? A series of XML files, or stream a file thru a series of XQuery|XSLT|XPath transforms.

Possibly poor wording, I meant read a large XML file and produce i.e. a csv file.

...

I don’t believe BaseX uses a streaming XML parser, so probably can’t handle streaming a single large XML file and produce output before it’s parsed the complete file.

Do you know of a streaming xml lib? other than StAX (no Java here :<)?

...

But it looks like, from the link in your stackoverflow post that the data is already sharded into a collection of separate XML files that each contain multiple <page> elements.

This is the alternative, instead of processing the monolithic multistream file, I could crawl over the ~150MB bz2-compressed chunks.

Regards, Maxime

Christian Grün

24 Feb 24 Feb

11:27 a.m.

Hi Maxime,

BaseX provides no streaming facilities for large XML instances.

However, if you have enough disk space left, you can create a database instance from your XML dump. We have already done this for Wiki dumps up to 420 GB [1]. You should disable the text and attribute index; database creation will then consume constant memory.

In the next step, you can write a query that writes out CSV entries for all page elements; the File Module and file:append can be helpful for that [2]. If this approach turns out not be fast enough, you can use the FLWOR window clause for writing out chunks of CSV entries [3]. If your output is projected to be much smaller than your input, you don’t need any window clause, and you could use our CSV Module and the 'xquery' format for serialising your CSV result in one go [4].

Hope this helps, Christian

[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/File_Module [3] http://docs.basex.org/wiki/XQuery_3.0#window [4] http://docs.basex.org/wiki/CSV_Module

On Mon, Feb 24, 2020 at 12:54 AM maxzor maxzor@maxzor.eu wrote:

...

...
Do you mean stream a single large XML file ? A series of XML files, or stream a file thru a series of XQuery|XSLT|XPath transforms.

Possibly poor wording, I meant read a large XML file and produce i.e. a csv file.

...
I don’t believe BaseX uses a streaming XML parser, so probably can’t handle streaming a single large XML file and produce output before it’s parsed the complete file.

Do you know of a streaming xml lib? other than StAX (no Java here :<)?

...
But it looks like, from the link in your stackoverflow post that the data is already sharded into a collection of separate XML files that each contain multiple <page> elements.

This is the alternative, instead of processing the monolithic multistream file, I could crawl over the ~150MB bz2-compressed chunks.

Regards, Maxime

Majewski, Steven Dennis (sdm7g)

27 Feb 27 Feb

5:06 p.m.

If you really want to read in all of the data as a single stream, I would suggest writing a preprocessor using SAX library ( from Python, Java or whatever language you want to use ) to break the Wikimedia stream into separate XML files for each page element, or else use the same language to do the streaming CSV conversion .

However, for a file that large, you may have issues if there is a network interruption.

Depending on how reliable you connection is, you might be better off downloading the separate chunks. It gives you easily recognizable restart points.

Otherwise: Saxon can do streaming XSLT, but only with one of the paid license Enterprise versions. No idea if Saxon XQuery can also handle streaming input. Also, no idea if any of the non-Java versions of Saxon handle streaming.

If all that is needed is to convert the XML stream into CSV records to dump into Postgres, I would probably use Python/SAX, but I wonder if Postgres is really a requirement, or if you can do your final queries in BaseX ? If dumping everything in a BaseX database is just an intermediary step, then it’s probably not the most efficient way to go.

— Steve M.

...

On Feb 23, 2020, at 6:54 PM, maxzor < maxzor@maxzor.eu> wrote:

...
Do you mean stream a single large XML file ? A series of XML files, or stream a file thru a series of XQuery|XSLT|XPath transforms.

Possibly poor wording, I meant read a large XML file and produce i.e. a csv file.

...
I don’t believe BaseX uses a streaming XML parser, so probably can’t handle streaming a single large XML file and produce output before it’s parsed the complete file.

Do you know of a streaming xml lib? other than StAX (no Java here :<)?

...
But it looks like, from the link in your stackoverflow post that the data is already sharded into a collection of separate XML files that each contain multiple <page> elements.

This is the alternative, instead of processing the monolithic multistream file, I could crawl over the ~150MB bz2-compressed chunks.

Regards, Maxime

thufir

6:04 p.m.

you write:

"I would like to preprocess the xml before entering postgres, and stream it with the copy command." but why? I'm inferring that you want to dynamically generate XML as its queried by postgres?

Just curious,

Thufir

On 2020-02-23 7:31 a.m., maxzor wrote:

...

Hello,

Thank you for your software which GUI has been my savior every time I needed to deal with XML.

I would like to know if I can stream xml transforms, to pipe wikimedia XML dumps into a format acceptable by postgres copy ? I know very well SQL, but nothing about XPath or XQuery

I managed to mock a XPath (or is it XQuery ? :/) snippet from postgres itself, but obviously this would need rewriting for basex CLI : https://stackoverflow.com/questions/60361030/how-to-transform-and-stream-lar...

Best regards, Maxime Chambonnet

2019

Age (days ago)

2023

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

5 comments

4 participants

tags (0)

participants (4)

Christian Grün
Majewski, Steven Dennis (sdm7g)
maxzor
thufir