Oh, sometimes I may be too euphoric about BaseX - after all it is only a fantastic, mind-boggling product. Thank you for the explanation, Christian, which amounts to a hidden textbook example of non-linearity. When things are correlated linearly, a repetition on the left means a repetition on the right. The use of a string dictionary means that a repetition on the left is not necessarily repeated (fully) on the right. The memory consumption of the internal representation of the document is thus non-linearly correlated with the memory consumption of the external representation. Here the non-linearity was so drastic that external and internal representation looked as if uncorrelated (aka streaming). But then we should not forget that we are sitting in Plato's cave (also when waiting for whether or not program execution crashes), staring at flickering shadows on the wall, while behind us and before a fire burning, barefoot servants carry objects to and fro, including documents.
Am Montag, 12. April 2021, 15:39:25 MESZ hat Christian Grün christian.gruen@gmail.com Folgendes geschrieben:
Hi Hans-Jürgen,
Here’s why your 8 GB document can be opened successfully without streaming: In the main memory representation of parsed documents, all distinct strings will only be kept once in memory. As a result, the 8 GB document (which I created, following your instructions in a personal mail) will consume less than 1 GB of main memory.
Hope this helps, Christian
On Mon, Apr 12, 2021 at 12:56 PM Hans-Juergen Rennau hrennau@yahoo.de wrote:
Hi Christian, I had myself wondered if there is a database secretly involved. So I had checked that - there is no such database.
Today I have already exploited the feature in serious, writing the query below, which does a little more than counting: it extracts data and composes a result document from those contents.
Kind regards, Hans-Jürgen
PS: Here der nächste Streich:
declare namespace f="http://www.parsqube.de/ns/xquery-functions"; declare variable $infile external := 'wiki-sample.xml';
declare function f:filePath($n) { ( let $fullNumber := format-number($n, '0000000') for $i in 1 to 7 return substring($fullNumber, $i, 1) ) => string-join('/') || '.xml' };
prof:time( <wiki>{ for $page at $pos in doc($infile)/*/page return <page>{ <url>{f:filePath($pos)}</url>, $page/title }</page> }</wiki>)
Am Montag, 12. April 2021, 12:22:59 MESZ hat Christian Grün christian.gruen@gmail.com Folgendes geschrieben:
Hi Hans-Jürgen,
A streaming fn:doc?
I need to check the code ;)
Have you possibly created a database for that document in the past? If so, the database will be opened instead of the local file.
Best, Christian
On Fri, Apr 9, 2021 at 8:54 PM Hans-Juergen Rennau hrennau@yahoo.de wrote:
Hello, I would like to let you know that just as there is matter and anti matter, there are bugs and anti bugs. An anti bug is when something works which cannot possibly work. I am not surprised that the first anti bug detected (by me) is a BaseX one.
The following invocation
basex "doc('result.xml')/*/page/id => distinct-values() => sort()"
processes a file the size of which is more than 8 GB. It cannot be processed on my machine, unless processing is streaming. A streaming fn:doc? Amazing. (At least me.)
Hans-Jürgen