Hi Tim, hi Bridger,

Some time ago, we have decided to put atomicity first, and to synchronize all concurrent updating file operations, whether they run in parallel or are invoked by different clients. This way, we prevent parallel transactions from being interrupted if they write to the same target.

Perhaps we are being overly cautious. We could choose a more fine granular concept and synchronize I/O access to individual files. It would be straightforward for operations on single operations (with file:write and its variants), but it gets more complex for recursive operations like file:copy or file:delete that may affect all files in a directory. We’ll have some more thoughts on this.

Best,

Christian

On Fri, Oct 4, 2024 at 11:03 PM Bridger Dyson-Smith <bdysonsmith@gmail.com> wrote:

Hey Tim -

On Fri, Oct 4, 2024 at 5:53 PM Thompson, Timothy <timothy.thompson@yale.edu> wrote:

Thanks, Bridger! `file:write-text-lines` seems to be the issue. For example, this query doesn’t run in parallel.

You're right - apologies for missing this key point in your initial email.

Is this expected behavior?

declare variable $PATH := "";

xquery:fork-join(

for $_ in (1 to 8)

return fn() {

    file:write-text-lines(

      $PATH||$_||".json",

      for $i in (1 to 1000000)

      return

        serialize(

        <fn:map>

          <fn:string key="n">{$i}</fn:string>

        </fn:map>, {"method": "json", "escape-solidus": "no", "json": {

          "format": "basic", "indent": "no"

        }}

      )

    )

},

{ "parallel": "8"}

)

It does seem to be the case that the writes in `file:write-text-lines` are *not* parallel vs a sequential use of the same:
I did the following comparison:

using your example,
ls -l --time-style=full-iso /tmp/fork-test
total 130860
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:39:57.926518544 +0000 1.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:02.849576119 +0000 2.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:07.799634010 +0000 3.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:28.652877890 +0000 4.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:12.892693574 +0000 5.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:18.140754950 +0000 6.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:23.569818443 +0000 7.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:39.098000046 +0000 8.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:40:33.779937851 +0000 9.json

vs

using a sequential write:
declare variable $PATH := "/tmp/fork-test/sequential/";

for $i in (1 to 9)
return
file:write-text-lines(
$PATH || $i || ".json",
for $n in (1 to 1000000)
return
serialize(
<fn:map>
<fn:string key="n">{$n}</fn:string>
</fn:map>,
{ "method": "json", "escape-solidus": "no",
"json": { "format": "basic", "indent": "no" }
}
)
)

ls -l --time-style=full-iso /tmp/fork-test/sequential
total 130860
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:19.841259435 +0000 1.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:24.820319704 +0000 2.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:29.838380446 +0000 3.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:35.041443427 +0000 4.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:40.182505657 +0000 5.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:45.305567669 +0000 6.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:50.535630977 +0000 7.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:49:55.703693534 +0000 8.json
-rw-r--r-- 1 bridger bridger 14888896 2024-10-04 20:50:00.948757024 +0000 9.json

each file in both attempts takes about 5ms to write, with the exception that the writes are non-sequential in the fork-join example. I wonder if it's due to the appending in `file:write-text-lines`?
Maybe Christian can chime in and let us know :)

Have a nice weekend!
Best,
Bridger

--
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research

Interim Manager, Metadata Services Unit

www.linkedin.com/in/timathompson

From: Bridger Dyson-Smith <bdysonsmith@gmail.com>
Date: Wednesday, October 2, 2024 at 1:05 PM
To: Thompson, Timothy <timothy.thompson@yale.edu>
Cc: BaseX <basex-talk@mailman.uni-konstanz.de>
Subject: Re: [basex-talk] Write files in parallel?

hi Tim - hope you are well.

In the past (i.e. I don't remember exactly if this was perfectly parallel, it was just "parallel enough"), I have used something like the following for web requests:

xquery:fork-join(
for $xml in ('calq.xqm','factbook.xml','filesystem.xml','locations.xml','wiki1.zip', 'wiki2.zip','xmark.xml')
let $url := 'https://files.basex.org/xml/'
return fn() {
file:write(
'/tmp/fork-test/' || $xml,
http:send-request(
<http:request method='get'/>,
$url || $xml
)
)
},
map { 'parallel': '3'}
)

Hopefully that's helpful (and apologies to the BaseX team's file server)!

Best,

Bridger

) ls -l --time-style=full-iso
total 11640
-rw-r--r-- 1 bridger bridger 1593 2024-10-02 17:02:51.321251082 +0000 calq.xqm
-rw-r--r-- 1 bridger bridger 1763070 2024-10-02 17:02:52.301261520 +0000 factbook.xml
-rw-r--r-- 1 bridger bridger 2770290 2024-10-02 17:02:53.331272491 +0000 filesystem.xml
-rw-r--r-- 1 bridger bridger 1566322 2024-10-02 17:02:52.497263608 +0000 locations.xml
-rw-r--r-- 1 bridger bridger 512686 2024-10-02 17:02:52.670265451 +0000 wiki1.zip
-rw-r--r-- 1 bridger bridger 5133340 2024-10-02 17:02:54.046280106 +0000 wiki2.zip
-rw-r--r-- 1 bridger bridger 155448 2024-10-02 17:02:52.859267464 +0000 xmark.xml

On Tue, Oct 1, 2024 at 5:32 PM Thompson, Timothy <timothy.thompson@yale.edu> wrote:

Hello,

Is it possible to call file:write-text-lines in parallel inside a fork-join operation? I have multiple databases that I would like to run a query over, in parallel, and write the results as JSON Lines to a file per database. When I try this, it doesn’t seem to parallelize.

Thanks in advance,

Tim

--
Tim A. Thompson (he, him)
Librarian for Applied Metadata Research

Interim Manager, Metadata Services Unit

Yale University Library

www.linkedin.com/in/timathompson