Re: [basex-talk] BaseX-Talk Digest, Vol 104, Issue 15 - BaseX-Talk - mailman.uni-konstanz.de

9 Aug 2018

      On Thu, Aug 9, 2018 at 2:29 AM, basex-talk-request@mailman.uni-konstanz.de
wrote:
...
Send BaseX-Talk mailing list submissions to
        basex-talk@mailman.uni-konstanz.de
To subscribe or unsubscribe via the World Wide Web, visit
        https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
or, via email, send a message with subject or body 'help' to
        basex-talk-request@mailman.uni-konstanz.de
You can reach the person managing the list at
        basex-talk-owner@mailman.uni-konstanz.de
When replying, please edit your Subject line so it is more specific
than "Re: Contents of BaseX-Talk digest..."
Today's Topics:

Re: Different interpretation of regex in eXist, Saxon and
BaseX (Omar Siam)
Re: BaseX insert/delete node performance (Christian Gr?n)
Transaction management in BaseX 8.6.4 (Marc Coenegracht)
Re: Transaction management in BaseX 8.6.4 (Christian Gr?n)
Re: Transaction management in BaseX 8.6.4 (Christian Gr?n)

Message: 1
Date: Wed, 8 Aug 2018 12:58:39 +0200
From: Omar Siam Omar.Siam@oeaw.ac.at
To: basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] Different interpretation of regex in eXist,
        Saxon and BaseX
Message-ID: 91b47b0e-70a1-1336-ced6-e12eaa804cde@oeaw.ac.at
Content-Type: text/plain; charset=utf-8; format=flowed
Hi
I think the problem is: There are numerous implemetations of regular
expressions which have a common subset but are different in the more
advanced features.
Using the java regular expression implementation you can use greedy and
some other things. The XSL and XQuery implementation according to the
standards does not allow this and so misinterpretes the regular
expression. See here: https://www.w3.org/TR/xpath-
functions-31/#regex-syntax
You can tell Saxon to use a different regexp engine such as the standard
Java one:
https://www.saxonica.com/html/documentation/functions/fn/matches.html
Best regards
Omar
Am 07.08.2018 um 21:38 schrieb Andreas Mixich:
...
Hi
[rfc3986](https://tools.ietf.org/html/rfc3986#appendix-B) defines a nice
regular expression, which groups any URI, including URN, by URI
component.
...
Interesting about this regex is the use of the '?' quantifier which
makes every preceding group/component optional, thus matching either an
URI or any other(!) string, since anything, that does not match one of
the special groups, goes into a catch-all group (no.5), which keeps
either the path or the full, arbitrary string. This is neglectable,
since the input to this regex is guaranteed to be of the right type
(a/@href/string()).
Here is the relevant part from the RFC.
Appendix B
^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(?([^#]*))?(#(.*))?
          12            3  4          5       6  7        8 9
  The numbers in the second line above are only to assist
  readability; they indicate the reference points for each
  subexpression (i.e., each paired parenthesis).  We refer to the
  value matched for subexpression <n> as $<n>.  For example, matching
  the above expression to

     http://www.ics.uci.edu/pub/ietf/uri/#Related

  results in the following subexpression matches:

     $1 = http:
     $2 = http
     $3 = //www.ics.uci.edu
     $4 = www.ics.uci.edu
     $5 = /pub/ietf/uri/
     $6 = <undefined>
     $7 = <undefined>
     $8 = #Related
     $9 = Related

  where <undefined> indicates that the component is not present,
  as is the case for the query component in the above example.
  Therefore, we can determine the value of the five components as

     scheme    = $2
     authority = $4
     path      = $5
     query     = $7
     fragment  = $9

  Going in the opposite direction, we can recreate a URI reference
  from its components by using the algorithm of Section 5.3.

I tested this regex with Saxon, eXist and BaseX. eXist successfully
parsed all the test-cases, I threw at it, into the right groups, Saxon
and BaseX did not. The failure is:
 [FORX0003] Pattern matches empty string..

And that got me baffled, since all three processors use Java underneath
and since the definition of the '?' quantifier, when used like this,
seems to be:
 Makes the preceding item optional. Greedy, so the optional item
 is included in the match if possible.

Which means, that *if* any of the group's contents match, they should be
included, rather than producing an empty string.
Why is it like that? And what can I do about it? I found no other URI
parsing regex, that componentizes this way and would be compatible with
XQuery.
See, attached, a test-case.

Message: 2
Date: Wed, 8 Aug 2018 19:16:51 +0200
From: Christian Gr?n christian.gruen@gmail.com
To: BIRKNER Michael Michael.BIRKNER@akwien.at
Cc: BaseX basex-talk@mailman.uni-konstanz.de
Subject: Re: [basex-talk] BaseX insert/delete node performance
Message-ID:
        <CAP94bnPj9-qHXKu6bbv_6FAiX=JQ28R8etd_oH31j=-=tPL+UQ@mail.
gmail.com>
Content-Type: text/plain; charset="utf-8"
Michael,
Welcome to the list.
One thing you could try immediately is to call OPTIMIZE ? possibly followed
by the ALL flag, or db:optimize(..., true() ? and see if performance
improves. Obviously, this doesn't make sense after each single update
operation, but it could be called before a bigger number of updates is to
be performed.
...
The problem is that in my case, I have to do about 150000 inserts and
deletes, so it would take too much time.
If you define all the insert expression (or a bigger number than just 1 or
10) in a single XQuery expression (via a FLWOR expression), you will
benefit from various bulk optimizations. Did you try that already?
Best,
Christian
Hi
I did the same and the result are good,
I took db:node-pre() suggestion from the website and from Christian and did
a one time bulk delete first and then one bulk insert(try to use insert as
last for speed) in a bulk operation in a loop. Optimize the database before
and after each of the bulk operation.
I use text optimization and optimize all takes longtime
1. session->"create index text"
2. for all delete pre-nodes-array
      session->query-->"delete statement
3. for all insert xml
    session->query->"insert statement"
4. session->"create index text
This solved my issue of bulk update
Christian suggested replace inserted of delete and insert but replace was
taking a little extra time
Thanks,
...
BIRKNER Michael Michael.BIRKNER@akwien.at schrieb am Mi., 8. Aug. 2018,
08:36:
...
Hello,
I asked this question in StackOverflow concerning some performance
problems I experienced when inserting nodes into a BaseX database:
https://stackoverflow.com/questions/51595210/basex-
inserting-nodes-performance-problems
...
I already made some progress, especially when it comes to querying all
data I need for the updates. I work a lot with the indexes now.
But I still have problems with inserting - and also deleting - nodes. It
doesn't matter if I insert/delete nodes via a Java program or in the
editor
...
of the BaseX GUI: Both is quite slow. Inserting just one node in the GUI
with an XQuery like this one takes up to 3 seconds:
insert node <related_record><title>Test title</title><author>Joe
Lastname</author></related_record> into db:open-id('Database_Name',

...
Deleting a node with the following command takes up to 7 seconds:
delete node db:open-id('Database_Name', 88085737)
The problem is that in my case, I have to do about 150000 inserts and
deletes, so it would take too much time.
Maybe my database is just too big to be performant? Or some settings are
wrong? I'm very new to BaseX (and XML databases in general) so maybe
there
...
are just some errors I don't see. I also give you some information on my
database that I copied from the info screen of the BaseX GUI:
Database Properties
 NAME: Database_Name
 SIZE: 2568 MB
 NODES: 135607105
 DOCUMENTS: 1
 BINARIES: 0
 TIMESTAMP: 2018-08-07T07:05:56.000Z
 UPTODATE: true
Resource Properties
 INPUTPATH: /path/to/file.xml
 INPUTSIZE: 1774 MB
 INPUTDATE: 2018-07-24T14:32:58.000Z
Indexes
 TEXTINDEX: true
 ATTRINDEX: true
 TOKENINDEX: false
 FTINDEX: false
 TEXTINCLUDE:
 ATTRINCLUDE:
 TOKENINCLUDE:
 FTINCLUDE:
 LANGUAGE: English
 STEMMING: false
 CASESENS: false
 DIACRITICS: false
 STOPWORDS:
 UPDINDEX: true
 AUTOOPTIMIZE: false
 MAXCATS: 100
 MAXLEN: 96
 SPLITSIZE: 0
Best regards,
Michael
Beachten Sie, dass Sie uns ab sofort unter einer ge?nderten Rufnummer
erreichen. Bitte speichern Sie gleich Ihren Kontakt zur AK Wien ein
unter *501
...
65 1*, gefolgt von der gewohnten Durchwahl.
Dieses Mail ist ausschlie?lich f?r die Verwendung durch die/den darin
genannten AdressatInnen bestimmt und kann vertrauliche bzw rechtlich
gesch?tzte Informationen enthalten, deren Verwendung ohne Genehmigung
durch
...
den/ die AbsenderIn rechtswidrig sein kann. Falls Sie dieses Mail
irrt?mlich erhalten haben, informieren Sie uns bitte und l?schen Sie die
Nachricht. UID: ATU 16209706 I
https://wien.arbeiterkammer.at/Datenschutz_(DSGVO).html