Hi,
I'm making use of the Java bindings in BaseX, with some of the functions returning List<String> and Set<String> types.
For List<String> I can adapt that to a sequence using:
declare namespace List = "java:java.util.List";
declare function util:list-to-sequence($list) { for $n in 0 to List:size($list) - 1 return List:get($list, $n cast as xs:int) };
however, I'm not sure how to do the equivalent for Set<String> (or more generally, any Iterator<T>) without converting it to a list or array first, as Set only has size() and iterator() methods. Has anyone done this before?
The best I can come up with is the following, which relies on the size of the set and the number of next calls in the iterator to be the same (where it should be checking hasNext):
declare namespace Set = "java:java.util.Set"; declare namespace Iterator = "java:java.util.Iterator";
declare function util:set-to-sequence($set) { let $iterator := Set:iterator($set) for $n in 0 to Set:size($set) - 1 return Iterator:next($iterator) };
More generally, it would be helpful for BaseX to have adapters for Java arrays, Lists, Sets, Maps, and Iterables/Iterators to XQuery (XDM) types and functions to construct them in XQuery (like my util:list-to-sequence function above).
Kind regards, Reece
Hi Reece,
Interesting thoughts. All I can say is that your iterator approach for sets looks pretty similar to something that I tried in the past.
More generally, it would be helpful for BaseX to have adapters for Java arrays, Lists, Sets, Maps, and Iterables/Iterators to XQuery (XDM) types and functions to construct them in XQuery (like my util:list-to-sequence function above).
One way would be to add new built-in functions to BaseX (in the Conversion Module, or in a new Java Module) that provide conversions custom functions for data structures in Java. I guess it might be cleaner to convert lists and sets to arrays, as those data structures can also contain null references.
The main reason why we didn’t push this any further was that we didn’t want to give users additional incentives to resort to Java code. Many things can also be done in XQuery, and as the XQuery-Java mapping for data types can never be perfect, and we experienced that users often stumbled upon these things in the beginning. However, quite obviously, there are always use cases in which a direct data exchange between XQuery and Java is helpful, and less cumbersome than writing custom Java functions with custom entry points for XQuery function calls (as e.g. documented in [2]).
Maybe it would be good indeed to realize the set of additional functions as an XQuery module. We still haven’t defined a canonical way to promote and document external BaseX XQuery Modules – some users may remember that we have assembled existing modules on our server some time ago [1]; other modules, such as Leo’s algorithms and data structures, can be found on private repositories [3] – so ideas on how to get this better organized are welcome.
Cheers, Christian
[1] https://files.basex.org/modules/ [2] https://docs.basex.org/wiki/Repository#Combined [3] https://github.com/LeoWoerteler/xq-modules
On Mon, Jun 28, 2021 at 6:04 PM Reece Dunn msclrhd@googlemail.com wrote:
Hi,
I'm making use of the Java bindings in BaseX, with some of the functions returning List<String> and Set<String> types.
For List<String> I can adapt that to a sequence using:
declare namespace List = "java:java.util.List"; declare function util:list-to-sequence($list) { for $n in 0 to List:size($list) - 1 return List:get($list, $n cast as xs:int) };
however, I'm not sure how to do the equivalent for Set<String> (or more generally, any Iterator<T>) without converting it to a list or array first, as Set only has size() and iterator() methods. Has anyone done this before?
The best I can come up with is the following, which relies on the size of the set and the number of next calls in the iterator to be the same (where it should be checking hasNext):
declare namespace Set = "java:java.util.Set"; declare namespace Iterator = "java:java.util.Iterator"; declare function util:set-to-sequence($set) { let $iterator := Set:iterator($set) for $n in 0 to Set:size($set) - 1 return Iterator:next($iterator) };
More generally, it would be helpful for BaseX to have adapters for Java arrays, Lists, Sets, Maps, and Iterables/Iterators to XQuery (XDM) types and functions to construct them in XQuery (like my util:list-to-sequence function above).
Kind regards, Reece
On Tue, 29 Jun 2021 at 10:36, Christian Grün christian.gruen@gmail.com wrote:
Hi Reece,
Interesting thoughts. All I can say is that your iterator approach for sets looks pretty similar to something that I tried in the past.
More generally, it would be helpful for BaseX to have adapters for Java
arrays, Lists, Sets, Maps, and Iterables/Iterators to XQuery (XDM) types and functions to construct them in XQuery (like my util:list-to-sequence function above).
One way would be to add new built-in functions to BaseX (in the Conversion Module, or in a new Java Module) that provide conversions custom functions for data structures in Java. I guess it might be cleaner to convert lists and sets to arrays, as those data structures can also contain null references.
It would be useful to have Java Collection to sequence, Java Collection to array(*) and Java Map to map(*) converters. Either the conversion module or a Java helper module would be useful. Saxon does the Collection to sequence automatically in its Java bindings - https://www.saxonica.com/documentation10/index.html#!extensibility/functions... .
My rational for not converting them to arrays is to avoid a performance overhead when dealing with a large number of items, but I can see how null values could be complicated to manage if the BaseX sequence interface doesn't do flattening itself (otherwise, you could map null to the empty sequence instance like with the general Java mapping).
Additionally, I'm working in Kotlin and have the list values as non-nullable types, so that won't be an issue for my particular use case.
The main reason why we didn’t push this any further was that we didn’t want to give users additional incentives to resort to Java code. Many things can also be done in XQuery, and as the XQuery-Java mapping for data types can never be perfect, and we experienced that users often stumbled upon these things in the beginning. However, quite obviously, there are always use cases in which a direct data exchange between XQuery and Java is helpful, and less cumbersome than writing custom Java functions with custom entry points for XQuery function calls (as e.g. documented in [2]).
Yeah. I'm experimenting with NLP and am passing the text through a tokenization, stemming/lemmatization, part of speech, etc. pipeline which looks something like this:
let $tokens := nlp:tokenize($node) => nlp:lemmatize() => nlp:pos-tag() => util:list-to-sequence() for $token in $tokens let $text := Token:get-text($token) let $part-of-speech := util:set-to-sequence(Token:get-part-of-speech($token)) return <span class="token" title="{string-join(",", $part-of-speech)}">{$text}</span>
I'm using Java (Kotlin more accurately) to do the logic that needs state to implement (and possibly share with other projects), and tying it together in XQuery.
Maybe it would be good indeed to realize the set of additional
functions as an XQuery module. We still haven’t defined a canonical way to promote and document external BaseX XQuery Modules – some users may remember that we have assembled existing modules on our server some time ago [1]; other modules, such as Leo’s algorithms and data structures, can be found on private repositories [3] – so ideas on how to get this better organized are welcome.
There is http://cxan.org/ but I don't know how active it currently is.
Kind regards, Reece
Cheers, Christian
[1] https://files.basex.org/modules/ [2] https://docs.basex.org/wiki/Repository#Combined [3] https://github.com/LeoWoerteler/xq-modules
On Mon, Jun 28, 2021 at 6:04 PM Reece Dunn msclrhd@googlemail.com wrote:
Hi,
I'm making use of the Java bindings in BaseX, with some of the functions
returning List<String> and Set<String> types.
For List<String> I can adapt that to a sequence using:
declare namespace List = "java:java.util.List"; declare function util:list-to-sequence($list) { for $n in 0 to List:size($list) - 1 return List:get($list, $n cast as xs:int) };
however, I'm not sure how to do the equivalent for Set<String> (or more
generally, any Iterator<T>) without converting it to a list or array first, as Set only has size() and iterator() methods. Has anyone done this before?
The best I can come up with is the following, which relies on the size
of the set and the number of next calls in the iterator to be the same (where it should be checking hasNext):
declare namespace Set = "java:java.util.Set"; declare namespace Iterator = "java:java.util.Iterator"; declare function util:set-to-sequence($set) { let $iterator := Set:iterator($set) for $n in 0 to Set:size($set) - 1 return Iterator:next($iterator) };
More generally, it would be helpful for BaseX to have adapters for Java
arrays, Lists, Sets, Maps, and Iterables/Iterators to XQuery (XDM) types and functions to construct them in XQuery (like my util:list-to-sequence function above).
Kind regards, Reece
It would be useful to have Java Collection to sequence, Java Collection to array(*) and Java Map to map(*) converters. Either the conversion module or a Java helper module would be useful. Saxon does the Collection to sequence automatically in its Java bindings - https://www.saxonica.com/documentation10/index.html#!extensibility/functions....
Thanks for the link. I didn’t know that Saxon does different things if they are returned by either a constructor or a method. This looks like a pragmatic solution, but I’m not completely convinced if that’s a good idea. In my point of view, we shouldn’t make a difference if a data structure is returned by a constructor or a (possibly static generator) function.
My rational for not converting them to arrays is to avoid a performance overhead when dealing with a large number of items, but I can see how null values could be complicated to manage if the BaseX sequence interface doesn't do flattening itself (otherwise, you could map null to the empty sequence instance like with the general Java mapping).
I agree, I would also prefer sequences to arrays whenever possible, as they are supported much better in XQuery.
From a performance point of view, arrays and sequences are internally
based on the same data structure, so it shouldn’t make a difference if you create arrays or sequences. I was mostly thinking about cases in which null values are deliberately added to a list. If such a list is converted to a sequence, the positions of the resulting sequence entries wouldn’t reflect the original positions anymore.
Having said that, I checked the existing code and I noticed two things:
1. If Java arrays are returned, null entries are already ignored by our mapper (so it would just be consistent to do the same for other data structures). 2. If a String array contains null values, an unexpected runtime error is currently raised because we didn’t think of this case [1]…
Saxon does better, it returns an XQuery error – and I think we should do the same:
SXJE0051 Returned array contains null values: cannot convert to items
Out of interest, and as you seem to have worked with both the Saxon and BaseX Java mapping: Did you encounter other mapping details that you believe are handled better in one of the processors?
When the Java mapping was introduced, there were no maps and arrays visible on the horizon. I’ll definitely have more on that now.
Cheers, Christian
[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
On Tue, 29 Jun 2021 at 12:31, Christian Grün christian.gruen@gmail.com wrote:
Out of interest, and as you seem to have worked with both the Saxon and BaseX Java mapping: Did you encounter other mapping details that you believe are handled better in one of the processors?
I've not actually used Saxon's Java bindings, so I can't go into more details other than what the documentation says. This is currently the only project I'm using Java bindings for.
I'm only aware of the Saxon logic as I plan at some point to have Java integration in my XQuery plugin so you can navigate to the Java class/method/etc., have it auto-complete methods, and perform some static analysis like checking the number of arguments.
Kind regards, Reece
Hi Reece,
I implemented an initial version of convert:from-java [1, 2].
Looking forward to your feedback and further suggestions, Christian
[1] https://github.com/BaseXdb/basex/issues/2017 [2] https://files.basex.org/releases/latest/
On Tue, Jun 29, 2021 at 1:56 PM Reece Dunn msclrhd@googlemail.com wrote:
On Tue, 29 Jun 2021 at 12:31, Christian Grün christian.gruen@gmail.com wrote:
Out of interest, and as you seem to have worked with both the Saxon and BaseX Java mapping: Did you encounter other mapping details that you believe are handled better in one of the processors?
I've not actually used Saxon's Java bindings, so I can't go into more details other than what the documentation says. This is currently the only project I'm using Java bindings for.
I'm only aware of the Saxon logic as I plan at some point to have Java integration in my XQuery plugin so you can navigate to the Java class/method/etc., have it auto-complete methods, and perform some static analysis like checking the number of arguments.
Kind regards, Reece
On Tue, 29 Jun 2021 at 15:49, Christian Grün christian.gruen@gmail.com wrote:
Hi Reece,
I implemented an initial version of convert:from-java [1, 2].
Great, thanks.
Looking forward to your feedback and further suggestions,
Trying to use convert:from-java on a list of a custom Java object, I get:
[convert:java] Java object cannot be converted: "Word(text=test, normalized=test)".
It should just marshal the Java object like is done with the Java interop in this case.
My initial testing on other cases (set-to-sequence) indicate that it is slightly faster than the XQuery code I had -- I haven't measured them in isolation, just on a test example I have.
Kind regards, Reece
Christian
[1] https://github.com/BaseXdb/basex/issues/2017 [2] https://files.basex.org/releases/latest/
On Tue, Jun 29, 2021 at 1:56 PM Reece Dunn msclrhd@googlemail.com wrote:
On Tue, 29 Jun 2021 at 12:31, Christian Grün christian.gruen@gmail.com
wrote:
Out of interest, and as you seem to have worked with both the Saxon and BaseX Java mapping: Did you encounter other mapping details that you believe are handled better in one of the processors?
I've not actually used Saxon's Java bindings, so I can't go into more
details other than what the documentation says. This is currently the only project I'm using Java bindings for.
I'm only aware of the Saxon logic as I plan at some point to have Java
integration in my XQuery plugin so you can navigate to the Java class/method/etc., have it auto-complete methods, and perform some static analysis like checking the number of arguments.
Kind regards, Reece
Thanks for testing.
Trying to use convert:from-java on a list of a custom Java object, I get: [convert:java] Java object cannot be converted: "Word(text=test, normalized=test)".
It should just marshal the Java object like is done with the Java interop in this case.
As the function triggers an explicit conversion, I’d like to inform the caller that no conversion is possible (at least for now; we may find other Java types that could be converted). I’ll certainly need to add that in the documentation.
I’m still not sure if I will keep the recursive conversion (maybe that should be controlled via an additional argument). For maps, it’s certainly helpful.
I have finalized the convert:from-java function [1], and I have added new conversion rules for XQuery arrays [2]. As it was about time, I have also revised our documentation on Java Bindings [3].
Cheers Christian
[1] https://docs.basex.org/wiki/Conversion_Module#convert:from-java [2] https://github.com/BaseXdb/basex/issues/2020 [3] https://docs.basex.org/wiki/Java_Bindings#Data_Types
On Wed, Jun 30, 2021 at 11:29 AM Christian Grün christian.gruen@gmail.com wrote:
Thanks for testing.
Trying to use convert:from-java on a list of a custom Java object, I get: [convert:java] Java object cannot be converted: "Word(text=test, normalized=test)".
It should just marshal the Java object like is done with the Java interop in this case.
As the function triggers an explicit conversion, I’d like to inform the caller that no conversion is possible (at least for now; we may find other Java types that could be converted). I’ll certainly need to add that in the documentation.
I’m still not sure if I will keep the recursive conversion (maybe that should be controlled via an additional argument). For maps, it’s certainly helpful.
Hi all, hi Reece,
I have remastered the conversion of Java values: Objects of unknown type are now returned as function item, and the conversion of the contained value can be enforced by invoking the function item:
declare namespace Scanner = 'java:java.util.Scanner'; let $scanner := Scanner:new("A B C") => Scanner:useDelimiter(" ") return $scanner()
If no conversion rule is defined for a Java object type, the string representation (i.e., the one that’s returned by Object.toString) is returned. The convert:from-java() function is obsolete now, so it was kicked out again.
Furthermore, the middle dot extension was enhanced to support array arguments (such as byte[]): As square brackets are illegal QName characters, array types can now be addressed with three dots:
Q{java.lang.String}new·byte...(xs:hexBinary('414243'))
More details can be found in the updated documentation [1]. Everyone’s feedback is welcome.
Best, Christian
[1] https://docs.basex.org/wiki/Java_Bindings
On Fri, Jul 2, 2021 at 3:44 PM Christian Grün christian.gruen@gmail.com wrote:
I have finalized the convert:from-java function [1], and I have added new conversion rules for XQuery arrays [2]. As it was about time, I have also revised our documentation on Java Bindings [3].
Cheers Christian
[1] https://docs.basex.org/wiki/Conversion_Module#convert:from-java [2] https://github.com/BaseXdb/basex/issues/2020 [3] https://docs.basex.org/wiki/Java_Bindings#Data_Types
On Wed, Jun 30, 2021 at 11:29 AM Christian Grün christian.gruen@gmail.com wrote:
Thanks for testing.
Trying to use convert:from-java on a list of a custom Java object, I get: [convert:java] Java object cannot be converted: "Word(text=test, normalized=test)".
It should just marshal the Java object like is done with the Java interop in this case.
As the function triggers an explicit conversion, I’d like to inform the caller that no conversion is possible (at least for now; we may find other Java types that could be converted). I’ll certainly need to add that in the documentation.
I’m still not sure if I will keep the recursive conversion (maybe that should be controlled via an additional argument). For maps, it’s certainly helpful.
On Fri, 9 Jul 2021 at 13:01, Christian Grün christian.gruen@gmail.com wrote:
Hi all, hi Reece,
I have remastered the conversion of Java values: Objects of unknown type are now returned as function item, and the conversion of the contained value can be enforced by invoking the function item:
Thanks. I've ported my code over to the latest 9.6 dev release, which is working aside from a strange caching issue.
The following produces the correct output in the BaseX GUI: --- declare namespace String = "java:java.lang.String";
declare function local:tokenize($text as xs:string) { String:split($text, " ") };
local:tokenize("Lorem ipsum dolor"), local:tokenize("sed emit consecutor") ---
Note: I'm using `String:split($text, " ")` here as a demonstration of the issue.
However, if I take my https://github.com/rhdunn/document-viewer code running on the BaseX HTTP server (via bin/basexhttp on AdoptOpenJDK 11.0.7+10), and in src/modules/html.xqy add: --- declare namespace String = "java:java.lang.String";
declare function local:tokenize($text as xs:string) { String:split($text, " ") }; ---
and then modify the text() case in html:simplify from: --- if (contains($node, "margin-bottom: ")) then () else $node --- to --- if (contains($node, "margin-bottom: ")) then () else text { html:tokenize($node) } --- I just see whitespace (as if it is caching the first $node value). Changing it to: --- if (contains($node, "margin-bottom: ")) then () else if (normalize-space($node) eq "") then $node else text { html:tokenize($node) } --- then I see the first non-whitespace text node repeated.
If I then replace the `String:split($text, " ")` call with `tokenize($text)` I don't see the issue, so it seems to be related with the Java interop being cached.
Kind regards, Reece
Hi Reece,
That was a helpful hint. Some caching was going on indeed; it was introduced in a much older version of BaseX, and I noticed it did strange things in more recent versions. A new snapshot is available.
Best, Christian
On Fri, Jul 23, 2021 at 1:18 PM Reece Dunn msclrhd@googlemail.com wrote:
On Fri, 9 Jul 2021 at 13:01, Christian Grün christian.gruen@gmail.com wrote:
Hi all, hi Reece,
I have remastered the conversion of Java values: Objects of unknown type are now returned as function item, and the conversion of the contained value can be enforced by invoking the function item:
Thanks. I've ported my code over to the latest 9.6 dev release, which is working aside from a strange caching issue.
The following produces the correct output in the BaseX GUI:
declare namespace String = "java:java.lang.String";
declare function local:tokenize($text as xs:string) { String:split($text, " ") };
local:tokenize("Lorem ipsum dolor"), local:tokenize("sed emit consecutor")
Note: I'm using `String:split($text, " ")` here as a demonstration of the issue.
However, if I take my https://github.com/rhdunn/document-viewer code running on the BaseX HTTP server (via bin/basexhttp on AdoptOpenJDK 11.0.7+10), and in src/modules/html.xqy add:
declare namespace String = "java:java.lang.String";
declare function local:tokenize($text as xs:string) { String:split($text, " ") };
and then modify the text() case in html:simplify from:
if (contains($node, "margin-bottom: ")) then () else $node
to
if (contains($node, "margin-bottom: ")) then () else text { html:tokenize($node) }
I just see whitespace (as if it is caching the first $node value). Changing it to:
if (contains($node, "margin-bottom: ")) then () else if (normalize-space($node) eq "") then $node else text { html:tokenize($node) }
then I see the first non-whitespace text node repeated.
If I then replace the `String:split($text, " ")` call with `tokenize($text)` I don't see the issue, so it seems to be related with the Java interop being cached.
Kind regards, Reece
On Fri, 23 Jul 2021 at 13:34, Christian Grün christian.gruen@gmail.com wrote:
Hi Reece,
That was a helpful hint. Some caching was going on indeed; it was introduced in a much older version of BaseX, and I noticed it did strange things in more recent versions. A new snapshot is available.
Thanks for the fast turnaround. I can confirm that the snapshot fixes that issue.
Kind regards, Reece
Best, Christian
On Fri, Jul 23, 2021 at 1:18 PM Reece Dunn msclrhd@googlemail.com wrote:
On Fri, 9 Jul 2021 at 13:01, Christian Grün christian.gruen@gmail.com
wrote:
Hi all, hi Reece,
I have remastered the conversion of Java values: Objects of unknown type are now returned as function item, and the conversion of the contained value can be enforced by invoking the function item:
Thanks. I've ported my code over to the latest 9.6 dev release, which is
working aside from a strange caching issue.
The following produces the correct output in the BaseX GUI:
declare namespace String = "java:java.lang.String";
declare function local:tokenize($text as xs:string) { String:split($text, " ") };
local:tokenize("Lorem ipsum dolor"), local:tokenize("sed emit consecutor")
Note: I'm using `String:split($text, " ")` here as a demonstration of
the issue.
However, if I take my https://github.com/rhdunn/document-viewer code
running on the BaseX HTTP server (via bin/basexhttp on AdoptOpenJDK 11.0.7+10), and in src/modules/html.xqy add:
declare namespace String = "java:java.lang.String";
declare function local:tokenize($text as xs:string) { String:split($text, " ") };
and then modify the text() case in html:simplify from:
if (contains($node, "margin-bottom: ")) then () else $node
to
if (contains($node, "margin-bottom: ")) then () else text { html:tokenize($node) }
I just see whitespace (as if it is caching the first $node value).
Changing it to:
if (contains($node, "margin-bottom: ")) then () else if (normalize-space($node) eq "") then $node else text { html:tokenize($node) }
then I see the first non-whitespace text node repeated.
If I then replace the `String:split($text, " ")` call with
`tokenize($text)` I don't see the issue, so it seems to be related with the Java interop being cached.
Kind regards, Reece
basex-talk@mailman.uni-konstanz.de