Hi everybody.
I'm here again with my doubts. Thank you for your patience. ^^
I have a database of trademarks with a full-text index for two nodes: **:mark-identification,*:party-name*. [1]
Where "*mark-identification*" contains the name of the trademark, and " *party-name*" contains the name of the owner of the trademark.
I use the full-text index in order to search trademarks by its name, for example:
*for $results in //case-file[case-file-header/mark-identification/text() contains text {'basex'}]* *return $results//mark-identification*
returns all trademarks with "*basex*" on its name. It works like a thunderlight: 15ms to get 3 records among 2,134,434,598 nodes. Really a dream. [2]
But, for example, if I change the searched text from "*basex*" by a common word in "*party-name*", for example, "*corporation*" ( has 1096187x occurrences on the full-text index as showed in [1], it's a very common word in owners of trademarks ):
*for $results in //case-file[case-file-header/mark-identification/text() contains text {'corporation'}]* *return $results//mark-identification*
It takes a long time to get 6,715 records: 62,000ms [3]
If I search for "*live*" ( a common word for trademarks name, but not for owners names ) I get 5,875 records in 2,773 ms, which has not a relationship with the 62,000ms to get the 6k records for "*corporation*". [4]
So...
- Is this an expected behaviour? - Is there a way to specify which "section" of the full-text index should be used to perform the search? ( I don't know... maybe something similar to "*using stemming*" but "*using index 'mark-identification'*" )
Please apologize me if I'm asking by something not-logical,
Best regards, Sebastian
[1] https://imgur.com/uLla1Xt [2] https://imgur.com/Fkcvv2O [3] https://imgur.com/Hk71CNe [4] https://imgur.com/P72k574
Hi Sebastian, Yes I think your search on mark-identification suffers from the huge number of party-names.
From what I remember, reverse index (from full text tokens to node ids) is shared across all element's names.
so filtering on the element's name is done at last.
When I was using basex to handle DOCDB patent db, I used to explode a document in sub-documents containing only keys and text to be indexed with respect to language and xml element, and then build seperate databases. That way I could create a dedicated full text index on a single (element names, language) combination.
Did that help ?
I really appreciated working with basex that time, because others were in a kind of java/relational mapping hell... Me, I just had to add xml documents, reindex, and sometimes purge deleted items.
Best, Fabrice
________________________________ De : BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de de la part de Sebastian Guerrero chapeti@gmail.com Envoyé : lundi 18 mai 2020 17:23 À : BaseX basex-talk@mailman.uni-konstanz.de Objet : [basex-talk] Full-text index: searches for common words in another node. Does it take a lot of time?
Hi everybody.
I'm here again with my doubts. Thank you for your patience. ^^
I have a database of trademarks with a full-text index for two nodes: *:mark-identification,*:party-name. [1]
Where "mark-identification" contains the name of the trademark, and "party-name" contains the name of the owner of the trademark.
I use the full-text index in order to search trademarks by its name, for example:
for $results in //case-file[case-file-header/mark-identification/text() contains text {'basex'}] return $results//mark-identification
returns all trademarks with "basex" on its name. It works like a thunderlight: 15ms to get 3 records among 2,134,434,598 nodes. Really a dream. [2]
But, for example, if I change the searched text from "basex" by a common word in "party-name", for example, "corporation" ( has 1096187x occurrences on the full-text index as showed in [1], it's a very common word in owners of trademarks ):
for $results in //case-file[case-file-header/mark-identification/text() contains text {'corporation'}] return $results//mark-identification
It takes a long time to get 6,715 records: 62,000ms [3]
If I search for "live" ( a common word for trademarks name, but not for owners names ) I get 5,875 records in 2,773 ms, which has not a relationship with the 62,000ms to get the 6k records for "corporation". [4]
So...
* Is this an expected behaviour? * Is there a way to specify which "section" of the full-text index should be used to perform the search? ( I don't know... maybe something similar to "using stemming" but "using index 'mark-identification'" )
Please apologize me if I'm asking by something not-logical,
Best regards, Sebastian
[1] https://imgur.com/uLla1Xt [2] https://imgur.com/Fkcvv2O [3] https://imgur.com/Hk71CNe [4] https://imgur.com/P72k574
Hi Fabrice!
Thanks a lot for your advice. Yes, it's a good idea. And yes, it works.
I created a separated index ( a new database ) for '*mark-identification*':
*for $db in ('US00','US01','US02')* *let $index := <index>{* * for $cases in db:open($db)/trademark-applications-daily/application-information/file-segments/action-keys/case-file* * group by $text := $cases/case-file-header/mark-identification* * return* * <text>* * <value>{$text}</value>* * <nodes>* * {for $node in $cases return <id>{ db:node-id($node) }</id>}* * </nodes>* * </text>* *}</index>* *return db:create($db || '-mark-text', $index, $db || '-mark-text.xml')*
Of course with a full-text index for '*value*'.
So, to search I use this piece of code:
let $text := 'corporation' for $db in ('US00','US01','US02') for $id in ft:search($db || '-mark-text', $text)/ancestor::text/nodes/id let $case-file := db:open-id($db, $id) return $case-file
And now it only takes 185ms in order to get the results and there is no scan for the '*party-name*' values.
*- "**I really appreciated working with basex that time, because others were in a kind of java/relational mapping hell... Me, I just had to add xml documents, reindex, and sometimes purge deleted items."*: Oh dear, I can't explain to you how much I'm in love with BaseX right now. Yes, trying to manage this volume of data and translate to a SQL database is like a Kafkaesque nightmare, not a healthy idea.
Thank you very much! Cheers, Sebastian.
On Mon, May 18, 2020 at 12:43 PM ETANCHAUD Fabrice < fabrice.etanchaud@maif.fr> wrote:
Hi Sebastian, Yes I think your search on mark-identification suffers from the huge number of party-names. From what I remember, reverse index (from full text tokens to node ids) is shared across all element's names. so filtering on the element's name is done at last.
When I was using basex to handle DOCDB patent db, I used to explode a document in sub-documents containing only keys and text to be indexed with respect to language and xml element, and then build seperate databases. That way I could create a dedicated full text index on a single (element names, language) combination.
Did that help ?
I really appreciated working with basex that time, because others were in a kind of java/relational mapping hell... Me, I just had to add xml documents, reindex, and sometimes purge deleted items.
Best, Fabrice
*De :* BaseX-Talk basex-talk-bounces@mailman.uni-konstanz.de de la part de Sebastian Guerrero chapeti@gmail.com *Envoyé :* lundi 18 mai 2020 17:23 *À :* BaseX basex-talk@mailman.uni-konstanz.de *Objet :* [basex-talk] Full-text index: searches for common words in another node. Does it take a lot of time?
Hi everybody.
I'm here again with my doubts. Thank you for your patience. ^^
I have a database of trademarks with a full-text index for two nodes: **:mark-identification,*:party-name*. [1]
Where "*mark-identification*" contains the name of the trademark, and " *party-name*" contains the name of the owner of the trademark.
I use the full-text index in order to search trademarks by its name, for example:
*for $results in //case-file[case-file-header/mark-identification/text() contains text {'basex'}]* *return $results//mark-identification*
returns all trademarks with "*basex*" on its name. It works like a thunderlight: 15ms to get 3 records among 2,134,434,598 nodes. Really a dream. [2]
But, for example, if I change the searched text from "*basex*" by a common word in "*party-name*", for example, "*corporation*" ( has 1096187x occurrences on the full-text index as showed in [1], it's a very common word in owners of trademarks ):
*for $results in //case-file[case-file-header/mark-identification/text() contains text {'corporation'}]* *return $results//mark-identification*
It takes a long time to get 6,715 records: 62,000ms [3]
If I search for "*live*" ( a common word for trademarks name, but not for owners names ) I get 5,875 records in 2,773 ms, which has not a relationship with the 62,000ms to get the 6k records for "*corporation*". [4]
So...
- Is this an expected behaviour?
- Is there a way to specify which "section" of the full-text index
should be used to perform the search? ( I don't know... maybe something similar to "*using stemming*" but "*using index 'mark-identification'*" )
Please apologize me if I'm asking by something not-logical,
Best regards, Sebastian
[1] https://imgur.com/uLla1Xt [2] https://imgur.com/Fkcvv2O [3] https://imgur.com/Hk71CNe [4] https://imgur.com/P72k574
Thank you Sebastian !
Yes, BaseX is an incredible piece of software, reducing development time by magnitudes.
The problems I faced were elsewhere :
* disruptive technology : pure XML technology is poorly shared among IT people, few engineers have a starter level of XPath, XSLT, XQuery. I found that most colleagues did not want to improve their skills in that domain, keeping with java/jaxb/sax/sql (and sometimes even hibernate...), finding all kinds of reasons not to embrace this solution. * not hype (not 'big' data) : excepted MarkLogic (but in an ashamed fashion in my opinion), XML is sadly absent from the 'big' data landscape, even though we did not wait for big data tools (map/reduce, json...) to handle lots of data ! * management by the way was reluctant to give that solution a try... * who's that guy that do the entire team's job in one week with a solution no one else can maintain ?
So yes I found in love with BaseX and XML too, but even if I had great great pleasures, it was (and still is) a kind of secret love, a team and management breaker. I certainly have my part in that situation, with my viceral aversion for things like governance, mediocracy...
All the best from french west coast, Fabrice Etanchaud
________________________________ De : Sebastian Guerrero chapeti@gmail.com Envoyé : lundi 18 mai 2020 20:32 À : ETANCHAUD Fabrice fabrice.etanchaud@maif.fr Cc : basex-talk@mailman.uni-konstanz.de basex-talk@mailman.uni-konstanz.de Objet : Re: [basex-talk] Full-text index: searches for common words in another node. Does it take a lot of time?
Hi Fabrice!
Thanks a lot for your advice. Yes, it's a good idea. And yes, it works.
I created a separated index ( a new database ) for 'mark-identification':
for $db in ('US00','US01','US02') let $index := <index>{ for $cases in db:open($db)/trademark-applications-daily/application-information/file-segments/action-keys/case-file group by $text := $cases/case-file-header/mark-identification return <text> <value>{$text}</value> <nodes> {for $node in $cases return <id>{ db:node-id($node) }</id>} </nodes> </text> }</index> return db:create($db || '-mark-text', $index, $db || '-mark-text.xml')
Of course with a full-text index for 'value'.
So, to search I use this piece of code:
let $text := 'corporation' for $db in ('US00','US01','US02') for $id in ft:search($db || '-mark-text', $text)/ancestor::text/nodes/id let $case-file := db:open-id($db, $id) return $case-file
And now it only takes 185ms in order to get the results and there is no scan for the 'party-name' values.
- "I really appreciated working with basex that time, because others were in a kind of java/relational mapping hell... Me, I just had to add xml documents, reindex, and sometimes purge deleted items.": Oh dear, I can't explain to you how much I'm in love with BaseX right now. Yes, trying to manage this volume of data and translate to a SQL database is like a Kafkaesque nightmare, not a healthy idea.
Thank you very much! Cheers, Sebastian.
On Mon, May 18, 2020 at 12:43 PM ETANCHAUD Fabrice <fabrice.etanchaud@maif.frmailto:fabrice.etanchaud@maif.fr> wrote: Hi Sebastian, Yes I think your search on mark-identification suffers from the huge number of party-names.
From what I remember, reverse index (from full text tokens to node ids) is shared across all element's names.
so filtering on the element's name is done at last.
When I was using basex to handle DOCDB patent db, I used to explode a document in sub-documents containing only keys and text to be indexed with respect to language and xml element, and then build seperate databases. That way I could create a dedicated full text index on a single (element names, language) combination.
Did that help ?
I really appreciated working with basex that time, because others were in a kind of java/relational mapping hell... Me, I just had to add xml documents, reindex, and sometimes purge deleted items.
Best, Fabrice
________________________________ De : BaseX-Talk <basex-talk-bounces@mailman.uni-konstanz.demailto:basex-talk-bounces@mailman.uni-konstanz.de> de la part de Sebastian Guerrero <chapeti@gmail.commailto:chapeti@gmail.com> Envoyé : lundi 18 mai 2020 17:23 À : BaseX <basex-talk@mailman.uni-konstanz.demailto:basex-talk@mailman.uni-konstanz.de> Objet : [basex-talk] Full-text index: searches for common words in another node. Does it take a lot of time?
Hi everybody.
I'm here again with my doubts. Thank you for your patience. ^^
I have a database of trademarks with a full-text index for two nodes: *:mark-identification,*:party-name. [1]
Where "mark-identification" contains the name of the trademark, and "party-name" contains the name of the owner of the trademark.
I use the full-text index in order to search trademarks by its name, for example:
for $results in //case-file[case-file-header/mark-identification/text() contains text {'basex'}] return $results//mark-identification
returns all trademarks with "basex" on its name. It works like a thunderlight: 15ms to get 3 records among 2,134,434,598 nodes. Really a dream. [2]
But, for example, if I change the searched text from "basex" by a common word in "party-name", for example, "corporation" ( has 1096187x occurrences on the full-text index as showed in [1], it's a very common word in owners of trademarks ):
for $results in //case-file[case-file-header/mark-identification/text() contains text {'corporation'}] return $results//mark-identification
It takes a long time to get 6,715 records: 62,000ms [3]
If I search for "live" ( a common word for trademarks name, but not for owners names ) I get 5,875 records in 2,773 ms, which has not a relationship with the 62,000ms to get the 6k records for "corporation". [4]
So...
* Is this an expected behaviour? * Is there a way to specify which "section" of the full-text index should be used to perform the search? ( I don't know... maybe something similar to "using stemming" but "using index 'mark-identification'" )
Please apologize me if I'm asking by something not-logical,
Best regards, Sebastian
[1] https://imgur.com/uLla1Xt [2] https://imgur.com/Fkcvv2O [3] https://imgur.com/Hk71CNe [4] https://imgur.com/P72k574
basex-talk@mailman.uni-konstanz.de