Hello,
I would be having 100s of databases, with each database having 100 XML documents. I want to devise an algorithm, where given a part of XML file name, i want to know which database(s) contains it, or null if document is not currently present in any database. Based on that, add current document into the database. This is to always maintain latest version of a document in DB, and remove the older version, while adding newer version.
So far, only way I could come up with is:
for $db in all-databases: open $db $fileNames = list $db for eachFileName in $fileNames: if $eachFileName.contains(sub-xml filename): add to ret-list-db
return ret-list-db
Above algorithm, seems highly inefficient, Is there any indexing, which can be done ? Do you suggest, for each document insert, I should maintain a separate XML document, which lists each file inserted etc.
Once, i get hold of above list of db, I would be eventually deleting that file and inserting a latest version of that file(which would have same sub-xml file name). So, constant updating of this external document also seems painful (Map be ?).
Also, would it be faster, using XQUERY script files, thru java code, or using Java API for such operations ?
How do you all deal with such operations ?
- Mansi
Hi Mansi, I have a similar situation. I don't think there's a fast way to get documents by only knowing a part of their names. It seems you need to know the exact name. In my case, we might be able to group documents by a common id, so we might create subfolders inside the DB and store/get the contents of the subfolder directly, which is pretty fast. I've also tried indexing, but insertions got really slow (I assume maybe because indexing is not granular, it indexes all values) and we need performance. Oh, I've also tried using starts-with() instead of contains(), but it seems it does not pick up indexes. Martín.
Date: Fri, 28 Aug 2015 16:52:37 -0400 From: mansi.sheth@gmail.com To: basex-talk@mailman.uni-konstanz.de Subject: [basex-talk] Finding document based on filename
Hello, I would be having 100s of databases, with each database having 100 XML documents. I want to devise an algorithm, where given a part of XML file name, i want to know which database(s) contains it, or null if document is not currently present in any database. Based on that, add current document into the database. This is to always maintain latest version of a document in DB, and remove the older version, while adding newer version. So far, only way I could come up with is: for $db in all-databases: open $db $fileNames = list $db for eachFileName in $fileNames: if $eachFileName.contains(sub-xml filename): add to ret-list-db return ret-list-db Above algorithm, seems highly inefficient, Is there any indexing, which can be done ? Do you suggest, for each document insert, I should maintain a separate XML document, which lists each file inserted etc. Once, i get hold of above list of db, I would be eventually deleting that file and inserting a latest version of that file(which would have same sub-xml file name). So, constant updating of this external document also seems painful (Map be ?). Also, would it be faster, using XQUERY script files, thru java code, or using Java API for such operations ? How do you all deal with such operations ? - Mansi
How about (ignore the bad casing--that's Outlook's autocorrect and I'm too lazy to go back and correct every line):
Let $docs := collection('/mydir')/* For $doc in $docs Return if (matches(document-uri(root($doc)), '^.+somestring$')) Then $doc Else ()
Cheers,
Eliot ---- Eliot Kimber, Owner Contrext, LLC http://contrext.com
On 8/31/15, 11:35 AM, "Martín Ferrari" <basex-talk-bounces@mailman.uni-konstanz.de on behalf of ferrari_martin@hotmail.com> wrote:
Hi Mansi, I have a similar situation. I don't think there's a fast way to get documents by only knowing a part of their names. It seems you need to know the exact name. In my case, we might be able to group documents by a common id, so we might create subfolders inside the DB and store/get the contents of the subfolder directly, which is pretty fast. I've also tried indexing, but insertions got really slow (I assume maybe because indexing is not granular, it indexes all values) and we need performance.
Oh, I've also tried using starts-with() instead of contains(), but
it seems it does not pick up indexes.
Martín.
Date: Fri, 28 Aug 2015 16:52:37 -0400 From: mansi.sheth@gmail.com To: basex-talk@mailman.uni-konstanz.de Subject: [basex-talk] Finding document based on filename
Hello, I would be having 100s of databases, with each database having 100 XML documents. I want to devise an algorithm, where given a part of XML file name, i want to know which database(s) contains it, or null if document is not currently present in any database. Based on that, add current document into the database. This is to always maintain latest version of a document in DB, and remove the older version, while adding newer version.
So far, only way I could come up with is:
for $db in all-databases: open $db $fileNames = list $db for eachFileName in $fileNames: if $eachFileName.contains(sub-xml filename): add to ret-list-db
return ret-list-db
Above algorithm, seems highly inefficient, Is there any indexing, which can be done ? Do you suggest, for each document insert, I should maintain a separate XML document, which lists each file inserted etc.
Once, i get hold of above list of db, I would be eventually deleting that file and inserting a latest version of that file(which would have same sub-xml file name). So, constant updating of this external document also seems painful (Map be ?).
Also, would it be faster, using XQUERY script files, thru java code, or using Java API for such operations ?
How do you all deal with such operations ?
- Mansi
Hello Martin,
I would like to add that Christian just implemented selective indexes, so if you want to index in a more granular fashion this should now be possible. See https://github.com/BaseXdb/basex/issues/59 for more details and of course this is not stable software yet, so use with care. But we are happy about feedback, as always.
Regarding the initial issue: I do think using supfolders in the database should be the easiest and fastest way. Is there any reason not to use distinct directories instead of naming the file using some pattern?
Cheers Dirk
On 08/31/2015 06:35 PM, Martín Ferrari wrote:
Hi Mansi, I have a similar situation. I don't think there's a fast way to get documents by only knowing a part of their names. It seems you need to know the exact name. In my case, we might be able to group documents by a common id, so we might create subfolders inside the DB and store/get the contents of the subfolder directly, which is pretty fast. I've also tried indexing, but insertions got really slow (I assume maybe because indexing is not granular, it indexes all values) and we need performance.
Oh, I've also tried using starts-with() instead of contains(),
but it seems it does not pick up indexes.
Martín.
Date: Fri, 28 Aug 2015 16:52:37 -0400 From: mansi.sheth@gmail.com To: basex-talk@mailman.uni-konstanz.de Subject: [basex-talk] Finding document based on filename
Hello,
I would be having 100s of databases, with each database having 100 XML documents. I want to devise an algorithm, where given a part of XML file name, i want to know which database(s) contains it, or null if document is not currently present in any database. Based on that, add current document into the database. This is to always maintain latest version of a document in DB, and remove the older version, while adding newer version.
So far, only way I could come up with is:
for $db in all-databases: open $db $fileNames = list $db for eachFileName in $fileNames: if $eachFileName.contains(sub-xml filename): add to ret-list-db
return ret-list-db
Above algorithm, seems highly inefficient, Is there any indexing, which can be done ? Do you suggest, for each document insert, I should maintain a separate XML document, which lists each file inserted etc.
Once, i get hold of above list of db, I would be eventually deleting that file and inserting a latest version of that file(which would have same sub-xml file name). So, constant updating of this external document also seems painful (Map be ?).
Also, would it be faster, using XQUERY script files, thru java code, or using Java API for such operations ?
How do you all deal with such operations ?
- Mansi
I forgot one thing, I got much better performance by just calling replace rather than delete and insert, but this is a db with more than one million records. If performance is not important, I believe either way will do. Martín.
From: ferrari_martin@hotmail.com To: mansi.sheth@gmail.com; basex-talk@mailman.uni-konstanz.de Date: Mon, 31 Aug 2015 16:35:33 +0000 Subject: Re: [basex-talk] Finding document based on filename
Hi Mansi, I have a similar situation. I don't think there's a fast way to get documents by only knowing a part of their names. It seems you need to know the exact name. In my case, we might be able to group documents by a common id, so we might create subfolders inside the DB and store/get the contents of the subfolder directly, which is pretty fast. I've also tried indexing, but insertions got really slow (I assume maybe because indexing is not granular, it indexes all values) and we need performance. Oh, I've also tried using starts-with() instead of contains(), but it seems it does not pick up indexes. Martín.
Date: Fri, 28 Aug 2015 16:52:37 -0400 From: mansi.sheth@gmail.com To: basex-talk@mailman.uni-konstanz.de Subject: [basex-talk] Finding document based on filename
Hello, I would be having 100s of databases, with each database having 100 XML documents. I want to devise an algorithm, where given a part of XML file name, i want to know which database(s) contains it, or null if document is not currently present in any database. Based on that, add current document into the database. This is to always maintain latest version of a document in DB, and remove the older version, while adding newer version. So far, only way I could come up with is: for $db in all-databases: open $db $fileNames = list $db for eachFileName in $fileNames: if $eachFileName.contains(sub-xml filename): add to ret-list-db return ret-list-db Above algorithm, seems highly inefficient, Is there any indexing, which can be done ? Do you suggest, for each document insert, I should maintain a separate XML document, which lists each file inserted etc. Once, i get hold of above list of db, I would be eventually deleting that file and inserting a latest version of that file(which would have same sub-xml file name). So, constant updating of this external document also seems painful (Map be ?). Also, would it be faster, using XQUERY script files, thru java code, or using Java API for such operations ? How do you all deal with such operations ? - Mansi
Thanks guys for all expert comments. Currently, I am going experimenting performance with just deleting and inserting using Java API. If this process takes a tiny bit longer, i don't really care is what I figured :) If i becomes unacceptable, I will use one of these suggestions.
Thanks once again.
StringList databases = List.list(context) ;
String query = "" ;
for(String database : databases ) {
query = "db:list('" + database + "')" ;
try {
for (String fileName: query(query).split(" ")) {
query = "db:delete('" + database + "','" + fileName + "')" ;
if(fileName.contains(XMLFileName.split("_")[1])) {
query(query) ;
logger.info("Deleted " + fileName + " from " + database) ;
retVal = true;
break;
}
}
} catch (BaseXException e) {
e.printStackTrace();
}
}
On Mon, Aug 31, 2015 at 9:45 PM, Martín Ferrari ferrari_martin@hotmail.com wrote:
I forgot one thing, I got much better performance by just calling
replace rather than delete and insert, but this is a db with more than one million records. If performance is not important, I believe either way will do.
Martín.
From: ferrari_martin@hotmail.com To: mansi.sheth@gmail.com; basex-talk@mailman.uni-konstanz.de Date: Mon, 31 Aug 2015 16:35:33 +0000 Subject: Re: [basex-talk] Finding document based on filename
Hi Mansi, I have a similar situation. I don't think there's a fast way to get documents by only knowing a part of their names. It seems you need to know the exact name. In my case, we might be able to group documents by a common id, so we might create subfolders inside the DB and store/get the contents of the subfolder directly, which is pretty fast. I've also tried indexing, but insertions got really slow (I assume maybe because indexing is not granular, it indexes all values) and we need performance.
Oh, I've also tried using starts-with() instead of contains(), but it
seems it does not pick up indexes.
Martín.
Date: Fri, 28 Aug 2015 16:52:37 -0400 From: mansi.sheth@gmail.com To: basex-talk@mailman.uni-konstanz.de Subject: [basex-talk] Finding document based on filename
Hello,
I would be having 100s of databases, with each database having 100 XML documents. I want to devise an algorithm, where given a part of XML file name, i want to know which database(s) contains it, or null if document is not currently present in any database. Based on that, add current document into the database. This is to always maintain latest version of a document in DB, and remove the older version, while adding newer version.
So far, only way I could come up with is:
for $db in all-databases: open $db $fileNames = list $db for eachFileName in $fileNames: if $eachFileName.contains(sub-xml filename): add to ret-list-db
return ret-list-db
Above algorithm, seems highly inefficient, Is there any indexing, which can be done ? Do you suggest, for each document insert, I should maintain a separate XML document, which lists each file inserted etc.
Once, i get hold of above list of db, I would be eventually deleting that file and inserting a latest version of that file(which would have same sub-xml file name). So, constant updating of this external document also seems painful (Map be ?).
Also, would it be faster, using XQUERY script files, thru java code, or using Java API for such operations ?
How do you all deal with such operations ?
- Mansi
basex-talk@mailman.uni-konstanz.de