Hi Christian,
Thank you for coming back so quickly.
One way out (until this has been fixed) is to optimize these databases in regular time intervals.
I’ve been doing this on one of my databases and it does work - it’s just another thing to remember to do! It’s a large database and the index speeds up the queries I need to do by so much (and I’m doing query, replace, query, replace) that UPDINDEX makes a huge difference. Doing a db:optimize() after each replace was too slow.
I’ve spent some time pulling apart the index files to understand what’s going on inside and provide this as much for reference as anything:
I don't know the format for the index files but I've looked at atvl.basex just in a text editor. It looks like for each update to the index around 40k blank lines are being added. I don't know that they are truly blank lines - but that's how they're rendering in the editor.
This sounds surprising, but it could be an interesting hint. If you manage to compress this file to a reasonable size, feel free to send it to me.
I do know the format for the files now and I can confirm that the new lines were just a red herring. It so happened that the difference between the IDs for the attributes happened to be 12 in the repeating test data I was using - rendered as an ASCII character in my editor that was a new line.
Instead, newly created ID lists will always be appended to the end of this file, resulting in a continuous increase of the file size.
This is absolutely true for db:add(). If a new attribute is added, for example with value 1, then a new list of all the IDs with value one is appended to the end of the index file and the old one is left orphaned.
However the behaviour is different when using db:replace. I think it’s doing a db:delete() and then a db:add(). So first the index file has the ID list for that attribute value rewritten in place (so the count will go from 2048 to 2047 for example) with a new value for count and just the remaining IDs once the document being replaced is removed. The now unused bytes at the end are left with their previous values. Then a completely new ID list is written to the end of the file (now with the count back up to 2048 for example) as the replacement attribute is added.
In short then: ID lists are updated in place if they get shorter but appended to the end of the file if they get longer.
[As a note: there seems to be a small bug when UPDINDEX is true in that a index file is always at least 4096 bytes. When an empty database is created the index file will be 4096 zero bytes with updates appended to the end. Even if you optimize the file will be padded to 4096 bytes with zeros.]
I can see that there are ways to work round the issue of the even growing index but if there is a way to prevent it happening I think it would be very beneficial. BaseX is so easy to get started with that I push all sorts of things into it because I can do things quickly - I’m sure others do too - but the indexes make such a difference to speed in my uses that I’d love to be able to do everything with UPDINDEX set to true and just forget about it. I think the file is recreated each time too which means each time it gets written there’s a more and more to write to the disk (I was doing an optimise every 1000 replaces so it was still getting to be a big file!) which must come with a time overhead.
How fixed is the index file format? I ask because I’ve spent some time understanding how it works so I can read the files and see exactly what’s in them. If it would be useful then I’m happy to put the information into the wiki somewhere to make it quicker for anyone else who’s interested. However if you want to keep the structure obscure for any reason then I won’t publish anything. Let me know.
Many thanks, James
On 15 Jul 2014, at 12:14, Christian Grün christian.gruen@gmail.com wrote:
Hi James,
The issue I'm seeing is that the size of the index grows by approximately 1MB with every updating 'transaction' (snapshot?) even if there is no new data for the index. For example if I have a database with 100,000 files and I replace one of those files (with itself so there's no new data) then the size of the index will go up by around 1MB. If I replace 1000 files in the same transaction (again with themselves) the size of the index will go up again by around 1MB. Dropping and recreating the index returns it to its original size. I have a current project where I'm expecting thousands of files a few at time that need to be added/replaced - I completely ran out of disk space before I spotted what was happening when testing.
I can confirm that this is a known issue of the UPDINDEX option. We didn't have time so far to dive into this yet (and it doesn't seem to cause troubles in all scenarios we know). I assume the reason is that obsolete ID lists in atvl.basex will not be overwritten by newer data, but instead are orphaned. Instead, newly created ID lists will always be appended to the end of this file, resulting in a continuous increase of the file size.
I don't know the format for the index files but I've looked at atvl.basex just in a text editor. It looks like for each update to the index around 40k blank lines are being added. I don't know that they are truly blank lines - but that's how they're rendering in the editor.
This sounds surprising, but it could be an interesting hint. If you manage to compress this file to a reasonable size, feel free to send it to me.
Best, Christian