Hi Martin,
sorry for letting you wait, and thanks for giving a summary of your project.
Storing, indexing and querying gigabytes of XML data is something that should be no major problem (some out-dated statistics can be found here [1]; please note that the create databases did not include any index structures). I assume you have already stumbled upon XQuery Full Text, which also allows you to do text-based search [2].
Talking about scalability, do you have an approximate guess on the total byte size of XML documents to be managed? Maybe the easiest thing would be to simply run BaseX, create a first database from an initial collection.
It surely gets more interesting and challenging when the original data is to be changed, i.e. if texts are annotated. In this case, I would recommend to keep the original documents untouched and well-indexed, and store changes in an additional database. Node IDs could be used as back references [3], and the updates could be merged back to the original data in regular time intervals. As more than one databases can be addressed by a single query, original and updated nodes can also be merged on the fly, using XQuery Update [4].
Feel free to ask for more details, Christian
[1] http://docs.basex.org/wiki/Statistics [2] http://docs.basex.org/wiki/Full-Text [3] http://docs.basex.org/wiki/Database_Module#db:open-pre [4] http://docs.basex.org/wiki/XQuery_Update