Hi E. Wray,
I have attached a little example for some XQuery code, which adds files, archives and archive contents to a database. It’s probably not the most efficient solution, so feel free to enhance it or ask more questions.
I agree that your use case is an enticing one: We also use BaseX to process office files, and Rositsa Shadura wrote an interesting thesis on that topic [1]. As Dirk pointed out, it turned out that we didn’t want to choose one particular solution, and the XQuery approach is currently the most flexible one.
Hope this helps, Christian
[1] http://basex.org/about-us/publications ___________________________
On Wed, Nov 25, 2015 at 5:43 PM, Dirk Kirsten dk@basex.org wrote:
Hello,
which problems did you encounter? This problem should be solvable using a small XQuery, basically putting what you describe in natural languages in XQuery so our processor understands it.
I don't think it would make any sense to add such a specific format. There are simply way to many possible combinations - You want archive files extracted, others might want not to do this. In the end we would end up with a very complex definition language - And what's the point if we already have a standardized query language like XQuery, which can achieve the same thing?
Cheers Dirk
On 11/25/2015 05:38 PM, E. Wray Johnson wrote:
Here is what I want to do: For a given folder and all its subfolders on my physical dive, mirror its contents including the contents of archives, parsing xml, json,html, text, etc. using their respective parser skipping invalids, and adding all other files as raw. I want archive files (*.zip, *.doxc) to be added as raw, however I want the text inside archive files like docx (ms-word) to be indexed and any files in the archives files that match a filter to be indexed.
Note: It would be nice if there was a single db:add method that allowed me to specify a map of filters to parsers with options, where all files that do not match a filter (or are invalid) will be optionally added as raw.