Hello!
Just wanted to report back that it works really well. It is about 50% slower than running the md5 command on the command line of my mac. A 4.15 gb file takes around 20 seconds in BaseX compared to 10 seconds using the native command.
Not sure if this is a limitation in Java or if performance could be tweaked further. But at the moment it feels unimportant for our case.
Thank you again for your swift reply and delivery!
Regards, Johan Mörén
On Sun Jan 25 2015 at 1:56:21 PM Johan Mörén johan.moren@gmail.com wrote:
Great news Christian. I'll try it out tomorrow at work!
/Johan
On Sun, Jan 25, 2015 at 1:22 PM, Christian Grün <christian.gruen@gmail.com
wrote:
Hi Johan,
A new snapshot is available [1]. In the course of rewriting the hashing code, I further improved our streamlining architecture [2, 3].
Your testing feedback is welcome, Christian
[1] http://files.basex.org/releases/latest/ [2] https://github.com/BaseXdb/basex/commit/b39b7 [3] https://github.com/BaseXdb/basex/commit/28139
On Sat, Jan 24, 2015 at 8:39 PM, Christian Grün christian.gruen@gmail.com wrote:
Thanks, this makes it much easier. I'll probably go for this one:
MessageDigest md = MessageDigest.getInstance(algo); try(InputStream is = ...) { try(DigestInputStream dis = new DigestInputStream(is, md)) { while(dis.read() != -1); } return md.digest(); }
Keeping you updated, Christian
On Sat, Jan 24, 2015 at 7:39 PM, Johan Mörén johan.moren@gmail.com
wrote:
Hi Christian
I think you can go with Javas implementation all the way. like this
MessageDigest md = MessageDigest.getInstance("MD5"); InputStream is = new FileInputStream("C:\Temp\Small\Movie.mp4"); //
Size
700 MB
byte [] buffer = new byte [blockSize]; int numRead; do { numRead = is.read(buffer); if (numRead > 0) { md.update(buffer, 0, numRead); } } while (numRead != -1);
byte[] digest = md.digest();
On Sat Jan 24 2015 at 6:49:18 PM Christian Grün <
christian.gruen@gmail.com>
wrote:
Hi Johan,
looks like a useful feature! Currently, we use Java's default implementation for computing hashes [1]. If you want to help us, you could look out for an existing Java md5 hashing source code, which we could then adopt in BaseX!
Best, Christian
[1]
https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/ba...
On Sat, Jan 24, 2015 at 11:37 AM, Johan Mörén johan.moren@gmail.com wrote:
Hello!
We have been using the hashing module to calculate md5 checksums on binary files successfully for a while. But last week we received our first really large file (4.3 gb) and our script threw a
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
We are currently using the 7.8 version of BaseX. I suspect that
BaseX
materialize the stream returned by file:read-binary as a byte-array
when
we call the hash:md5 function.
This is a snippet of our script where the problem arises ... let $binary := file:read-binary($filePath) let $checksum :=
lower-case(xs:string(xs:hexBinary(hash:md5($binary))))
...
I think a nice feature to add to BaseX could either be a new
function in
the file-module called file-checksum($algorithm) that calculates
checksum on
files in a streaming fashion. Or perhaps an option to the hashing functions that indicates that you want them to use streaming.
Regards, Johan Mörén