Hi Christian

I think you can go with Javas implementation all the way. like this

MessageDigest md = MessageDigest.getInstance("MD5");
InputStream is = new FileInputStream("C:\\Temp\\Small\\Movie.mp4"); // Size 700 MB

byte [] buffer = new byte [blockSize];
int numRead;
do 
{
 numRead = is.read(buffer);
 if (numRead > 0) 
 {
  md.update(buffer, 0, numRead);
 }
} while (numRead != -1);

byte[] digest = md.digest();

On Sat Jan 24 2015 at 6:49:18 PM Christian Grün <christian.gruen@gmail.com> wrote:
Hi Johan,

looks like a useful feature! Currently, we use Java's default
implementation for computing hashes [1]. If you want to help us, you
could look out for an existing Java md5 hashing source code, which we
could then adopt in BaseX!

Best,
Christian

[1] https://github.com/BaseXdb/basex/blob/master/basex-core/src/main/java/org/basex/query/func/hash/HashFn.java


On Sat, Jan 24, 2015 at 11:37 AM, Johan Mörén <johan.moren@gmail.com> wrote:
> Hello!
>
> We have been using the hashing module to calculate md5 checksums on binary
> files successfully for a while. But last week we received our first really
> large file (4.3 gb) and our script threw a
>
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>
> We are currently using the 7.8 version of BaseX. I suspect that BaseX
> materialize the stream returned by file:read-binary as a byte-array when we
> call the hash:md5 function.
>
> This is a snippet of our script where the problem arises
> ...
> let $binary := file:read-binary($filePath)
> let $checksum := lower-case(xs:string(xs:hexBinary(hash:md5($binary))))
> ...
>
> I think a nice feature to add to BaseX could either be a new function in the
> file-module called file-checksum($algorithm) that calculates checksum on
> files in a streaming fashion. Or perhaps an option to the hashing functions
> that indicates that you want them to use streaming.
>
> Regards,
> Johan Mörén