creating epub and odf with bases

List overview All Threads
Download

newer

older

big data performance

Re: [basex-talk] [EXTERNAL] - Re:...

Jos van den Oever

8 Sep 2020 8 Sep '20

3:29 a.m.

Hello all,

As you might know, epub files and ODF files are zip files with specific contents. BaseX supports the expath zip module and could in theory be used for creating these files if it were not for a missing simple feature.

There is one rule for epub and ODF files that cannot be followed by BaseX at the moment: the first file in the zip container should be named 'mimetype' and is a plain test file that contains the mimetype string. This is meant to allow applications to read the mimetype at a fixed offset in the file and without doing decompression.

In unzip -vl it looks like this:

Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 20 Stored 20 0% 10-14-2018 05:57 2cab616f mimetype

Here is an XQuery to create a file with just that entry:

````xquery declare namespace zip = "http://expath.org/ns/zip";

let $zip := <zip:file href="new.epub"> <zip:entry name="mimetype" compressed="no" method="text"> {"application/epub+zip"} </zip:entry> </zip:file> return zip:zip-file($zip) ```

BaseX does not support the 'compressed' option. Without that option the file 'mimetype' is stored in compressed form and cannot be used by applications to quickly determine the mimetype of the file.

Modifying the xml in an exisiting epub or ODF with zip:update-entries is also not possible because the mimetype file is still compressed.

An additional issue: when reading a zip file, the entries in zip:file are not in the same order as they are in the zip file. So when modifying an existing file, the mimetype entry has to moved to the front of the list explicitly.

In short: to make BaseX support the creation of epub en ODF files it should: - support the 'compressed' attribute - retain the order of files in the zip file in the zip:file element.

Best regards, Jos

Attachments:

signature.asc (application/pgp-signature — 488 bytes)

Show replies by date

Christian Grün

8 Sep 8 Sep

3:57 a.m.

Hi Jos,

While the ZIP Module is still part of our distribution, it’s not actively maintained anymore, and we generally recommend our users to switch to the Archive Module [1]. Providing custom compression levels for each archive entry is one of the features that is provided by this newer module.

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Archive_Module

On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever jos@vandenoever.info wrote:

...

Hello all,

As you might know, epub files and ODF files are zip files with specific contents. BaseX supports the expath zip module and could in theory be used for creating these files if it were not for a missing simple feature.

There is one rule for epub and ODF files that cannot be followed by BaseX at the moment: the first file in the zip container should be named 'mimetype' and is a plain test file that contains the mimetype string. This is meant to allow applications to read the mimetype at a fixed offset in the file and without doing decompression.

In unzip -vl it looks like this:

Length Method Size Cmpr Date Time CRC-32 Name
  20  Stored       20   0% 10-14-2018 05:57 2cab616f  mimetype
Here is an XQuery to create a file with just that entry:
declare namespace zip = "http://expath.org/ns/zip";

let $zip :=
<zip:file href="new.epub">
  <zip:entry name="mimetype" compressed="no" method="text">
    {"application/epub+zip"}
  </zip:entry>
</zip:file>
return zip:zip-file($zip)
```

BaseX does not support the 'compressed' option. Without that option the file
'mimetype' is stored in compressed form and cannot be used by applications to
quickly determine the mimetype of the file.

Modifying the xml in an exisiting epub or ODF with zip:update-entries is also
not possible because the mimetype file is still compressed.

An additional issue: when reading a zip file, the entries in <zip:file> are
not in the same order as they are in the zip file. So when modifying an
existing file, the mimetype entry has to moved to the front of the list
explicitly.

In short: to make BaseX support the creation of epub en ODF files it should:
 - support the 'compressed' attribute
 - retain the order of files in the zip file in the <zip:file> element.

Best regards,
Jos

Jos van den Oever

4:23 a.m.

On dinsdag 8 september 2020 09:57:50 CEST Christian Grün wrote:

...

Hi Jos,

While the ZIP Module is still part of our distribution, it’s not actively maintained anymore, and we generally recommend our users to switch to the Archive Module [1]. Providing custom compression levels for each archive entry is one of the features that is provided by this newer module.

Oh, a shame that the cross-implementation module is not maintained.

The archive module also compresses the 'mimetype' file with this code:

let $file := "test.ods" let $archive := file:read-binary($file) let $content := parse-xml(archive:extract-text($archive, "content.xml")) let $content := local:change($content, local:add_number_value_type#1) let $updated := archive:update($archive, "content.xml", $content) return file:write-binary($file, $updated)

Cheers, Jos

...

Hope this helps, Christian

[1] https://docs.basex.org/wiki/Archive_Module

On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever jos@vandenoever.info

wrote:

...

...
Hello all,

As you might know, epub files and ODF files are zip files with specific contents. BaseX supports the expath zip module and could in theory be used for creating these files if it were not for a missing simple feature.

There is one rule for epub and ODF files that cannot be followed by BaseX at the moment: the first file in the zip container should be named 'mimetype' and is a plain test file that contains the mimetype string. This is meant to allow applications to read the mimetype at a fixed offset in the file and without doing decompression.

In unzip -vl it looks like this: Length Method Size Cmpr Date Time CRC-32 Name
  20  Stored       20   0% 10-14-2018 05:57 2cab616f  mimetype
Here is an XQuery to create a file with just that entry:
declare namespace zip = "http://expath.org/ns/zip";

let $zip :=
<zip:file href="new.epub">

  <zip:entry name="mimetype" compressed="no" method="text">

    {"application/epub+zip"}

  </zip:entry>

</zip:file>
return zip:zip-file($zip)
```

BaseX does not support the 'compressed' option. Without that option the
file 'mimetype' is stored in compressed form and cannot be used by
applications to quickly determine the mimetype of the file.

Modifying the xml in an exisiting epub or ODF with zip:update-entries is
also not possible because the mimetype file is still compressed.

An additional issue: when reading a zip file, the entries in <zip:file>
are
not in the same order as they are in the zip file. So when modifying an
existing file, the mimetype entry has to moved to the front of the list
explicitly.

In short: to make BaseX support the creation of epub en ODF files it 

should:

...

...

support the 'compressed' attribute

retain the order of files in the zip file in the zip:file element.

Best regards, Jos

Christian Grün

4:59 a.m.

...

Oh, a shame that the cross-implementation module is not maintained.

The Archive Module was supposed to become the new EXPath standard. Unfortunately, different versions of that module were specified one after another such that the spec that’s currently publicly available doesn’t reflect our implementation anymore [1].

I didn’t know that the ZIP Module is still maintained in other implementations of XQuery. Is it still popular e.g. in eXist-db?

...

The archive module also compresses the 'mimetype' file with this code:

When calling archive:update, you can supply more properties with an archive:entry element:

<archive:entry last-modified='2011-11-11T11:11:11' compression-level='8' encoding='US-ASCII'>hello.txt</archive:entry>

Best, Christian

[1] http://expath.org/spec/archive/20130930

...

let $file := "test.ods" let $archive := file:read-binary($file) let $content := parse-xml(archive:extract-text($archive, "content.xml")) let $content := local:change($content, local:add_number_value_type#1) let $updated := archive:update($archive, "content.xml", $content) return file:write-binary($file, $updated)

Cheers, Jos

...
Hope this helps, Christian

[1] https://docs.basex.org/wiki/Archive_Module

On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever jos@vandenoever.info

wrote:

...
...
Hello all,

As you might know, epub files and ODF files are zip files with specific contents. BaseX supports the expath zip module and could in theory be used for creating these files if it were not for a missing simple feature.

There is one rule for epub and ODF files that cannot be followed by BaseX at the moment: the first file in the zip container should be named 'mimetype' and is a plain test file that contains the mimetype string. This is meant to allow applications to read the mimetype at a fixed offset in the file and without doing decompression.

In unzip -vl it looks like this: Length Method Size Cmpr Date Time CRC-32 Name
  20  Stored       20   0% 10-14-2018 05:57 2cab616f  mimetype
Here is an XQuery to create a file with just that entry:
declare namespace zip = "http://expath.org/ns/zip";

let $zip :=
<zip:file href="new.epub">

  <zip:entry name="mimetype" compressed="no" method="text">

    {"application/epub+zip"}

  </zip:entry>

</zip:file>
return zip:zip-file($zip)
```

BaseX does not support the 'compressed' option. Without that option the
file 'mimetype' is stored in compressed form and cannot be used by
applications to quickly determine the mimetype of the file.

Modifying the xml in an exisiting epub or ODF with zip:update-entries is
also not possible because the mimetype file is still compressed.

An additional issue: when reading a zip file, the entries in <zip:file>
are
not in the same order as they are in the zip file. So when modifying an
existing file, the mimetype entry has to moved to the front of the list
explicitly.

In short: to make BaseX support the creation of epub en ODF files it
should:

...
...

support the 'compressed' attribute

retain the order of files in the zip file in the zip:file element.

Best regards, Jos

Jos van den Oever

5:05 a.m.

New subject: creating epub and odf with basex

On dinsdag 8 september 2020 10:59:37 CEST Christian Grün wrote:

...

...
Oh, a shame that the cross-implementation module is not maintained.

The Archive Module was supposed to become the new EXPath standard. Unfortunately, different versions of that module were specified one after another such that the spec that’s currently publicly available doesn’t reflect our implementation anymore [1].

I didn’t know that the ZIP Module is still maintained in other implementations of XQuery. Is it still popular e.g. in eXist-db?

I've used it in production to create governemnt epub files (law bundles).

...

...
The archive module also compresses the 'mimetype' file with this code:

When calling archive:update, you can supply more properties with an archive:entry element:

<archive:entry last-modified='2011-11-11T11:11:11' compression-level='8' encoding='US-ASCII'>hello.txt</archive:entry>

I assumed that files that are not mentioned in the archive:update call or zip:update-entries call would not be touched.

I'll see if this way works.

Cheers, Jos

...

Best, Christian

[1] http://expath.org/spec/archive/20130930

...
let $file := "test.ods" let $archive := file:read-binary($file) let $content := parse-xml(archive:extract-text($archive, "content.xml")) let $content := local:change($content, local:add_number_value_type#1) let $updated := archive:update($archive, "content.xml", $content) return file:write-binary($file, $updated)

Cheers, Jos

...
Hope this helps, Christian

[1] https://docs.basex.org/wiki/Archive_Module

On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever jos@vandenoever.info

wrote:

...
...
Hello all,

As you might know, epub files and ODF files are zip files with specific contents. BaseX supports the expath zip module and could in theory be used for creating these files if it were not for a missing simple feature.

There is one rule for epub and ODF files that cannot be followed by BaseX at the moment: the first file in the zip container should be named 'mimetype' and is a plain test file that contains the mimetype string. This is meant to allow applications to read the mimetype at a fixed offset in the file and without doing decompression.

In unzip -vl it looks like this: Length Method Size Cmpr Date Time CRC-32 Name
  20  Stored       20   0% 10-14-2018 05:57 2cab616f  mimetype
Here is an XQuery to create a file with just that entry:
declare namespace zip = "http://expath.org/ns/zip";

let $zip :=
<zip:file href="new.epub">

  <zip:entry name="mimetype" compressed="no" method="text">

    {"application/epub+zip"}

  </zip:entry>

</zip:file>
return zip:zip-file($zip)
```

BaseX does not support the 'compressed' option. Without that option
the
file 'mimetype' is stored in compressed form and cannot be used by
applications to quickly determine the mimetype of the file.

Modifying the xml in an exisiting epub or ODF with zip:update-entries
is
also not possible because the mimetype file is still compressed.

An additional issue: when reading a zip file, the entries in
<zip:file>
are
not in the same order as they are in the zip file. So when modifying
an
existing file, the mimetype entry has to moved to the front of the
list
explicitly.

In short: to make BaseX support the creation of epub en ODF files it
should:

...
...

support the 'compressed' attribute

retain the order of files in the zip file in the zip:file

element.

Best regards, Jos

Jos van den Oever

5:49 a.m.

New subject: creating epub and odf with basex

On dinsdag 8 september 2020 11:05:45 CEST Jos van den Oever wrote:

...

On dinsdag 8 september 2020 10:59:37 CEST Christian Grün wrote:

...
...
Oh, a shame that the cross-implementation module is not maintained.

The Archive Module was supposed to become the new EXPath standard. Unfortunately, different versions of that module were specified one after another such that the spec that’s currently publicly available doesn’t reflect our implementation anymore [1].

I didn’t know that the ZIP Module is still maintained in other implementations of XQuery. Is it still popular e.g. in eXist-db?

I've used it in production to create governemnt epub files (law bundles).

...
...
The archive module also compresses the 'mimetype' file with this code:

When calling archive:update, you can supply more properties with an archive:entry element:

<archive:entry last-modified='2011-11-11T11:11:11'
           compression-level='8'
           encoding='US-ASCII'>hello.txt</archive:entry>
I assumed that files that are not mentioned in the archive:update call or zip:update-entries call would not be touched.

I'll see if this way works.

Calling with compression-level="0" still compresses the file. And because a call with update is done, the entire zip needs to be rewritten while taking care that 'mimetype' is the first entry even though the archive spec says "The relative order of all the existing and replaced entries within the archive is preserved." This example demonstrates that compression-level="0" does do what the api promises:

```xquery let $file := "test.ods" let $archive := file:read-binary($file) let $mimetype := archive:extract-text($archive, "mimetype") let $content_xml := fn:parse-xml(archive:extract-text($archive, "content.xml")) let $content_xml := local:change($content_xml, local:add_number_value_type#1) let $entries := ( <archive:entry compression-level='0'>{"mimetype"}</archive:entry>, archive:entry{"content.xml"}</archive:entry> ) let $contents := ($mimetype, fn:serialize($content_xml)) let $updated := archive:update($archive, $entries, $contents) return file:write-binary($file, $updated) ```

On the archive spec: the example in '3.1 Creating a simple EPUB document' is not valid XQuery and does not match the description of the function.

Best regards, Jos

...

...
[1] http://expath.org/spec/archive/20130930

...
let $file := "test.ods" let $archive := file:read-binary($file) let $content := parse-xml(archive:extract-text($archive, "content.xml")) let $content := local:change($content, local:add_number_value_type#1) let $updated := archive:update($archive, "content.xml", $content) return file:write-binary($file, $updated)

Cheers, Jos

...
Hope this helps, Christian

[1] https://docs.basex.org/wiki/Archive_Module

On Tue, Sep 8, 2020 at 9:29 AM Jos van den Oever jos@vandenoever.info

wrote:

...
...
Hello all,

As you might know, epub files and ODF files are zip files with specific contents. BaseX supports the expath zip module and could in theory be used for creating these files if it were not for a missing simple feature.

There is one rule for epub and ODF files that cannot be followed by BaseX at the moment: the first file in the zip container should be named 'mimetype' and is a plain test file that contains the mimetype string. This is meant to allow applications to read the mimetype at a fixed offset in the file and without doing decompression.

In unzip -vl it looks like this: Length Method Size Cmpr Date Time CRC-32 Name
  20  Stored       20   0% 10-14-2018 05:57 2cab616f  mimetype
Here is an XQuery to create a file with just that entry:
declare namespace zip = "http://expath.org/ns/zip";

let $zip :=
<zip:file href="new.epub">

  <zip:entry name="mimetype" compressed="no" method="text">

    {"application/epub+zip"}

  </zip:entry>

</zip:file>
return zip:zip-file($zip)
```

BaseX does not support the 'compressed' option. Without that option
the
file 'mimetype' is stored in compressed form and cannot be used by
applications to quickly determine the mimetype of the file.

Modifying the xml in an exisiting epub or ODF with
zip:update-entries
is
also not possible because the mimetype file is still compressed.

An additional issue: when reading a zip file, the entries in
<zip:file>
are
not in the same order as they are in the zip file. So when modifying
an
existing file, the mimetype entry has to moved to the front of the
list
explicitly.

In short: to make BaseX support the creation of epub en ODF files it
should:

...
...

support the 'compressed' attribute

retain the order of files in the zip file in the zip:file

element.

Best regards, Jos

Christian Grün

5:57 a.m.

New subject: creating epub and odf with basex

...

This example demonstrates that compression-level="0" does do what the api promises:

I can have a closer look into that. Could you possibly provide me with a little self-contained example that I can run out of the box?

Jos van den Oever

6:15 a.m.

New subject: creating epub and odf with basex

On dinsdag 8 september 2020 11:57:16 CEST Christian Grün wrote:

...

...
This example demonstrates that compression-level="0" does do what

...
the api promises:

I can have a closer look into that. Could you possibly provide me with a little self-contained example that I can run out of the box?

Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed. The algorithm should be 'stored' for every entry that has compression-level="0".

```xquery declare namespace file = "http://expath.org/ns/file"; declare namespace archive = "http://basex.org/modules/archive";

(: Create a zip file with one uncompressed file :) let $file := "test.epub" let $mimetype := "application/epub+zip" let $entries := ( <archive:entry compression-level='0'>{"mimetype"}</archive:entry> ) let $contents := ($mimetype) let $zip := archive:create($entries, $contents, map { "format": "zip", "algorithm": "stored" } ) return file:write-binary($file, $zip) ```

Best regards, Jos

Christian Grün

7:06 a.m.

New subject: creating epub and odf with basex

...

Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed.

Thanks for the example. – My observation is that the entry is indeed archived uncompressed if you choose compression-level="0"; but I think what you are saying is that an uncompressed DEFLATE entry is not the same as an uncompressed STORED entry, right, and that ODS and ePub files require certain files to be stored with the STORED algorithm, is that right?

The Archive Module has a long history, and was initially based on a proposal for the Zorba XQuery Processor back in 2012. I don’t actually remember why the algorithm option was not adopted for the single archive entries; maybe that would have been more reasonable. As we seem to be the only implementation left today, we could think about changing that. I doubt anyway that people will use different compression levels for single archive entries (apart from archiving them uncompressed), so it might be a better solution to define one global compression level for the whole archive.

Jos van den Oever

8:06 a.m.

New subject: creating epub and odf with basex

On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:

...

...
Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed.

Thanks for the example. – My observation is that the entry is indeed archived uncompressed if you choose compression-level="0"; but I think what you are saying is that an uncompressed DEFLATE entry is not the same as an uncompressed STORED entry, right, and that ODS and ePub files require certain files to be stored with the STORED algorithm, is that right?

The thing that counts is that you can read the mimetype enty name and contents without decompression starting from byte 30. That way tools such as 'find' can report the mimetype.

The file generated with the attached script in BaseX 9.4.3 beta gives this:

$ file -i test.epub test.epub: application/octet-stream; charset=binary $ unzip -vl test.epub Archive: test.epub Length Method Size Cmpr Date Time CRC-32 Name -------- ------ ------- ---- ---------- ----- -------- ---- 20 Defl:N 25 -25% 09-08-2020 13:54 2cab616f mimetype -------- ------- --- ------- 20 25 -25% 1 file $ hexdump -C test.epub | head -4 00000000 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 |PK.........n(Q..| 00000010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 6d 69 |..............mi| 00000020 6d 65 74 79 70 65 01 14 00 eb ff 61 70 70 6c 69 |metype.....appli| 00000030 63 61 74 69 6f 6e 2f 65 70 75 62 2b 7a 69 70 50 |cation/epub+zipP|

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name and the contents and the zip will be recognized by the epub and ODF applications (and use less space) than when it is deflated with compression- level 0.

...

The Archive Module has a long history, and was initially based on a proposal for the Zorba XQuery Processor back in 2012. I don’t actually remember why the algorithm option was not adopted for the single archive entries; maybe that would have been more reasonable. As we seem to be the only implementation left today, we could think about changing that. I doubt anyway that people will use different compression levels for single archive entries (apart from archiving them uncompressed), so it might be a better solution to define one global compression level for the whole archive.

From a practical point of view (regardless of what is in the specification) it makes sense to store 'mimetype' uncompressed and also store files such as png and jpg that are already compressed in the 'stored' way. If that can be achieved easily: great, but at least it should be possible. I think the simplest solution is to save compression-level=0 as stored.

Best regards, Jos

Jos van den Oever

8:11 a.m.

New subject: creating epub and odf with basex

To be complete, here is an example to create a file that is recognized as epub:

$ echo -n application/epub+zip > mimetype $ zip -D -X -0 test.epub mimetype $ file -i test.epub test.epub: application/epub+zip; charset=binary $ hexdump -C test.epub | head -4 00000000 50 4b 03 04 0a 00 00 00 00 00 3d 2f 4e 4d 6f 61 |PK........=/NMoa| 00000010 ab 2c 14 00 00 00 14 00 00 00 08 00 00 00 6d 69 |.,............mi| 00000020 6d 65 74 79 70 65 61 70 70 6c 69 63 61 74 69 6f |metypeapplicatio| 00000030 6e 2f 65 70 75 62 2b 7a 69 70 50 4b 01 02 1e 03 |n/epub+zipPK....|

On dinsdag 8 september 2020 14:06:20 CEST Jos van den Oever wrote:

...

On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:

...
...
Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed.

Thanks for the example. – My observation is that the entry is indeed archived uncompressed if you choose compression-level="0"; but I think what you are saying is that an uncompressed DEFLATE entry is not the same as an uncompressed STORED entry, right, and that ODS and ePub files require certain files to be stored with the STORED algorithm, is that right?

The thing that counts is that you can read the mimetype enty name and contents without decompression starting from byte 30. That way tools such as 'find' can report the mimetype.

The file generated with the attached script in BaseX 9.4.3 beta gives this:

$ file -i test.epub test.epub: application/octet-stream; charset=binary $ unzip -vl test.epub Archive: test.epub Length Method Size Cmpr Date Time CRC-32 Name
  20  Defl:N       25 -25% 09-08-2020 13:54 2cab616f  mimetype
  20               25 -25%                            1 file
$ hexdump -C test.epub | head -4 00000000 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 |PK.........n(Q..| 00000010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 6d 69 |..............mi| 00000020 6d 65 74 79 70 65 01 14 00 eb ff 61 70 70 6c 69 |metype.....appli| 00000030 63 61 74 69 6f 6e 2f 65 70 75 62 2b 7a 69 70 50 |cation/epub+zipP|

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name and the contents and the zip will be recognized by the epub and ODF applications (and use less space) than when it is deflated with compression- level 0.

...
The Archive Module has a long history, and was initially based on a proposal for the Zorba XQuery Processor back in 2012. I don’t actually remember why the algorithm option was not adopted for the single archive entries; maybe that would have been more reasonable. As we seem to be the only implementation left today, we could think about changing that. I doubt anyway that people will use different compression levels for single archive entries (apart from archiving them uncompressed), so it might be a better solution to define one global compression level for the whole archive.

From a practical point of view (regardless of what is in the specification) it makes sense to store 'mimetype' uncompressed and also store files such as png and jpg that are already compressed in the 'stored' way. If that can be achieved easily: great, but at least it should be possible. I think the simplest solution is to save compression-level=0 as stored.

Best regards, Jos

Christian Grün

8:27 a.m.

New subject: creating epub and odf with basex

Hi Jos,

...

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name […]

Great, so we are talking about the same thing.

...

I think the simplest solution is to save compression-level=0 as stored.

That was also my thought. A quick fix caused the following error message (similar to what is described here [1])…

...

Operation failed: STORED entry missing size, compressed size, or crc-32.

…which means we’ll probably need to set additional values before writing the actual byte array. I’ll see what we can do.

I was surprised to learn more about the deficiencies of the Archive Module. The module was already used many times in the past to create ePub files, so my guess would be that these files could be opened by many readers, but were not 100% valid. How do you usually proceed to check the validity of ePub files?

Best, Christian

[1] https://stackoverflow.com/questions/1206970/how-to-create-uncompressed-zip-a...

On Tue, Sep 8, 2020 at 2:06 PM Jos van den Oever jos@vandenoever.info wrote:

...

On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:

...
...
Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed.

Thanks for the example. – My observation is that the entry is indeed archived uncompressed if you choose compression-level="0"; but I think what you are saying is that an uncompressed DEFLATE entry is not the same as an uncompressed STORED entry, right, and that ODS and ePub files require certain files to be stored with the STORED algorithm, is that right?

The thing that counts is that you can read the mimetype enty name and contents without decompression starting from byte 30. That way tools such as 'find' can report the mimetype.

The file generated with the attached script in BaseX 9.4.3 beta gives this:

$ file -i test.epub test.epub: application/octet-stream; charset=binary $ unzip -vl test.epub Archive: test.epub Length Method Size Cmpr Date Time CRC-32 Name
  20  Defl:N       25 -25% 09-08-2020 13:54 2cab616f  mimetype
  20               25 -25%                            1 file
$ hexdump -C test.epub | head -4 00000000 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 |PK.........n(Q..| 00000010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 6d 69 |..............mi| 00000020 6d 65 74 79 70 65 01 14 00 eb ff 61 70 70 6c 69 |metype.....appli| 00000030 63 61 74 69 6f 6e 2f 65 70 75 62 2b 7a 69 70 50 |cation/epub+zipP|

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name and the contents and the zip will be recognized by the epub and ODF applications (and use less space) than when it is deflated with compression- level 0.

...
The Archive Module has a long history, and was initially based on a proposal for the Zorba XQuery Processor back in 2012. I don’t actually remember why the algorithm option was not adopted for the single archive entries; maybe that would have been more reasonable. As we seem to be the only implementation left today, we could think about changing that. I doubt anyway that people will use different compression levels for single archive entries (apart from archiving them uncompressed), so it might be a better solution to define one global compression level for the whole archive.

From a practical point of view (regardless of what is in the specification) it makes sense to store 'mimetype' uncompressed and also store files such as png and jpg that are already compressed in the 'stored' way. If that can be

Best regards, Jos

Christian Grün

8:40 a.m.

New subject: creating epub and odf with basex

I’ve updated the code; if the compression level is set to 0, entries will be STORED [1]. Feel free to check out the latest snapshot [2].

[1] https://github.com/BaseXdb/basex/commit/67ad584a85e0848432e19b4f587fbabfc2fc... [2] https://files.basex.org/releases/latest/

On Tue, Sep 8, 2020 at 2:27 PM Christian Grün christian.gruen@gmail.com wrote:

...

Hi Jos,

...
There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name […]

Great, so we are talking about the same thing.

...
I think the simplest solution is to save compression-level=0 as stored.

That was also my thought. A quick fix caused the following error message (similar to what is described here [1])…

...
Operation failed: STORED entry missing size, compressed size, or crc-32.

…which means we’ll probably need to set additional values before writing the actual byte array. I’ll see what we can do.

I was surprised to learn more about the deficiencies of the Archive Module. The module was already used many times in the past to create ePub files, so my guess would be that these files could be opened by many readers, but were not 100% valid. How do you usually proceed to check the validity of ePub files?

Best, Christian

[1] https://stackoverflow.com/questions/1206970/how-to-create-uncompressed-zip-a...

On Tue, Sep 8, 2020 at 2:06 PM Jos van den Oever jos@vandenoever.info wrote:

...
On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:

...
...
Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed.

Thanks for the example. – My observation is that the entry is indeed archived uncompressed if you choose compression-level="0"; but I think what you are saying is that an uncompressed DEFLATE entry is not the same as an uncompressed STORED entry, right, and that ODS and ePub files require certain files to be stored with the STORED algorithm, is that right?

The thing that counts is that you can read the mimetype enty name and contents without decompression starting from byte 30. That way tools such as 'find' can report the mimetype.

The file generated with the attached script in BaseX 9.4.3 beta gives this:

$ file -i test.epub test.epub: application/octet-stream; charset=binary $ unzip -vl test.epub Archive: test.epub Length Method Size Cmpr Date Time CRC-32 Name
  20  Defl:N       25 -25% 09-08-2020 13:54 2cab616f  mimetype
  20               25 -25%                            1 file
$ hexdump -C test.epub | head -4 00000000 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 |PK.........n(Q..| 00000010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 6d 69 |..............mi| 00000020 6d 65 74 79 70 65 01 14 00 eb ff 61 70 70 6c 69 |metype.....appli| 00000030 63 61 74 69 6f 6e 2f 65 70 75 62 2b 7a 69 70 50 |cation/epub+zipP|

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name and the contents and the zip will be recognized by the epub and ODF applications (and use less space) than when it is deflated with compression- level 0.

...
The Archive Module has a long history, and was initially based on a proposal for the Zorba XQuery Processor back in 2012. I don’t actually remember why the algorithm option was not adopted for the single archive entries; maybe that would have been more reasonable. As we seem to be the only implementation left today, we could think about changing that. I doubt anyway that people will use different compression levels for single archive entries (apart from archiving them uncompressed), so it might be a better solution to define one global compression level for the whole archive.

From a practical point of view (regardless of what is in the specification) it makes sense to store 'mimetype' uncompressed and also store files such as png and jpg that are already compressed in the 'stored' way. If that can be

Best regards, Jos

Jos van den Oever

9:39 a.m.

New subject: creating epub and odf with basex

On dinsdag 8 september 2020 14:40:34 CEST Christian Grün wrote:

...

I’ve updated the code; if the compression level is set to 0, entries will be STORED [1]. Feel free to check out the latest snapshot [2].

Creating a new file epub or odf file works correctly now but archive:update() does not retain the 'stored' property for the manifest file. (It does retain the order of the entries). Here is an example script that takes 'test.epub' as input.

```xquery declare namespace file = "http://expath.org/ns/file"; declare namespace archive = "http://basex.org/modules/archive";

(: Update a zip file Currently, this will change the 'stored' entries to 'deflate' breaking mimetype recognition. :) let $file := "test.epub" let $archive := file:read-binary($file) let $updated := archive:update($archive, (), ()) return file:write-binary($file, $updated) ```

Best regards, Jos

...

[1] https://github.com/BaseXdb/basex/commit/67ad584a85e0848432e19b4f587fbabfc2f c38e5 [2] https://files.basex.org/releases/latest/

On Tue, Sep 8, 2020 at 2:27 PM Christian Grün christian.gruen@gmail.com

wrote:

...

...
Hi Jos,

...
There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name […]

Great, so we are talking about the same thing.

...
I think the simplest solution is to save compression-level=0 as stored.

That was also my thought. A quick fix caused the following error message (similar to what is described here [1])…

...
Operation failed: STORED entry missing size, compressed size, or crc-32.

…which means we’ll probably need to set additional values before writing the actual byte array. I’ll see what we can do.

I was surprised to learn more about the deficiencies of the Archive Module. The module was already used many times in the past to create ePub files, so my guess would be that these files could be opened by many readers, but were not 100% valid. How do you usually proceed to check the validity of ePub files?

Best, Christian

[1] https://stackoverflow.com/questions/1206970/how-to-create-uncompressed-zi p-archive-in-java> On Tue, Sep 8, 2020 at 2:06 PM Jos van den Oever jos@vandenoever.info

wrote:

...

...
...
On dinsdag 8 september 2020 13:06:19 CEST Christian Grün wrote:

...
...
Here is an example that creates a new archive that uses compression-level="0" and algorithm="stored" and still compresses that entry.

Note that the archive level option 'algorithm' is unfortumate because often it is only single entries such as 'mimetype' or images that should not be compressed.

Thanks for the example. – My observation is that the entry is indeed archived uncompressed if you choose compression-level="0"; but I think what you are saying is that an uncompressed DEFLATE entry is not the same as an uncompressed STORED entry, right, and that ODS and ePub files require certain files to be stored with the STORED algorithm, is that right?

The thing that counts is that you can read the mimetype enty name and contents without decompression starting from byte 30. That way tools such as 'find' can report the mimetype.

The file generated with the attached script in BaseX 9.4.3 beta gives this:

$ file -i test.epub test.epub: application/octet-stream; charset=binary $ unzip -vl test.epub Archive: test.epub

Length Method Size Cmpr Date Time CRC-32 Name
  20  Defl:N       25 -25% 09-08-2020 13:54 2cab616f  mimetype
  20               25 -25%                            1 file
$ hexdump -C test.epub | head -4 00000000 50 4b 03 04 14 00 08 08 08 00 d9 6e 28 51 00 00 |PK.........n(Q..| 00000010 00 00 00 00 00 00 00 00 00 00 08 00 00 00 6d 69 |..............mi| 00000020 6d 65 74 79 70 65 01 14 00 eb ff 61 70 70 6c 69 |metype.....appli| 00000030 63 61 74 69 6f 6e 2f 65 70 75 62 2b 7a 69 70 50 |cation/epub+zipP|

There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name and the contents and the zip will be recognized by the epub and ODF applications (and use less space) than when it is deflated with compression- level 0.

...
The Archive Module has a long history, and was initially based on a proposal for the Zorba XQuery Processor back in 2012. I don’t actually remember why the algorithm option was not adopted for the single archive entries; maybe that would have been more reasonable. As we seem to be the only implementation left today, we could think about changing that. I doubt anyway that people will use different compression levels for single archive entries (apart from archiving them uncompressed), so it might be a better solution to define one global compression level for the whole archive.

From a practical point of view (regardless of what is in the specification) it makes sense to store 'mimetype' uncompressed and also store files such as png and jpg that are already compressed in the 'stored' way. If that can be

Best regards, Jos

Christian Grün

10:59 a.m.

New subject: creating epub and odf with basex

...

Creating a new file epub or odf file works correctly now but archive:update() does not retain the 'stored' property for the manifest file. (It does retain the order of the entries).

The stored property will now be retained if an archive is updated.

Up to now, archive:update removed existing update candidates from the archive and added new entries at the end. I changed this as well: If existing files are updated, the original order will be preserved, and new files will be added in the order in which they were supplied by the user.

...

EPub actually adopted the practice of a stored mimetype from ODF.

I didn’t know that. And thanks for the link to the W3C epub checker (which I actually used by myself a long time ago) and the ODF checker.

BaseX 9.4.3 is scheduled to be released later this week.

Jos van den Oever

11:07 a.m.

New subject: creating epub and odf with basex

Thank you for making the improvements. This is much cleaner imho than bash + zip + xsltproc. :-)

On dinsdag 8 september 2020 16:59:14 CEST Christian Grün wrote:

...

...
Creating a new file epub or odf file works correctly now but archive:update() does not retain the 'stored' property for the manifest file. (It does retain the order of the entries).

The stored property will now be retained if an archive is updated.

Up to now, archive:update removed existing update candidates from the archive and added new entries at the end. I changed this as well: If existing files are updated, the original order will be preserved, and new files will be added in the order in which they were supplied by the user.

...
EPub actually adopted the practice of a stored mimetype from ODF.

I didn’t know that. And thanks for the link to the W3C epub checker (which I actually used by myself a long time ago) and the ODF checker.

Since both are java they even fit in the basex environment.

...

BaseX 9.4.3 is scheduled to be released later this week.

Liam R. E. Quin

1:31 p.m.

New subject: creating epub and odf with basex

On Tue, 2020-09-08 at 17:07 +0200, Jos van den Oever wrote:

...

Thank you for making the improvements. This is much cleaner imho than bash + zip + xsltproc. :-)

A minor addition - i've sometimes started with a base zip file with the uncompressed "mimetype" entry in it, and just added the rest to that file fromXQuery or XSLT, without problems.

-- Liam Quin, https://www.delightfulcomputing.com/ Available for XML/Document/Information Architecture/XSLT/ XSL/XQuery/Web/Text Processing/A11Y training, work & consulting. Barefoot Web-slave, antique illustrations: http://www.fromoldbooks.org

Jos van den Oever

1:50 p.m.

New subject: creating epub and odf with basex

On dinsdag 8 september 2020 19:31:41 CEST Liam R. E. Quin wrote:

...

On Tue, 2020-09-08 at 17:07 +0200, Jos van den Oever wrote:

...
Thank you for making the improvements. This is much cleaner imho than bash + zip + xsltproc. :-)

A minor addition - i've sometimes started with a base zip file with the uncompressed "mimetype" entry in it, and just added the rest to that file fromXQuery or XSLT, without problems.

That's a fine solution, but in this case that did not work because it 'upgrade' the mimetype file to be compressed.

Jos van den Oever

9:28 a.m.

New subject: creating epub and odf with basex

On dinsdag 8 september 2020 14:27:55 CEST Christian Grün wrote:

...

Hi Jos,

...
There are 5 bytes between 'mimetype' and 'applicatino/epub+zip'. These are deflate information. If the entry is 'stored' there are no bytes between the entry name […]

Great, so we are talking about the same thing.

...
I think the simplest solution is to save compression-level=0 as stored.

That was also my thought. A quick fix caused the following error message (similar to what is described here [1])…

...
Operation failed: STORED entry missing size, compressed size, or crc-32.

…which means we’ll probably need to set additional values before writing the actual byte array. I’ll see what we can do.

I was surprised to learn more about the deficiencies of the Archive Module. The module was already used many times in the past to create ePub files, so my guess would be that these files could be opened by many readers, but were not 100% valid. How do you usually proceed to check the validity of ePub files?

I think many, but not all, tools are forgiving. Especially tools that scan many files such as file explorers, will rely on 'magic bytes'. Most epub files comply to having a 'stored' mimetype as the first file. From a local collection from various sources 138 out of 159 files comply.

For ODF, complyance to this rule is almost universal. EPub actually adopted the practice of a stored mimetype from ODF.

For validation of epub files you can use the w3c validator. https://github.com/w3c/epubcheck For ODF you can use the ODF validator: https://odftoolkit.org/conformance/ODFValidator.html

Best regards, Jos

1774

Age (days ago)

1774

Last active (days ago)

basex-talk@mailman.uni-konstanz.de

18 comments

3 participants

tags (0)

participants (3)

Christian Grün
Jos van den Oever
Liam R. E. Quin