Hi all,

again:

curl -v -X PUT -T some.pdf http://localhost:9998/tika --header "Content-Type: application/pdf"

... and tika returns plain text as it should - so a working MIME type would be 'application/pdf'.


Now off to BaseX:

let
 $request :=
  <http:request  method='PUT'    >
    <http:body media-type="application/pdf" src="some.pdf"/>
  </http:request>
return
 http:send-request($request,"http://localhost:9998/tika")

For this, tika returns 415 - unsupported media type. Although specifying the MIME type this time, the content that BaseX sends does not look like what tika expects.

let
  $file:="some.pdf",
  $request :=
<http:request  method='PUT'>
 <http:body media-type="application/pdf">{
  fetch:binary($file)
 }</http:body>
</http:request>
return
 http:send-request($request,"http://localhost:9998/tika")

For this, tika returns 500 - processing error. Media type is specified to 'application/pdf' which works with curl (see above) but not with BaseX. Also the tcpdump differs for the BaseX requests, as expected. So either we're doing something really wrong, or BaseX sends the content in a way it's not supposed to. In the latter case I'm not the one to look into this issue and we have to wait for someone to take a proper look at it.

Regards,
Lukas


On Sun, Jan 5, 2014 at 5:06 PM, Andy Bunce <bunce.andy@gmail.com> wrote:
Hi Dirk,
The Tika documentation is not very clear[1]. tika-app has a simple server mode. tika-server, which I am using,  is a different jar [2]

[1] http://stackoverflow.com/questions/12231630/how-to-use-tika-in-server-mode
[2] http://mvnrepository.com/artifact/org.apache.tika/tika-server/1.4


On Sun, Jan 5, 2014 at 3:39 PM, Dirk Kirsten <dk@basex.org> wrote:
Hello,

You can also simple get all the request headers using the -v flag when
running curl. Or you could use wireshark, which (at least to me) seems
easier than using tcpdump.

I'd like to reproduce your problem, but I seem to be too stupid to get
the Tika server up and running.
When running
  java -jar tika-app-1.4.jar -s 9999

(or even with the verbose flag) I simply don't get any thing (but a
running process) and the server seems to me not properly started, e.g.
if I do
  curl -X GET http://localhost:9998/tika

I simply get nothing (I don't get any response, servers seems not to
send any response).

However, I would suggest to try to look at the request sent by curl, as
curl sets some headers automatically and I also experienced similar
problems before (i.e. for some servers not setting some obscure headers
seems to be fatal...)

Cheers,
Dirk


On 05/01/14 15:00, Florent Georges wrote:
> On 5 January 2014 00:57, Andy Bunce wrote:
>
>   Hi,
>
>> curl -X PUT -T aa.pdf http://localhost:9998/tika
>> [...]
>> I have tried:
>> let $file:="C:\tmp\aa.pdf"
>> let $request :=
>>   <http:request  method='PUT'    >
>>     <http:body media-type="application/octet-stream">{
>>       fetch:binary($file)
>>     }</http:body>
>>     </http:request>
>
>   I do not know Tika, I do not have BaseX on this machine, and you did
> not give a lot of details about what is not working nor error messages,
> so it is a bit difficult to help here.  All I can say is that I would
> use the following as the EXPath HTTP Client equivalent to the above
> CURL command:
>
>     <http:request method="put">
>        <http:body media-type="application/pdf" src="file:/c:/tmp/aa.pdf"/>
>     </http:request>
>
>   The @media-type is mandatory.  You do not set any explicitly with
> CURL, so you should probably find which MIME type works with CURL in
> the first place.  The @src lets the processor handle the details of
> accessing the binary file, which makes things easier and then you are
> sure the problem is not with fetch:binary() or with the analysis of
> the binary content of http:body.
>
>   If you find a MIME type that works with CURL (you can use the -H
> option like the following: -H "Content-Type: application/pdf"), and it
> is still failing, tcpdump can help as well.  Open a terminal window,
> and execute the following:
>
>     sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
>
>   This will dump all traffic to localhost:9998.  Then go to another
> terminal window (because tcpdump is still running) and execute the
> CURL command.  After the completion, go back to the first window and
> press Ctrl-C (to kill tcpdump).  In between, tcpdump has output to the
> console a dump of the request.  It will as well if you keep it running
> when you test your query in BaseX.  So you can compare both requests
> and see what is different (or post it here so we can see what is
> happening).
>
>   Regards,
>

--
Dirk Kirsten, BaseX GmbH, http://basex.org
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk