Hi,
I want to use the tika server [2] to extract text from pdfs. This working using curl as the client
curl -X PUT -T aa.pdf http://localhost:9998/tika
However I want to use the http module[1] I have tried:
let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request> let $r:= http:send-request($request,$tika)
I have tried this with various values for http:body/@method with no sucess. The content-length header from this does not match the one sent by curl. This did not work either (no body?):
let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="text/plain" src="{$file}"/> </http:request> let $r:= http:send-request($request,$tika)
Any ideas? Regards /Andy [1] http://docs.basex.org/wiki/HTTP_Module [2] http://wiki.apache.org/tika/TikaJAXRS
Hi Andy,
- just a quick report, as I wasn't able to solve the problem so far.
This working using curl as the client
curl -X PUT -T aa.pdf http://localhost:9998/tika
If I add '--header "Content-Type: application/pdf" ' it works fine for me,
too. If I don't specify the content-type I get a "415: Unsupported Media Type". Just for others as a note ...
If I run the following:
let $file:="some.pdf", $request := <http:request method='PUT'> <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request> return http:send-request($request,"http://localhost:9998/tika")
I get from BaseX (running in debug mode):
*java.lang.IllegalArgumentException: object is not an instance of declaring class*
and (from Tika):
*INFO: tika (autodetecting type)*
Looks like there's already going something wrong on BaseX level. I still get a response from Tika, but not the one I expected. If I change the media-type to 'application/pdf' I no longer get the BaseX error, but a document processing error (500) from Tika. 'application/pdf' is also the media type that 'fetch:content-type()' returns..
So if it's not further specified, Tika tries to guess the content type but cannot find one. If it's specified it returns a processing error. Like you said maybe a problem with the content (as the content-length headers differ).
Sorry for not being of much help but maybe someone else has an idea?
Cheers, Lukas
On 5 January 2014 00:57, Andy Bunce wrote:
Hi,
curl -X PUT -T aa.pdf http://localhost:9998/tika [...] I have tried: let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request>
I do not know Tika, I do not have BaseX on this machine, and you did not give a lot of details about what is not working nor error messages, so it is a bit difficult to help here. All I can say is that I would use the following as the EXPath HTTP Client equivalent to the above CURL command:
<http:request method="put"> <http:body media-type="application/pdf" src="file:/c:/tmp/aa.pdf"/> </http:request>
The @media-type is mandatory. You do not set any explicitly with CURL, so you should probably find which MIME type works with CURL in the first place. The @src lets the processor handle the details of accessing the binary file, which makes things easier and then you are sure the problem is not with fetch:binary() or with the analysis of the binary content of http:body.
If you find a MIME type that works with CURL (you can use the -H option like the following: -H "Content-Type: application/pdf"), and it is still failing, tcpdump can help as well. Open a terminal window, and execute the following:
sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
This will dump all traffic to localhost:9998. Then go to another terminal window (because tcpdump is still running) and execute the CURL command. After the completion, go back to the first window and press Ctrl-C (to kill tcpdump). In between, tcpdump has output to the console a dump of the request. It will as well if you keep it running when you test your query in BaseX. So you can compare both requests and see what is different (or post it here so we can see what is happening).
Regards,
Hello,
You can also simple get all the request headers using the -v flag when running curl. Or you could use wireshark, which (at least to me) seems easier than using tcpdump.
I'd like to reproduce your problem, but I seem to be too stupid to get the Tika server up and running. When running java -jar tika-app-1.4.jar -s 9999
(or even with the verbose flag) I simply don't get any thing (but a running process) and the server seems to me not properly started, e.g. if I do curl -X GET http://localhost:9998/tika
I simply get nothing (I don't get any response, servers seems not to send any response).
However, I would suggest to try to look at the request sent by curl, as curl sets some headers automatically and I also experienced similar problems before (i.e. for some servers not setting some obscure headers seems to be fatal...)
Cheers, Dirk
On 05/01/14 15:00, Florent Georges wrote:
On 5 January 2014 00:57, Andy Bunce wrote:
Hi,
curl -X PUT -T aa.pdf http://localhost:9998/tika [...] I have tried: let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request>
I do not know Tika, I do not have BaseX on this machine, and you did not give a lot of details about what is not working nor error messages, so it is a bit difficult to help here. All I can say is that I would use the following as the EXPath HTTP Client equivalent to the above CURL command:
<http:request method="put"> <http:body media-type="application/pdf" src="file:/c:/tmp/aa.pdf"/> </http:request>
The @media-type is mandatory. You do not set any explicitly with CURL, so you should probably find which MIME type works with CURL in the first place. The @src lets the processor handle the details of accessing the binary file, which makes things easier and then you are sure the problem is not with fetch:binary() or with the analysis of the binary content of http:body.
If you find a MIME type that works with CURL (you can use the -H option like the following: -H "Content-Type: application/pdf"), and it is still failing, tcpdump can help as well. Open a terminal window, and execute the following:
sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
This will dump all traffic to localhost:9998. Then go to another terminal window (because tcpdump is still running) and execute the CURL command. After the completion, go back to the first window and press Ctrl-C (to kill tcpdump). In between, tcpdump has output to the console a dump of the request. It will as well if you keep it running when you test your query in BaseX. So you can compare both requests and see what is different (or post it here so we can see what is happening).
Regards,
On 5 January 2014 16:39, Dirk Kirsten wrote:
However, I would suggest to try to look at the request sent by curl, as curl sets some headers automatically and I also experienced similar problems before (i.e. for some servers not setting some obscure headers seems to be fatal...)
If it is of any help, here is what I got with tcpdump, by using a random PDF file (I did not install Tika, but sent the request to a server of mine; the request sent by CURL should be the same though):
PUT /tools/dump HTTP/1.1 User-Agent: curl/7.30.0 Host: h2oconsulting.be Content-Length: 18108 Expect: 100-continue
%PDF-1.2 1 0 obj << ... [the rest of the PDF content] ...
Regards,
Hi Dirk, The Tika documentation is not very clear[1]. tika-app has a simple server mode. tika-server, which I am using, is a different jar [2]
[1] http://stackoverflow.com/questions/12231630/how-to-use-tika-in-server-mode [2] http://mvnrepository.com/artifact/org.apache.tika/tika-server/1.4
On Sun, Jan 5, 2014 at 3:39 PM, Dirk Kirsten dk@basex.org wrote:
Hello,
You can also simple get all the request headers using the -v flag when running curl. Or you could use wireshark, which (at least to me) seems easier than using tcpdump.
I'd like to reproduce your problem, but I seem to be too stupid to get the Tika server up and running. When running java -jar tika-app-1.4.jar -s 9999
(or even with the verbose flag) I simply don't get any thing (but a running process) and the server seems to me not properly started, e.g. if I do curl -X GET http://localhost:9998/tika
I simply get nothing (I don't get any response, servers seems not to send any response).
However, I would suggest to try to look at the request sent by curl, as curl sets some headers automatically and I also experienced similar problems before (i.e. for some servers not setting some obscure headers seems to be fatal...)
Cheers, Dirk
On 05/01/14 15:00, Florent Georges wrote:
On 5 January 2014 00:57, Andy Bunce wrote:
Hi,
curl -X PUT -T aa.pdf http://localhost:9998/tika [...] I have tried: let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request>
I do not know Tika, I do not have BaseX on this machine, and you did not give a lot of details about what is not working nor error messages, so it is a bit difficult to help here. All I can say is that I would use the following as the EXPath HTTP Client equivalent to the above CURL command:
<http:request method="put"> <http:body media-type="application/pdf"
src="file:/c:/tmp/aa.pdf"/>
</http:request>
The @media-type is mandatory. You do not set any explicitly with CURL, so you should probably find which MIME type works with CURL in the first place. The @src lets the processor handle the details of accessing the binary file, which makes things easier and then you are sure the problem is not with fetch:binary() or with the analysis of the binary content of http:body.
If you find a MIME type that works with CURL (you can use the -H option like the following: -H "Content-Type: application/pdf"), and it is still failing, tcpdump can help as well. Open a terminal window, and execute the following:
sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
This will dump all traffic to localhost:9998. Then go to another terminal window (because tcpdump is still running) and execute the CURL command. After the completion, go back to the first window and press Ctrl-C (to kill tcpdump). In between, tcpdump has output to the console a dump of the request. It will as well if you keep it running when you test your query in BaseX. So you can compare both requests and see what is different (or post it here so we can see what is happening).
Regards,
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22 _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi all,
again:
curl -v -X PUT -T some.pdf http://localhost:9998/tika --header "Content-Type: application/pdf"
... and tika returns plain text as it should - so a working MIME type would be 'application/pdf'.
*Now off to BaseX:*
let $request := <http:request method='PUT' > <http:body media-type="application/pdf" src="some.pdf"/> </http:request> return http:send-request($request,"http://localhost:9998/tika")
*For this, tika returns 415* - unsupported media type. Although specifying the MIME type this time, the content that BaseX sends does not look like what tika expects.
let $file:="some.pdf", $request := <http:request method='PUT'> <http:body media-type="application/pdf">{ fetch:binary($file) }</http:body> </http:request> return http:send-request($request,"http://localhost:9998/tika")
*For this, tika returns 500* - processing error. Media type is specified to 'application/pdf' which works with curl (see above) but not with BaseX. Also the tcpdump differs for the BaseX requests, as expected. So either we're doing something really wrong, or BaseX sends the content in a way it's not supposed to. In the latter case I'm not the one to look into this issue and we have to wait for someone to take a proper look at it.
Regards, Lukas
On Sun, Jan 5, 2014 at 5:06 PM, Andy Bunce bunce.andy@gmail.com wrote:
Hi Dirk, The Tika documentation is not very clear[1]. tika-app has a simple server mode. tika-server, which I am using, is a different jar [2]
[1] http://stackoverflow.com/questions/12231630/how-to-use-tika-in-server-mode [2] http://mvnrepository.com/artifact/org.apache.tika/tika-server/1.4
On Sun, Jan 5, 2014 at 3:39 PM, Dirk Kirsten dk@basex.org wrote:
Hello,
You can also simple get all the request headers using the -v flag when running curl. Or you could use wireshark, which (at least to me) seems easier than using tcpdump.
I'd like to reproduce your problem, but I seem to be too stupid to get the Tika server up and running. When running java -jar tika-app-1.4.jar -s 9999
(or even with the verbose flag) I simply don't get any thing (but a running process) and the server seems to me not properly started, e.g. if I do curl -X GET http://localhost:9998/tika
I simply get nothing (I don't get any response, servers seems not to send any response).
However, I would suggest to try to look at the request sent by curl, as curl sets some headers automatically and I also experienced similar problems before (i.e. for some servers not setting some obscure headers seems to be fatal...)
Cheers, Dirk
On 05/01/14 15:00, Florent Georges wrote:
On 5 January 2014 00:57, Andy Bunce wrote:
Hi,
curl -X PUT -T aa.pdf http://localhost:9998/tika [...] I have tried: let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request>
I do not know Tika, I do not have BaseX on this machine, and you did not give a lot of details about what is not working nor error messages, so it is a bit difficult to help here. All I can say is that I would use the following as the EXPath HTTP Client equivalent to the above CURL command:
<http:request method="put"> <http:body media-type="application/pdf"
src="file:/c:/tmp/aa.pdf"/>
</http:request>
The @media-type is mandatory. You do not set any explicitly with CURL, so you should probably find which MIME type works with CURL in the first place. The @src lets the processor handle the details of accessing the binary file, which makes things easier and then you are sure the problem is not with fetch:binary() or with the analysis of the binary content of http:body.
If you find a MIME type that works with CURL (you can use the -H option like the following: -H "Content-Type: application/pdf"), and it is still failing, tcpdump can help as well. Open a terminal window, and execute the following:
sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
This will dump all traffic to localhost:9998. Then go to another terminal window (because tcpdump is still running) and execute the CURL command. After the completion, go back to the first window and press Ctrl-C (to kill tcpdump). In between, tcpdump has output to the console a dump of the request. It will as well if you keep it running when you test your query in BaseX. So you can compare both requests and see what is different (or post it here so we can see what is happening).
Regards,
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22 _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Florent:
Thanks for the tcpdump and @src tips
Lukas That matches my experience. I wonder if this is relevant: http://stackoverflow.com/questions/18728100/file-upload-via-http-put-request
/Andy
On Mon, Jan 6, 2014 at 7:49 AM, Lukas Kircher <lukas.kircher@uni-konstanz.de
wrote:
Hi all,
again:
curl -v -X PUT -T some.pdf http://localhost:9998/tika --header "Content-Type: application/pdf"
... and tika returns plain text as it should - so a working MIME type would be 'application/pdf'.
*Now off to BaseX:*
let $request := <http:request method='PUT' > <http:body media-type="application/pdf" src="some.pdf"/> </http:request> return http:send-request($request,"http://localhost:9998/tika")
*For this, tika returns 415* - unsupported media type. Although specifying the MIME type this time, the content that BaseX sends does not look like what tika expects.
let $file:="some.pdf", $request := <http:request method='PUT'> <http:body media-type="application/pdf">{ fetch:binary($file) }</http:body> </http:request> return http:send-request($request,"http://localhost:9998/tika")
*For this, tika returns 500* - processing error. Media type is specified to 'application/pdf' which works with curl (see above) but not with BaseX. Also the tcpdump differs for the BaseX requests, as expected. So either we're doing something really wrong, or BaseX sends the content in a way it's not supposed to. In the latter case I'm not the one to look into this issue and we have to wait for someone to take a proper look at it.
Regards, Lukas
On Sun, Jan 5, 2014 at 5:06 PM, Andy Bunce bunce.andy@gmail.com wrote:
Hi Dirk, The Tika documentation is not very clear[1]. tika-app has a simple server mode. tika-server, which I am using, is a different jar [2]
[1] http://stackoverflow.com/questions/12231630/how-to-use-tika-in-server-mode [2] http://mvnrepository.com/artifact/org.apache.tika/tika-server/1.4
On Sun, Jan 5, 2014 at 3:39 PM, Dirk Kirsten dk@basex.org wrote:
Hello,
You can also simple get all the request headers using the -v flag when running curl. Or you could use wireshark, which (at least to me) seems easier than using tcpdump.
I'd like to reproduce your problem, but I seem to be too stupid to get the Tika server up and running. When running java -jar tika-app-1.4.jar -s 9999
(or even with the verbose flag) I simply don't get any thing (but a running process) and the server seems to me not properly started, e.g. if I do curl -X GET http://localhost:9998/tika
I simply get nothing (I don't get any response, servers seems not to send any response).
However, I would suggest to try to look at the request sent by curl, as curl sets some headers automatically and I also experienced similar problems before (i.e. for some servers not setting some obscure headers seems to be fatal...)
Cheers, Dirk
On 05/01/14 15:00, Florent Georges wrote:
On 5 January 2014 00:57, Andy Bunce wrote:
Hi,
curl -X PUT -T aa.pdf http://localhost:9998/tika [...] I have tried: let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request>
I do not know Tika, I do not have BaseX on this machine, and you did not give a lot of details about what is not working nor error messages, so it is a bit difficult to help here. All I can say is that I would use the following as the EXPath HTTP Client equivalent to the above CURL command:
<http:request method="put"> <http:body media-type="application/pdf"
src="file:/c:/tmp/aa.pdf"/>
</http:request>
The @media-type is mandatory. You do not set any explicitly with CURL, so you should probably find which MIME type works with CURL in the first place. The @src lets the processor handle the details of accessing the binary file, which makes things easier and then you are sure the problem is not with fetch:binary() or with the analysis of the binary content of http:body.
If you find a MIME type that works with CURL (you can use the -H option like the following: -H "Content-Type: application/pdf"), and it is still failing, tcpdump can help as well. Open a terminal window, and execute the following:
sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
This will dump all traffic to localhost:9998. Then go to another terminal window (because tcpdump is still running) and execute the CURL command. After the completion, go back to the first window and press Ctrl-C (to kill tcpdump). In between, tcpdump has output to the console a dump of the request. It will as well if you keep it running when you test your query in BaseX. So you can compare both requests and see what is different (or post it here so we can see what is happening).
Regards,
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22 _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Using *http:send-request#3, *i.e using an explicit body works and seems to make sense because the type information is available in this form.
let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream" method="raw"/> </http:request> let $r:= http:send-request($request,$tika,fetch:binary($file))
I think the @src is not working because no base64 encoding is done. I think some review of the code [1] [2] might help.
Regards /Andy
[1] https://github.com/BaseXdb/basex/blob/5b4bcc3272b0611bc99b7d9eebbb432b22a4dc...
[2] https://github.com/BaseXdb/basex/blob/next/basex-core/src/main/java/org/base...
On Mon, Jan 6, 2014 at 12:09 PM, Andy Bunce bunce.andy@gmail.com wrote:
Florent:
Thanks for the tcpdump and @src tips
Lukas That matches my experience. I wonder if this is relevant:
http://stackoverflow.com/questions/18728100/file-upload-via-http-put-request
/Andy
On Mon, Jan 6, 2014 at 7:49 AM, Lukas Kircher < lukas.kircher@uni-konstanz.de> wrote:
Hi all,
again:
curl -v -X PUT -T some.pdf http://localhost:9998/tika --header "Content-Type: application/pdf"
... and tika returns plain text as it should - so a working MIME type would be 'application/pdf'.
*Now off to BaseX:*
let $request := <http:request method='PUT' > <http:body media-type="application/pdf" src="some.pdf"/> </http:request> return http:send-request($request,"http://localhost:9998/tika")
*For this, tika returns 415* - unsupported media type. Although specifying the MIME type this time, the content that BaseX sends does not look like what tika expects.
let $file:="some.pdf", $request := <http:request method='PUT'> <http:body media-type="application/pdf">{ fetch:binary($file) }</http:body> </http:request> return http:send-request($request,"http://localhost:9998/tika")
*For this, tika returns 500* - processing error. Media type is specified to 'application/pdf' which works with curl (see above) but not with BaseX. Also the tcpdump differs for the BaseX requests, as expected. So either we're doing something really wrong, or BaseX sends the content in a way it's not supposed to. In the latter case I'm not the one to look into this issue and we have to wait for someone to take a proper look at it.
Regards, Lukas
On Sun, Jan 5, 2014 at 5:06 PM, Andy Bunce bunce.andy@gmail.com wrote:
Hi Dirk, The Tika documentation is not very clear[1]. tika-app has a simple server mode. tika-server, which I am using, is a different jar [2]
[1] http://stackoverflow.com/questions/12231630/how-to-use-tika-in-server-mode [2] http://mvnrepository.com/artifact/org.apache.tika/tika-server/1.4
On Sun, Jan 5, 2014 at 3:39 PM, Dirk Kirsten dk@basex.org wrote:
Hello,
You can also simple get all the request headers using the -v flag when running curl. Or you could use wireshark, which (at least to me) seems easier than using tcpdump.
I'd like to reproduce your problem, but I seem to be too stupid to get the Tika server up and running. When running java -jar tika-app-1.4.jar -s 9999
(or even with the verbose flag) I simply don't get any thing (but a running process) and the server seems to me not properly started, e.g. if I do curl -X GET http://localhost:9998/tika
I simply get nothing (I don't get any response, servers seems not to send any response).
However, I would suggest to try to look at the request sent by curl, as curl sets some headers automatically and I also experienced similar problems before (i.e. for some servers not setting some obscure headers seems to be fatal...)
Cheers, Dirk
On 05/01/14 15:00, Florent Georges wrote:
On 5 January 2014 00:57, Andy Bunce wrote:
Hi,
curl -X PUT -T aa.pdf http://localhost:9998/tika [...] I have tried: let $file:="C:\tmp\aa.pdf" let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream">{ fetch:binary($file) }</http:body> </http:request>
I do not know Tika, I do not have BaseX on this machine, and you did not give a lot of details about what is not working nor error
messages,
so it is a bit difficult to help here. All I can say is that I would use the following as the EXPath HTTP Client equivalent to the above CURL command:
<http:request method="put"> <http:body media-type="application/pdf"
src="file:/c:/tmp/aa.pdf"/>
</http:request>
The @media-type is mandatory. You do not set any explicitly with CURL, so you should probably find which MIME type works with CURL in the first place. The @src lets the processor handle the details of accessing the binary file, which makes things easier and then you are sure the problem is not with fetch:binary() or with the analysis of the binary content of http:body.
If you find a MIME type that works with CURL (you can use the -H option like the following: -H "Content-Type: application/pdf"), and it is still failing, tcpdump can help as well. Open a terminal window, and execute the following:
sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
This will dump all traffic to localhost:9998. Then go to another terminal window (because tcpdump is still running) and execute the CURL command. After the completion, go back to the first window and press Ctrl-C (to kill tcpdump). In between, tcpdump has output to the console a dump of the request. It will as well if you keep it running when you test your query in BaseX. So you can compare both requests and see what is different (or post it here so we can see what is happening).
Regards,
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22 _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Hi Andy,
I think the @src is not working because no base64 encoding is done. I think some review of the code [1] [2] might help.
I must confess I haven’t followed the full conversation, but based on your hints I have revised the BaseX code revolving around the src attribute [1]; hopefully, the latest snapshot [2] does what it’s supposed to do.
@Florent: I’m not quite sure how to properly handle the input linked by @src, which is why I have sent you another e-mail via the expath mailing list. This is what I currently do [3].
Looking forward to your feedback, Christian
[1] https://github.com/BaseXdb/basex/commit/ef4f7c [2] http://files.basex.org/releases/latest [3] https://github.com/BaseXdb/basex/blob/next/basex-core/src/main/java/org/base... _______________________________________
On Thu, Jan 9, 2014 at 1:50 PM, Andy Bunce bunce.andy@gmail.com wrote:
Using http:send-request#3, i.e using an explicit body works and seems to make sense because the type information is available in this form.
let $request := <http:request method='PUT' > <http:body media-type="application/octet-stream" method="raw"/> </http:request> let $r:= http:send-request($request,$tika,fetch:binary($file))
Regards /Andy
[1] https://github.com/BaseXdb/basex/blob/5b4bcc3272b0611bc99b7d9eebbb432b22a4dc...
[2] https://github.com/BaseXdb/basex/blob/next/basex-core/src/main/java/org/base...
On Mon, Jan 6, 2014 at 12:09 PM, Andy Bunce bunce.andy@gmail.com wrote:
Florent:
Thanks for the tcpdump and @src tips
Lukas That matches my experience. I wonder if this is relevant:
http://stackoverflow.com/questions/18728100/file-upload-via-http-put-request
/Andy
On Mon, Jan 6, 2014 at 7:49 AM, Lukas Kircher lukas.kircher@uni-konstanz.de wrote:
Hi all,
again:
curl -v -X PUT -T some.pdf http://localhost:9998/tika --header "Content-Type: application/pdf"
... and tika returns plain text as it should - so a working MIME type would be 'application/pdf'.
Now off to BaseX:
let $request := <http:request method='PUT' > <http:body media-type="application/pdf" src="some.pdf"/> </http:request> return http:send-request($request,"http://localhost:9998/tika")
For this, tika returns 415 - unsupported media type. Although specifying the MIME type this time, the content that BaseX sends does not look like what tika expects.
let $file:="some.pdf", $request := <http:request method='PUT'> <http:body media-type="application/pdf">{ fetch:binary($file) }</http:body> </http:request> return http:send-request($request,"http://localhost:9998/tika")
For this, tika returns 500 - processing error. Media type is specified to 'application/pdf' which works with curl (see above) but not with BaseX. Also the tcpdump differs for the BaseX requests, as expected. So either we're doing something really wrong, or BaseX sends the content in a way it's not supposed to. In the latter case I'm not the one to look into this issue and we have to wait for someone to take a proper look at it.
Regards, Lukas
On Sun, Jan 5, 2014 at 5:06 PM, Andy Bunce bunce.andy@gmail.com wrote:
Hi Dirk, The Tika documentation is not very clear[1]. tika-app has a simple server mode. tika-server, which I am using, is a different jar [2]
[1] http://stackoverflow.com/questions/12231630/how-to-use-tika-in-server-mode [2] http://mvnrepository.com/artifact/org.apache.tika/tika-server/1.4
On Sun, Jan 5, 2014 at 3:39 PM, Dirk Kirsten dk@basex.org wrote:
Hello,
You can also simple get all the request headers using the -v flag when running curl. Or you could use wireshark, which (at least to me) seems easier than using tcpdump.
I'd like to reproduce your problem, but I seem to be too stupid to get the Tika server up and running. When running java -jar tika-app-1.4.jar -s 9999
(or even with the verbose flag) I simply don't get any thing (but a running process) and the server seems to me not properly started, e.g. if I do curl -X GET http://localhost:9998/tika
I simply get nothing (I don't get any response, servers seems not to send any response).
However, I would suggest to try to look at the request sent by curl, as curl sets some headers automatically and I also experienced similar problems before (i.e. for some servers not setting some obscure headers seems to be fatal...)
Cheers, Dirk
On 05/01/14 15:00, Florent Georges wrote:
On 5 January 2014 00:57, Andy Bunce wrote:
Hi,
> curl -X PUT -T aa.pdf http://localhost:9998/tika > [...] > I have tried: > let $file:="C:\tmp\aa.pdf" > let $request := > <http:request method='PUT' > > <http:body media-type="application/octet-stream">{ > fetch:binary($file) > }</http:body> > </http:request>
I do not know Tika, I do not have BaseX on this machine, and you did not give a lot of details about what is not working nor error messages, so it is a bit difficult to help here. All I can say is that I would use the following as the EXPath HTTP Client equivalent to the above CURL command:
<http:request method="put"> <http:body media-type="application/pdf"
src="file:/c:/tmp/aa.pdf"/> </http:request>
The @media-type is mandatory. You do not set any explicitly with CURL, so you should probably find which MIME type works with CURL in the first place. The @src lets the processor handle the details of accessing the binary file, which makes things easier and then you are sure the problem is not with fetch:binary() or with the analysis of the binary content of http:body.
If you find a MIME type that works with CURL (you can use the -H option like the following: -H "Content-Type: application/pdf"), and it is still failing, tcpdump can help as well. Open a terminal window, and execute the following:
sudo tcpdump -s 0 -A -i any tcp and host localhost and port 9998
This will dump all traffic to localhost:9998. Then go to another terminal window (because tcpdump is still running) and execute the CURL command. After the completion, go back to the first window and press Ctrl-C (to kill tcpdump). In between, tcpdump has output to the console a dump of the request. It will as well if you keep it running when you test your query in BaseX. So you can compare both requests and see what is different (or post it here so we can see what is happening).
Regards,
-- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Dr. Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22 _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
On 11 January 2014 17:02, Christian Grün wrote:
@Florent: I’m not quite sure how to properly handle the input linked by @src, which is why I have sent you another e-mail via the expath mailing list. This is what I currently do [3].
I am going to have a look at your email on the EXPath mailing list, but I already had a quick look at the code. Unfortunately, I do not understand all the details (e.g. I do not know the class IOUrl).
The idea of the @src attribute is that if the source of the body is on the filesystem, there is no point to buffer it in memory, parse it (if it is XML for instance, but even for text you have to decode the encoding), then serialise it again. The serialisation already exists, and it is possible to stream it straight to the HTTP layer (hopefully, depending on the library used, it should be possible to plug the library straight to the file).
Of course, if the description claims that the file is, say, text in UTF-8, then it has to be, or the request is malformed.
But there is maybe a specific issue I do not see?
Regards,
Hi Florent,
thanks for your quick feedback. I’m simply quoting my EXPath feedback below: __________________________________________
I had yet another look at the HTTP Spec, and I was wondering how binary data is to be treated. The paragraph on content serialization (3.2.) says:
“This spec defines in addition the method 'binary'; in this case the body content must be either an xs:hexBinary or an xs:base64Binary item, and no other serialization parameter can be set besides media-type.“
As the body content results from an XML fragment, how can it be hex or base64? My assumption is that resources linked via the @src attribute are to be treated as either base64 or hex? In that case, it could make sense to extend Section 3.1. Here it says:
“The src attribute can be used in a request to set the body content as the content of the linked resource instead of using the children of the http:body element.”
…but it’s not sure what’s meant by setting the body content (what is e.g. expected to happen if the media type is text?) __________________________________________
Just ask if you need more details, Christian
On 12 January 2014 02:11, Christian Grün wrote:
thanks for your quick feedback. I’m simply quoting my EXPath feedback below:
I've already responded to your original email on the EXPath mailing list, here is the link for the archives:
https://groups.google.com/forum/#!topic/expath/PKl27uQndng
Regards,
basex-talk@mailman.uni-konstanz.de