Hi,
I am trying to work with a huge CSV file (about 380 MB), but If I built the database it seems that even simple operations cannot be evaluated. Is splitting the CSV file the only option or am I missing something here? Thanks.
Giuseppe
As there are many different ways to process large CSV data with BaseX… What did you try so far?
On Fri, Aug 10, 2018 at 1:36 PM Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I am trying to work with a huge CSV file (about 380 MB), but If I built the database it seems that even simple operations cannot be evaluated. Is splitting the CSV file the only option or am I missing something here? Thanks.
Giuseppe
I uploaded the file, as it is, in the database, but this does not help. The idea was to preliminary transform the file into xml and then query it, but this cannot be done on the fly. So the only thing I can think of is to parcel the original csv file into multiple csv files and then tranform each of them in xml, and then query these latter. Are there alternatives? Thanks.
Giuseppe
Universität Leipzig Institute of Computer Science, NLP Augustusplatz 10 04109 Leipzig Deutschland E-mail: celano@informatik.uni-leipzig.de E-mail: giuseppegacelano@gmail.com Web site 1: http://asv.informatik.uni-leipzig.de/en/staff/Giuseppe_Celano Web site 2: https://sites.google.com/site/giuseppegacelano/
On Aug 10, 2018, at 1:37 PM, Christian Grün christian.gruen@gmail.com wrote:
As there are many different ways to process large CSV data with BaseX… What did you try so far?
On Fri, Aug 10, 2018 at 1:36 PM Giuseppe Celano celano@informatik.uni-leipzig.de wrote:
Hi,
I am trying to work with a huge CSV file (about 380 MB), but If I built the database it seems that even simple operations cannot be evaluated. Is splitting the CSV file the only option or am I missing something here? Thanks.
Giuseppe
I uploaded it as csv (it is csv) via the GUI and it is then converted into XML (this conversion probably makes it too big)
On Aug 10, 2018, at 1:50 PM, Christian Grün christian.gruen@gmail.com wrote:
I uploaded the file, as it is, in the database
So you uploaded the file as binary? Did you try to import it as XML, too? Does »upload« mean that you used the simple REST API?
On Fri, 2018-08-10 at 13:43 +0200, Giuseppe Celano wrote:
I uploaded the file, as it is, in the database,
i'd probably look for an XSLT transformation to turn it into XSLT - of there are python and perl scripts or other programs that can do it - and then load the result intoa database.
It's not all that large a file, so maybe it'd help if you described the exact problems you were having -- what did you try, what did you expect to happen, what actually happen, what steps did you take to investigate...
Liam
Hi Liam,
Thanks for answering. The problem is not only the XML transformation per se, but also the subsequent query of the documents. I see that if I parcel the big csv into smaller (XML) documents and query them sequentially, I have no performance problems. This is also the case in the database, as far as I can see: more documents accessed sequentially is better than one big file.
Ciao, Giuseppe
On Aug 10, 2018, at 9:09 PM, Liam R. E. Quin liam@fromoldbooks.org wrote:
On Fri, 2018-08-10 at 13:43 +0200, Giuseppe Celano wrote:
I uploaded the file, as it is, in the database,
i'd probably look for an XSLT transformation to turn it into XSLT - of there are python and perl scripts or other programs that can do it - and then load the result intoa database.
It's not all that large a file, so maybe it'd help if you described the exact problems you were having -- what did you try, what did you expect to happen, what actually happen, what steps did you take to investigate...
Liam
-- Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart http://www.fromoldbooks.org/ Available for XML/Document/Information Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.
On Sun, 2018-08-12 at 23:58 +0200, Giuseppe Celano wrote:
more documents accessed sequentially is better than one big file.
Are you building indexes in the database? Do yourqueries make use of them?
You may find using the full text extensions useful.
Liam
Yes, I build them, but I do not use them explicitly all the time.
On Aug 13, 2018, at 12:04 AM, Liam R. E. Quin liam@fromoldbooks.org wrote:
On Sun, 2018-08-12 at 23:58 +0200, Giuseppe Celano wrote:
more documents accessed sequentially is better than one big file.
Are you building indexes in the database? Do yourqueries make use of them?
You may find using the full text extensions useful.
Liam
-- Liam Quin, https://www.holoweb.net/liam/cv/ Web slave for vintage clipart http://www.fromoldbooks.org/ Available for XML/Document/Information Architecture/ XSL/XQuery/Web/Text Processing/A11Y work & consulting.
basex-talk@mailman.uni-konstanz.de