Recently, a colleague of mine had to write an ABAP application with an import functionality for Excel files. I recommended him to use the abap2xlsx package which serves as a fully featured negotiator between the ABAP and the Excel world. In particular, abap2xlsx contains a reader class, ZCL_EXCEL_READER_2007, designed for importing a .xslx file into the ABAP data structures on which the package is based.
To my astonishment, the reader dumped with a memory overflow error for a sample Excel file.
- Yes - with about 28'000 rows and 50 columns it was a large file!
- And, yes, it is crazy to have a business process with files of these dimensions! No discussion about that.
On the other hand, the file had a size of 6.5 MB, which was moderate compared to the heap memory limit of 2 GB which had been touched according to the short dump.
I was just curious how this memory consumption of a factor way larger than 100 from the original data could be explained.
So I wrote a little test report for analyzing the situation with the memory inspector.
report zz_test_abap2xlsx_reader. parameters: p_file type string memory id fil. at selection-screen on value-request for p_file. perform get_file(zz_file_io_forms) using p_file. start-of-selection. perform start. * --- form start. data: lo_excel_reader type ref to zif_excel_reader, lo_excel type ref to zcl_excel. create object lo_excel_reader type zcl_excel_reader_2007. lo_excel = lo_excel_reader->load_file( p_file ). break-point. endform. "start
Starting the report with one of those large Excel files and stepping through the code with the debugger, I arrive at the place where the worksheet is to be parsed. Everything looks normal. There are some 9 MB allocated for the XSTRING containing the file. At this point, this is the largest object in memory. No problem. The second largest object is the "table of X", the result of the file read process (more or less a redundant copy of the first data - but negligible compared to the 4 GB we are talking about).
Peanuts!
At that point in the debugger, I am just before skipping into the method get_ixml_from_zip_archive( ). This is a general-purpose method in the reader class, which loads an XML document contained in the zip archive; after loading, it will be parsed with the methods of the IF_IXML family, and a reference to the result XML DOM object will be pased back to the caller.
Here, I am immediately before the call of the parse() method of the if_ixml_parser object. The memory consumption looks still more than modest:
Now, with a single step, I am triggering the parse() Method. Processing takes a few minutes until the control is given back to the debugger again. From a memory perspective, the result is amazing:
With this single method call (which is implemented "by kernel module" and thus cannot be analyzed further with ABAP means), the overall memory consumption raises to more than 2 GB!
Even more amazing: The 2GB cannot be explained from the memory inspector's individual object view :
Here, the top object only requires some 100 MB. This is the decompressed XML worksheet, as produced from the zip loader. From the rest of the 2GB, we see absolutely nothing.
So we have to face the fact that the iXML DOM parser in this case requires a factor 20 more memory than the raw document which is to be parsed.
Since at this point, nothing could be done (after all, iXML is a kernel component), I looked out for alternatives. My plan was: If we omitted the DOM parsing and instead are satisfied with gripping the element contents from the stream while we're reading it, performance could improve considerably.
I could prove that this really is the case, by implementing an alternative parser class, ZCL_XLSX_PARSER, on the base of the sXML family, in particular of the IF_SXML_READER implementation provided by this family. If you are happy with simply reading the raw data from the file (as we were in our particular case), omitting more functionality like macros, cell styles etc., the class ZCL_XLSX_PARSER will be fully sufficient.
At the same place as above, immediately before parsing, the memory in the debugger looks similar: On top, we see the decompressed XML file, allocating about 100 MB as above:
Now I execute the parse_cell_data( ) method.
For the same file that I tested above, the memory consumption only moderately increased through the parsing: It was 117,707,688 bytes before the call of parse_cell_data( ), and is 141,485,696 bytes afterwards. The difference - about 23 MB - is needed for storing the result (the deep structure ES_EXCEL above, containing internal tables with the cell data). All the figures can be explained by looking at the consumption of the ABAP data alone. We are still far away from the 4GB limit.
Also, the execution speed is better. With the sXML reader, the parsing needs about 52 seconds. With the iXML reader, only the parsing of the same worksheet took 340 seconds.
The main difference between IF_IXML and IF_SXML is that the latter is only a scanner. It detects the beginning, the end, the attributes and the contents of XML elements during reading the stream. That's it. With IF_SXML, I have to instrument the reader myself if I want to extract data from the reading process. Here is the implementation of the parse_cell_data( ) method, just to give you the idea. If you want to study the whole class, have a look into it here.
method parse_cell_data. data: ls_cell type ty_excel_cell, lv_number type decfloat34, lv_time type t, lv_date type d. * Main parse loop while io_reader->node_type ne if_sxml_node=>co_nt_final. io_reader->next_node( ). case io_reader->name. when 'row'. " A row check io_reader->node_type = if_sxml_node=>co_nt_element_open. add 1 to ls_cell-row. clear: ls_cell-col, ls_cell-ref, ls_cell-type. when 'c'. " A column, child of a row case io_reader->node_type. when if_sxml_node=>co_nt_element_open. add 1 to ls_cell-col. ls_cell-type = get_cell_type( io_reader ). when if_sxml_node=>co_nt_element_close. insert ls_cell into table cs_excel-cells. endcase. when 'v'. " A value - may be a reference to the string table if io_reader->node_type eq if_sxml_node=>co_nt_element_close. case ls_cell-type. when gc_cell_datatype-string. " Referenzindex für Stringtabelle ls_cell-ref = io_reader->value + 1. " ist 0-basiert in xlsx when gc_cell_datatype-number. lv_number = io_reader->value. append lv_number to cs_excel-numbers. ls_cell-ref = sy-tabix. when gc_cell_datatype-date. * Convert to the ABAP-conformal internal date representation lv_number = io_reader->value + 693595. append lv_number to cs_excel-numbers. ls_cell-ref = sy-tabix. when gc_cell_datatype-time. * Convert to the ABAP-conformal internal time representation * Work with milliseconds, however, for further improvements lv_number = round( val = 86400 * io_reader->value dec = 3 ). append lv_number to cs_excel-numbers. ls_cell-ref = sy-tabix. endcase. endif. when 'is'. " Inline Strings: Just append them to the stringtab if io_reader->node_type eq if_sxml_node=>co_nt_element_close. append io_reader->value to cs_excel-strings. ls_cell-ref = sy-tabix. ls_cell-type = gc_cell_datatype-string. endif. endcase. endwhile. endmethod.
By the way: The code has been written with support of unit tests. The above method, like the rest of the core methods, is covered to 100%. There are only a few unprocessed boundary cases in the unit tests: For example the case of a corrupt zip file (which cannot be unpacked properly). See the ABAP unit test section in the code repository for details.