Quantcast
Channel: SCN : All Content - ABAP Development
Viewing all articles
Browse latest Browse all 8332

Seek the most efficient way to detect whether there are table row with duplicate key

$
0
0

The requirement is: there is an internal table with a large number of table row.

 

If all rows have the identical recipient_id, that id( 30273 ) must be returned.

 

UUID

Phone_number

Recipient_id

0412ASFDSFDSFXCVS

138XXXXX1

30273

0412ASFDSFDSFXCVD

138XXXXX2

30273

0412ASFDSFDSFXCVF

138XXXXX3

30273

30273

 

If not, it must return empty.

UUID

Phone_number

Recipient_id

0412ASFDSFDSFXCVS

138XXXXX1

30273

0412ASFDSFDSFXCVD

138XXXXX2

30273

0412ASFDSFDSFXCVF

138XXXXX3

30273

30272

 

The table line type structure in the project looks like below:

clipboard1.png

Three different solutions have been made.

 

Approach1

the idea is a temporary table lt_sms_status is used to hold all the content of the internal table to be checked, and then SORT on the temporary table and delete adjacent entries. If all the table row have the same recipient id, after the operation there must be only one entry left.

    DATA: lt_sms_status LIKE it_tab.    lt_sms_status = it_tab.
SORT lt_sms_status BY recipient_id.
DELETE ADJACENT DUPLICATES FROM lt_sms_status COMPARING recipient_id.
IF lines( lt_sms_status ) = 1.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.        ev_rec_id = <line>-recipient_id.
ENDIF.

The drawback of approach1 is it could lead to the unnecessary high memory assumption. when lt_sms_status = it_tab is executed, no new memory allocation will not occur, until the write operation on the copied content. This behavior is documented as "Delayed Copy".

We also have concern regarding the performance of SORT and DELETE keyword when they are executed on a big internal table.

clipboard2.png

Approach2

Now we fetch the recipient id of the first row, and compare it with the left rows in the table. If most of the table rows have different recipient id, the execution has the chance to quit early. However if unfortunately all the table rows have exactly the same recipient id, this approach has to loop until last table row.

  

 DATA: lv_diff_found TYPE abap_bool VALUE abap_false.
READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.
DATA(lv_account_id) = <line>-recipient_id.
LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>).
IF lv_account_id <> <ls_line>-recipient_id.          lv_diff_found = abap_true.
EXIT.
ENDIF.
ENDLOOP.
IF lv_diff_found = abap_false.       ev_rec_id = lv_account_id.
ENDIF.

Approach3

the idea is similar as approach2, now instead of manual comparison inside each LOOP, we leverage "LOOP AT XXX WHERE condition".

  

 READ TABLE it_tab ASSIGNING FIELD-SYMBOL(<line>) INDEX 1.    LOOP AT it_tab ASSIGNING FIELD-SYMBOL(<ls_line>) WHERE recipient_id <> <line>-recipient_id.    ENDLOOP.
IF sy-subrc <> 0.       ev_rec_id = <line>-recipient_id.
ENDIF.

In order to measure the perfomance, we construct two kinds of test case. In the first one, we generate the internal table with N rows, each has exactly the same recipient id. And for the second, each one has different. Both are extreme kinds of scenarios. We may consider to measure the case between these two, for example for a N rows table there are 50% table rows have the same id and another 50% have difference one.

 

Performance test result

The time spent is measured in microsecond.

N = 1000

For the first test case, approach3 is most efficient. For the second test case, approach2 is the winner, as we expected.

clipboard4.png


N = 10000

clipboard5.png

N = 100000

clipboard6.png

N = 1000000

clipboard7.png

N = 5000000

clipboard8.png

Based on the performance result, we do not consider approach1 any more. For the choice between approach2 and 3, we need to investigate on the distriction of recipient id in the real world.

 

Maybe you can also share if you have better solutions?


Viewing all articles
Browse latest Browse all 8332

Trending Articles