Sunday, April 15, 2018

Benchmark of Regular Expression Engines

I did some benchmarking of regular expression engines and the results are quite interesting.  The benchmark involved finding all matches of different patterns in a 15 Mb text file.  It is a similar benchmark to that found in other comparisons of regular expression engines (here and here).  The following engines were compared.
  • Delphi’s built-in engine (PCRE)
  • A modified version of Jcl’s regular expression engine, also PCRE based.  Three versions of this set up were tested:
    • with Study and JIT enabled
    • with Study and no JIT
    • without Study and JIT
  • Python’s built-in regular expression engine and
  • A relative new comer FLRE a pascal-based regular expression engine.
Here follow the results:
                                                        Time     | Match count
==============================================================================
Delphi's own TRegEx:
                                         /Twain/ :       35.00 ms |         811
                                     /(?i)Twain/ :       69.00 ms |         965
                                    /[a-z]shing/ :      510.00 ms |        1540
                    /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :      581.00 ms |         262
                                     /\b\w+nn\b/ :      548.00 ms |         262
                              /[a-q][^u-z]{13}x/ :      566.00 ms |        4094
                   /Tom|Sawyer|Huckleberry|Finn/ :      743.00 ms |        2598
               /(?i)Tom|Sawyer|Huckleberry|Finn/ :      875.00 ms |        4152
           /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :     2617.00 ms |        2598
           /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :     2814.00 ms |        1976
             /Tom.{10,25}river|river.{10,25}Tom/ :      421.00 ms |           2
                                  /[a-zA-Z]+ing/ :      763.00 ms |       78423
                         /\s[a-zA-Z]{0,12}ing\s/ :      485.00 ms |       55201
                 /([A-Za-z]awyer|[A-Za-z]inn)\s/ :     1468.00 ms |         209
                     /["'][^"']{0,30}[?!\.]["']/ :      302.00 ms |        8885
Total Time:    12812.00 ms
==============================================================================
Modified TJclWideRegEx with Study and JIT:
                                         /Twain/ :        9.00 ms |         811
                                     /(?i)Twain/ :       30.00 ms |         965
                                    /[a-z]shing/ :       74.00 ms |        1540
                    /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :       17.00 ms |         262
                                     /\b\w+nn\b/ :      115.00 ms |         262
                              /[a-q][^u-z]{13}x/ :      181.00 ms |        4094
                   /Tom|Sawyer|Huckleberry|Finn/ :       18.00 ms |        2598
               /(?i)Tom|Sawyer|Huckleberry|Finn/ :       50.00 ms |        4152
           /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :      194.00 ms |        2598
           /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :      216.00 ms |        1976
             /Tom.{10,25}river|river.{10,25}Tom/ :       24.00 ms |           2
                                  /[a-zA-Z]+ing/ :      168.00 ms |       78423
                         /\s[a-zA-Z]{0,12}ing\s/ :       97.00 ms |       55248
                 /([A-Za-z]awyer|[A-Za-z]inn)\s/ :       95.00 ms |         209
                     /["'][^"']{0,30}[?!\.]["']/ :       25.00 ms |        8885
Total Time:     1336.00 ms
==============================================================================
Modified TJclWideRegEx with Study no JIT:
                                         /Twain/ :       11.00 ms |         811
                                     /(?i)Twain/ :       42.00 ms |         965
                                    /[a-z]shing/ :      272.00 ms |        1540
                    /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :       19.00 ms |         262
                                     /\b\w+nn\b/ :      418.00 ms |         262
                              /[a-q][^u-z]{13}x/ :      384.00 ms |        4094
                   /Tom|Sawyer|Huckleberry|Finn/ :       23.00 ms |        2598
               /(?i)Tom|Sawyer|Huckleberry|Finn/ :      209.00 ms |        4152
           /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :     2664.00 ms |        2598
           /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :     2730.00 ms |        1976
             /Tom.{10,25}river|river.{10,25}Tom/ :       45.00 ms |           2
                                  /[a-zA-Z]+ing/ :      627.00 ms |       78423
                         /\s[a-zA-Z]{0,12}ing\s/ :      279.00 ms |       55248
                 /([A-Za-z]awyer|[A-Za-z]inn)\s/ :      593.00 ms |         209
                     /["'][^"']{0,30}[?!\.]["']/ :       40.00 ms |        8885
Total Time:     8389.00 ms
==============================================================================
Modified TJclWideRegEx no Study no JIT:
                                         /Twain/ :       10.00 ms |         811
                                     /(?i)Twain/ :       43.00 ms |         965
                                    /[a-z]shing/ :      341.00 ms |        1540
                    /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :      383.00 ms |         262
                                     /\b\w+nn\b/ :      500.00 ms |         262
                              /[a-q][^u-z]{13}x/ :      659.00 ms |        4094
                   /Tom|Sawyer|Huckleberry|Finn/ :      716.00 ms |        2598
               /(?i)Tom|Sawyer|Huckleberry|Finn/ :      984.00 ms |        4152
           /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :     2769.00 ms |        2598
           /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :     3130.00 ms |        1976
             /Tom.{10,25}river|river.{10,25}Tom/ :      409.00 ms |           2
                                  /[a-zA-Z]+ing/ :      845.00 ms |       78423
                         /\s[a-zA-Z]{0,12}ing\s/ :      501.00 ms |       55248
                 /([A-Za-z]awyer|[A-Za-z]inn)\s/ :      815.00 ms |         209
                     /["'][^"']{0,30}[?!\.]["']/ :      281.00 ms |        8885
Total Time:    12411.00 ms

==============================================================================           Python Regular Expressions
Twain                                    time:    0.00505 found mathces:    811
(?i)Twain                                time:      0.209 found mathces:    965
[a-z]shing                               time:       0.24 found mathces:   1540
Huck[a-zA-Z]+|Saw[a-zA-Z]+               time:     0.0807 found mathces:    262
\b\w+nn\b                                time:      0.513 found mathces:    262
[a-q][^u-z]{13}x                         time:      0.685 found mathces:   4081
Tom|Sawyer|Huckleberry|Finn              time:     0.0715 found mathces:   2598
(?i)Tom|Sawyer|Huckleberry|Finn          time:       0.93 found mathces:   4152
.{0,2}(Tom|Sawyer|Huckleberry|Finn)      time:      0.957 found mathces:   2598
.{2,4}(Tom|Sawyer|Huckleberry|Finn)      time:      0.935 found mathces:   1976
Tom.{10,25}river|river.{10,25}Tom        time:     0.0922 found mathces:      2
[a-zA-Z]+ing                             time:       0.52 found mathces:  78423
\s[a-zA-Z]{0,12}ing\s                    time:      0.299 found mathces:  55201
([A-Za-z]awyer|[A-Za-z]inn)\s            time:      0.358 found mathces:    209
["][^"]{0,30}[?!\.]["]                   time:     0.0175 found mathces:   5261
total time:       5910 ms

                                                        Time     | Match count
==============================================================================
FLRE:
                                         /Twain/ :        7.83 ms |         811
                                     /(?i)Twain/ :        4.78 ms |         965
                                    /[a-z]shing/ :        8.95 ms |        1540
                    /Huck[a-zA-Z]+|Saw[a-zA-Z]+/ :        8.14 ms |         262
                                     /\b\w+nn\b/ :       50.53 ms |         262
                              /[a-q][^u-z]{13}x/ :       94.96 ms |        4094
                   /Tom|Sawyer|Huckleberry|Finn/ :       12.24 ms |        2598
               /(?i)Tom|Sawyer|Huckleberry|Finn/ :       21.34 ms |        4152
           /.{0,2}(Tom|Sawyer|Huckleberry|Finn)/ :       47.25 ms |        2598
           /.{2,4}(Tom|Sawyer|Huckleberry|Finn)/ :       47.37 ms |        1976
             /Tom.{10,25}river|river.{10,25}Tom/ :       46.06 ms |           2
                                  /[a-zA-Z]+ing/ :       60.95 ms |       78423
                         /\s[a-zA-Z]{0,12}ing\s/ :       55.11 ms |       55248
                 /([A-Za-z]awyer|[A-Za-z]inn)\s/ :       10.76 ms |         209
                     /["'][^"']{0,30}[?!\.]["']/ :       49.21 ms |        8885
Total time:      542.93 ms



And the winner by a big margin is FLRE! 
Notes:
  • The time to read the file is not included in the timings.
  • PCRE with JIT is also quite impressive.  In other test it was found to compete well with the engines released by Intel and Google.  However I found the JIT is very buggy (crashes often) and limited (works only with very few options).

No comments: