This file has advice to help you merge newer versions of PCRE. JavaScriptCore's PCRE is currently based on: PCRE 6.5 With the following differences. 1) We added a PCRE_UTF16 define that makes a library that works on UTF-16 strings rather than on ASCII or UTF-8. We introduced the public typedef pcre_char and the internal typedef pcre_uchar. We changed access to the digitab and ctypes arrays to range check and work only on values in the 0-255 range. We changed GETCHAR, GETCHRATEST, GETCHARINC, GETCHARINCTEST, and GETCHARLEN so they work on UTF-16. We added ISMIDCHAR to abstract the notion of characters to skip over, and handle it right regardless of UTF-16 or UTF-8, and changed code to call it when appropriate. We added GETUTF8CHARLEN and GETUTF8CHARINC, to be used in cases where we always process UTF-8, even if the subject string is UTF-16, and changed code to call them when appropriate. 2) We added a JAVASCRIPT define that turns off and alters various features to match the requirements of the JavaScript language specification. We removed these: \C \E \G \L \N \P \Q \U \X \Z \e \l \p \u \z [::] [..] [==] (?#) (?<=) (?) (?C) (?P) (?R) (?0) (and 1-9) (?imsxUX) And we added these: \u \v And we changed the semantics for \1-style backreferences to parentheses that are not included in a match to match the empty string instead of not matching anything: This is a difference between the JavaScript language specification and the perl script. And we include ASCII 0x0B as a space. 3) We made a more-efficient version of the NO_RECURSE mode that uses goto or computed goto statements instead of setjmp/longjmp, since it's so much faster that way. We also allocated the first 16 stack frames on the stack instead of using malloc every time; we use malloc for deeper nesting. This included adding a numeric parameter to the RMATCH macro. 4) The original PCRE relied on having the input be a null-terminated string, even though pcre_exec takes a length parameter. We removed that restriction, passing additional parameters internally to make sure the code does not read off the end of the input buffer. We added the macro GETCHARLENEND to be used in some places where GETCHARLEN might otherwise walk off the end of the buffer. 5) We added code to forbid values that are not Unicode characters from being used in \x and \u escape sequences in regular expressions. 6) We changed the names of the public entry points to have a kjs prefix so they don't collide with a "real" copy of PCRE at link or load time. 7) We added a hand-edited pcre-config.h, which is used instead of a configure-generated config.h file. Note, this is made from the config.h.in from the PCRE distribution. 8) We eliminated non-ASCII characters from the source files (they were used only in one or two places). 9) We removed many unused source files. 10) We marked some additional global data tables const. 11) We fixed Unicode support for negative special classes (bug 10370). 12) And we fixed some compiler warnings. For easy merging: 1) We look for approaches that minimize changes to the base PCRE code. 2) When making global changes we leave code alone that we're not compiling. So code that's inside #if !JAVASCRIPT need not have the other changes above. This can be a bit strange. For example, there's a choice about what to do with the code to handle an end of pattern pointer or length rather than a trailing zero. Our strategy is to not make enhancements to the code that we're not compiling, so if you turned off the JAVASCRIPT flag, you'd find that the range checking changes are incomplete. This is solely to aid merging. 3) We are willing to format code strangely to minimize the differences from the base PCRE code. Differences from the base PCRE code should be viewed with these comments in mind.