Support Unicode property escapes in JavaScript

hubprince
26 Aug 2020

I was hoping to be able to run a script like this to wrap all the characters that matched a regular expression using Unicode property escapes:

$("p").html(function(index, html) {
    return html.replace(/(\p{Script=Greek}+)/gu, '<span class="greek">$1</span>');
});

Unfortunately the regular expression doesn't match anything, as it seems that

\p{Script=Greek}

isn't supported by Prince's JavaScript engine.

hubprince
26 Aug 2020

For reference, I was able to work around this by using regexpu to expand the Unicode property escape to a series of Unicode codepoints:

  $("p").html(function(index, html) {
    return html.replace(/((?:[\u0370-\u0373\u0375-\u0377\u037A-\u037D\u037F\u0384\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6-\u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2126\uAB65]|\uD800[\uDD40-\uDD8E\uDDA0]|\uD834[\uDE00-\uDE45])+)/g, '<span class="greek">$1</span>');
  });

pjrm
7 Sep 2020

\p{Script=...} was added in ECMAScript 2018; Prince does not yet support it.

Regarding the trick of marking up spans of a particular script, I would add that it's useful to allow for diacritics expressed as combining marks, i.e. changing from

(\p{Script=Greek}+)

to

((?:\p{Script=Greek}\p{Script=Inherited}*)+)

and thus

((?:(?:[\u0370-\u0373\u0375-\u0377\u037A-\u037D\u037F\u0384\u0386\u0388-\u038A\u038C\u038E-\u03A1\u03A3-\u03E1\u03F0-\u03FF\u1D26-\u1D2A\u1D5D-\u1D61\u1D66-\u1D6A\u1DBF\u1F00-\u1F15\u1F18-\u1F1D\u1F20-\u1F45\u1F48-\u1F4D\u1F50-\u1F57\u1F59\u1F5B\u1F5D\u1F5F-\u1F7D\u1F80-\u1FB4\u1FB6-\u1FC4\u1FC6-\u1FD3\u1FD6-\u1FDB\u1FDD-\u1FEF\u1FF2-\u1FF4\u1FF6-\u1FFE\u2126\uAB65]|\uD800[\uDD40-\uDD8E\uDDA0]|\uD834[\uDE00-\uDE45])(?:[\u0300-\u036F\u0485-\u0486\u064B-\u0655\u0670\u0951-\u0954\u1AB0-\u1ABE\u1CD0-\u1CD2\u1CD4-\u1CE0\u1CE2-\u1CE8\u1CED\u1CF4\u1CF8-\u1CF9\u1DC0-\u1DF9\u1DFB-\u1DFF\u200C-\u200D\u20D0-\u20F0\u302A-\u302D\u3099-\u309A\uFE00-\uFE0F\uFE20-\uFE2D]|\uD800[\uDDFD\uDEE0]|\uD804\uDF3B|\uD834[\uDD67-\uDD69\uDD7B-\uDD82\uDD85-\uDD8B\uDDAA-\uDDAD]|\uDB40[\uDD00-\uDDEF])*)+)

between the two //.

For some uses, it can also be useful to include codepoints of script Unknown or Common within the greek spans, especially when adjacent to or surrounded by Greek codepoints. This typically requires human judgement, so might be best done in a text editor that supports regexps (and preferably \p{...} regexps, especially if any codepoints u+10000 or above occur in the document, to avoid complications with surrogate pairs in the expansions given here).

Forum › Feature requests

Support Unicode property escapes in JavaScript