Safe Regex engine to prevent ReDOS Attack

Currently, JavaScript regex engine suffers from lacking atomic groups and other features for preventing ReDos Attack. this makes it very complicated to handle these kinds of attacks e.g coming up with a safe regex for a url link such as the one below, can be very daunting:
/^(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&'\(\)\*\+,;=.]+$/

therefore It will be very good to handle this in js like java or other languages to prevent ReDos attacks, e.g java8 had this problem but in java9 this problem has been handled.

......
#You can find some patterns and their problems here

Hi Seyyed,

See prior discussion at Possessive RegExp matching.

I actually started a proposal for this, but when I spoke with implementers they said they would rather make implementation only changes (and changes are being pursued!). They are not interested in more advanced features like atomic groups or possessive quantifiers. They would possibly be interested in a "regular mode" flag that disables all backtracking for a particular regex.

2 Likes

Also, there have been some efforts into identifying regexps vulnerable to catastrophic backtracking, like vuln-regex-detector. However, checking for this in the general case is very expensive (they only use heuristics, and in the general case it's an exponential problem).

3 Likes

I assume this more or less tells them to always construct a DFA where possible, even when it's expensive to construct?

1 Like

Is it really the right way it should be? if it's preventing the attack, it would be good, but I think its not enough for a language, because all developers are not a security man, so it wont prevent the bugs from being spread in this topic.

I'll revive this, having a flag that disables backtracking would make it trivial to avoid ReDOS, and would in general provide RegExps that are easier to reason about.

Is there interest among delegates to push this?

V8's implemented an experimental engine to provide guaranteed constant-time lookups, so it's more an implementation choice. (They do offer an additional optional /l flag, but the main goal is to just use this as a fallback engine until they can get performance on par with Irregexp for simple stuff.)

1 Like

There is a proposal for possessive quantifiers, yes: GitHub - rbuckton/proposal-regexp-atomic-operators

2 Likes

There is also now @iter-tools/regex which uses a non-backtracking algorithm and should (in theory?) be safe from ReDOS.

I'm actually working on a compose-regexp update that makes the /(?=(...))\1/ "polyfill" usable (and composable). I'll keep you posted when I publish a new version.

I've just released compose-regexp@0.6.1, which now comes with an atomic() helper that uses the look ahead / back reference trick to emulate atomic groups (re. backref / lookBehind when matching backwards).

The first example in the README is about ReDOS protection.

import {atomic, sequence} from 'compose-regexp'

// classic ReDOS-vulnerable RegExp:
const ReDOS = /^(([a-z])+.)+[A-Z]([a-z])+$/

// fixed with compose-regexp, this does not backtrack
const fixed = sequence(/^/, atomic(/(([a-z])+.)+/), /[A-Z]([a-z])+$/)

The second deals with character classes operations (difference, intersection, etc...) and arbitrary bounds:

import {bound, charSet, flags, suffix} from 'compose-regexp'

const LcGrekLetter = charSet.intersection(/\p{Lowercase}/u, /\p{Script=Greek}/u)
LcGrekLetter.test("Γ") // false
LcGrekLetter.test("γ") // true
LcGrekLetter.test("x") // false

// like /\b/ but for Greek
const b = bound(/\p{Script=Greek}/u)

const LcGrekWords = flags.add('g', [b, suffix("+", LcGrekLetter), b])
for (
  lc of `Θεωρείται ως ο σημαντικότερος θεμελιωτής ...`.matchAll(LcGrekWords)
) {
  console.log(lc) //'ως', 'ο', 'σημαντικότερος', 'θεμελιωτής'
}

I really think a simple solid fix for preventing catastrophic backtracking could be for RegExp methods, like RegExp.prototype.test and String.prototype.match, to provide some type of optional argument for limiting processing time or operations.

Execution could be limited in one of two ways (that I can think of off-hand):

  1. By indicating a maximum timeframe (perhaps in milliseconds) that the regex would run for.
  2. Alternatively, by indicating the maximum number of backtracking steps allowed. (Unless I’m mistaken, backtracking is the only unsafe aspect of regexs.)

If the timeframe expires, the match should fail and return something unique (e.g., a symbol, object, or, e.g., null for test which normally only returns true / false). At that point, it is simply up to the developer to decide what to do if the test runs into potential catastrophic backtracking.

I think this approach would be suitable for most cases (if not all). Very often, we know the general size we expect the input to be, which can give a general idea of the timeframe or backtracking steps that should be required.

I think just having one of these options would resolve the issue, but having both would give some flexibility, since each limit has drawbacks:

  1. A maximum timeframe might work sometimes and fail other times, since a heavy load of other tasks might cause even a simple regex test to take longer than normal.
  2. Whereas, although a backtracking step limit guarantees the results are always the same, it doesn’t guarantee a fixed timeframe, which could be problematic if the input is excessively larger than expected. (It could also be a problem if there are other concerns with regexs beyond catastrophic backtracking.)

The greatest benefit to this approach versus others is that it allows for safe use of arbitrary regex input. (I think? Do correct me if I’m wrong.)