Array Shapes For Preg Match Matches
In August 2023, I started into an adventure which in the end took me 10 months to figure out. It’s another part about my ongoing efforts to close blind spots in PHPStan’s type inference.
I did similar things before with phpstan-dba
, which implements SQL based static analysis and type inference for the database access layer.
The journey to precise array-shapes for preg_match
$matches
In its most basic form, we search for the answer to the following question:
How does the $matches
array look like after a preg_match
call?
function doFoo(string $s): void {
if (preg_match('/(?:(a)(\d))?(c)(\s)*/', $s, $matches)) {
// how can $matches look like at this line?
} else {
// how can $matches look like at this line?
}
// how can $matches look like at this line?
}
I am not aware of any static analysis tool which is able to figure this out, so it kind of was clear that this will take a few experiments and time-consuming research. Play with the example in the PHPStan playground.
To explore a possible solution, I had to answer a few questions:
- Which capturing groups (named vs. unnamed) are contained in the used pattern?
- Which capturing groups are optional/conditional?
- How do the capturing groups relate to the array-shape of
$matches
? - How can the
$flags
parameter influence the array-shape of$matches
? - How do the resulting array-shapes flow through the branches of the if-else construct?
- How to implement this type-inference improving mechanism in a way, that other
preg_match
wrapping libraries could benefit from it (e.g.composer/pcre
,nette/utils
)?
Thanks to the great PHPStan community a few other people stopped by and helped me with some super special corner cases. Also adding more test-cases to the initial prototype was really helpful to get a high quality implementation in the end.
TL;DR: The feature is already merged into PHPStan starting with 1.11.6 and can be enabled via Bleeding Edge.
Update: Starting with PHPStan 1.12.x precise type inference for regular expressions is enabled by default.
Most relevant pull requests along the road were…
- the very first iteration
- implementation of ParameterOutTypeExtensions
- improvements in the type-specifier to handle non-strict comparisons
- drop the regex pattern hack and do everything in the regex AST
- handling of optional top level groups
- handling of top level alternation groups
Figuring this one out was a joy, sometimes frustrating, and a time-consuming task.
It’s a thing no other static analyzer I am aware of can handle and it will save any PHPStan user fiddling with preg_match
a lot of time and effort.
Please considering sponsoring my open-source efforts 💕.
TL;DR aside, lets dive into it…
Which capturing groups are contained in the used pattern?
One of the easier questions at first sight, since the initial requester of the above feature provided a regex pattern hack which obviously provided this information. I went with this hack for a few months and moved along.
While adding more and more test-cases with different patterns, we realized that the hack was not reliable. It needed a few tweaks to also work with named capturing. It does not work consistently across PHP versions.
As an alternative I started playing around with Hoa\Regex
, a library already contained in PHPStan to build a abstract syntax tree (AST) for regex patterns.
It’s the only library I could find in the PHP ecosystem suitable for this task. An additional complication is, that this library is not maintained anymore and has a few bugs.
To get the AST parsing up to speed, I had to backport a few yet unreleased fixes from the upstream repository and with the support of Michael Voříšek we were able to fix the grammar file so named capturing groups were properly recognized.
In the end we decided to go with the AST parsing, since it was more reliable and also was the only solution which would work consistently for all php versions PHPStan 1.x supports (PHP 7.2+).
Which capturing groups are optional/conditional? How do the capturing groups relate to the array-shape of $matches
?
In early prototype stage I had implemented a hybrid approach between the regex pattern hack and the AST parsing.
We used the AST to identify which capturing groups would be contained and the pattern hack with PREG_UNMATCHED_AS_NULL
to get an idea of the optional/conditional groups.
PREG_UNMATCHED_AS_NULL
started working properly in PHP 7.4, so making this work consistently across php-versions was another problem to solve.
Later I re-implemented the optional/conditional capturing group detection with plain AST based logic, which was a hell of a ride on its own.
The main quest was to figure out when preg_match
would leave out a capturing group from $matches
(trailing optional groups) and how to properly structure the shape,
when optional capturing groups are involved before mandatory capturing groups.
Additionally, it’s not that easy to figure out, when a capturing group is optional or conditional.
A group might be part of an alternation like (?:(\d)|(\w))
or (?:(\d)|(\w)|no-group)
.
An alternation element might be optional on its own - as in (?:(\d)*|(\w))
- or the whole alternation might be optional like in (?:(\d)|(\w))?
- or a mix of all that.
As you might already imagine the field is pretty complex and doing the regex AST dance properly is quite a challenge.
You can find what was needed to get this working in the related classes: RegexArrayShapeMatcher, RegexCapturingGroup, RegexNonCapturingGroup.
At this point the implementation got simpler because we no longer had this hybrid thing.
Ondřej was also pretty happy about that:
How can the $flags
parameter influence the array-shape of $matches
?
That one was easier than the others. Bonus points were in because possible flags are php-version specific.
Flags like PREG_UNMATCHED_AS_NULL
can also lead to $matches
to contain null
values.
PREG_OFFSET_CAPTURE
will lead to a different array-shape, since values will be accompanied by their offset in the input string.
How do the resulting array-shapes flow through the branches of the if-else construct?
Let’s have a look back at our initial example:
function doFoo(string $s): void {
if (preg_match('/(?:(a)(\d))?(c)(\s)*/', $s, $matches)) {
// (a) how can $matches look like at this line?
} else {
// (b) how can $matches look like at this line?
}
// (c) how can $matches look like at this line?
}
One might think getting it resolved should be some kind of already solved puzzle. preg_match
needs some special treatment though,
because of the by-ref $matches
arg is changing the variable outside the if-branch scope.
See the following example which asserts the expected PHPStan type-inference within the given branches:
use function PHPStan\Testing\assertType;
function doFoo(string $s): void {
if (preg_match('/(?:(a)(\d))?(c)(\s)*/', $s, $matches)) {
// (a)
assertType('array{0: string, 1: string, 2: string, 3: string, 4?: string}', $matches);
} else {
// (b)
assertType('array{}', $matches);
}
// (c)
assertType('array{}|array{0: string, 1: string, 2: string, 3: string, 4?: string}', $matches);
}
- In the
(a)
branch, the pattern surely matches, so the array-shape consists of a mix of always-matched and sometimes-matched offsets - In the
(b)
branch, the pattern surely does not match, so the array-shape is empty - In the
(c)
branch, we don’t know whether the pattern matched, therefore the array-shape could be empty or a match
If you are interested in other test-cases and the types PHPStan can understand in these situations, please consult the test-suite. Alternatively copy the example code, drop it into the PHPStan online playground (don’t forget to enable the ‘Bleeding Edge’ checkbox) and see the expected types.
In an early prototype I was using only a TypeSpecifyingExtension to override the type of $matches
. This lead to some consequential problems though.
TypeSpecifyingExtension are meant to narrow an existing type for the if-branch and/or the else-branch. It will not change the types after the if/else construct though.
We had to come up with a new type of PHPStan extension to properly handle the by-ref $matches
argument.
Up to this point in time a param-out
type could only be defined using phpDoc.
So we implemented ParameterOutTypeExtensions which allow to define param-out
types programmatically and in a context-sensitive way.
The idea is, to use a FunctionParameterOutTypeExtension
to type $matches
the way the outer scope expects it to be (see (c)
).
On top, we use a FunctionTypeSpecifyingExtension
to narrow this type for the if-branch (a)
and/or the else-branch (b)
.
How to implement this type-inference improving mechanism in a way, that other preg_match
wrapping libraries could benefit from it?
In the previous chapter we learned what PHPStan needs to ship in its core to support $matches type-inference for the preg_match
function.
The mentioned FunctionParameterOutTypeExtension
and FunctionTypeSpecifyingExtension
both rely on the magic which happens in RegexArrayShapeMatcher
, which is doing the heavy lifting.
This RegexArrayShapeMatcher
-class is declared as @api
which means it is meant for use by other extensions outside the phpstan-src repository.
We use it to implement the same type inference capabilities in nette/utils or composer/pcre.
You might also use this class to build custom extensions for your very own preg_match
-wrapping API.
Future work
For the future is planned to
- stabilize the implementation to make it general available (without Bleeding Edge)
- finalize the
composer/pcre
integration - finalize the PHP-CS-Fixer
Preg::match
integration - use similar type narrowing for
preg_match_all
and maybe other functions - use more precise types when possible
Support my open source work
In case this article was useful, or you want to honor the effort I put into one of the hundreds of pull-requests to PHPStan, please considering sponsoring my open-source efforts 💕.
Found a bug? Please help improve this article.