Metadata-Version: 2.1
Name: regex
Version: 2019.6.8
Summary: Alternative regular expression module, to replace re.
Home-page: https://bitbucket.org/mrabarnett/mrab-regex
Author: Matthew Barnett
Author-email: regex@mrabarnett.plus.com
License: Python Software Foundation License
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Python Software Foundation License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: General
Introduction
------------
This regex implementation is backwards-compatible with the standard 're' module, but offers additional functionality.
Note
----
The re module's behaviour with zero-width matches changed in Python 3.7, and this module will follow that behaviour when compiled for Python 3.7.
Old vs new behaviour
--------------------
In order to be compatible with the re module, this module has 2 behaviours:
* **Version 0** behaviour (old behaviour, compatible with the re module):
Please note that the re module's behaviour may change over time, and I'll endeavour to match that behaviour in version 0.
* Indicated by the ``VERSION0`` or ``V0`` flag, or ``(?V0)`` in the pattern.
* Zero-width matches are not handled correctly in the re module before Python 3.7. The behaviour in those earlier versions is:
* ``.split`` won't split a string at a zero-width match.
* ``.sub`` will advance by one character after a zero-width match.
* Inline flags apply to the entire pattern, and they can't be turned off.
* Only simple sets are supported.
* Case-insensitive matches in Unicode use simple case-folding by default.
* **Version 1** behaviour (new behaviour, possibly different from the re module):
* Indicated by the ``VERSION1`` or ``V1`` flag, or ``(?V1)`` in the pattern.
* Zero-width matches are handled correctly.
* Inline flags apply to the end of the group or pattern, and they can be turned off.
* Nested sets and set operations are supported.
* Case-insensitive matches in Unicode use full case-folding by default.
If no version is specified, the regex module will default to ``regex.DEFAULT_VERSION``.
Case-insensitive matches in Unicode
-----------------------------------
The regex module supports both simple and full case-folding for case-insensitive matches in Unicode. Use of full case-folding can be turned on using the ``FULLCASE`` or ``F`` flag, or ``(?f)`` in the pattern. Please note that this flag affects how the ``IGNORECASE`` flag works; the ``FULLCASE`` flag itself does not turn on case-insensitive matching.
In the version 0 behaviour, the flag is off by default.
In the version 1 behaviour, the flag is on by default.
Nested sets and set operations
------------------------------
It's not possible to support both simple sets, as used in the re module, and nested sets at the same time because of a difference in the meaning of an unescaped ``"["`` in a set.
For example, the pattern ``[[a-z]--[aeiou]]`` is treated in the version 0 behaviour (simple sets, compatible with the re module) as:
* Set containing "[" and the letters "a" to "z"
* Literal "--"
* Set containing letters "a", "e", "i", "o", "u"
* Literal "]"
but in the version 1 behaviour (nested sets, enhanced behaviour) as:
* Set which is:
* Set containing the letters "a" to "z"
* but excluding:
* Set containing the letters "a", "e", "i", "o", "u"
Version 0 behaviour: only simple sets are supported.
Version 1 behaviour: nested sets and set operations are supported.
Flags
-----
There are 2 kinds of flag: scoped and global. Scoped flags can apply to only part of a pattern and can be turned on or off; global flags apply to the entire pattern and can only be turned on.
The scoped flags are: ``FULLCASE``, ``IGNORECASE``, ``MULTILINE``, ``DOTALL``, ``VERBOSE``, ``WORD``.
The global flags are: ``ASCII``, ``BESTMATCH``, ``ENHANCEMATCH``, ``LOCALE``, ``POSIX``, ``REVERSE``, ``UNICODE``, ``VERSION0``, ``VERSION1``.
If neither the ``ASCII``, ``LOCALE`` nor ``UNICODE`` flag is specified, it will default to ``UNICODE`` if the regex pattern is a Unicode string and ``ASCII`` if it's a bytestring.
The ``ENHANCEMATCH`` flag makes fuzzy matching attempt to improve the fit of the next match that it finds.
The ``BESTMATCH`` flag makes fuzzy matching search for the best match instead of the next match.
Notes on named capture groups
-----------------------------
All capture groups have a group number, starting from 1.
Groups with the same group name will have the same group number, and groups with a different group name will have a different group number.
The same name can be used by more than one group, with later captures 'overwriting' earlier captures. All of the captures of the group will be available from the ``captures`` method of the match object.
Group numbers will be reused across different branches of a branch reset, eg. ``(?|(first)|(second))`` has only group 1. If capture groups have different group names then they will, of course, have different group numbers, eg. ``(?|(?P<foo>first)|(?P<bar>second))`` has group 1 ("foo") and group 2 ("bar").
In the regex ``(\s+)(?|(?P<foo>[A-Z]+)|(\w+) (?P<foo>[0-9]+)`` there are 2 groups:
* ``(\s+)`` is group 1.
* ``(?P<foo>[A-Z]+)`` is group 2, also called "foo".
* ``(\w+)`` is group 2 because of the branch reset.
* ``(?P<foo>[0-9]+)`` is group 2 because it's called "foo".
If you want to prevent ``(\w+)`` from being group 2, you need to name it (different name, different group number).
Multithreading
--------------
The regex module releases the GIL during matching on instances of the built-in (immutable) string classes, enabling other Python threads to run concurrently. It is also possible to force the regex module to release the GIL during matching by calling the matching methods with the keyword argument ``concurrent=True``. The behaviour is undefined if the string changes during matching, so use it *only* when it is guaranteed that that won't happen.
Unicode
-------
This module supports Unicode 12.1.0.
Full Unicode case-folding is supported.
Additional features
-------------------
The issue numbers relate to the Python bug tracker, except where listed as "Hg issue".
* Added support for lookaround in conditional pattern (`Hg issue 163 <https://bitbucket.org/mrabarnett/mrab-regex/issues/163>`_)
The test of a conditional pattern can now be a lookaround.
Examples:
.. sourcecode:: python
>>> regex.match(r'(?(?=\d)\d+|\w+)', '123abc')
<regex.Match object; span=(0, 3), match='123'>
>>> regex.match(r'(?(?=\d)\d+|\w+)', 'abc123')
<regex.Match object; span=(0, 6), match='abc123'>
This is not quite the same as putting a lookaround in the first branch of a pair of alternatives.
Examples:
.. sourcecode:: python
>>> print(regex.match(r'(?:(?=\d)\d+\b|\w+)', '123abc'))
<regex.Match object; span=(0, 6), match='123abc'>
>>> print(regex.match(r'(?(?=\d)\d+\b|\w+)', '123abc'))
None
In the first example, the lookaround matched, but the remainder of the first branch failed to match, and so the second branch was attempted, whereas in the second example, the lookaround matched, and the first branch failed to match, but the second branch was **not** attempted.
* Added POSIX matching (leftmost longest) (`Hg issue 150 <https://bitbucket.org/mrabarnett/mrab-regex/issues/150>`_)
The POSIX standard for regex is to return the leftmost longest match. This can be turned on using the ``POSIX`` flag (``(?p)``).
Examples:
.. sourcecode:: python
>>> # Normal matching.
>>> regex.search(r'Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 2), match='Mr'>
>>> regex.search(r'one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 7), match='oneself'>
>>> # POSIX matching.
>>> regex.search(r'(?p)Mr|Mrs', 'Mrs')
<regex.Match object; span=(0, 3), match='Mrs'>
>>> regex.search(r'(?p)one(self)?(selfsufficient)?', 'oneselfsufficient')
<regex.Match object; span=(0, 17), match='oneselfsufficient'>
Note that it will take longer to find matches because when it finds a match at a certain position, it won't return that immediately, but will keep looking to see if there's another longer match there.
* Added ``(?(DEFINE)...)`` (`Hg issue 152 <https://bitbucket.org/mrabarnett/mrab-regex/issues/152>`_)
If there's no group called "DEFINE", then ... will be ignored, but any group definitions within it will be available.
Examples:
.. sourcecode:: python
>>> regex.search(r'(?(DEFINE)(?P<quant>\d+)(?P<item>\w+))(?&quant) (?&item)', '5 elephants')
<regex.Match object; span=(0, 11), match='5 elephants'>
* Added ``(*PRUNE)``, ``(*SKIP)`` and ``(*FAIL)`` (`Hg issue 153 <https://bitbucket.org/mrabarnett/mrab-regex/issues/153>`_)
``(*PRUNE)`` discards the backtracking info up to that point. When used in an atomic group or a lookaround, it won't affect the enclosing pattern.
``(*SKIP)`` is similar to ``(*PRUNE)``, except that it also sets where in the text the next attempt to match will start. When used in an atomic group or a lookaround, it won't affect the enclosing pattern.
``(*FAIL)`` causes immediate backtracking. ``(*F)`` is a permitted abbreviation.
* Added ``\K`` (`Hg issue 151 <https://bitbucket.org/mrabarnett/mrab-regex/issues/151>`_)
Keeps the part of the entire match after the position where ``\K`` occurred; the part before it is discarded.
It does not affect what capture groups return.
Examples:
.. sourcecode:: python
>>> m = regex.search(r'(\w\w\K\w\w\w)', 'abcdef')
>>> m[0]
'cde'
>>> m[1]
'abcde'
>>>
>>> m = regex.search(r'(?r)(\w\w\K\w\w\w)', 'abcdef')
>>> m[0]
'bc'
>>> m[1]
'bcdef'
* Added capture subscripting for ``expandf`` and ``subf``/``subfn`` (`Hg issue 133 <https://bitbucket.org/mrabarnett/mrab-regex/issues/133>`_)
You can now use subscripting to get the captures of a repeated capture group.
Examples:
.. sourcecode:: python
>>> m = regex.match(r"(\w)+", "abc")
>>> m.expandf("{1}")
'c'
>>> m.expandf("{1[0]} {1[1]} {1[2]}")
'a b c'
>>> m.expandf("{1[-1]} {1[-2]} {1[-3]}")
'c b a'
>>>
>>> m = regex.match(r"(?P<letter>\w)+", "abc")
>>> m.expandf("{letter}")
'c'
>>> m.expandf("{letter[0]} {letter[1]} {letter[2]}")
'a b c'
>>> m.expandf("{letter[-1]} {letter[-2]} {letter[-3]}")
'c b a'
* Added support for referring to a group by number using ``(?P=...)``.
This is in addition to the existing ``\g<...>``.
* Fixed the handling of locale-sensitive regexes.
The ``LOCALE`` flag is intended for legacy code and has limited support. You're still recommended to use Unicode instead.
* Added partial matches (`Hg issue 102 <https://bitbucket.org/mrabarnett/mrab-regex/issues/102>`_)
A partial match is one that matches up to the end of string, but that string has been truncated and you want to know whether a complete match could be possible if the string had not been truncated.
Partial matches are supported by ``match``, ``search``, ``fullmatch`` and ``finditer`` with the ``partial`` keyword argument.
Match objects have a ``partial`` attribute, which is ``True`` if it's a partial match.
For example, if you wanted a user to enter a 4-digit number and check it character by character as it was being entered:
.. sourcecode:: python
>>> pattern = regex.compile(r'\d{4}')
>>> # Initially, nothing has been entered:
>>> print(pattern.fullmatch('', partial=True))
<regex.Match object; span=(0, 0), match='', partial=True>
>>> # An empty string is OK, but it's only a partial match.
>>> # The user enters a letter:
>>> print(pattern.fullmatch('a', partial=True))
None
>>> # It'll never match.
>>> # The user deletes that and enters a digit:
>>> print(pattern.fullmatch('1', partial=True))
<regex.Match object; span=(0, 1), match='1', partial=True>
>>> # It matches this far, but it's only a partial match.
>>> # The user enters 2 more digits:
>>> print(pattern.fullmatch('123', partial=True))
<regex.Match object; span=(0, 3), match='123', partial=True>
>>> # It matches this far, but it's only a partial match.
>>> # The user enters another digit:
>>> print(pattern.fullmatch('1234', partial=True))
<regex.Match object; span=(0, 4), match='1234'>
>>> # It's a complete match.
>>> # If the user enters another digit:
>>> print(pattern.fullmatch('12345', partial=True))
None
>>> # It's no longer a match.
>>> # This is a partial match:
>>> pattern.match('123', partial=True).partial
True
>>> # This is a complete match:
>>> pattern.match('1233', partial=True).partial
False
* ``*`` operator not working correctly with sub() (`Hg issue 106 <https://bitbucket.org/mrabarnett/mrab-regex/issues/106>`_)
Sometimes it's not clear how zero-width matches should be handled. For example, should ``.*`` match 0 characters directly after matching >0 characters?
Examples:
.. sourcecode:: python
# Python 3.7 and later
>>> regex.sub('.*', 'x', 'test')
Loading ...