RegEx

Reg Ex

Contents:

Reg Ex

Back

General

WikidPad supports the use of RegEx or RE (=Regular Expressions) as a way to specify search criteria for wiki wide or page specific searches. Different dialects of RE exist; the one that WikidPad uses, is that of the built-in Python regular expressions. For a description of it's syntax see: Python: RE syntax. For a gentler introduction see: Python: RE HowTo. It gives good examples on how to use RE, but you will have to read past a few programming instructions.
A more elaborate description of RE can be found on Wikipedia: "Regular Expression".

Remarks

In some of the descriptions below, extra backslashes have had to be added, to prevent certain characters/phrases from being interpreted as Wikidpad elements. The text is only completely correct, as it is seen in preview.

Top

Simple search

The basic principle is simple; to find the word "circumflex", just type it and hit the "search" button. Regex however defaults to a character search (partial word), not a whole word search. Searching for "wiki" would result in all the words with the characters "wiki" in them, so also wikiwords, wikipedia, linkswikis, etc. Some of Wikidpad's search dialogs however have a checkbox "Whole word" with which the search can be switched from character mode to word mode.
Regex also defaults to case insensitive search. Check the "case sensitive" checkbox if it has to be otherwise.

Top

Metacharacters

Some characters are "metacharacters" and have a special meaning in RE. For instance specifying "regex*" as the text to search for, would not result in what you might think it would. The metacharacters are:

. ^ $ * + ? { } [ ] \ | ( )

To use them as a regular character in a search criterion, prefix them with a backslash "\". So in order to find "regex*", it should be specified as: "regex\*". Metacharacters can be combined in every thinkable way, in order to specify a "meta search".

Top

Examples

The vertical bar "|" is the "or" operator; to find:

"characters/" ór "/phrases", specify: "characters/|/phrases"

The "\b" combination is the "at word boundary" specifier. To find all words that:

start with "meta", specify: "\bmeta".
end with "flex", specify: "flex\b".
are "RE" and nothing else, specify "\bRE\b".

The "\B" is it's opposite: "nót at word boundary".

The circumflex accent "^" specifies: "at start of line". To find all lines that:

start with "start with", specify: "^start with".

The dollar sign "$" specifies: "at end of line". To find all lines that:

end with "lines that:" or with "words that:", specify: "lines that:$|words that:$".

The asterisk "*", plus sign "+" and question mark "?" are repeat specifiers for the preceding character, where:

* matches zero or more occurrences: "ca*t" will match "ct", "cat", "caat", "caaat", etc.
+ matches one or more occurrences: "ca+t" will match "cat", "caat", "caaat", etc., but not "ct".
? matches zero or one occurrence: "built-?in" will match "builtin" and "built-in", but not "built--in" (or "built in").

Examples:

"^\+* *Find" will find any line starting with "find", whether it is a heading or not.
"^\++ Find" will find any heading starting with "find", independant of the heading level.
"page ?link" will find "pagelink" and "page link".

The period "." is the "match any character" specifier. To find:

"built in" ánd "built-in", specify: "built.in".
"builtin","built in" ánd "built-in", specify: "built.?in".

A pair of curly brackets "{ }" also forms a repeat qualifier; it's fomat is: "{m,n}" where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n. To find:

headings only of level 4 and 5, specify: "\+{4,5}"

The pair can also be used with one integer specified, like: "{3}"; it then specifies the "m" parameter.

A pair of parenthesis "( )" forms a "group" specifier, with which a gróup of characters can be qualified with a metacharacter instead of only one character at a time.

text(html)?elements will find "TextElements" and "TextHtmlElements".

Groups can be nested; i.e. you can use groups within groups.

A pair of square brackets "[ ]" forms a "character class"; a "choice of characters". Specifying "[1234]" would mean that any one of the digits 1,2,3 or 4 would cause a match. To find:

"Ctrl-T" ánd "Ctrl+T", specify "Ctrl[-+]T".

Ranges of characters can also be specified using a hyphen "-"; so "[1234]" can also be written as "[1-4]".

Metacharacters lose their special nature inside classes; a asterisk inside a class e.g. "[*_] represents an asterisk and not a repeat specifier. To find:

any bold or italics formatted "Remarks", specify: "[*_]Remarks"

The circumflex accent "^" has a special meaning as the first character of a class, it then complements the classes character set; turning it into an "anything bút" specification. To find:

any nót formatted "Remarks", specify: "[^*_]Remarks".

Greedy qualifiers

*?, +?, ??, {m,n}?
The "*", "+", "?" and "{m,n}" qualifiers are all greedy; they match as much text as possible. This means that matching "(++)*" against "++++" would result in three (3) matches: "++++", "++++" and "++++"! When this behaviour is not desired, it can be prevented, by adding a question mark "?" áfter the qualifier. This will make it perform the match in a non-greedy or minimal fashion; as few characters as possible will be matched. So using "(++)*?" in the match against "++++" would result only in the match: "++++".

Top

Extended notation

(?...)
Is an extension notation and not a group specifier. The first character after the "?" determines what the meaning and further syntax of the construct is. Following are some currently supported extensions.

(?=...)
Matches if ... matches next. This is called a lookahead assertion. For example:

"Page (?=link)" will match "Page " only if it's followed by "link".

(?!...)
Matches if ... doesn't match next. This is a negative lookahead assertion. For example:

"Page (?!link)" will match "Page " only if it's not followed by "link".

(?<=...)
Matches if the current position in the string is preceded by a match for ... that ends at the current position. For example:

"Link(?<=Url )" will match "Link" only if it's preceded by "Url ".
"(?<=-)w+" looks for a word following a hyphen.

(?<!...)
Matches if the current position in the string is not preceded by a match for ... For example:

"link(?<!Page )" will match "link" only if it's nót preceded by "Page ".

Top

Special sequences

Regex supports a number of special sequences that start with a backslash "\" like "\b" and "\B" mentioned above. The list below gives the default description; there might be differences in behaviour between ascii and unicode and between lokales. For detailed specifications see: Python: RE syntax

\A
Matches only at the start of the string.

\b
Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

\B
Matches the empty string, but only when it is nót at the beginning or end of a word.

\d
Matches any decimal digit; this is equivalent to the set [0-9].

\D
Matches any non-digit character; this is equivalent to the set [^0-9]

\s
Matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v].

\S
Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v]

\w
Matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_].

\W
Matches any non-alphanumeric character; this is equivalent to the set [^a-zA-Z0-9_].

\Z
Matches only at the end of the string.

Top

Escape sequences

Python regex supports the use of escape sequences, to specify special characters in the search pattern, that can't be typed with the keyboard; these sequences are:

\a	ASCII Bell (BEL)	
\b	ASCII Backspace (BS)	
\f	ASCII Formfeed (FF)	
\n	ASCII Linefeed (LF)	
\r	ASCII Carriage Return (CR)	
\t	ASCII Horizontal Tab (TAB)	
\v	ASCII Vertical Tab (VT)
\xhh	Character with hex value hh
\\	Backslash (\)

Sequences like bel, bs and vt won't be of too much use in Wikidpad, but the others might in certain cases.

Examples

• To find a string with a tab like in the next line:

* Find this you could use the pattern: "\t* Find". This will however only work if the whitespace in front of the asterisk really ís a tab and that is only the case, when the setting "tabs to spaces" in the Editor menu was set to "false", when the page text was created.

• To find a string split over two lines, like "line 1 this" in:

this is line 1
this is line 2

you could use the pattern: "line 1\nthis" [1].

Find & replace

The sequences have special use in Find & Replace when it is used in regex mode. They can then be used as (part of) the replace value. For example: "this is line 2" in the previous example could be changed into "that is line 2", using the replace value: "line 1\nthat" [1].
A special use of the sequences can be found in the print dialog, where they can be used as part of the "page separator" string.

Top
[1] WikidPad does not use the CR/LF pair as a new line indicator, but the single LF character.

previous: Back
parents: FastSearch, FindReplace, FullSearch, IncrementalSearch, Intro7Searching, PageListDialog, TextBlocks, WikiSyntax
[help.status: done]