REGEX is a sequence of characters that forms a search patterns. In data analysis, Regex used for matching patterns, data extraction and replace strings within a large string of data. In Tableau, Regular Expressions are used for powerful text processing tasks. These expressions are flexible. Regular expressions follows certain patterns. Regular expressions abbreviated as REGEX or REGEXP.
Below are some of the Regular expressions meta characters classes:
\b : Match if the current position is a word boundary. Boundaries occur at the transitions between word (\w) and non-word (\W) characters, with combining marks ignored.
\B : Match if the current position is not a word boundary.
\d : Match any character with the Unicode General Category of Nd (Number, Decimal Digit.)
\D : Match any character that is not a decimal digit.
\w : Match a word character.
\W : Match a non-word character.
. : Match any character.
^ : Match at the beginning of a line.
$ : Match at the end of a line.
\ : Quotes the following character. Characters that must be quoted to be treated as literals are * ? + [ ( ) { } ^ $ | \ .
Here are some regular expression Operations & Quantifiers:
| : Alternation. A|B matches either A or B.
* : Match 0 or more times. Match as many times as possible.
+ : Match 1 or more times. Match as many times as possible.
? : Match zero or one times. Prefer one.
*? : Match 0 or more times. Match as few times as possible.
+? : Match 1 or more times. Match as few times as possible.
?? : Match zero or one times. Prefer zero.
{n} : Match exactly n times.
{n}? : Match exactly n times.
{n,} : Match at least n times. Match as many times as possible.
{n,}? : Match at least n times, but no more than required for an overall pattern match.
{n,m} : Match between n and m times. Match as many times as possible, but not more than m.
{n,m}? : Match between n and m times. Match as few times as possible, but not less than n
Here are some Set expressions(character classes):
[abc] : Match any of the characters a, b or c.
[^abc] : Negation - match any character except a, b, or c.
[A-M] : Range - match any character from A to M.
Tableau has four different regular expression functions which are match, extract or replace strings within a larger string.
REGEXP_MATCH(string, pattern)
REGEXP_EXTRACT(string, pattern)
REGEXP_EXTRACT_NTH(string, pattern, index)
REGEXP_REPLACE(string, pattern ,replacement)
We are going to use below data set ( Customer Order Data ) to learn more about how these functions can be used in Tableau.
REGEXP_MATCH ( string, pattern ) :
This function is used to validate whether the given input matches the required structure of the data field or not. It returns 'TRUE' , if the substring of the provided string matches the regular expression pattern, otherwise returns 'FALSE'.
e.g., In the "customer order data" data set, we want to check whether the order-IDs follows a particular pattern ([A-z]-digits,ex.,B-2897) or not.
For this we need to create a calculated field named "Order-IDMatch" and added a formula as below.
Order-IDMatch: REGEXP_MATCH([order-ID],'^[A-Z]-\d+$')
1. ^ : Asserts position at the start of the string
2. [A-z] : Matches any single upper case letter
3. - : Matches hyphen character
4. \d+ : Matches one or more digits
5. $ : Asserts position at the end of the string
If I put the order-ID and the Order-IDMatch in the column, I got 'True' for fields which are matching and I got 'False' for the fields, which are not matching with the given pattern.
REGEXP_EXTRACT (string, pattern ):
The primary use of this function is to extract a particular data from a string.
e.g., In the "customer order data" data set, we want to extract the price from "Prodname-Price" column.
For this I created a calculated field named 'PriceExtract' and added formula as below.
PriceExtract: REGEXP_EXTRACT([Prodname-Price],'[a-z]+\$+(\d+)')
1. [a-z]+ : Matches one or more lower case letters
2. \d+ : Matches one or more digits
The function returns below result.
REGEXP_EXTRACT_NTH (string, pattern, index ):
This is mainly used to look for a particular pattern within a string or substring, starting at the nth position in string and extract the data elements. We need to pass the index along with string and pattern to this function.
In the "customer order data" data set, we want to extract the price of each product along with '$' symbol. For this, I created a calculated field named "Price" and added below formula.
Price: REGEXP_EXTRACT_NTH([Prodname-Price],'(\$[0-9]+)',1)
1.\$ : Matches the $ sign('$')
2.[0-9]+ : Matches one or more digits
3.(\$[0-9]+) : Captures the price part
4. 1 : Specifies that the first capturing group to be extracted
The result is produced as below:
REGEXP_REPLACE( string, pattern, replacement ):
Using this we can replaces a particular matched pattern with the specified replacement text. Interestingly, Tableau regular expressions can also be nested. We can use REGEXP_REPLACE() with another REPLACE() function.
In the "customer order data" data set, to replace 'Prodname-Price' with 'Product-Description', I created a calculated field 'Product-Description ' and added below formula.
Product-Description: REGEXP_REPLACE([Prodname-Price],'\s?\$[0-9]+','')
\s? : Matches the optional white space
\$ : Matches the dollar sign
[0-9]+ : Matches one or more digits
When I applied the calculated field, I got the output as below.
Conclusion:
Regular expressions are an extremely powerful tool for parsing out information from text. Using regular expressions we can find and return phone numbers, emails, websites and much more. Regular expressions are effective for cleaning and transforming data. We can use them to remove unwanted character, replace pattern, reformat strings. REGEX provides a consistent way to handle text processing task across different programming languages and tools. A single regex pattern can replace several lines of code or multiple steps in data preparation, making the process much easier to manage. By using regex into tableau workflows, we can enhance our data analysis and visualization capabilities.
Comments