Espressioni regolari per principianti: Come iniziare a scoprire dati sensibili
May 29, 2019
Qualsiasi soluzione di scoperta e classificazione dei dati si basa fortemente sulle espressioni regolari (a volte chiamate RegExes, REs o modelli RegEx) per identificare dati sensibili. Ma cosa sono le RegExes e come possono essere utilizzate per scoprire dati sensibili? Scopriamolo.
Contenuti correlati selezionati:
Le espressioni regolari sono un piccolo ma altamente specializzato linguaggio di programmazione; sono fondamentalmente dei caratteri jolly potenziati. Utilizzando questo piccolo linguaggio, si specificano le regole che definiscono le stringhe che si desidera corrispondere. Ad esempio, è possibile definire una RegEx che corrisponderà agli indirizzi email, PII, PHI o numeri di carte di credito.
Componenti Regex
Un RegEx può includere letterali e metacaratteri.
Letterali
Qualsiasi singolo carattere, eccetto quelli riservati come metacaratteri, è già di per sé un'espressione regolare. Ad esempio, www corrisponde a www.Netwrix.com ma wwz non lo è. Notare che le espressioni regolari sono sensibili al maiuscolo e minuscolo, quindi www non corrisponderà a WWW né a wWw.
Metacaratteri
I seguenti singoli caratteri non vengono interpretati come letterali ma hanno significati speciali:
- . ^ $ * + ? { } [ ] | ( )
La seguente tabella descrive come funziona ciascuno di questi metacaratteri.
Digita | Meta-caratteri | Descrizione | Esempi |
|---|---|---|---|
|
Il punto |
. |
Il punto significa qualsiasi carattere. |
Netwrix corrisponderà sia a www.netwrix.com che a www.netfrix.com. |
|
Classe di personaggio |
[] |
Corrispondenze per qualsiasi cosa all'interno delle parentesi quadre. |
Puoi elencare i caratteri individualmente; per esempio, Netwrix corrisponderà a netw , netr e netx ma non a netz. |
|
Ancore |
^ |
Utilizzato per trovare i caratteri all'inizio di una stringa |
^https corrisponderà a https://netwrix.com ma non a www.netwrix.com né a http://netwrix.com |
|
$ |
Utilizzato per trovare i caratteri alla fine di una stringa |
com$ corrisponderà a www.netwrix.com o telecom ma non a computer. |
|
|
Iterazione / quantificatori |
? |
Matches the preceding element zero or one time (it will always match if the character was not found). It is great for finding optional characters. |
colou?r will match both color and colour. |
|
|
* |
Matches the preceding element zero or more times instead of zero or once. It is great for finding optional series of characters. |
ne*t will match nt (zero e characters), net (one e ), neeet (three e characters), and so forth. |
|
+ |
Matches the preceding element one or more times. |
ne+t will match net and neeet but not nt. |
|
|
| |
The choice operator matches either the expression before or the expression after the operator. |
net|wrix will match net and wrix. |
|
|
{} |
{x} matches if the element that precedes it is found exactly x times. |
n{3} will match nnn , nnnn and nnnd (because they all include n three times in a row), but it will not match nnw. |
|
|
Blocking and capturing |
() |
Defines a subexpression that can be recalled later using shorthand: The first subexpression in parentheses can be recalled by \1, the second can be recalled by \2 and so on. |
Gr(a|e)y will match Gray or Grey. |
|
Escape sequence |
\ |
The metacharacter that follows the slash will be used as a literal. |
www\.netwrix\.com will match www.netwrix.com but not www,netwrix,com. |
|
Special metacharacters |
\s |
Matches any whitespace character (a space, a tab, a line break or a form feed). |
Netwrix\sAuditor will match Netwrix Auditor, and Netwrix(tab)Auditor, but not Netwrix<5 spaces> Auditor or NetwrixAuditor. |
|
|
\S |
Matches any non-whitespace character. |
\Snetwrix will match Xnetwrix and 1netwrix. |
|
\w |
Matches any alphanumeric character. |
\w\w\w will match net, dfw and Netwrix. |
|
|
\W |
Matches any non-alphanumeric character. |
netwrix\W will match netwrix! and netwrix?. |
|
|
|
\d |
Matches any decimal digit. |
Netwrix\d\d will match Netwrix80 and Netwrix90. |
|
\D |
Matches any non-digit character. |
Netwrix\D will match Netwrix) and Netwrix-. |
|
|
\a |
Matches any single alphabetic character, either capital or lowercase. |
net\arix will match netWrix, netfrix and netarix. |
|
|
|
\b |
Defines a word boundary. |
\brix will match rix and rixon but not netwrix. |
|
|
\B |
Defines a non-word boundary |
\Brix will match Netwrix and trix but not rixon. |
Metacharacter combinations
Now we know almost all the metacharacters and are ready to combine them.
Example: Looking for license plate numbers
Suppose we need to find a license number in the format aaa-nnnn — the first three digits must be alphanumeric and the last four must be numeric. The hyphen can be replaced with any character or missing altogether.
The RegEx for this will be:
- b[0-9A-Z]{3}([^ 0-9A-Z]|s)?[0-9]{4}b
Let’s dissect this RegEx:
- b requires a word boundary, so matching strings cannot be part of a larger string.
- [0-9A-Z]{3} means that the first three characters must be alphanumeric.
- ([^ 0-9A-Z]|s)? means the next part of the string must be either a delimiter — a non-alphanumeric character or a whitespace character — or nothing at all.
- [0-9]{4} means the next part of the string must be 4 digits.
- b specifies another word boundary.
This RegEx will match the following license numbers: NT5-6345, GH3 9452, XS83289
However, it will not match these license numbers: ZNT49371, HG3-29347, nt4-9371
Example: Looking for Social Security numbers
Another good example is U.S. Social Security number (SSN), which always takes the form nnn-nn-nnnn.
The easiest RegEx is the following:
- [0-9]{3}-[0-9]{2}-[0-9]{4}
However, this will generate false positives, since not all numbers that have this form are legitimate SSNs. Moreover, it will miss some actual SSNs, including any that are written without the hyphens. To get more accurate results, we should build more complex one. We know that:
- No digit group can be all zeroes.
- The first block cannot be 666 or 900-999.
- SSNs can be written with whitespace characters instead of hyphens, or without any delimiters at all.
- If the first block starts with a 7, it must be followed by a number between 0 and 6 and then any third digit.
Therefore, the advanced RegEx will look like this:
- b(?!000|666|9d{2})([0-8]d{2}|7([0-6]d))([-]?|s{1})(?!00)dd2(?!0000)d{4}b
As before, b at the beginning and end specify a word boundary. Let’s look more deeply at each number block in between.
The first block
- (?!000|666|9d{2}) is a negative look-ahead that specifies the number must not begin with 000, 666, or 9 followed by any two digits.
- ([0-8]d{2} specifies that the string has to start with a digit between 0 and 8 and have two more digits (0-9) after it.
- |7[0-6]d)) says that it happens to begin with 7, the next digit must be between 0 and 6, followed by any digit.
- ([-]?|s{1}) specifies that after the three digits, there should be either a hyphen, a whitespace character or nothing at all to mark the end of the first block.
The second block
- (?!00) is another negative look-ahead that specifies there must not be 00 in the second block.
- dd specifies that there must be any two digits in the second block.
- 2 matches the same text as the second capturing group, which is ([-]?|s{1}), so it specifies that the second block can end with a hyphen, a whitespace character or no additional character at all.
The third block
- (?!0000) is another negative look-ahead that specifies there cannot be four zeroes in the third block.
- d{4} requires any four digits in the third SSN block.
Examples of popular RegExes
To find | Use this RegEx | Example of match |
|---|---|---|
|
Email addresses |
^[\w\.=-]+@[\w\.-]+\.[\w]{2,3}$ |
T.Simpson@netwrix.com |
|
U.S. Social Security numbers |
\b(?!000|666|9\d{2})([0-8]\d{2}|7([0-6]\d))([-]?|\s{1})(?!00)\d\d\2(?!0000)\d{4}\b |
513-84-7329 |
|
IPV4 addresses |
^\d{1,3}[.]\d{1,3}[.]\d{1,3}[.]\d{1,3}$ |
192.168.1.1 |
|
Dates in MM/DD/YYYY format |
^([1][12]|[0]?[1-9])[\/-]([3][01]|[12]\d|[0]?[1-9])[\/-](\d{4}|\d{2})$ |
05/05/2018 |
|
MasterCard numbers |
^(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|27[01][0-9]|2720)[0-9]{12}$ |
5258704108753590 |
|
Visa card numbers |
\b([4]\d{3}[\s]\d{4}[\s]\d{4}[\s]\d{4}|[4]\d{3}[-]\d{4}[-]\d{4}[- |
4563-7568-5698-4587 |
|
American Express card numbers |
^3[47][0-9]{13}$ |
34583547858682157 |
|
U.S. ZIP codes |
^((\d{5}-\d{4})|(\d{5})|([A-Z]\d[A-Z]\s\d[A-Z]\d))$ |
97589 |
|
File paths |
\\[^\\]+$ |
\\fs1\shared |
|
URLs |
(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+ |
www.netwrix.com |
Helpful Regex web resources
- https://regexr.com and https://regex101.com will help you to check your RegExes by highlighting syntax and tooltips.
- https://regexcrossword.com is a crossword puzzle game in which the clues are defined using regular expressions.
- https://www.regular-expressions.info a great site with information about regular expressions. In addition, the Notepad++ tool has a RegEx helper extension that will serve you well while you’re working with regular expressions.
Condividi su
Scopri di più
Informazioni sull'autore
Jeff Melnick
Direttore dell'Ingegneria dei Sistemi
Jeff è un ex Direttore dell'Ingegneria delle Soluzioni Globali presso Netwrix. È un blogger di Netwrix da lungo tempo, nonché relatore e presentatore. Nel blog di Netwrix, Jeff condivide lifehack, consigli e trucchi che possono migliorare notevolmente la tua esperienza di amministrazione del sistema.
Scopri di più su questo argomento
I prossimi cinque minuti di conformità: costruire la Data Security That Starts with Identity in tutto l'APAC
Dal rumore all'azione: trasformare il rischio dei dati in risultati misurabili
Leggi sulla Privacy dei Dati per Stato: Diversi Approcci alla Protezione della Privacy
Esempio di Analisi del Rischio: Come Valutare i Rischi
Il Triangolo CIA e la sua applicazione nel mondo reale