Regular Expressions Mastery Across Languages

Regular Expressions Mastery Across Languages

Introduction

Regular expressions (regex) provide powerful pattern matching for text processing. This guide covers regex syntax—character classes, quantifiers, anchors, groups, capturing, lookaheads/lookbehinds—with practical examples for validation, extraction, and replacement across JavaScript, Python, C#, and Java.

Basic Patterns

Literal Characters and Metacharacters

Simple matching:

// JavaScript
const text = "Hello World";

// Literal match
/Hello/.test(text);  // true
/hello/.test(text);  // false (case-sensitive by default)

// Case-insensitive flag
/hello/i.test(text);  // true

// Match any single character (.)
/H.llo/.test("Hello");   // true
/H.llo/.test("Hallo");   // true
/H.llo/.test("H123lo");  // false (. matches one char)

// Escape metacharacters
/example\.com/.test("example.com");  // true
/\$19\.99/.test("$19.99");           // true

// Metacharacters requiring escape: . ^ $ * + ? { } [ ] \ | ( )

Character Classes

Predefined classes:

# Python
import re

# \d = digit [0-9]
re.search(r'\d+', 'Order 12345')  # Matches '12345'

# \w = word character [a-zA-Z0-9_]
re.search(r'\w+', 'hello_world')  # Matches 'hello_world'

# \s = whitespace [ \t\n\r\f\v]
re.search(r'\s+', 'hello   world')  # Matches '   '

# Negated classes:
# \D = non-digit [^0-9]
# \W = non-word character [^a-zA-Z0-9_]
# \S = non-whitespace

# Custom character class
re.search(r'[aeiou]', 'hello')  # Matches 'e' (first vowel)
re.search(r'[0-9]', 'abc123')   # Matches '1'
re.search(r'[^0-9]', '123abc')  # Matches 'a' (first non-digit)

# Ranges
re.search(r'[a-z]+', 'Hello')       # Matches 'ello'
re.search(r'[A-Z]+', 'Hello')       # Matches 'H'
re.search(r'[a-zA-Z]+', 'Hello123') # Matches 'Hello'
re.search(r'[0-9a-fA-F]+', 'FF00AA')  # Matches 'FF00AA' (hex)

C# examples:

using System.Text.RegularExpressions;

// Character class matching
Regex.IsMatch("Hello123", @"[a-z]+");      // false (lowercase only)
Regex.IsMatch("Hello123", @"[a-zA-Z]+");   // true
Regex.IsMatch("user@example.com", @"[\w@.]+");  // true

// Extract digits
var match = Regex.Match("Price: $199.99", @"\d+\.\d+");
Console.WriteLine(match.Value);  // "199.99"

Quantifiers

Repetition Patterns

Basic quantifiers:

// JavaScript
const patterns = {
    '*': 'Zero or more',
    '+': 'One or more',
    '?': 'Zero or one (optional)',
    '{n}': 'Exactly n times',
    '{n,}': 'At least n times',
    '{n,m}': 'Between n and m times'
};

// Examples
/\d+/.test('123');       // true - one or more digits
/\d*/.test('');          // true - zero or more digits
/colou?r/.test('color'); // true - 'u' is optional
/colou?r/.test('colour');// true

// Specific counts
/\d{4}/.test('2025');         // true - exactly 4 digits
/\d{2,4}/.test('99');         // true - 2 to 4 digits
/\d{2,4}/.test('12345');      // true - matches first 4
/\w{3,}/.test('hello');       // true - at least 3 word chars

// Phone number pattern
/\d{3}-\d{3}-\d{4}/.test('555-123-4567');  // true

Greedy vs lazy (non-greedy):

# Python
import re

text = "<div>Content</div><div>More</div>"

# Greedy (default) - matches as much as possible
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group())  # '<div>Content</div><div>More</div>'

# Lazy (non-greedy) - matches as little as possible
lazy = re.search(r'<div>.*?</div>', text)
print(lazy.group())  # '<div>Content</div>'

# Password validation (8-20 chars)
pattern = r'^.{8,20}$'
re.match(pattern, 'password123')  # Valid
re.match(pattern, 'short')        # None (too short)

Anchors and Boundaries

Position Matching

Start and end anchors:

// JavaScript
// ^ = start of string
// $ = end of string

/^Hello/.test('Hello World');   // true
/^Hello/.test('Say Hello');     // false

/World$/.test('Hello World');   // true
/World$/.test('World is big');  // false

// Exact match (start + end)
/^Hello World$/.test('Hello World');      // true
/^Hello World$/.test('Hello World!');     // false
/^Hello World$/.test('Say Hello World');  // false

// Validate format exactly
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;
emailPattern.test('user@example.com');  // true
emailPattern.test('invalid email');     // false

Word boundaries:

# Python
import re

# \b = word boundary (between \w and \W)
# \B = non-word boundary

text = "The cat in the cathedral"

# Match whole word 'cat'
re.search(r'\bcat\b', text)  # Matches 'cat' (standalone)
re.search(r'\bcat\b', 'cathedral')  # None (part of word)

# Find all whole words
words = re.findall(r'\b\w+\b', "Hello, world! How are you?")
print(words)  # ['Hello', 'world', 'How', 'are', 'you']

# Replace whole word only
result = re.sub(r'\bcat\b', 'dog', text)
print(result)  # "The dog in the cathedral"

Groups and Capturing

Parentheses for Grouping

Capturing groups:

// JavaScript
// ( ) = capturing group

const text = "John Doe (555-1234)";
const pattern = /(\w+) (\w+) \((\d{3}-\d{4})\)/;
const match = text.match(pattern);

console.log(match[0]);  // "John Doe (555-1234)" - full match
console.log(match[1]);  // "John" - first capture group
console.log(match[2]);  // "Doe" - second capture group
console.log(match[3]);  // "555-1234" - third capture group

// Named capturing groups (ES2018)
const namedPattern = /(?<firstName>\w+) (?<lastName>\w+) \((?<phone>[\d-]+)\)/;
const namedMatch = text.match(namedPattern);

console.log(namedMatch.groups.firstName);  // "John"
console.log(namedMatch.groups.lastName);   // "Doe"
console.log(namedMatch.groups.phone);      // "555-1234"

Python named groups:

# Python
import re

# Named groups with ?P<name>
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, 'Date: 2025-05-15')

print(match.group('year'))   # '2025'
print(match.group('month'))  # '05'
print(match.group('day'))    # '15'

# Access as dictionary
print(match.groupdict())
# {'year': '2025', 'month': '05', 'day': '15'}

# Extract email components
email_pattern = r'(?P<user>[\w.-]+)@(?P<domain>[\w.-]+)\.(?P<tld>\w+)'
email_match = re.search(email_pattern, 'user@example.com')

print(email_match.group('user'))    # 'user'
print(email_match.group('domain'))  # 'example'
print(email_match.group('tld'))     # 'com'

C# named groups:

// C#
using System.Text.RegularExpressions;

var pattern = @"(?<area>\d{3})-(?<exchange>\d{3})-(?<number>\d{4})";
var match = Regex.Match("555-123-4567", pattern);

if (match.Success)
{
    Console.WriteLine(match.Groups["area"].Value);      // "555"
    Console.WriteLine(match.Groups["exchange"].Value);  // "123"
    Console.WriteLine(match.Groups["number"].Value);    // "4567"
}

Non-capturing groups:

// (?: ) = non-capturing group (for grouping without capturing)

// Without non-capturing group
const withCapture = /(\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withCapture);  // ['555-123-4567', '555', '123', '4567']

// With non-capturing group
const withoutCapture = /(?:\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withoutCapture);  // ['555-123-4567', '123', '4567']

// Useful for alternation
/(https?|ftp):\/\//.test('https://example.com');  // true
/(?:https?|ftp):\/\//.test('ftp://files.com');    // true

Lookaheads and Lookbehinds

Zero-Width Assertions

Positive lookahead (?=):

// JavaScript
// (?= ) = positive lookahead (match if followed by pattern)

// Password must contain digit
/^(?=.*\d).{8,}$/.test('password123');  // true
/^(?=.*\d).{8,}$/.test('password');     // false

// Password must contain uppercase AND lowercase AND digit
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('Pass1234');  // true
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('password1'); // false

// Extract word before comma
/\w+(?=,)/.exec('apple,banana,orange');  // ['apple']

Negative lookahead (?!):

# Python
import re

# (?! ) = negative lookahead (match if NOT followed by pattern)

# Find 'q' not followed by 'u'
pattern = r'q(?!u)'
re.findall(pattern, 'Iraq Qatar queue')  # ['q'] (only in Iraq)

# Username: letters/digits, but cannot start with digit
username_pattern = r'^(?!\d)[a-zA-Z0-9_]{3,16}$'
re.match(username_pattern, 'user123')   # Valid
re.match(username_pattern, '123user')   # None (starts with digit)

Positive lookbehind (?<=):

# Python
# (?<= ) = positive lookbehind (match if preceded by pattern)

# Find price (digits after $)
pattern = r'(?<=\$)\d+(?:\.\d{2})?'
re.findall(pattern, 'Items: $19.99, $5, $150.00')
# ['19.99', '5', '150.00']

# Extract @mentions (alphanumeric after @)
mentions_pattern = r'(?<=@)\w+'
text = "Hello @alice and @bob_123!"
re.findall(mentions_pattern, text)  # ['alice', 'bob_123']

Negative lookbehind (?<!):

// C#
using System.Text.RegularExpressions;

// (?<! ) = negative lookbehind (match if NOT preceded by pattern)

// Find digits not preceded by $
var pattern = @"(?<!\$)\d+";
var matches = Regex.Matches("Price: $100 and 50 items", pattern);
// Matches: "100" in "$100" is skipped, "50" is matched

foreach (Match match in matches)
{
    Console.WriteLine(match.Value);  // "50"
}

Practical Examples

Email Validation

Basic email pattern:

// JavaScript
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;

// Valid emails
emailPattern.test('user@example.com');      // true
emailPattern.test('john.doe@company.org');  // true
emailPattern.test('test_123@sub.domain.co.uk');  // true

// Invalid emails
emailPattern.test('invalid');               // false
emailPattern.test('@example.com');          // false
emailPattern.test('user@');                 // false

// More comprehensive email validation
const strictEmail = /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;

Phone Number Formats

Multiple formats:

# Python
import re

def validate_phone(phone):
    """Validate US phone number in various formats."""
    patterns = [
        r'^\d{3}-\d{3}-\d{4}$',           # 555-123-4567
        r'^\(\d{3}\) \d{3}-\d{4}$',       # (555) 123-4567
        r'^\d{10}$',                       # 5551234567
        r'^\+1-\d{3}-\d{3}-\d{4}$',       # +1-555-123-4567
    ]
    
    return any(re.match(pattern, phone) for pattern in patterns)

# Test
print(validate_phone('555-123-4567'))   # True
print(validate_phone('(555) 123-4567')) # True
print(validate_phone('5551234567'))     # True
print(validate_phone('invalid'))        # False

# Extract and normalize phone numbers
def extract_phone(text):
    """Extract phone number and normalize to XXX-XXX-XXXX format."""
    pattern = r'(?:\+1[-.]?)?\(?(\d{3})\)?[-. ]?(\d{3})[-. ]?(\d{4})'
    match = re.search(pattern, text)
    if match:
        return f'{match.group(1)}-{match.group(2)}-{match.group(3)}'
    return None

print(extract_phone('Call me at (555) 123-4567'))  # '555-123-4567'
print(extract_phone('Phone: 555.123.4567'))        # '555-123-4567'

URL Parsing

Extract URL components:

// JavaScript
const urlPattern = /^(https?):\/\/([^:\/\s]+)(?::(\d+))?(\/[^\s]*)?$/;

const url = 'https://example.com:8080/path/to/page?query=value';
const match = url.match(urlPattern);

if (match) {
    console.log('Protocol:', match[1]);  // 'https'
    console.log('Domain:', match[2]);    // 'example.com'
    console.log('Port:', match[3]);      // '8080'
    console.log('Path:', match[4]);      // '/path/to/page?query=value'
}

// Extract all URLs from text
const text = "Visit https://example.com or http://test.org for more info";
const urls = text.match(/https?:\/\/[^\s]+/g);
console.log(urls);  // ['https://example.com', 'http://test.org']

Data Extraction

Parse log files:

# Python
import re
from datetime import datetime

log_pattern = r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.*)'

log_lines = [
    '2025-05-05 14:30:00 [INFO] Application started',
    '2025-05-05 14:30:15 [ERROR] Database connection failed',
    '2025-05-05 14:30:20 [WARN] Retrying connection',
]

for line in log_lines:
    match = re.match(log_pattern, line)
    if match:
        timestamp = datetime.strptime(match.group('timestamp'), '%Y-%m-%d %H:%M:%S')
        level = match.group('level')
        message = match.group('message')
        print(f'{level}: {message} at {timestamp}')

Extract data from HTML:

// C#
using System.Text.RegularExpressions;

// Extract all links from HTML
var html = @"
    <a href='/home'>Home</a>
    <a href='https://example.com'>Example</a>
    <a href='/contact'>Contact</a>
";

var linkPattern = @"<a\s+href=['""]([^'""]+)['""]>([^<]+)</a>";
var matches = Regex.Matches(html, linkPattern);

foreach (Match match in matches)
{
    var url = match.Groups[1].Value;
    var text = match.Groups[2].Value;
    Console.WriteLine($"{text}: {url}");
}
// Output:
// Home: /home
// Example: https://example.com
// Contact: /contact

String Replacement

Find and replace:

// JavaScript
// Simple replacement
'hello world'.replace(/world/, 'JavaScript');  // 'hello JavaScript'

// Global replacement (all occurrences)
'foo bar foo'.replace(/foo/g, 'baz');  // 'baz bar baz'

// Case-insensitive replacement
'Hello WORLD'.replace(/world/gi, 'JavaScript');  // 'Hello JavaScript'

// Replacement with capturing groups
const date = '2025-05-15';
const formatted = date.replace(/(\d{4})-(\d{2})-(\d{2})/, '$2/$3/$1');
console.log(formatted);  // '05/15/2025'

// Replacement with function
const text = 'Total: $100, Tax: $8, Shipping: $5';
const doubled = text.replace(/\$(\d+)/g, (match, amount) => {
    return '$' + (parseInt(amount) * 2);
});
console.log(doubled);  // 'Total: $200, Tax: $16, Shipping: $10'

Python substitution:

# Python
import re

# Simple substitution
re.sub(r'apple', 'orange', 'I like apple pie')  # 'I like orange pie'

# Using captured groups
text = 'Name: John Doe, Age: 30'
result = re.sub(r'Name: (\w+) (\w+)', r'\2, \1', text)
print(result)  # 'Name: Doe, John, Age: 30'

# Substitution with function
def uppercase_match(match):
    return match.group().upper()

text = 'hello world from python'
result = re.sub(r'\b\w+\b', uppercase_match, text)
print(result)  # 'HELLO WORLD FROM PYTHON'

# Remove HTML tags
html = '<p>Hello <b>world</b>!</p>'
clean = re.sub(r'<[^>]+>', '', html)
print(clean)  # 'Hello world!'

Language-Specific Features

JavaScript Flags

// i = case-insensitive
/hello/i.test('HELLO');  // true

// g = global (find all matches)
'foo bar foo'.match(/foo/g);  // ['foo', 'foo']

// m = multiline (^ and $ match line boundaries)
const text = 'Line 1\nLine 2';
text.match(/^Line/gm);  // ['Line', 'Line']

// s = dotAll (. matches newlines)
/hello.world/s.test('hello\nworld');  // true

// u = unicode
/\u{1F600}/u.test('😀');  // true

// y = sticky (matches at exact position)
const pattern = /foo/y;
pattern.lastIndex = 4;
pattern.test('foo foo');  // true (matches at position 4)

Python re Module

import re

# Compile pattern for reuse
pattern = re.compile(r'\d+')
pattern.findall('123 abc 456')  # ['123', '456']

# Verbose mode (comments and whitespace ignored)
email_pattern = re.compile(r'''
    [\w.-]+    # username
    @          # at symbol
    [\w.-]+    # domain
    \.         # dot
    \w{2,}     # TLD
''', re.VERBOSE)

# Methods
re.search(pattern, string)   # Find first match
re.match(pattern, string)    # Match at start
re.findall(pattern, string)  # Find all matches (list)
re.finditer(pattern, string) # Find all matches (iterator)
re.sub(pattern, repl, string)  # Replace
re.split(pattern, string)    # Split by pattern

C# Regex Options

using System.Text.RegularExpressions;

// RegexOptions enumeration
var pattern = @"hello";

// Case-insensitive
Regex.IsMatch("HELLO", pattern, RegexOptions.IgnoreCase);

// Multiline
var text = "Line 1\nLine 2";
Regex.Matches(text, @"^Line", RegexOptions.Multiline);

// Compiled (faster for repeated use)
var compiled = new Regex(@"\d+", RegexOptions.Compiled);

// Timeout (prevent catastrophic backtracking)
var regex = new Regex(@"a+b+c+", RegexOptions.None, TimeSpan.FromSeconds(1));

Best Practices

  1. Start Simple: Begin with basic patterns, add complexity gradually
  2. Test Thoroughly: Use regex testers (regex101.com, regexr.com)
  3. Use Non-Capturing Groups: (?:) when you don't need to capture
  4. Avoid Greedy Quantifiers: Use lazy quantifiers (.*?) for HTML/XML
  5. Escape Metacharacters: Always escape . $ ^ * + ? { } [ ] \ | ( )
  6. Comment Complex Patterns: Use verbose mode in Python, comments in code

Key Takeaways

  • Character classes ([a-z], \d, \w) match specific character sets
  • Quantifiers (*, +, ?, {n,m}) control repetition
  • Anchors (^, $, \b) match positions, not characters
  • Groups () capture submatches, (?:) groups without capturing
  • Lookaheads/lookbehinds (?=, ?!, ?<=, ?<!) enable zero-width assertions
  • Named groups improve readability and maintenance

Next Steps

  • Learn atomic groups (?>...) for performance optimization
  • Explore Unicode properties (\p{L}, \p{N}) for international text
  • Master conditional patterns (?(condition)yes|no)
  • Study catastrophic backtracking and prevention strategies

Additional Resources


Match patterns, not headaches.