Regular Expressions Mastery Across Languages

Introduction

Regular expressions (regex) provide powerful pattern matching for text processing. This guide covers regex syntax—character classes, quantifiers, anchors, groups, capturing, lookaheads/lookbehinds—with practical examples for validation, extraction, and replacement across JavaScript, Python, C#, and Java.

Basic Patterns

Literal Characters and Metacharacters

Simple matching:

// JavaScript
const text = "Hello World";

// Literal match
/Hello/.test(text);  // true
/hello/.test(text);  // false (case-sensitive by default)

// Case-insensitive flag
/hello/i.test(text);  // true

// Match any single character (.)
/H.llo/.test("Hello");   // true
/H.llo/.test("Hallo");   // true
/H.llo/.test("H123lo");  // false (. matches one char)

// Escape metacharacters
/example\.com/.test("example.com");  // true
/\$19\.99/.test("$19.99");           // true

// Metacharacters requiring escape: . ^ $ * + ? { } [ ] \ | ( )

Character Classes

Predefined classes:

# Python
import re

# \d = digit [0-9]
re.search(r'\d+', 'Order 12345')  # Matches '12345'

# \w = word character [a-zA-Z0-9_]
re.search(r'\w+', 'hello_world')  # Matches 'hello_world'

# \s = whitespace [ \t\n\r\f\v]
re.search(r'\s+', 'hello   world')  # Matches '   '

# Negated classes:
# \D = non-digit [^0-9]
# \W = non-word character [^a-zA-Z0-9_]
# \S = non-whitespace

# Custom character class
re.search(r'[aeiou]', 'hello')  # Matches 'e' (first vowel)
re.search(r'[0-9]', 'abc123')   # Matches '1'
re.search(r'[^0-9]', '123abc')  # Matches 'a' (first non-digit)

# Ranges
re.search(r'[a-z]+', 'Hello')       # Matches 'ello'
re.search(r'[A-Z]+', 'Hello')       # Matches 'H'
re.search(r'[a-zA-Z]+', 'Hello123') # Matches 'Hello'
re.search(r'[0-9a-fA-F]+', 'FF00AA')  # Matches 'FF00AA' (hex)

C# examples:

using System.Text.RegularExpressions;

// Character class matching
Regex.IsMatch("Hello123", @"[a-z]+");      // false (lowercase only)
Regex.IsMatch("Hello123", @"[a-zA-Z]+");   // true
Regex.IsMatch("user@example.com", @"[\w@.]+");  // true

// Extract digits
var match = Regex.Match("Price: $199.99", @"\d+\.\d+");
Console.WriteLine(match.Value);  // "199.99"

Quantifiers

Repetition Patterns

Basic quantifiers:

// JavaScript
const patterns = {
    '*': 'Zero or more',
    '+': 'One or more',
    '?': 'Zero or one (optional)',
    '{n}': 'Exactly n times',
    '{n,}': 'At least n times',
    '{n,m}': 'Between n and m times'
};

// Examples
/\d+/.test('123');       // true - one or more digits
/\d*/.test('');          // true - zero or more digits
/colou?r/.test('color'); // true - 'u' is optional
/colou?r/.test('colour');// true

// Specific counts
/\d{4}/.test('2025');         // true - exactly 4 digits
/\d{2,4}/.test('99');         // true - 2 to 4 digits
/\d{2,4}/.test('12345');      // true - matches first 4
/\w{3,}/.test('hello');       // true - at least 3 word chars

// Phone number pattern
/\d{3}-\d{3}-\d{4}/.test('555-123-4567');  // true

Greedy vs lazy (non-greedy):

# Python
import re

text = "<div>Content</div><div>More</div>"

# Greedy (default) - matches as much as possible
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group())  # '<div>Content</div><div>More</div>'

# Lazy (non-greedy) - matches as little as possible
lazy = re.search(r'<div>.*?</div>', text)
print(lazy.group())  # '<div>Content</div>'

# Password validation (8-20 chars)
pattern = r'^.{8,20}$'
re.match(pattern, 'password123')  # Valid
re.match(pattern, 'short')        # None (too short)

Anchors and Boundaries

Position Matching

Start and end anchors:

// JavaScript
// ^ = start of string
// $ = end of string

/^Hello/.test('Hello World');   // true
/^Hello/.test('Say Hello');     // false

/World$/.test('Hello World');   // true
/World$/.test('World is big');  // false

// Exact match (start + end)
/^Hello World$/.test('Hello World');      // true
/^Hello World$/.test('Hello World!');     // false
/^Hello World$/.test('Say Hello World');  // false

// Validate format exactly
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;
emailPattern.test('user@example.com');  // true
emailPattern.test('invalid email');     // false

Word boundaries:

# Python
import re

# \b = word boundary (between \w and \W)
# \B = non-word boundary

text = "The cat in the cathedral"

# Match whole word 'cat'
re.search(r'\bcat\b', text)  # Matches 'cat' (standalone)
re.search(r'\bcat\b', 'cathedral')  # None (part of word)

# Find all whole words
words = re.findall(r'\b\w+\b', "Hello, world! How are you?")
print(words)  # ['Hello', 'world', 'How', 'are', 'you']

# Replace whole word only
result = re.sub(r'\bcat\b', 'dog', text)
print(result)  # "The dog in the cathedral"

Groups and Capturing

Parentheses for Grouping

Capturing groups:

// JavaScript
// ( ) = capturing group

const text = "John Doe (555-1234)";
const pattern = /(\w+) (\w+) \((\d{3}-\d{4})\)/;
const match = text.match(pattern);

console.log(match[0]);  // "John Doe (555-1234)" - full match
console.log(match[1]);  // "John" - first capture group
console.log(match[2]);  // "Doe" - second capture group
console.log(match[3]);  // "555-1234" - third capture group

// Named capturing groups (ES2018)
const namedPattern = /(?<firstName>\w+) (?<lastName>\w+) \((?<phone>[\d-]+)\)/;
const namedMatch = text.match(namedPattern);

console.log(namedMatch.groups.firstName);  // "John"
console.log(namedMatch.groups.lastName);   // "Doe"
console.log(namedMatch.groups.phone);      // "555-1234"

Python named groups:

# Python
import re

# Named groups with ?P<name>
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, 'Date: 2025-05-15')

print(match.group('year'))   # '2025'
print(match.group('month'))  # '05'
print(match.group('day'))    # '15'

# Access as dictionary
print(match.groupdict())
# {'year': '2025', 'month': '05', 'day': '15'}

# Extract email components
email_pattern = r'(?P<user>[\w.-]+)@(?P<domain>[\w.-]+)\.(?P<tld>\w+)'
email_match = re.search(email_pattern, 'user@example.com')

print(email_match.group('user'))    # 'user'
print(email_match.group('domain'))  # 'example'
print(email_match.group('tld'))     # 'com'

C# named groups:

// C#
using System.Text.RegularExpressions;

var pattern = @"(?<area>\d{3})-(?<exchange>\d{3})-(?<number>\d{4})";
var match = Regex.Match("555-123-4567", pattern);

if (match.Success)
{
    Console.WriteLine(match.Groups["area"].Value);      // "555"
    Console.WriteLine(match.Groups["exchange"].Value);  // "123"
    Console.WriteLine(match.Groups["number"].Value);    // "4567"
}

Non-capturing groups:

// (?: ) = non-capturing group (for grouping without capturing)

// Without non-capturing group
const withCapture = /(\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withCapture);  // ['555-123-4567', '555', '123', '4567']

// With non-capturing group
const withoutCapture = /(?:\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withoutCapture);  // ['555-123-4567', '123', '4567']

// Useful for alternation
/(https?|ftp):\/\//.test('https://example.com');  // true
/(?:https?|ftp):\/\//.test('ftp://files.com');    // true

Lookaheads and Lookbehinds

Zero-Width Assertions

Positive lookahead (?=):

// JavaScript
// (?= ) = positive lookahead (match if followed by pattern)

// Password must contain digit
/^(?=.*\d).{8,}$/.test('password123');  // true
/^(?=.*\d).{8,}$/.test('password');     // false

// Password must contain uppercase AND lowercase AND digit
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('Pass1234');  // true
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('password1'); // false

// Extract word before comma
/\w+(?=,)/.exec('apple,banana,orange');  // ['apple']

Negative lookahead (?!):

# Python
import re

# (?! ) = negative lookahead (match if NOT followed by pattern)

# Find 'q' not followed by 'u'
pattern = r'q(?!u)'
re.findall(pattern, 'Iraq Qatar queue')  # ['q'] (only in Iraq)

# Username: letters/digits, but cannot start with digit
username_pattern = r'^(?!\d)[a-zA-Z0-9_]{3,16}$'
re.match(username_pattern, 'user123')   # Valid
re.match(username_pattern, '123user')   # None (starts with digit)

Positive lookbehind (?<=):

# Python
# (?<= ) = positive lookbehind (match if preceded by pattern)

# Find price (digits after $)
pattern = r'(?<=\$)\d+(?:\.\d{2})?'
re.findall(pattern, 'Items: $19.99, $5, $150.00')
# ['19.99', '5', '150.00']

# Extract @mentions (alphanumeric after @)
mentions_pattern = r'(?<=@)\w+'
text = "Hello @alice and @bob_123!"
re.findall(mentions_pattern, text)  # ['alice', 'bob_123']

Negative lookbehind (?<!):

// C#
using System.Text.RegularExpressions;

// (?<! ) = negative lookbehind (match if NOT preceded by pattern)

// Find digits not preceded by $
var pattern = @"(?<!\$)\d+";
var matches = Regex.Matches("Price: $100 and 50 items", pattern);
// Matches: "100" in "$100" is skipped, "50" is matched

foreach (Match match in matches)
{
    Console.WriteLine(match.Value);  // "50"
}

Practical Examples

Email Validation

Basic email pattern:

// JavaScript
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;

// Valid emails
emailPattern.test('user@example.com');      // true
emailPattern.test('john.doe@company.org');  // true
emailPattern.test('test_123@sub.domain.co.uk');  // true

// Invalid emails
emailPattern.test('invalid');               // false
emailPattern.test('@example.com');          // false
emailPattern.test('user@');                 // false

// More comprehensive email validation
const strictEmail = /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;

Phone Number Formats

Multiple formats:

# Python
import re

def validate_phone(phone):
    """Validate US phone number in various formats."""
    patterns = [
        r'^\d{3}-\d{3}-\d{4}$',           # 555-123-4567
        r'^\(\d{3}\) \d{3}-\d{4}$',       # (555) 123-4567
        r'^\d{10}$',                       # 5551234567
        r'^\+1-\d{3}-\d{3}-\d{4}$',       # +1-555-123-4567
    ]
    
    return any(re.match(pattern, phone) for pattern in patterns)

# Test
print(validate_phone('555-123-4567'))   # True
print(validate_phone('(555) 123-4567')) # True
print(validate_phone('5551234567'))     # True
print(validate_phone('invalid'))        # False

# Extract and normalize phone numbers
def extract_phone(text):
    """Extract phone number and normalize to XXX-XXX-XXXX format."""
    pattern = r'(?:\+1[-.]?)?\(?(\d{3})\)?[-. ]?(\d{3})[-. ]?(\d{4})'
    match = re.search(pattern, text)
    if match:
        return f'{match.group(1)}-{match.group(2)}-{match.group(3)}'
    return None

print(extract_phone('Call me at (555) 123-4567'))  # '555-123-4567'
print(extract_phone('Phone: 555.123.4567'))        # '555-123-4567'

URL Parsing

Extract URL components:

// JavaScript
const urlPattern = /^(https?):\/\/([^:\/\s]+)(?::(\d+))?(\/[^\s]*)?$/;

const url = 'https://example.com:8080/path/to/page?query=value';
const match = url.match(urlPattern);

if (match) {
    console.log('Protocol:', match[1]);  // 'https'
    console.log('Domain:', match[2]);    // 'example.com'
    console.log('Port:', match[3]);      // '8080'
    console.log('Path:', match[4]);      // '/path/to/page?query=value'
}

// Extract all URLs from text
const text = "Visit https://example.com or http://test.org for more info";
const urls = text.match(/https?:\/\/[^\s]+/g);
console.log(urls);  // ['https://example.com', 'http://test.org']

Data Extraction

Parse log files:

# Python
import re
from datetime import datetime

log_pattern = r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.*)'

log_lines = [
    '2025-05-05 14:30:00 [INFO] Application started',
    '2025-05-05 14:30:15 [ERROR] Database connection failed',
    '2025-05-05 14:30:20 [WARN] Retrying connection',
]

for line in log_lines:
    match = re.match(log_pattern, line)
    if match:
        timestamp = datetime.strptime(match.group('timestamp'), '%Y-%m-%d %H:%M:%S')
        level = match.group('level')
        message = match.group('message')
        print(f'{level}: {message} at {timestamp}')

Extract data from HTML:

// C#
using System.Text.RegularExpressions;

// Extract all links from HTML
var html = @"
    <a href='/home'>Home</a>
    <a href='https://example.com'>Example</a>
    <a href='/contact'>Contact</a>
";

var linkPattern = @"<a\s+href=['""]([^'""]+)['""]>([^<]+)</a>";
var matches = Regex.Matches(html, linkPattern);

foreach (Match match in matches)
{
    var url = match.Groups[1].Value;
    var text = match.Groups[2].Value;
    Console.WriteLine($"{text}: {url}");
}
// Output:
// Home: /home
// Example: https://example.com
// Contact: /contact

String Replacement

Find and replace:

// JavaScript
// Simple replacement
'hello world'.replace(/world/, 'JavaScript');  // 'hello JavaScript'

// Global replacement (all occurrences)
'foo bar foo'.replace(/foo/g, 'baz');  // 'baz bar baz'

// Case-insensitive replacement
'Hello WORLD'.replace(/world/gi, 'JavaScript');  // 'Hello JavaScript'

// Replacement with capturing groups
const date = '2025-05-15';
const formatted = date.replace(/(\d{4})-(\d{2})-(\d{2})/, '$2/$3/$1');
console.log(formatted);  // '05/15/2025'

// Replacement with function
const text = 'Total: $100, Tax: $8, Shipping: $5';
const doubled = text.replace(/\$(\d+)/g, (match, amount) => {
    return '$' + (parseInt(amount) * 2);
});
console.log(doubled);  // 'Total: $200, Tax: $16, Shipping: $10'

Python substitution:

# Python
import re

# Simple substitution
re.sub(r'apple', 'orange', 'I like apple pie')  # 'I like orange pie'

# Using captured groups
text = 'Name: John Doe, Age: 30'
result = re.sub(r'Name: (\w+) (\w+)', r'\2, \1', text)
print(result)  # 'Name: Doe, John, Age: 30'

# Substitution with function
def uppercase_match(match):
    return match.group().upper()

text = 'hello world from python'
result = re.sub(r'\b\w+\b', uppercase_match, text)
print(result)  # 'HELLO WORLD FROM PYTHON'

# Remove HTML tags
html = '<p>Hello <b>world</b>!</p>'
clean = re.sub(r'<[^>]+>', '', html)
print(clean)  # 'Hello world!'

Language-Specific Features

JavaScript Flags

// i = case-insensitive
/hello/i.test('HELLO');  // true

// g = global (find all matches)
'foo bar foo'.match(/foo/g);  // ['foo', 'foo']

// m = multiline (^ and $ match line boundaries)
const text = 'Line 1\nLine 2';
text.match(/^Line/gm);  // ['Line', 'Line']

// s = dotAll (. matches newlines)
/hello.world/s.test('hello\nworld');  // true

// u = unicode
/\u{1F600}/u.test('😀');  // true

// y = sticky (matches at exact position)
const pattern = /foo/y;
pattern.lastIndex = 4;
pattern.test('foo foo');  // true (matches at position 4)

Python re Module

import re

# Compile pattern for reuse
pattern = re.compile(r'\d+')
pattern.findall('123 abc 456')  # ['123', '456']

# Verbose mode (comments and whitespace ignored)
email_pattern = re.compile(r'''
    [\w.-]+    # username
    @          # at symbol
    [\w.-]+    # domain
    \.         # dot
    \w{2,}     # TLD
''', re.VERBOSE)

# Methods
re.search(pattern, string)   # Find first match
re.match(pattern, string)    # Match at start
re.findall(pattern, string)  # Find all matches (list)
re.finditer(pattern, string) # Find all matches (iterator)
re.sub(pattern, repl, string)  # Replace
re.split(pattern, string)    # Split by pattern

C# Regex Options

using System.Text.RegularExpressions;

// RegexOptions enumeration
var pattern = @"hello";

// Case-insensitive
Regex.IsMatch("HELLO", pattern, RegexOptions.IgnoreCase);

// Multiline
var text = "Line 1\nLine 2";
Regex.Matches(text, @"^Line", RegexOptions.Multiline);

// Compiled (faster for repeated use)
var compiled = new Regex(@"\d+", RegexOptions.Compiled);

// Timeout (prevent catastrophic backtracking)
var regex = new Regex(@"a+b+c+", RegexOptions.None, TimeSpan.FromSeconds(1));

Best Practices

Start Simple: Begin with basic patterns, add complexity gradually
Test Thoroughly: Use regex testers (regex101.com, regexr.com)
Use Non-Capturing Groups: (?:) when you don't need to capture
Avoid Greedy Quantifiers: Use lazy quantifiers (.*?) for HTML/XML
Escape Metacharacters: Always escape . $ ^ * + ? { } [ ] \ | ( )
Comment Complex Patterns: Use verbose mode in Python, comments in code

Key Takeaways

Character classes ([a-z], \d, \w) match specific character sets
Quantifiers (*, +, ?, {n,m}) control repetition
Anchors (^, $, \b) match positions, not characters
Groups () capture submatches, (?:) groups without capturing
Lookaheads/lookbehinds (?=, ?!, ?<=, ?<!) enable zero-width assertions
Named groups improve readability and maintenance

Next Steps

Learn atomic groups (?>...) for performance optimization
Explore Unicode properties (\p{L}, \p{N}) for international text
Master conditional patterns (?(condition)yes|no)
Study catastrophic backtracking and prevention strategies

Additional Resources

Match patterns, not headaches.