Regular Expressions Mastery Across Languages
Introduction
Regular expressions (regex) provide powerful pattern matching for text processing. This guide covers regex syntax—character classes, quantifiers, anchors, groups, capturing, lookaheads/lookbehinds—with practical examples for validation, extraction, and replacement across JavaScript, Python, C#, and Java.
Basic Patterns
Literal Characters and Metacharacters
Simple matching:
// JavaScript
const text = "Hello World";
// Literal match
/Hello/.test(text); // true
/hello/.test(text); // false (case-sensitive by default)
// Case-insensitive flag
/hello/i.test(text); // true
// Match any single character (.)
/H.llo/.test("Hello"); // true
/H.llo/.test("Hallo"); // true
/H.llo/.test("H123lo"); // false (. matches one char)
// Escape metacharacters
/example\.com/.test("example.com"); // true
/\$19\.99/.test("$19.99"); // true
// Metacharacters requiring escape: . ^ $ * + ? { } [ ] \ | ( )
Character Classes
Predefined classes:
# Python
import re
# \d = digit [0-9]
re.search(r'\d+', 'Order 12345') # Matches '12345'
# \w = word character [a-zA-Z0-9_]
re.search(r'\w+', 'hello_world') # Matches 'hello_world'
# \s = whitespace [ \t\n\r\f\v]
re.search(r'\s+', 'hello world') # Matches ' '
# Negated classes:
# \D = non-digit [^0-9]
# \W = non-word character [^a-zA-Z0-9_]
# \S = non-whitespace
# Custom character class
re.search(r'[aeiou]', 'hello') # Matches 'e' (first vowel)
re.search(r'[0-9]', 'abc123') # Matches '1'
re.search(r'[^0-9]', '123abc') # Matches 'a' (first non-digit)
# Ranges
re.search(r'[a-z]+', 'Hello') # Matches 'ello'
re.search(r'[A-Z]+', 'Hello') # Matches 'H'
re.search(r'[a-zA-Z]+', 'Hello123') # Matches 'Hello'
re.search(r'[0-9a-fA-F]+', 'FF00AA') # Matches 'FF00AA' (hex)
C# examples:
using System.Text.RegularExpressions;
// Character class matching
Regex.IsMatch("Hello123", @"[a-z]+"); // false (lowercase only)
Regex.IsMatch("Hello123", @"[a-zA-Z]+"); // true
Regex.IsMatch("user@example.com", @"[\w@.]+"); // true
// Extract digits
var match = Regex.Match("Price: $199.99", @"\d+\.\d+");
Console.WriteLine(match.Value); // "199.99"
Quantifiers
Repetition Patterns
Basic quantifiers:
// JavaScript
const patterns = {
'*': 'Zero or more',
'+': 'One or more',
'?': 'Zero or one (optional)',
'{n}': 'Exactly n times',
'{n,}': 'At least n times',
'{n,m}': 'Between n and m times'
};
// Examples
/\d+/.test('123'); // true - one or more digits
/\d*/.test(''); // true - zero or more digits
/colou?r/.test('color'); // true - 'u' is optional
/colou?r/.test('colour');// true
// Specific counts
/\d{4}/.test('2025'); // true - exactly 4 digits
/\d{2,4}/.test('99'); // true - 2 to 4 digits
/\d{2,4}/.test('12345'); // true - matches first 4
/\w{3,}/.test('hello'); // true - at least 3 word chars
// Phone number pattern
/\d{3}-\d{3}-\d{4}/.test('555-123-4567'); // true
Greedy vs lazy (non-greedy):
# Python
import re
text = "<div>Content</div><div>More</div>"
# Greedy (default) - matches as much as possible
greedy = re.search(r'<div>.*</div>', text)
print(greedy.group()) # '<div>Content</div><div>More</div>'
# Lazy (non-greedy) - matches as little as possible
lazy = re.search(r'<div>.*?</div>', text)
print(lazy.group()) # '<div>Content</div>'
# Password validation (8-20 chars)
pattern = r'^.{8,20}$'
re.match(pattern, 'password123') # Valid
re.match(pattern, 'short') # None (too short)
Anchors and Boundaries
Position Matching
Start and end anchors:
// JavaScript
// ^ = start of string
// $ = end of string
/^Hello/.test('Hello World'); // true
/^Hello/.test('Say Hello'); // false
/World$/.test('Hello World'); // true
/World$/.test('World is big'); // false
// Exact match (start + end)
/^Hello World$/.test('Hello World'); // true
/^Hello World$/.test('Hello World!'); // false
/^Hello World$/.test('Say Hello World'); // false
// Validate format exactly
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;
emailPattern.test('user@example.com'); // true
emailPattern.test('invalid email'); // false
Word boundaries:
# Python
import re
# \b = word boundary (between \w and \W)
# \B = non-word boundary
text = "The cat in the cathedral"
# Match whole word 'cat'
re.search(r'\bcat\b', text) # Matches 'cat' (standalone)
re.search(r'\bcat\b', 'cathedral') # None (part of word)
# Find all whole words
words = re.findall(r'\b\w+\b', "Hello, world! How are you?")
print(words) # ['Hello', 'world', 'How', 'are', 'you']
# Replace whole word only
result = re.sub(r'\bcat\b', 'dog', text)
print(result) # "The dog in the cathedral"
Groups and Capturing
Parentheses for Grouping
Capturing groups:
// JavaScript
// ( ) = capturing group
const text = "John Doe (555-1234)";
const pattern = /(\w+) (\w+) \((\d{3}-\d{4})\)/;
const match = text.match(pattern);
console.log(match[0]); // "John Doe (555-1234)" - full match
console.log(match[1]); // "John" - first capture group
console.log(match[2]); // "Doe" - second capture group
console.log(match[3]); // "555-1234" - third capture group
// Named capturing groups (ES2018)
const namedPattern = /(?<firstName>\w+) (?<lastName>\w+) \((?<phone>[\d-]+)\)/;
const namedMatch = text.match(namedPattern);
console.log(namedMatch.groups.firstName); // "John"
console.log(namedMatch.groups.lastName); // "Doe"
console.log(namedMatch.groups.phone); // "555-1234"
Python named groups:
# Python
import re
# Named groups with ?P<name>
pattern = r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})'
match = re.search(pattern, 'Date: 2025-05-15')
print(match.group('year')) # '2025'
print(match.group('month')) # '05'
print(match.group('day')) # '15'
# Access as dictionary
print(match.groupdict())
# {'year': '2025', 'month': '05', 'day': '15'}
# Extract email components
email_pattern = r'(?P<user>[\w.-]+)@(?P<domain>[\w.-]+)\.(?P<tld>\w+)'
email_match = re.search(email_pattern, 'user@example.com')
print(email_match.group('user')) # 'user'
print(email_match.group('domain')) # 'example'
print(email_match.group('tld')) # 'com'
C# named groups:
// C#
using System.Text.RegularExpressions;
var pattern = @"(?<area>\d{3})-(?<exchange>\d{3})-(?<number>\d{4})";
var match = Regex.Match("555-123-4567", pattern);
if (match.Success)
{
Console.WriteLine(match.Groups["area"].Value); // "555"
Console.WriteLine(match.Groups["exchange"].Value); // "123"
Console.WriteLine(match.Groups["number"].Value); // "4567"
}
Non-capturing groups:
// (?: ) = non-capturing group (for grouping without capturing)
// Without non-capturing group
const withCapture = /(\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withCapture); // ['555-123-4567', '555', '123', '4567']
// With non-capturing group
const withoutCapture = /(?:\d{3})-(\d{3})-(\d{4})/.exec('555-123-4567');
console.log(withoutCapture); // ['555-123-4567', '123', '4567']
// Useful for alternation
/(https?|ftp):\/\//.test('https://example.com'); // true
/(?:https?|ftp):\/\//.test('ftp://files.com'); // true
Lookaheads and Lookbehinds
Zero-Width Assertions
Positive lookahead (?=):
// JavaScript
// (?= ) = positive lookahead (match if followed by pattern)
// Password must contain digit
/^(?=.*\d).{8,}$/.test('password123'); // true
/^(?=.*\d).{8,}$/.test('password'); // false
// Password must contain uppercase AND lowercase AND digit
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('Pass1234'); // true
/^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$/.test('password1'); // false
// Extract word before comma
/\w+(?=,)/.exec('apple,banana,orange'); // ['apple']
Negative lookahead (?!):
# Python
import re
# (?! ) = negative lookahead (match if NOT followed by pattern)
# Find 'q' not followed by 'u'
pattern = r'q(?!u)'
re.findall(pattern, 'Iraq Qatar queue') # ['q'] (only in Iraq)
# Username: letters/digits, but cannot start with digit
username_pattern = r'^(?!\d)[a-zA-Z0-9_]{3,16}$'
re.match(username_pattern, 'user123') # Valid
re.match(username_pattern, '123user') # None (starts with digit)
Positive lookbehind (?<=):
# Python
# (?<= ) = positive lookbehind (match if preceded by pattern)
# Find price (digits after $)
pattern = r'(?<=\$)\d+(?:\.\d{2})?'
re.findall(pattern, 'Items: $19.99, $5, $150.00')
# ['19.99', '5', '150.00']
# Extract @mentions (alphanumeric after @)
mentions_pattern = r'(?<=@)\w+'
text = "Hello @alice and @bob_123!"
re.findall(mentions_pattern, text) # ['alice', 'bob_123']
Negative lookbehind (?<!):
// C#
using System.Text.RegularExpressions;
// (?<! ) = negative lookbehind (match if NOT preceded by pattern)
// Find digits not preceded by $
var pattern = @"(?<!\$)\d+";
var matches = Regex.Matches("Price: $100 and 50 items", pattern);
// Matches: "100" in "$100" is skipped, "50" is matched
foreach (Match match in matches)
{
Console.WriteLine(match.Value); // "50"
}
Practical Examples
Email Validation
Basic email pattern:
// JavaScript
const emailPattern = /^[\w.-]+@[\w.-]+\.\w{2,}$/;
// Valid emails
emailPattern.test('user@example.com'); // true
emailPattern.test('john.doe@company.org'); // true
emailPattern.test('test_123@sub.domain.co.uk'); // true
// Invalid emails
emailPattern.test('invalid'); // false
emailPattern.test('@example.com'); // false
emailPattern.test('user@'); // false
// More comprehensive email validation
const strictEmail = /^[a-zA-Z0-9.!#$%&'*+\/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/;
Phone Number Formats
Multiple formats:
# Python
import re
def validate_phone(phone):
"""Validate US phone number in various formats."""
patterns = [
r'^\d{3}-\d{3}-\d{4}$', # 555-123-4567
r'^\(\d{3}\) \d{3}-\d{4}$', # (555) 123-4567
r'^\d{10}$', # 5551234567
r'^\+1-\d{3}-\d{3}-\d{4}$', # +1-555-123-4567
]
return any(re.match(pattern, phone) for pattern in patterns)
# Test
print(validate_phone('555-123-4567')) # True
print(validate_phone('(555) 123-4567')) # True
print(validate_phone('5551234567')) # True
print(validate_phone('invalid')) # False
# Extract and normalize phone numbers
def extract_phone(text):
"""Extract phone number and normalize to XXX-XXX-XXXX format."""
pattern = r'(?:\+1[-.]?)?\(?(\d{3})\)?[-. ]?(\d{3})[-. ]?(\d{4})'
match = re.search(pattern, text)
if match:
return f'{match.group(1)}-{match.group(2)}-{match.group(3)}'
return None
print(extract_phone('Call me at (555) 123-4567')) # '555-123-4567'
print(extract_phone('Phone: 555.123.4567')) # '555-123-4567'
URL Parsing
Extract URL components:
// JavaScript
const urlPattern = /^(https?):\/\/([^:\/\s]+)(?::(\d+))?(\/[^\s]*)?$/;
const url = 'https://example.com:8080/path/to/page?query=value';
const match = url.match(urlPattern);
if (match) {
console.log('Protocol:', match[1]); // 'https'
console.log('Domain:', match[2]); // 'example.com'
console.log('Port:', match[3]); // '8080'
console.log('Path:', match[4]); // '/path/to/page?query=value'
}
// Extract all URLs from text
const text = "Visit https://example.com or http://test.org for more info";
const urls = text.match(/https?:\/\/[^\s]+/g);
console.log(urls); // ['https://example.com', 'http://test.org']
Data Extraction
Parse log files:
# Python
import re
from datetime import datetime
log_pattern = r'(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \[(?P<level>\w+)\] (?P<message>.*)'
log_lines = [
'2025-05-05 14:30:00 [INFO] Application started',
'2025-05-05 14:30:15 [ERROR] Database connection failed',
'2025-05-05 14:30:20 [WARN] Retrying connection',
]
for line in log_lines:
match = re.match(log_pattern, line)
if match:
timestamp = datetime.strptime(match.group('timestamp'), '%Y-%m-%d %H:%M:%S')
level = match.group('level')
message = match.group('message')
print(f'{level}: {message} at {timestamp}')
Extract data from HTML:
// C#
using System.Text.RegularExpressions;
// Extract all links from HTML
var html = @"
<a href='/home'>Home</a>
<a href='https://example.com'>Example</a>
<a href='/contact'>Contact</a>
";
var linkPattern = @"<a\s+href=['""]([^'""]+)['""]>([^<]+)</a>";
var matches = Regex.Matches(html, linkPattern);
foreach (Match match in matches)
{
var url = match.Groups[1].Value;
var text = match.Groups[2].Value;
Console.WriteLine($"{text}: {url}");
}
// Output:
// Home: /home
// Example: https://example.com
// Contact: /contact
String Replacement
Find and replace:
// JavaScript
// Simple replacement
'hello world'.replace(/world/, 'JavaScript'); // 'hello JavaScript'
// Global replacement (all occurrences)
'foo bar foo'.replace(/foo/g, 'baz'); // 'baz bar baz'
// Case-insensitive replacement
'Hello WORLD'.replace(/world/gi, 'JavaScript'); // 'Hello JavaScript'
// Replacement with capturing groups
const date = '2025-05-15';
const formatted = date.replace(/(\d{4})-(\d{2})-(\d{2})/, '$2/$3/$1');
console.log(formatted); // '05/15/2025'
// Replacement with function
const text = 'Total: $100, Tax: $8, Shipping: $5';
const doubled = text.replace(/\$(\d+)/g, (match, amount) => {
return '$' + (parseInt(amount) * 2);
});
console.log(doubled); // 'Total: $200, Tax: $16, Shipping: $10'
Python substitution:
# Python
import re
# Simple substitution
re.sub(r'apple', 'orange', 'I like apple pie') # 'I like orange pie'
# Using captured groups
text = 'Name: John Doe, Age: 30'
result = re.sub(r'Name: (\w+) (\w+)', r'\2, \1', text)
print(result) # 'Name: Doe, John, Age: 30'
# Substitution with function
def uppercase_match(match):
return match.group().upper()
text = 'hello world from python'
result = re.sub(r'\b\w+\b', uppercase_match, text)
print(result) # 'HELLO WORLD FROM PYTHON'
# Remove HTML tags
html = '<p>Hello <b>world</b>!</p>'
clean = re.sub(r'<[^>]+>', '', html)
print(clean) # 'Hello world!'
Language-Specific Features
JavaScript Flags
// i = case-insensitive
/hello/i.test('HELLO'); // true
// g = global (find all matches)
'foo bar foo'.match(/foo/g); // ['foo', 'foo']
// m = multiline (^ and $ match line boundaries)
const text = 'Line 1\nLine 2';
text.match(/^Line/gm); // ['Line', 'Line']
// s = dotAll (. matches newlines)
/hello.world/s.test('hello\nworld'); // true
// u = unicode
/\u{1F600}/u.test('😀'); // true
// y = sticky (matches at exact position)
const pattern = /foo/y;
pattern.lastIndex = 4;
pattern.test('foo foo'); // true (matches at position 4)
Python re Module
import re
# Compile pattern for reuse
pattern = re.compile(r'\d+')
pattern.findall('123 abc 456') # ['123', '456']
# Verbose mode (comments and whitespace ignored)
email_pattern = re.compile(r'''
[\w.-]+ # username
@ # at symbol
[\w.-]+ # domain
\. # dot
\w{2,} # TLD
''', re.VERBOSE)
# Methods
re.search(pattern, string) # Find first match
re.match(pattern, string) # Match at start
re.findall(pattern, string) # Find all matches (list)
re.finditer(pattern, string) # Find all matches (iterator)
re.sub(pattern, repl, string) # Replace
re.split(pattern, string) # Split by pattern
C# Regex Options
using System.Text.RegularExpressions;
// RegexOptions enumeration
var pattern = @"hello";
// Case-insensitive
Regex.IsMatch("HELLO", pattern, RegexOptions.IgnoreCase);
// Multiline
var text = "Line 1\nLine 2";
Regex.Matches(text, @"^Line", RegexOptions.Multiline);
// Compiled (faster for repeated use)
var compiled = new Regex(@"\d+", RegexOptions.Compiled);
// Timeout (prevent catastrophic backtracking)
var regex = new Regex(@"a+b+c+", RegexOptions.None, TimeSpan.FromSeconds(1));
Best Practices
- Start Simple: Begin with basic patterns, add complexity gradually
- Test Thoroughly: Use regex testers (regex101.com, regexr.com)
- Use Non-Capturing Groups: (?:) when you don't need to capture
- Avoid Greedy Quantifiers: Use lazy quantifiers (.*?) for HTML/XML
- Escape Metacharacters: Always escape . $ ^ * + ? { } [ ] \ | ( )
- Comment Complex Patterns: Use verbose mode in Python, comments in code
Key Takeaways
- Character classes ([a-z], \d, \w) match specific character sets
- Quantifiers (*, +, ?, {n,m}) control repetition
- Anchors (^, $, \b) match positions, not characters
- Groups () capture submatches, (?:) groups without capturing
- Lookaheads/lookbehinds (?=, ?!, ?<=, ?<!) enable zero-width assertions
- Named groups improve readability and maintenance
Next Steps
- Learn atomic groups (?>...) for performance optimization
- Explore Unicode properties (\p{L}, \p{N}) for international text
- Master conditional patterns (?(condition)yes|no)
- Study catastrophic backtracking and prevention strategies
Additional Resources
Match patterns, not headaches.