Introducing libtld: A Lightweight TLD Parsing Library
Parsing domain names correctly is harder than it looks. Public suffix lists, internationalized domain names, and edge cases like co.uk or city.kawasaki.jp make naïve splitting brittle. libtld is a small, dependency-light library designed to reliably identify top-level domains (TLDs) and extract registrable domains and subdomains with minimal configuration and runtime cost.
Why libtld exists
- Correctness: Handles public suffixes (including multi-label TLDs) so you can determine the registrable domain (e.g., example.co.uk → example.co.uk; sub.example.co.uk → example.co.uk).
- Lightweight: Minimal dependencies and a small binary/VM footprint — suitable for client-side, edge, and embedded environments.
- Performance: Optimized lookups with a compact in-memory representation of the suffix list for fast parsing at scale.
- Ease of use: Straightforward API for common tasks: detect TLD, get registrable domain, split subdomain, validate hostnames, and normalize internationalized domains (IDNA).
Core features
- Public Suffix Support: Uses a compacted copy of the Public Suffix List (PSL) with tooling to update safely.
- IDN/Unicode support: Converts and normalizes Unicode domain labels to/from ACE (Punycode) so libtld works with internationalized domains.
- Registrable domain extraction: Returns the minimal domain that can be registered (the “effective TLD plus one”).
- Subdomain splitting: Separates subdomain(s) from registrable domain reliably.
- Validation utilities: Quick checks for syntactic validity of hostnames and disallowed characters.
- Configurable updates: Optionally refresh the suffix data from a trusted source, or load a frozen snapshot for deterministic builds.
Typical API (conceptual)
- parse(hostname) → { tld, sld, registrable, subdomain, isValid, punycode }
- getRegistrableDomain(hostname) → string | null
- isPublicSuffix(label) → boolean
- normalize(hostname) → normalizedHostname
Example (pseudocode):
Code
result = libtld.parse(“mail.shop.example.co.uk”) result.registrable// “example.co.uk” result.subdomain // “mail.shop” result.tld // “co.uk”
Implementation highlights
- Trie-based lookup: A compact trie or radix tree stores public suffix rules for O(L) lookup (L = label count).
- Rule precedence: Correctly applies exceptions and wildcard rules from the PSL.
- Memory vs. speed tradeoffs: Provides presets (tiny, default, full) so you can choose between minimal memory and maximal coverage.
- Safe updates: Update tooling validates and compacts PSL updates into a deterministic artifact to avoid runtime parsing overhead.
Use cases
- Cookie scoping and security: ensure cookies are not set at public suffixes.
- Analytics and reporting: aggregate traffic by registrable domain.
- Security tooling: detect suspicious subdomain patterns and homoglyph attacks via IDNA normalization.
- URL normalization and canonicalization in crawlers and search engines.
- Client-side libraries and edge functions where small binary size matters.
Best practices
- Prefer frozen snapshots in build artifacts for deterministic behavior; schedule periodic updates in CI.
- Normalize hostnames to ACE (Punycode) before parsing when handling user input.
- Use the “tiny” preset on constrained environments, and “full” on servers needing maximum coverage.
- Combine libtld with domain reputation or WHOIS lookups when making security-critical decisions.
Getting started
- Install via your package manager (example): npm, pip, crates, or a single-file drop-in for browsers.
- Load the appropriate suffix preset for your environment.
- Call getRegistrableDomain() to derive the domain for grouping, or parse() for full splitting.
Conclusion
libtld fills a focused but essential role: reliably determining TLDs and registrable domains without heavy dependencies or runtime cost. Its small footprint, PSL-correct behavior, and IDN support make it a practical choice for developers building web tooling, analytics, security, and edge applications that need dependable domain parsing.
Leave a Reply