releases.shpreview

Unicode email addresses under RFC 6530 discussion

1 featureThis release1 featureNew capabilitiesAI-tallied from the release notes

Eleven years ago, in Core-31992, someone proposed allowing non-US-ASCII email address support in WordPress. The software world has changed considerably since then: internationalized domain names and paths are uniformly handled in browsers, email systems support the wide range of Unicode characters as raw UTF-8, and UTF-8 is the only recommended text encoding for interchange between systems. This means that people are free to use their own names when communicating with others, whether they are Jake, Klára, আরিয়া , അമൽ, or any other name containing letters outside the A-Z range. Unfortuantely, WordPress has not kept up with these changes, and that’s what this post is all about.

This post is a request for comment on adding that support. There are a number of complications with potentially far-reaching implications.

TL;DR

  • WordPress’ email sanitization is based on US-ASCII characters and needs to be relaxed to allow for valid UTF-8, but this introduces new risks, including but not limited to: confusable characters, equivalence through normalization, and non-visible characters.
  • Sites whose databases cannot store full UTF-8 may fail to save valid email addresses. This could be confusing to the site owner and to people attempting to sign up on the site unless properly communicated.
  • Any additional code that assumes emails are encoded as single-byte US-ASCII will need updating, specifically because it was always an invariant before that emails would not contain multi-byte Unicode characters. Filters may start seeing characters they believed were impossible to receive.

If you have experience with email issues, deploy email services, or know about certain critical aspects of this proposal, please share your thoughts here or in Core-31992.

Unicode in email addresses was historically more complicated.

When email sprung up, servers were passing US-ASCII as a 7-bit encoding. The need to send text with characters beyond that range appeared shortly afterwards, and MIME text encoding was standardized in RFC 2047. This is what WordPress refers to in its wp_iso_descrambler() function: a mechanism for transmitting non-ASCII characters using only ASCII bytes. Critically, it only applied to certain headers and could not be applied to email addresses.

This funny-looking string indicates that it is encoded…
 - with the ISO-8859-2 character set.
 - using the quoted form, with escaped hex-codes for non-ASCII characters.

=?ISO-8859-2?Q?=A3=F3d=BC?=

It encodes the latin2 string "Łódź"

While MIME encoding alleviated the problem of sending non-English content, it did nothing to remove the need for people to romanize or ASCIIize their names and institutions. Punycode opened the door for internationalized domain names, again by encoding non-US-ASCII bytes through all-ASCII characters, but this applied only to domain names and remained unrecognizable when not parsed.

This indecipherable string encodes a state machine which, when decoded, produces a UTF-8 byte stream.

xn--l8je6s7a45b.com

It encodes the Japanese domain "あーるいん.com"

As protocols gained more functionality for unescaped UTF-8, such as in IMAP’s UTF-8 extension, more and more servers started allowing non-US-ASCII bytes as long as they were valid UTF-8. Even still, this did not change the state for email addresses, unfortunately, as the old restrictions on that header still applied.

Eventually, major email providers started allowing and passing valid UTF-8 sequences as email addresses, making them a de-facto supported feature. A comprehensive take is standardized in RFC 6530. See last year’s talk at FOSDEM for more information.

What is the proposal for WordPress?

Allow storing Unicode email addresses. (Core-31992)

Functions like is_email(), sanitize_email() and antispambot() need to be extended to support non-ASCII addresses. PHPMailer updates in WordPress 6.9 already made it possible for WordPress to send to Unicode addresses, but it’s not possible for users to use or store them on their account.

PR#5237 unlocks saving Unicode email addresses by modifying these functions, as long as the database permits it. Its validation is locked to the behaviors of <input type=email> elements to ensure compatibility with the browser and a predictable experience.

Back in April, during WordCamp Vienna, geoTLD.group and ICANN sponsored a contributor challenge to work on this very problem. @agulbra, @akirk, @benniledl, and @dmsnell worked together on this problem and proposed a new WP_Email_Address class which can parse email addresses and return the decoded local and domain parts. This class is then used by a filter to replace the decisions from is_email() sanitize_email() with their new counterparts: wp_is_unicode_email() and wp_sanitize_unicode_email(). This approach provides a path for interoperability with modern standards while preserving the ability to maintain the legacy behaviors, and it provides a helpful new class for structurally working with email addresses in various forms and places.

While Unicode email addresses should be supported, it’s still necessary to be able to apply legacy restrictions in some cases, such as for WordPress’ own sender address/RETURN FROM address, which must remain US-ASCII-only1. This proposal is exclusively about supporting Unicode email addresses for WordPress user accounts.

What could go wrong with storing Unicode email addresses?

If the database or site doesn’t support UTF-8 then there is a problem, because there is no guarantee that the email addresses will be able to be stored and retrieved without corruption. The linked pull request includes a new filter which restricts Unicode email support to sites with utf8mb4 databases. That’s a solid and simple restriction that nevertheless allows the overwhelming majority of sites to support the addresses. But this restriction needs to be communicated to site owners in a clear way.

Existing filters and plugin or theme code expecting all-US-ASCII email addresses might start receiving data that was never expected. Things as simple as calls to strlen() will return incorrect values when applied to UTF-8 strings containing multi-byte characters, and validation scripts and sanitization scripts need to be aware of the changes. For example, antispambot() needs updating because it assumes every byte is representable as a hex escape sequence, which is not the case for multibyte strings. Further, Unicode normalization properties means that two strings, which are essentially equivalent, may be treated as two distinct strings by PHP, and various functions need to agree on how to handle these to avoid conflating addresses.

Summary

The task of adding full Unicode support to identifiers in WordPress is worthwhile, despite being a broad and fuzzy challenge.

  • WordPress can start parsing addresses on supporting sites using modern standards.
  • Plugins can disable the modern email parsing.
  • An audit of Core and plugins is necessary to uncover where assumptions about US-ASCII email characters will be broken when WordPress starts allowing Unicode email addresses.
  • Your feedback will help make this process smooth and successful.

Props to Dennis Snell for help with this blog posting, as well as to Manuel Camargo, Dovid Levine, Tushar Bharti, Mukesh Panchal, and Dennis for help with the code.

  1. Sender addresses may use non-US-ASCII characters as an email alias, but the actual address portion should remain US-ASCII compatible – for example From: "मेरी साइट" <noreply@mysite.in>, which most software displays as From: मेरी साइट. ↩

#charset, #email, #unicode

Fetched May 22, 2026