XSLT Reference
string-to-codepoints()
Returns a sequence of integers representing the Unicode codepoints of each character in a string.
string-to-codepoints(string)Description
string-to-codepoints() decomposes a string into its individual Unicode characters and returns their integer codepoints as a sequence of xs:integer values. The sequence length equals the number of Unicode characters (codepoints) in the string, which may differ from the byte length in UTF-8 or UTF-16 encodings.
It is the inverse of codepoints-to-string() and enables character-level manipulation — inspecting, filtering, or transforming individual characters by their numeric values.
If the argument is an empty sequence or an empty string, the function returns an empty sequence.
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
string | xs:string? | Yes | The string to decompose into codepoints. |
Return value
xs:integer* — a sequence of Unicode codepoint integers, one per character.
Examples
Inspecting character codepoints
Stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/">
<codepoints>
<xsl:for-each select="string-to-codepoints('Hello!')">
<cp value="{.}"/>
</xsl:for-each>
</codepoints>
</xsl:template>
</xsl:stylesheet>
Output:
<codepoints>
<cp value="72"/>
<cp value="101"/>
<cp value="108"/>
<cp value="108"/>
<cp value="111"/>
<cp value="33"/>
</codepoints>
Filtering non-ASCII characters
Input XML:
<?xml version="1.0" encoding="UTF-8"?>
<texts>
<text>Héllo Wörld</text>
<text>Plain ASCII only</text>
</texts>
Stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="yes"/>
<xsl:template match="/texts">
<analysis>
<xsl:for-each select="text">
<xsl:variable name="cps" select="string-to-codepoints(.)"/>
<text ascii-only="{if (every $cp in $cps satisfies $cp le 127) then 'yes' else 'no'}">
<xsl:value-of select="."/>
</text>
</xsl:for-each>
</analysis>
</xsl:template>
</xsl:stylesheet>
Output:
<analysis>
<text ascii-only="no">Héllo Wörld</text>
<text ascii-only="yes">Plain ASCII only</text>
</analysis>
Notes
- Each item in the returned sequence is the codepoint of one Unicode character, not one byte. For multi-byte UTF-8 characters (e.g.,
©is 2 bytes) the function still returns one integer. - Surrogate pairs as used in UTF-16 are presented as their actual codepoint (e.g., U+1F600 emoji returns
128512, not two surrogate integers). - Combining with
codepoints-to-string()allows lossless character-by-character transformations. count(string-to-codepoints($s))gives the number of Unicode characters, equivalent tostring-length($s).