XSLT Text Processing: Fun with Anagrams

After so many nice experiences with XSLT text processing: dictionary lookup, spelling checking and suggesting candidate words, text justifying. text search and replacement, concordance over large corpora, even parsing LR-1 languages, I almost had decided that there was nothing much left to do in this area.

These days, while some people, who don’t know what they are talking about are discussing in the XSL-List "Is XSLT dead?", it suddenly came to me that yes, there was something left: finding anagrams

It turns out that finding anagrams with XSLT is as easy and elegant as shown in the following code:

 

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:f="http://fxsl.sf.net/"
 >
 <xsl:import href="../f/func-qsort.xsl"/> 

 <xsl:key name="kAnagram" match="w
   use="codepoints-to-string(f:qsort(string-to-codepoints(.)))"
  />

 <xsl:template match="/">
   <xsl:copy-of select="key(‘kAnagram’, ‘acert’)"/>
 </xsl:template>

</xsl:stylesheet>  

Here, I am reusing the same 46379 English wordforms dictionary, I was using for the spelling checking tasks. In fact, the transformation is applied on it — the document dictEnglish.xml. Each word has its Anagram Key, which is the string of the sorted sequence of characters comprising this word. It is easy to see that words, which are anagrams to each other do have the same anagram key.

The Anagram Key of a word is specified as the following XPath expression:

  codepoints-to-string(f:qsort(string-to-codepoints(.)))

where string-to-codepoints()  and codepoints-to-string()  are standard XPath 2.0 functions and  f:qsort()  is a function from FXSL.

In the code above the English dictionary is indexed using as key (yes!) the Anagram Key for each word. Then we get all words that have the same Anagram Key as the word "trace". The result is:  

<

w>caret</w><w>cater</w><w>crate</w>
<
w>react</w><w>recta</w><w>trace</w>

While it is nice to be able to implement such functionality just with a few lines of code, it is not wise to index the huge English dictionary each time we need to get some anagrams. In fact this indexing operation takes a lot of time — in the example above it took about 280 seconds.

The next logical step is to persist the indexed document into an Anagram Dictionary and reuse it from here on. Here is the XSLT transformation, which creates the Anagram Dictionary:

<xsl:stylesheet version="2.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
 xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 xmlns:f="http://fxsl.sf.net/" 
 exclude-result-prefixes="f xs" 
 > 
 <xsl:import href="../f/func-standardXSLTXpathFunctions.xsl"/>  

 <!– To be applied on dictEnglish.xml –>  

 <xsl:template match="/"> 
   <anagrams> 
     <xsl:for-each-group select="/*/w" group-by= 
       "codepoints-to-string(f:xsltSort(string-to-codepoints(.), 
                                        (f:_auxInt2Str()) 
                                       ) 
                             )"> 
          <xsl:sort select="current-grouping-key()"/>

          <xsl:if test="current-group()[2]"> 
              <aChain key="{current-grouping-key()}"> 
                <xsl:sequence select="current-group()"/> 
              </aChain> 
              <xsl:sequence select=" "/> 
           </xsl:if> 
         </xsl:for-each-group
   </anagrams> 
 </xsl:template>  

 <xsl:variable name="vZeroes" select="‘0000000000’
                              as="xs:string"/>

 <xsl:function name="f:_auxInt2Str" as="xs:string"> 
   <xsl:param name="arg1" as="xs:integer"/>   

   <xsl:variable name="vstrParm" as="xs:string" 
      select="xs:string($arg1)"/>      

   <xsl:sequence select= 
    "concat(substring($vZeroes,1, 
                      10 – string-length($vstrParm) 
                      ), 
            $vstrParm 
            )" 
    /> 
 </xsl:function>  

 <xsl:function name="f:_auxInt2Str" as="element()"> 
   <f:_auxInt2Str/> 
 </xsl:function>  

 <xsl:template match="f:_auxInt2Str" as="xs:string" 
   mode="f:FXSL"> 
   <xsl:param name="arg1" as="xs:integer"/>   

   <xsl:sequence select="f:_auxInt2Str($arg1)"/> 
 </xsl:template>

</xsl:stylesheet>

  

This creates an xml document with 2396 anagram sets (chains) and below is the start of this document:

 

<anagrams>

  <aChain key="‘aacorsttu"><w>actuator’s</w><w>autocrat’s</w></aChain>

  <aChain key="‘aainprsst"><w>aspirant’s</w><w>partisan’s</w></aChain>

  <aChain key="‘aalmnsu"><w>alumna’s</w><w>manual’s</w></aChain>

There longest chains consist of 8 anagrams. There are some pretty interesting ones, like:

<

aChain key="abimpst"><w>baptism</w><w>bitmaps</w></aChain>
<aChain key="abinost"><w>bastion</w><w>obtains</w></aChain>
<
aChain key="abinry"><w>binary</w><w>brainy</w></aChain>
<aChain key="acdeelrs"><w>declares</w><w>rescaled</w></aChain>
<
aChain key="acdeenrtu"><w>uncreated</w><w>unreacted</w></aChain>

<aChain key="acdeerrt"><w>cratered</w><w>retraced</w><w>terraced</w></aChain>
<
aChain key="acdeert"><w>catered</w><w>created</w><w>reacted</w></aChain>
<
aChain key="acdegln"><w>clanged</w><w>glanced</w></aChain>
<
aChain key="acdehmr"><w>charmed</w><w>marched</w></aChain>
<
aChain key="acdehs"><w>cashed</w><w>chased</w></aChain>
<aChain key="acdeilm"><w>claimed</w><w>decimal</w><w>medical</w></aChain>
<aChain key="acdeinotu"><w>auctioned</w><w>cautioned</w><w>education</w></aChain>



The code to get an anagram using the Anagram Dictionary now looks like this:

 

<xsl:stylesheet version="2.0" 
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
 xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 xmlns:f="http://fxsl.sf.net/" 
 > 
 <xsl:import href="../f/func-qsort.xsl"/>  

 <!– To be applied on: dictAnagrams.xml –>  

 <xsl:key name="kAngrmChain" match="aChain
   use="@key" 
  />

 <xsl:template match="/"> 
   <xsl:sequence select="f:anagrams(‘trace’, .)"/> 
 </xsl:template>  

  <xsl:function name="f:anagrams" as="xs:string"> 
   <xsl:param name="pWord" as="xs:string"/> 
   <xsl:param name="pContext" as="node()"/>   

   <xsl:sequence select= 
     "key(‘kAngrmChain’,f:anagramKey($pWord),$pContext)"/> 
  </xsl:function>  

 <xsl:function name="f:anagramKey" as="xs:string"> 
   <xsl:param name="pWord" as="xs:string"/>   

   <xsl:sequence select= 
    "codepoints-to-string(f:qsort(string-to-codepoints($pWord)))"/> 
 </xsl:function>
</xsl:stylesheet>

And obtaining all anagrams of "trace" now takes only 78 milliseconds.

All files shown and discussed here can be accessed at the CVS of FXSL.  

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment