Tuesday, February 26, 2013

Converting XeLaTeX into ODT or MS Word

TeX4ht can do a lot of the work of converting from LaTeX to wordprocessor.  But when one adds in the complications of UTF8 characters, multiple scripts, and XeLaTeX, things can get complicated.

C. V. Radhakrishnan today pointed me to this discussion on the TeX4ht mailing list:
What Radhakrishnan says is:
As far as I understand, TeX4ht won't support fontspec or XeLaTeX
technologies of using system fonts that do not have *.tfm's. In effect, by
adopting TeX4ht, one is likely to loose the features brought in by XeTeX.
However, here is another approach.

   1. We translate all the Unicode character representations in the
   document to Unicode code points in 7bit ascii which is very much palatable
   to TeX4ht. A simple perl script, utf2ent.pl in the attached archive does
   the job.
   2. We run TeX4ht on the output of step 1.
   3. Open the *html in a browser, I believe, we get what you wanted. See
   the attached screen shot as it appeared in Firefox in my Linux box.

Here is what I did with your specimen document.

   1. commented out lines that related to fontspec package from your
   sources named as alex.tex.
   2. added four lines of macro code to digest the converted TeX sources
   3. ran the command: perl utf2ent.pl alex.tex > alex-ent.tex
   4. ran the command: htlatex alex-ent "xhtml,charset=utf-8,fn-in" -utf8
   (fn-in option is to keep the footnotes in the same document). I have used a
   local bib file, mn.bib as I didn't have your bib database. biber was also
   run in the meantime to process the bibliography database.
   5. open the output, alex-ent.html in a browser. I got it as you see in
   the attached alex.png.
 Radhakrishnan's PERL script utf2ent.pl is
#!/usr/bin/perl

use strict;
use warnings;

for my $file ( @ARGV ){
  open my $fh, '<:utf8 br="" cannot="" die="" file:="" file="" open="" or="">   while( <$fh> ){
      s/([\x7f-\x{ffffff}])/'\\entity{'.ord($1).'}'/ge;
        print;
  }
}


For Radhakrishnan's continuing comments on TeX4ht development, see
TeX4ht's homepage: