(Better?) HTML to Text conversion

Roundcube has a class that handles HTML to text conversion based on the old html2text class. It improved over years, but it has its issues, e.g. tables support is really poor. Can we do better?

First, let’s see where such conversion is used in Roundcube:

  • Displaying HTML-only message when HTML preview is disabled,
  • Creating HTML message – for alternative text/plain part,
  • Switching compose editor from HTML to text,
  • Creating plain text version of HTML signature,
  • Spellchecking HTML content,
  • In Kolab plugins it’s used to display event/task descriptions (as we not yet support HTML for these).

Now, I feel our current conversion code is pretty good in general, but indeed table handling sucks. Problem is, there’s no PHP library that provides decent tables support. Probably it is because the matter is not simple. Consider text alignment, columns width, colspan/rowspan, tables nesting, css support – yeah!  For example, someone noticed that the code for tables handling alone is around 3000 lines in w3m (a known text-based HTML browser).

So, what options do we have? I think it is “write the code by yourself” or use text-based HTML browser like w3m or lynx.  Usually I choose the simplest available solution first, so I investigated what I can get from text-based browsers. I tested w3m, links and lynx. They support tables much better than text2html. I did only simple tests (not only for tables) and choose lynx, as I had some issues with others (that probably could be solved, but I didn’t want to spend much time on this).

Here’s the html_converter plugin that replaces Roundcube’s html2text converter with its own that uses lynx.

Advertisements