Efficient Recursive Levenshtein (Edit) Distance Algorithm

2016-04-14 by occen

Efficient Recursive Levenshtein (Edit) Distance Algorithm

Background

Levenshtein Distance is a way to ascribe a numeric distance between two sequences (often characters in a string or word) by counting the minimum number of insertion, deletion and substitution operations required to transform one sequence to the other.

As documented in Wikipedia (and elsewhere) there is an elegant recursive equation to determine the Levenshtein distance as follows:

If you think of transforming from sequence ‘a’ to sequence ‘b’ the first recursion represents a deletion from the ‘a’ sequence. The second recursion represents an insertion into the ‘b’ sequence, and the last recursion is either a substitution or a match (and skip over) depending on the equality check. The length is incremented in all cases except where there is a match.

When one sequence is exhausted (i.e. length=0) the length of the remaining sequence is added to the result (as either deletions or insertions depending which sequence is left).

Basic Computer Algorithm

The mathematical equation is fairly easily translated into a recursive computing function. The Clojure equivalent if translated literally would be something like:

(defn ld
  "Calculate the Levenshtein distance - basic literal code."
  [va vb i j]
  (cond
    (zero? (min i j)) (max i j)
    :else (let [k (if (= (va i) (vb j)) 0 1)]
            (min (+ (ld va vb (dec i) j) 1)
                 (+ (ld va vb i (dec j)) 1)
                 (+ (ld va vb (dec i) (dec j)) k)))))

(ld (vec " lawn") (vec " flaw") 4 4)  ; => 2

Literal translation of the Levenshtein equation

Note that each sequence is padded with an initial dummy character (blank in this case) to account for the difference between the one-based indexing used in the mathematical equation and the zero-based indexing used in computing. Without this adjustment the indexing gets messy and obscures the relationship with the equation.

A simpler and more idiomatic approach is to work directly with the sequences and avoid the indexing completely with code like this:

(defn ld
  "Calculate the Levenshtein distance - basic idiomatic code."
  [va vb]
  (cond
    (empty? va) (count vb)
    (empty? vb) (count va)
    :else (let [k (if (= (peek va) (peek vb)) 0 1)]
            (min (+ (ld (pop va) vb) 1)
                 (+ (ld va (pop vb)) 1)
                 (+ (ld (pop va) (pop vb)) k)))))

(ld (vec "lawn") (vec "flaw"))  ; => 2

Idiomatic (Clojure) representation of the Levenshtein equation

Both these implementations work and it’s not hard to find many examples of them on the Web (e.g. Rosetta Code) but they scale very poorly.

The solution tree grows by a factor of three for each level and the number of levels varies between the length of the shortest sequence and the sum of the lengths of both sequences. The function has a computing complexity of O(3ⁿ) which is essentially useless for anything but the shortest strings.

Some versions of this code employ ‘memorisation’ techniques in an attempt to improve performance. This helps because many nodes in the solution tree are duplicates which would otherwise be calculated over and over again. However memorisation introduces state and detracts for the functional purity of the solution. It also doesn’t address the underlying scaling problem.

Another approach is to use an iterative method based on a matrix representation. This is much more efficient (see below) but it masks the elegance of the original function.

Can we improve performance while maintaining the pure functional style of the equation? Two areas for improvement suggest themselves.

The first is the fact that as you recurse down the solution tree, the distance returned never reduces (i.e. it only gets larger or remains the same). So processing can stop once the result exceeds a previously calculated minimum.

The second is to note that two of the three recursion function calls use the third recursion result as input one recursion later. By passing this result down to the next recursive level, it doesn’t need to be recalculated.

We will look at these two optimisation methods below.

Upper Bound Optimisation

The idea is to pass an upper bound to the lower levels so that further processing can be avoided once that bound is reached. This is similar (but simpler) to alpha-beta pruning in two-person game trees. In this case, we need only a single alpha (or beta) value. It turns out that this approach provides significant pruning of the solution tree while vastly improves performance.

First let’s restructure the code above to pass a ‘running’ result down the tree. The new code looks like this:

(defn ld
  "Calculate the Levenshtein distance - basic with running result."
  [va vb res]
  (cond
    (empty? va) (+ res (count vb))
    (empty? vb) (+ res (count va))
    :else (let [k (if (= (peek va) (peek vb)) 0 1)]
            (min (ld (pop va) vb (+ res 1))
                 (ld va (pop vb) (+ res 1))
                 (ld (pop va) (pop vb) (+ res k))))))

(ld (vec "lawn") (vec "flaw") 0)  ; => 2

Idiomatic Levenshtein distance with a ‘running’ result

Here the initial call needs to pass the starting length of zero. This code doesn’t improve performance but it allows the result to be tested prior to further calls to lower levels.

Because the lower level calls can only ever increase (or keep the same) length, we can stop processing when the result equals (or exceeds) the current known minimum.

The following code introduces a fourth argument which holds the current known minimum. This value is updated after each branch is processed to let the next branch know the current best (minimum) value. Also the ‘substitution and match’ branch has been evaluated first because this is the only branch that can return a zero length increment which makes the ‘pruning’ minimum value more likely to be the lowest. This is equivalent to the heuristics applied is alpha-beta pruning where the player’s moves should be ordered with the likely strongest occurring first as this improves the pruning efficiency.

(defn ld
  "Calculate the Levenshtein distance - upper bound constraint."
  [va vb res minval]
  (cond
    (>= res minval) minval
    (empty? va) (+ res (count vb))
    (empty? vb) (+ res (count va))
    :else (let [k (if (= (peek va) (peek vb)) 0 1)
                r1 (ld (pop va) (pop vb) (+ res k) minval)
                r2 (ld va (pop vb) (+ res 1) (min r1 minval))
                r3 (ld (pop va) vb (+ res 1) (min r1 r2 minval))]
            (min r1 r2 r3))))

(ld (vec "lawn") (vec "flaw") 0 10000)  ; => 2

Levenshtein distance with an upper bound

The code introduces another termination case that tests whether the result (res) equals (or exceeds) the current best minimum (minval). If so, no further processing is performed as it will never improve on the current known minimum. Otherwise the next branch is evaluated with the lowest upper bound from the previous branches being passed on as the new upper bound.

Slightly more polished version

By taking advantage of Clojure’s multiple arity function definitions, we can simplify the call to the function by making the 2-arity function call the 4-arity version behind the scenes. This doesn’t change the functionality at all, it just hides some of the plumbing. The revised version is shown below.

(defn ld
  "Calculate the Levenshtein distance - upper bound constraint."
  ; ------ 2-arity
  ([seqa seqb]
   (let [minval (max (count seqa) (count seqb))]
     (ld (vec seqa) (vec seqb) 0 minval)))
  ; ------ 4-arity
  ([seqa seqb res minval]
   (cond
     (>= res minval) minval
     (empty? seqa) (+ res (count seqb))
     (empty? seqb) (+ res (count seqa))
     :else (let [k (if (= (peek seqa) (peek seqb)) 0 1)
                 r1 (ld (pop seqa) (pop seqb) (+ res k) minval)
                 r2 (ld seqa (pop seqb) (+ res 1) (min r1 minval))
                 r3 (ld (pop seqa) seqb (+ res 1) (min r1 r2 minval))]
             (min r1 r2 r3)))))

(ld "lawn" "flaw")  ; => 2

Upper bound Levenshtein distance

Note that we can set the minval initial value to the longest sequence as this is the worst case length (requiring every character to be replaced).

Performance Improvements

The table below shows the number of nodes processed by the original and bounded versions of the code for some sample sequences of increasing length. Indicative times are also provided but these are not rigorous benchmarks.

	Original		Bound
Sample Type	nodes/sample	msec/sample	nodes/sample	msec/sample
lawn->flaw	1	0.10	13	0.01
Mixed strings (len=5)	841	0.49	32	0.01
Mixed strings (len=6)	4,494	2.51	64	0.02
Mixed strings (len=7)	24,319	13.48	134	0.03
Mixeds strings (len=8)	132,864	73.08	254	0.06
abcdefghi->123456789	731,281	448.00	9841	2.35
a12345678->123456789	731,281	418.00	74	0.02
levenshtein->einstein	1,242,912	763.00	264	0.07
levenshtein->meilenstein	22,523,360	12,992.00	365	0.10

It is clear that the bounded version is a substantial improvement over the original version. This is most obvious as the length of the strings increase. However, the bounded version itself can fluctuate significantly depending on the input values as seen by the two highlighted lines in the table above. Here the input strings are the same length but the nodes processed vary by a factor of over 100.

Reuse of the ‘substitution’ value

The second optimisation approach is revealed by considering the following diagram showing two levels of recursion for the original Levenshtein equation.

The (i-1,j-1) term, also known at the substitution term, in the first recursion is also used by the other two terms at the next level down. So some additional optimisation can be achieved by preserving this value and passing it down to the next level. As only four new terms are calculated, instead of six, there should be a modest but noticeable improvement.

The modified code adds two new parameters that represent the ‘left’ (i-1,j) and ‘right’ (i,j-1) substitution value previously calculated. The code starts to get a little more opaque but the main theme is still relatively clear.

(defn ld
  "Calculate the Levenshtein distance - upper bound constraint
  and reuse of the 'substitute' term value."
  ([seqa seqb]
   (let [minval (max (count seqa) (count seqb))]
     (ld (vec seqa) (vec seqb) 0 minval nil nil)))
  ;
  ([va vb res minval r2 r3]
  (cond
    (>= res minval) minval
    (empty? va) (+ res (count vb))
    (empty? vb) (+ res (count va))
    :else
    (let [k (if (= (peek va) (peek vb)) 0 1)
          r1 (ld (pop va) (pop vb) (+ res k) minval nil nil)
          r2 (if r2 r2 (ld (pop va) vb (+ res 1) (min r1 minval) nil r1))
          r3 (if r3 r3 (ld va (pop vb) (+ res 1) (min r1 r2 minval) r1 nil))]
      (min r1 r2 r3)))))

(ld "lawn" "flaw")  ; => 2

Performance

The table below compares the bounded, and the bounded and reuse versions.

	Bound		Bound and reuse
Sample Type	nodes/sample	msec/sample	nodes/sample	msec/sample
lawn->flaw	13	0.01	13	0.01
Mixed strings (len=5)	32	0.02	24	0.01
Mixed strings (len=6)	64	0.02	42	0.01
Mixed strings (len=7)	134	0.05	76	0.03
Mixed strings (len=8)	254	0.10	126	0.04
123456789->abcdefghi	9,841	3.15	2377	0.78
123456789->a12345678	53	0.02	53	0.02
levenshtein->einstein	264	0.11	136	0.05
levenshtein->meilenstein	365	0.13	205	0.06

There is a small additional improvement by reusing the ‘substitution’ value. This is most noticeable where the bounded version works less well (highlighted rows).

Conclusion

The ‘standard’ recursive function to compute the Levenshtein distance is elegant but scales very badly, O(3ⁿ), which makes it impractical (and dangerous) for most applications. By adding an upper-bound constraint the recursive function can be substantially improved. Some further improvement is achieved by reusing the ‘substitution’ value. These improvements arise by significantly reducing the number of nodes in the solution tree that need to be processed.

Iterative version

Levenshtein distance can be represented in a matrix or grid format. One sequence is laid out at the top, horizontally, and the other is laid out to the left, vertically. Each is prepended with a notional ‘null’ entry and the initial distance (0, 1, 2, …) from this null entry is entered in the first row and first column of the grid.

From this starting position the grid is filled in cell by cell, left to right, top to bottom. The cell values represent the Levenshtein distance between the two sub-sequences – one from the left to the current cell’s horizontal location and the other from the top to the current cells vertical location. The ‘rule’ for determining the value of the next cell is to take the minimum of:

The cell immediately above it plus one
The cell immediately left of it plus one
The cell immediate above and left of it plus one if the corresponding sequence entries are different, or zero if they are the same

When the grid is completed, the bottom right value is the Levenshtein distance between the two sequences. An example is shown in the following diagram (“?” shows the cell to be determined, blue cells are the ones used as input, along with the highlighted sequence entries which are tested for equality).

There are clearly strong parallels here between the recursive equation and the grid construction.

The big difference between the methods is that the grid construction is at worst O(n²). The grid method also lends itself to an iterative implementation. A simple Clojure iterative version is shown below. It uses two row variables rx and ry that progressively move down the grid. The ry row is built up from the rx row above it, and, its own last lefthand value. Once completed, the rx row is replaced with the ry row and the process is repeated.

(defn ld-grid-iter
  "Calculate the Levenshtein (edit) distance iteratively using
  a grid representation"
  [seqa seqb]
  (let [va (vec seqa)
        vb (vec seqb)
        clen (inc (count va))
        rlen (inc (count vb))
        rx (vec (range clen))]
    (loop [r 1, c 1, rx rx, ry (assoc rx 0 1)]
      (if (= r rlen)
        (peek rx)
        (if (= c clen)
          (recur (inc r) 1 ry (assoc ry 0 (inc r)))
          (let [k (if (= (va (dec c)) (vb (dec r))) 0 1)
                ry (assoc ry c (min (+ (rx c) 1)             ; above
                                    (+ (ry (dec c)) 1)       ; left
                                    (+ (rx (dec c)) k)))]    ; diagonal
            (recur r (inc c) rx ry)))))))

(ld-grid-iter "lawn" "flaw")  ; => 2

Iterative grid version of Levenshtein distance

The performance comparison of the ‘bounded and reuse’ recursive version developed above and the iterative grid version for the same sample set is shown in the following table.

	Bound and reuse		Iterative grid
Sample Type	nodes/sample	msec/sample	nodes/sample	msec/sample
lawn->flaw	13	0.01	16	0.01
Mixed strings (len=5)	24	0.01	25	0.01
Mixed strings (len=6)	42	0.02	36	0.01
Mixed strings (len=7)	76	0.02	49	0.01
Mixed strings (len=8)	126	0.04	64	0.01
123456789->abcdefghi	2,377	0.75	81	0.01
123456789->a12345678	53	0.01	81	0.02
levenshtein->einstein	136	0.11	88	0.03
levenshtein->meilenstein	205	0.08	121	0.03

In general, the iterative grid version outperforms the recursive version, as you would expect, but the recursive version holds up surprising well at least for typical word-length sequences (less than around a length of 12).

There is scope to optimise the iterative version using upper bounds in a similar way to the recursive version although this is not explored here. And of course it is possible to turn the iterative grid version into a recursive grid version for the functional purist with code like:

(defn ld-grid-recursive
  "Calculate the Levenshtein (edit) distance recursively using
  a grid representation"
  ([seqa seqb]
   (ld-iter-recursive seqa seqb 1 (vec (range (inc (count seqa))))))
  ;
  ([seqa seqb r vx]
  (if (empty? seqb)
    (peek vx)
    (let [r-fn (fn [v [a x0 x1]]
                 (let [k (if (= a (first seqb)) 0 1)]
                   (conj v (min (+ x1 1)           ; above
                                (+ (peek v) 1)     ; left
                                (+ x0 k)))))       ; diag
          vx (reduce r-fn [r] (map vector seqa vx (rest vx)))]
      (ld-grid-recursive seqa (rest seqb) (inc r) vx)))))

(ld-grid-recursive "lawn" "flaw")  ; => 2

Recursive grid version of Levenshtein distance

High Quality HTML Tables using Text Only Layouts

2016-04-06 by occen

High Quality HTML Tables using Text Only Layouts

Or … How to create a perfect table

Tables are a powerful way to present information. HTML provides a comprehensive set of elements to construct them, and CSS provides extensive formatting capability. However, in reality, tables are fiddly to get right. It is rare that common or generic styling works for all tables. Usually each table needs some tweaking.

There are many minor irritations that blight the perfect table:

column or row font styles, sizes or weights are incorrect
table or column widths cause unwanted wrapping or whitespace
cell data is too crowded or wrongly aligned
colours or borders need adjustment
column or row spanning is required
cell numeric format is inconsistent

These problems are exacerbated when you don’t have full control over the table layout, such as when it is rendered in a framework with its own defaults (e.g. themes), or when it is difficult to modify the associated CSS directly.

The following is a proposed texttab specification to define and style a table using a text only format. A basic Clojure implementation is also provided.

The specification uses inline CSS styling to provide a high degree of flexility and precision over the table’s appearance, albeit at the cost of additional styling text overhead. Ideally it is used to adjust an existing table’s base styles to make small adjustments often required by particular tables.

It is designed to be preprocessed into inline HTML that is suitable for Markdown or similar post processing.

The main features include:

simple data representation
control of base HTML element styles table, th, td, and tr
setting styles for entire table rows and columns
setting styles for individual cells
support for colspan and rowspan
support for printf-style number format strings
support for simple column calculations like sum and average

Two quick examples show some of the capabilities available to construct and style tables. The plain text code used to define the tables is shown first, followed by the resulting table as rendered by the browser.

	table {width "auto" font-size "6pt"}
	td {height "20px" padding "1px" vertical-align "top"}
	^b {background-color "#808080"}
	td-width | "20px" | *
	td | 1  |    | 2  |
	td |    | ^b |    | ^b
	td | 3  |    |    | 4
	td |    | ^b | 5  |

1	2

3		4
	5

The following is a more complex and somewhat contrived table that shows many of the styling techniques. The details are explained in the Specification section below.

	table {width "auto"}
	^p-r {padding "4px 16px 4px 4px"}
	^p-l {padding "4px 4px 4px 16px"}
	| Longer style definitions, like ^1 below, can be split
	| by repeating the definition as the styles are merged
	^1 {background-color "#dddddd" font-weight "bold" format "%.1f"}
	^1 {color black}
	^bt {border-top "2px solid #808080"}
	^lb {background-color "#ddddff"}
	^db {background-color "#bbbbff"}
	^r {background-color "#ffcccc"}
	^c2 {colspan "2"}
	^r2 {rowspan "2" vertical-align "middle"}
	td-text-align | "left" | "right" | *
	td-color | "blue"
	td-^ |    ^p-r   | ^p-l | *
	th   |Location^r2| Day^c2^lb | ^ |Night^c2^db| ^
	th^1 |    ^      |low^lb |high^lb|low^db|high^db
	td^bt| Site #147 |  17.5 |  24.0 |  11.6 | 13.1 
	td   | Site #179 |  15.9 |  25.4 |  4.1^r| 11.7
	td   | Site #204 |  18.2 |  25.7 |  10.6 | 12.9
	td^1^bt| Average |^^col-avg|^^col-avg|^^col-avg^r|^^col-avg

Location	Day		Night
Location	low	high	low	high
Site #147	17.5	24.0	11.6	13.1
Site #179	15.9	25.4	4.1	11.7
Site #204	18.2	25.7	10.6	12.9
Average	17.2	25.0	8.8	12.6

Being able to selectively apply the full range of CSS styling to rows, columns and cells makes it easy to produce visually appealing tables.

Texttab Specification (Version 1.1)

The texttab specification describes the data and styling for a web-based (HTML and CSS) table definition in a textual format. The specification is intended to be converted from text to HTML table elements and inline CSS styles suitable for rendering by a web browser.

This specification is not formal in a technical sense, but hopefully clear and concise enough to allow any interested party to implement it fairly consistently.

The tables constructed from the specification only use four standard HTML elements: table, tr, th and td. Standard CSS styles are added to these elements using the style attribute. Styles are derived from four possible sources:

Base element style
Column style
Row style
Cell style

The specification uses two characters with special meaning:

Vertical bar (|) as a column separator
Caret (^) to specify and apply reference styles

These characters can be “escaped” using a backslash (\) if they are required as table content.

The specification assumes each data row and style definition is on a seperate line. As style definitions are aggregated, long style definitions can be split and repeated on separate lines.

There are two data definitions and three styling definitions. The data definitions define the actual cell values while the styling affects their appearance.

Data definition

The data definitions specify the row type followed by the cell values that make up the columns. The row type is either ‘th‘ or ‘td‘ reflecting the HTML element that will be used to construct the cells in the row.

Row data is define as:

	th | data1 | data2 ...
	td | data1 | data2 ...

For example:

	th | Team      | Played | Won | Score
	td | Blue Mist | 4      | 1   | 34.6
	td | Wayne's Warriors| 3 | 3 | 103.5
	td | Rabbits   | 3      | 0    | 0.0

There is no requirement to align the data columns other than to aid readability.

Blank cells

Blank cells are represented by zero or more spaces. A row ending with a vertical bar (|) has an implied blank cell to its right.

Some examples of blank cells are shown below (the first two rows are identical, and all rows have four table columns).

	th | a |   | c  | d
	td |a|| c|d
	td |   | b | c |

Blank cells can contain one or more reference styles (see below). The following four cells are all blank but two have associated styles.

	td |   | ^1 | ^1^bg |

Inline HTML elements

The specification allows inline (non-escaped) HTML as cell content. Inline HTML provides further formatting options within a table cell including:

images <img src="...">
links <a href="...">...</a>
markup (e.g. like superscript <sup>1</sup>).

Example:

	table {width "40%"}
	td {vertical-align "top"}
	th | Type | Item
	td | Image | <img src="https://www.occasionalenthusiast.com/wp-content/uploads/2016/04/ttimage.jpg">
	td | Link | <a href="http://www.google.com">Google</a>
	td | Superscript | Hello World<sup>1</sup>
	td | Escaped | Hello World\<sup\>1\</sup\>

Type	Item
Image
Link	Google
Superscript	Hello World¹
Escaped	Hello World<sup>1</sup>

The implementation may provide an implementation option (see below) to disable inline HTML.

Where inline HTML is enabled, the ‘<‘ and ‘>‘ characters can be escaped using a backslash ‘\<‘ and ‘\>‘ respectively.

Style definitions

There are three styling definitions:

Element styling (i.e. table, tr, th or td)
Named column styling (applies to all the cells in a particular table column)
Reference styling (applies to specific rows, columns and cells)

All styles aggregate and later style definitions replace (override) earlier styles with the same row type and style name.

Element styles

Element styles apply to the standard HTML table elements and are global in the sense they apply across the whole table. They are defined as:

	<element> {style1 "value1" style2 "value2" ...}
	<element> {style1 "value1", style2 "value2", ...}

<element> is one of the following four HTML table element names: table, tr, th and td.

Commas between the style/value pairs are optional (as in Clojure).

For example:

	table {font-family "sans-serif" font-size "9pt"}
	th {background-color "#cceeee"}

Reference styling

Reference styling defines a style that can be used to style rows, columns and cells. A reference style is defined by:

	^<style-ref> {style1 "value1" style2 "value2" ... }
	^<style-ref> {style1 "value1", style2 "value2", ... }

<style-ref> can be any case-sensitive string of characters (at least a-z, A-Z, 0-9, _ and -) excluding whitespace and the caret (^) character. Generally it is a short string, or a single character, as it is applied by appending ^<style-ref> to the styling target.

To style an entire data row with a reference style, append it to the row type at the start of the row:

	th^<style-ref> | data1 | data2 ...
	td^<style-ref1>^<style-ref2> | data1 | data2 ...

Cell styling is similar but the style reference is appended to the cell data:

	th | data1^<style-ref> | data2 ...
	td | data1 | data2^<style-ref1>^<style-ref2> ...

In both cases, multiple styles can be applied to the same target by appending additional reference styles as required.

Examples:

	^1 {background-color "#eeeeee"}
	^2 {color "red", font-weight "bold"}
	^bd {border "2px solid #cc0000"}
	
	th^1^2| data1 | data2 | data3
	td^1 | data1 | data2^2 | data3
	td   | data1^2^bd | data2 | data3^bd

Column styling using reference styles is discussed in the next section.

Column styles

Column styling applies to all the cells in a particular table column for the row type specified (either th or td). There are two varieties of column styling, named and reference.

Named column styles are defined by:

	th-<style-name> | "value1" | "value2" ...
	td-<style-name> | "value1" | "value2" ...
	t*-<style-name> | "value1" | "value2" ...

The styles starting with “th-” only apply to th data rows, while those starting with “td-” only apply to td data rows. Column styles starting with “t*” apply to all rows regardless of their type.

Named styles are particularly useful when many columns require a different value.

Reference column styles are defined in a similar way only using defined reference styles:

	th-^ | ^<style-ref1> | ^<style-ref2> ...
	td-^ | ^<style-ref1> | ^<style-ref2> ...
	t*-^ | ^<style-ref1> | ^<style-ref2> ...

As with named styles the reference styles apply to th, td or both respectively. Multiple reference styles can be used in one column definition:

	th-^ | ^<style-ref1>^<style-ref2> | ^<style-ref2> ...

For both column styles varieties, the last style value can be an asterisk character (*) to denote that the previous style value/reference will be propagated across all the remaining columns.

Style column values can be left blank where no style is required and there is no need to specify values for all remaining columns. Remaining columns without a style value are assumed to have no column style.

Some examples of column styles:

	th-text-align | "left" | "center" | *
	td-text-align | "left" | | | "right"
	t*-color | "blue"
	td-^ | ^1 |  |  | ^2 | ^1^3
	td-^ | ^left^bold | ^right | *

Style rows can be intermingled with data rows to change the column styling throughout the table.

For example:

	table {font-size "20px"}
	td {width "24px" height "24px" text-align "center" vertical-align "middle"}
	td {padding "0px" border "none"}
	^1 {background-color "#cceecc"}
	^2 {background-color "#aaccaa"}
	td-^|^1|^2|^1|^2|^1|^2|^1|^2
	td  |&#9814;|  |  |  |&#9818;|  |  |
	td-^|^2|^1|^2|^1|^2|^1|^2|^1
	td  |  |  |  |  |  |  |  |
	td-^|^1|^2|^1|^2|^1|^2|^1|^2
	td  |  |  |  |  |&#9812;|  |  |
	td-^|^2|^1|^2|^1|^2|^1|^2|^1
	td  |  |  |  |  |  |  |  |
	td-^|^1|^2|^1|^2|^1|^2|^1|^2
	td  |  |  |  |  |  |  |  |
	td-^|^2|^1|^2|^1|^2|^1|^2|^1
	td  |  |  |  |  |  |  |  |
	td-^|^1|^2|^1|^2|^1|^2|^1|^2
	td  |  |  |  |  |  |  |  |
	td-^|^2|^1|^2|^1|^2|^1|^2|^1
	td  |  |  |  |  |  |  |  |

♖				♚

				♔

Colspan and Rowspan

CSS does not directly support colspan and rowspan but for this specification they are simply treated the same as other styles. The implementation strips them out and applies them as separate attributes.

For example:

	^cs2 {colspan "2" background-color "LightSkyBlue"}
	
	t*-background-color | | | | "Thistle"
	td-width | "50px" | *
	th | head1 | head2^cs2 | ^ | head4
	td | data1 | data2 | data3 | data4

head1	head2		head4
data1	data2	data3	data4

A single carat (^) character is used to indicate a “placeholder” for cells that will be overlaid by a row or column span. These cells are deleted from the row once styles are assigned to make space for the span. The placeholder cells maintain the integrity of the table grid structure so that column styles and calculations work as expected, as in the above example.

Numeric Formatting

In a similar manner to colspan and rowspan, the non-CSS style format can be used to render numeric data formatted by the standard printf style format string.

For example:

	td {font-family "monospace"}
	^m {format "$%,.2f" text-align "right"}
	td-^|       | ^m    | *
	th  | Item | Q1 | Q2
	td  | #0001 | 130 | 141.2
	td  | #0002 | 1550.50 | 1661.236

Item	Q1	Q2
#0001	$130.00	$141.20
#0002	$1,550.50	$1,661.24

Styling Order

The style for each cell is an aggregation of the cell’s element, column, row and cell styles. If the same style is present in more than one of the style definitions, the following order determines the precedent (highest to lowest):

Cell style
Row style
Column style
Element style
Generic styles from the environment

Comment lines

Lines starting with a vertical bar (|) are treated as comments and are ignored.

	| I'm a comment
	|me too ....

Implementation options

Lines starting with two caret characters (^^) introduce implementation specific options or flags. The format is similar to the element style definitions – key and value pairs within braces – but without the element name.

	^^ {option1 "value1" option2 "value2" ... }
	^^ {option1 "value1", option2 "value2", ... }

The options and values follow the CSS style syntax but they are entirely specific to the implementation.

Examples:

	^^ {debug "true", type "error"}
	^^ {html-escape "yes"}

Special cell variables (optional)

The specification optionally includes special cell variables that define basic calculations on a row or column. They start with two caret characters (^^). The following variables are defined:

^^row-sum – sum all values to the left that are numbers
^^row-avg – average all values to the left that are numbers
^^col-sum – sum all values above that are numbers
^^col-avg – average all values above that are numbers

All variables return “NaN” (Not a Number) if there are no numeric cells to process.

Row variables are evaluated first, so column variable will include the row results in the column results.

Care needs to be taken if the data contains values that ‘look’ numeric but are not meant to be treated as a number (e.g. the year ‘2016’). These need to be ‘de-numerated’ by adding a non-numeric character (e.g. ‘Y2016’), otherwise the variables will return incorrect values.

Example:

	td {font-family "monospace" format "%.0f"}
	^1 {format "%.2f"}
	^b {background-color "#dddddd"}
	td-text-align|"left"|"right"|*
	t*-^  |          |      |        | ^b
	th    | Item     | Male | Female | Total
	td    | Saturday |  104 |  126   | ^^row-sum
	td    | Sunday   |   87 |   62   | ^^row-sum
	td^b  | Total    |^^col-sum|^^col-sum|^^col-sum
	td^1^b| Average  |^^col-avg|^^col-avg|^^col-avg

Item	Male	Female	Total
Saturday	104	126	230
Sunday	87	62	149
Total	191	188	379
Average	95.50	94.00	189.50

Change Log

List of texttab specification changes:

Version	Change
1.0	Initial version
1.1	Added single “`^`” as row/colspan cell placeholder to maintain the table “grid” integrity

Implementation

The following is a basic Clojure implementation of the texttab specification above. It is not particularly polished or proven, but it implements the full specification and was used for the tables in this article.

To simplify the implementation, and in the spirit of web browsers, it has no error reporting. It fails silently, but usually gracefully, by ignoring any offending items.

The implementation uses the wonderful hiccup library for HTML generation.

One public function texttab-html is available that takes two arguments:

texttab text
Base style map that holds any initial element styles

An example of the base style map is shown below, but it can simply be an empty map ({}) if no initial element styling is needed.

	{:table {:border-collapse "collapse"}
	 :th {:text-transform "none"}}

The function returns an HTML table definition.

	(texttab-html "td|a|b|c" {})
	;=> <table><tr><td>a</td><td>b</td><td>c</td></tr></table>
	
	(texttab-html "^1 {x \"xyz\"}\ntd|a|b^1|c" {})
	;=> <table><tr><td>a</td><td style="x:xyz">b</td><td>c</td></tr></table>

Usage

One usage approach is to ‘bracket’ the texttab definitions with a pseudo-HTML tag <texttab> and </texttab> inside a Markdown document as follows:

	## My Markdown document
	:              (Markdown code)
	<texttab>
	:              (texttab definitions)
	</texttab>
	:
	<texttab>
	:
	</texttab>
	:

Then preprocess the Markdown with code like the following:

(let [styles {}
      content (slurp "mydocument.md")
      content (clojure.string/replace
                content
                #"(?s)<texttab>(.*?)</texttab>"
                #(texttab-html (% 1) styles))
      doc-html (md-to-html-string content)]
  (spit "mydocument.html" doc-html))

As Markdown allows inline HTML, the resulting document is Markdown ‘compliant’ and can be processed with standard Markdown tools. In this case, the md-to-html-string function from the excellent markdown-clj library.

Code

The code uses a “state” data structure, state, that is updated by a reduce function for each line of the texttab input. It accumulates the HTML and style information for each line. The hiccup html function is then used to generate the HTML output.

The main texttab-html function and its reduce “dispatching” function are shown below. The term line and row are used interchangeable.

(defn- texttab-row
  "Reduce function for texttab-html. Dispatch different row types and
  return 'state' updated by the row data."
  [state row]
  ;; Note cond order IS important
  (cond
    ;;--- Blank line -------------------------------------------
    (empty? row)
    state
    ;;--- Comment line -----------------------------------------
    (re-find #"^\|" row)
    state
    ;;--- Implementation options -------------------------------
    (re-find #"^\^\^" row)
    (texttab-options state row)
    ;;--- Element style ----------------------------------------
    (re-find #"^(table|tr|th|td)\s+\{.*\}$" row)
    (texttab-elem-style state row)
    ;;--- Reference style --------------------------------------
    (re-find #"^\^[^\^\s]+\s+(\{.*\})$" row)
    (texttab-ref-style state row)
    ;;--- Column reference styles ------------------------------
    (re-find #"^th-\^" row)
    (texttab-ref-col-style state row :th)
    ;
    (re-find #"^td-\^" row)
    (texttab-ref-col-style state row :td)
    ;
    (re-find #"^t\*-\^" row)
    (let [state (texttab-ref-col-style state row :th)
          state (texttab-ref-col-style state row :td)]
      state)
    ;;--- Column named style -----------------------------------
    (re-find #"^th-" row)
    (texttab-named-col-style state row :th)
    ;
    (re-find #"^td-" row)
    (texttab-named-col-style state row :td)
    ;
    (re-find #"^t\*-" row)
    (let [state (texttab-named-col-style state row :th)
          state (texttab-named-col-style state row :td)]
      state)
    ;;--- Data rows --------------------------------------------
    (re-find #"^th" row)
    (texttab-row-data state row :th)
    ;
    (re-find #"^td" row)
    (texttab-row-data state row :td)
    ;;--- Non matching row - ERROR - ignore entire row ---------
    :else
    state))


(defn texttab-html
  "Convert line-based texttab content into <table> html."
  [text styles]
  (let [text (texttab-escape text)
        rows (map s/trim (s/split-lines (s/trim text)))
        col-cnt (texttab-col-cnt rows)   ; no. of table data columns
        state {:elem-styles styles
               :ref-styles {}
               :col-styles {:th (repeat col-cnt {})
                            :td (repeat col-cnt {})}
               :col-calcs {:cnts (repeat col-cnt 0)
                           :sums (repeat col-cnt 0.0)}
               :col-cnt col-cnt
               :options {}
               :html []}
        state (reduce texttab-row state rows)
        table-style (get-in state [:elem-styles :table] {})
        table-attr (style-to-attr table-style)
        table-html (html [:table table-attr (seq (:html state))])]
    (texttab-unescape table-html)))

The full code is available here.

Limitations and issues

The implementation uses read-string to parse style definitions into maps. This function can execute code so it is important to only use input from trusted sources.

Conclusion

Although various markup schemes have support for tables, they are often difficult to use and typically don’t provide the fine control to produce polished tables.

By tapping directly into CSS styles, it is possible to enhance the base styling to suit individual tables with a high degree of flexibly and precision.

The specification is designed to simplify the application of CSS styles to tables by providing ways to target table HTML elements, rows, columns and cells as individual items. This allows the generation of high quality tables with simple and minimal textual input.

Occasional Enthusiast

but infrequent expert

Monthly Archives: April 2016

Efficient Recursive Levenshtein (Edit) Distance Algorithm