Overdozing Rx

2021-01-10 00:00:04

Do you ever write regular expressions with 71 capture groups, then toss them out? A conversation earlier this evening on IRC reminded me of a thing in want of "talking" about. If you know a bit about dungeon-mode, or aren't that interested in it, you might want to skip ahead or just grab the tests.

So, there's this game..

Dungeon is a role-playing game I learned as a kid. It features uncomplicated mechanics and a potentially infinite flexibility for dungeon masters to due rude and innovated things to player characters. As of late 2019 I'm working with friends and comrades to create a game engine for Emacs, along with a sample game testing, at long last, the theory that technology has evoloved to the point where this game could be more fun to play using computers – a point hotly debated across decades, perhaps generations.

.. as I may have mentioned.

As we talked about at most opportunities during EmacsConf, we want as much as possible in "human first" formats. Specifically, we plan to store just about everything in org-mode, one way or another. If we have our way in detail:

The build process will eventually create the deployable el files, including packaging up "default dungeon" (the working title for the sample game) maybe from a separate repository but all from org formatted original sources files.

The documentation will eventually be interwoven into the "buildable" program sources, docstrings and manuals therefrom, so also stored originally in org formatted files.

Already, all or nearly all game content, including e.g. map tile and "cell" (layout) as SVG path or XML fragments. It will be nice eventually to be able to put especially longer SVG XML fragments into e.g. babel blocks, but baby steps: read on.

You can't fool me. It's turtles all the way down.

Using plain text files with a minimum of human readable "mark-up" helps to side-step proprietary formats often associated with game content. Even better, it leaves users in direct possession of all of their own original content as plain-text!

Let's put this in more concrete terms:

We want to let users rearrange and add and remove stuff from their copies of the game files, such as character sheets. The "master" copies need to live with the dungeon master, as they could contain secrets even from the players. Game designers should be able to adjust the general layout and structure of game source material, creating and removing files, and changing the design of even integral tables, e.g. as used by the sample game, etc.

When put in abstract terms this has the look of a nasty open-wound^Dended problem. A couple of hands-full of "design concessions" later and vola: a potentially implementable design that may not hopelessly confuse us.

The game will only create org files and will only remotely change .org tables.
Each system lists the available paths and files we can read or write locally.
The remote (dungeon-master) can create local files in response to our interactive commands.
The remote can over-write certain tables (e.g. possessions) in files it created.
Deal with sync errors by replacing entire tables from dungeon master's versions.
Changes are not distributed until persisted by the dungeon-master.
Separate tables isolate display concerns e.g. sorting, localization
Consolidate across possible sources before acting on new information
Verify entire tables when acting from remotely managed source (e.g using a possession)

Our hope is that this may encourage players to augment our character sheets, etc., creating additional files and/or surrounding the requisite dungeon-master controlled tabular information with stories and other narratives; that seems fun.

So, Anyway

Perhaps obviously: I told you that so I could tell you this.

Ever more likely to be obvious: there was some process to evolve our present conceptions of "literate game-sources", or whatever. One of the failed experiments, in fact, lives on in the git repo. The theory, in this case, was to render down an entire org-mode document looking for clues in the form of custom properties that would transform each file into a set of Emacs Lisp expressions. These lisp expressions would function as "load scripts" once evaluated, populating data structures in memory.

I called this idea ox-ox, and I still sort of like it. Let's get a little closer to org's export functionality to give it context.

The generalized solution for exporting from org-mode to some other format (for example HTML, or perhaps PDF via a pit-stop in LaTeX) is called ox. Each specific exporter is ox-something, where something is a file format that we can generate starting with org files, e.g. ox-html provides HTML export features.

Behind the scenes, ox gives programmers an API including a registry (org-export-backends) for plugging into the interface exposed to users by org-dispatch, and an inheritance system (org-export-define-derived-backend) similar in principle to define-derived-mode. The Individual exporters provide functions each handling creation of the "rendered" output from a given "element" found in an org document we are exporting, into a format suitable to the destination format, thus a given "element exporter function" is responsible for transforming one list item or table cell per invocation into, e.g. an <li>, or what-have-you.

Providing new export functionality can be as simple as selecting an existing ox module that does almost what we want, and then writing a couple of custom functions to get that output just right.o

With ox-ox I thought to expose the generalized solution to markup. Aside involved (but useful!) possibilities such as creating one-off exporters inline, this might have enabled syntax like:

#+TITLE: Clifford The Warrior

* Clifford's Character Sheet
:PROPERTIES:
:OX: '(character (name @II$1) (class @II$2) (player @II$1))
:END:
|----------------+---------+-------------|
| Character Name | Class   | Player Name |
|----------------+---------+-------------|
| Clifford       | Warrior | Corwin      |
|----------------+---------+-------------|

Which, in theory, might have rendered into this elisp:

'(character
   (name "Clifford")
   (class "Warrior")
   (player "Corwin"))

Except, of course, what we really want is to pull name based on, in this case, the table header under which we found it, which got pretty ugly to generalize. This experiment involved an interesting expression composition approach which got some way toward working; it seemed genuinely nifty.

But it didn't work out.

After decades of working on this game I -I dare say we- have developed some strong biases against creating new languages, syntax, or "comprehensive" models of the potential set of information and functionality that might exist within a given game environment. For reasons about which I may eventually say enough, my sense is that a declarative/discovery model is the only way to get the requisite arbitrariness from intersecting game concepts. Suppose we are designing a "curse". While in effect, all of the staircases between levels will (with no warning) go the opposite direction from usual until after they are next used. Well, let's see: we need a new counter per staircase.. in declarative context we create what we need as we need it and trust the parts that affect each other to take mutual care. An abstract system ("Countable" type) or otherwise exhaustive models have come to look like needless complexity in terms of making a computer game that is essentially replacing a tablet of graph-paper and a pair of six-sided dice.

I eventually found the ox-ox approach imposed too much convention or requires too much configuration. For this approach to work each copy of each file must either duplicate the necessary formatting or else identify with some archtype or "type system" to associate the files contents with expression templates. And custom syntax too.

Neither of those looked great, and we did eventually find something we liked better, but that's not the interesting part today. As seems often the case, the "nugget" from this experiment may have been the journey more than the destination.

What About TBLFM?

In the examples above you may have noticed the use of "TBLFM" syntax, the potentially cryptic seeming means org provides for referencing tables, table cells, and cell ranges. The examples aren't incidental: reusing org's existing facility for addressing table (and other document) content had helped motivate the idea.

The below "test program" defines and then tests an Rx formula which appears able to parse TBLFM style table, cell and range references including remate table references. Running M-x my:do-rx-test creates (or replaces the content of) a "ox re test result" buffer, which shows the rendered regular expression, and maps a series of test cases against the expression showing the capture groups populated by each as an org-table.

Note, several of the first few test cases represent syntax "invented" for the ox-ox experiment. Specifically, none of these test cases are valid for org-mode but are allowed by the Rx formula. In the ox-ox vision such as was developed, these would variously have allowed introspection and access to the arguments which are customarily passed to the per-element exporter functions.

"$_" "$_:kw" "$*" "$*:kw" "$:" "$::kw"

Here's a sample of the output:

Regular Expression

"\\$\\([#*:_]\\)\\(?::\\([A-Z_a-z][_[:alnum:]]*\\)\\)?\\|@\\(#\\)\\|\\(?:@\\(?:\\([<>]+\\)\\|\\([+-]\\)?\\(I+\\)\\(?:\\([+-]\\)\\([[:digit:]]+\\)\\)?\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\(?:\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\)?\\|\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\)\\(?:\\.\\.\\(?:@\\(?:\\([<>]+\\)\\|\\([+-]\\)?\\(I+\\)\\(?:\\([+-]\\)\\([[:digit:]]+\\)\\)?\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\(?:\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\)?\\|\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\(I+\\)\\(?:\\([+-]\\)\\([[:digit:]]+\\)\\)?\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\)?\\|remote([[:blank:]]*\\([^,]+\\)[[:blank:]]*,[[:blank:]]*\\(?:@\\(?:\\([<>]+\\)\\|\\([+-]\\)?\\(I+\\)\\(?:\\([+-]\\)\\([[:digit:]]+\\)\\)?\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\(?:\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\)?\\|\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\)\\.\\.\\(?:@\\(?:\\([<>]+\\)\\|\\([+-]\\)?\\(I+\\)\\(?:\\([+-]\\)\\([[:digit:]]+\\)\\)?\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\(?:\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\)?\\|\\$\\(?:\\([A-Z_a-z][_[:alnum:]]*\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\([[:digit:]]+\\)\\)\\|\\([<>]+\\)\\|\\([+-]\\)?\\(I+\\)\\(?:\\([+-]\\)\\([[:digit:]]+\\)\\)?\\|\\([+-]\\)?\\([[:digit:]]+\\)\\))"

Test Strings

@#

n	match	group
0	@#	full-match
3	#	row-count

$#

n	match	group
0	$#	full-match
1	#	special-name

$1

n	match	group
0	$1	full-match
18	1	col-number

$+2

n	match	group
0	$+2	full-match
17	+	col-sign
18	2	col-number

$>>>

n	match	group
0	$>>>	full-match
16	>>>	col-boundry

@1$1

n	match	group
0	@1$1	full-match
10	1	row-number
14	1	field-number

@1$1..@-2$+1

n	match	group
0	@1$1..@-2$+1	full-match
10	1	row-number
14	1	field-number
24	-	right-row-sign
25	2	right-row-number
28	+	right-field-sign
29	1	right-field-number

remote(Bar,@1$1..@-2$+1)

n	match	group
0	remote(Bar,@1$1..@-2$+1)	full-match
41	Bar	remote-name
48	1	remote-row-number
52	1	remote-field-number
62	-	remote-right-row-sign
63	2	remote-right-row-number
66	+	remote-right-field-sign
67	1	remote-right-field-number

Test Program

And here's the full program source. This contains code generating a buffer in org-mode, so we'll see how hugo deals :)

;;; ox-ox-test.el --- tests for ox-ox                -*- lexical-binding: t; -*-

;; Copyright (C) 2020  Corwin Brust

;; Author: Corwin Brust <corwin@bru.st>
;; Keywords:

;; This program is free software; you can redistribute it and/or modify
;; it under the terms of the GNU General Public License as published by
;; the Free Software Foundation, either version 3 of the License, or
;; (at your option) any later version.

;; This program is distributed in the hope that it will be useful,
;; but WITHOUT ANY WARRANTY; without even the implied warranty of
;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
;; GNU General Public License for more details.

;; You should have received a copy of the GNU General Public License
;; along with this program.  If not, see <https://www.gnu.org/licenses/>.

;;; Commentary:

;; tests, mostly of rx stuff

;; this initial version proves out some regexes used to support TBLFM
;; like interpolation syntax for @n$x, $#

;;; Code:

(require 'rx)

;; Mostly taken from `org-table' then converted for `rx'
(defvar my:rx)
(setq my:rx
      '((var-sigel ?$)
 (col-sigel ?$)
 (row-sigel ?@)
 (key-sigel ?:)
 (comma (and (0+ blank) ?, (0+ blank)))
 (sign (group (any ?+ ?-)))
 (boundry (group (1+ (or ?< ?>))))
 (numbered (group (1+ digit)))
 (named (group (and (any "a-z" "A-Z" "_")
		    (0+ (any alnum ?_)))))
 (key (and key-sigel named))
 (special (or (and var-sigel (group (any ?* ?_ ?: ?#)) (opt key))
	      (and ?@ (group ?#))))
 (remote (item) (and "remote(" (0+ blank) (group (1+ (not ?,)))
		     comma item ")"))
 (col (or named boundry (and (opt sign) numbered)))
 (row (or boundry
	  (and (opt sign) (group (1+ ?I))
	       (opt (and sign numbered)))
	  (and (opt sign) numbered)))
 (cell (or (and row-sigel row (opt (and col-sigel col)))
	   (and col-sigel col)))
 (range (and cell ".." (or cell row)))
 (range? (and cell (opt (and ".." (or cell row)))))
 (var (or special range? (remote range)))))
;; (eval (macroexpand `(rx-let ,my-rx (string-match-p (rx var) "$foo"))))

(defvar my:rx-labels nil "Labels for the groups of var in `my:rx'.")
(setq my:rx-labels
      '(full-match ; 0
 special-name ; 1
 special-keyword ; 2
 row-count ; 3
 row-boundry ; 4
 row-hline-sign ; 5
 row-hline ; 6
 row-hline-adj-sign ; 7
 row-hline-adj ; 8
 row-sign ; 9
 row-number ; 10
 field-name ; 11
 field-boundry ; 12
 field-sign ; 13
 field-number ; 14
 col-name ; 15
 col-boundry ; 16
 col-sign ; 17
 col-number ; 18
 right-row-boundry ; 19
 right-row-hline-sign ; 20
 right-row-hline ; 21
 right-row-hline-adj-sign ; 22
 right-row-hline-adj ; 23
 right-row-sign ; 24
 right-row-number ; 25
 right-field-name ; 26
 right-field-boundry ; 27
 right-field-sign ; 28
 right-field-number ; 29
 right-col-name ; 30
 right-col-boundry ; 31
 right-col-sign ; 32
 right-col-number ; 33
 ;; 34-40 never fill because remote doesn't handle:
 ;; | @#, 1 grp | special, 2 grp | col-*, 4 grp | = 6 |
 nil nil nil nil nil nil nil
 remote-name ; 41
 remote-row-boundry ; 42
 remote-row-hline-sign ; 43
 remote-row-hline ; 44
 remote-row-hline-adj-sign ; 45
 remote-row-hline-adj ; 46
 remote-row-sign ; 47
 remote-row-number ; 48
 remote-field-name ; 49
 remote-field-boundry ; 50
 remote-field-sign ; 51
 remote-field-number ; 52
 remote-col-name ; 53
 remote-col-boundry ; 54
 remote-col-sign ; 55
 remote-col-number ; 56
 remote-right-row-boundry ; 57
 remote-right-row-hline-sign ; 58
 remote-right-row-hline ; 59
 remote-right-row-hline-adj-sign ; 60
 remote-right-row-hline-adj ; 61
 remote-right-row-sign ; 62
 remote-right-row-number ; 63
 remote-right-field-name ; 64
 remote-right-field-boundry ; 65
 remote-right-field-sign ; 66
 remote-right-field-number ; 67
 remote-right-col-name ; 68
 remote-right-col-boundry ; 69
 remote-right-col-sign ; 70
 remote-right-col-number ; 71
 ))

(defvar my:rx-test-strings nil "These are test strings that should all match.")
(setq my:rx-test-strings
      '("$foo" ;; "remote(Bar,$foo)" "$foo:kw" ;;ZZZ: ?
 "$_" "$_:kw" "$*" "$*:kw" "$:" "$::kw"
 "@#" "$#" "$1" "$+2" "$>>>"
 "@1$1"  "@1$1..@-2$+1" "remote(Bar,@1$1..@-2$+1)"
 "@-2$+1"  "@-2$+1..@3$>>>" "remote(Bar,@-2$+1..@3$>>>)"
 "@3$>>>"  "@3$>>>..@+4$foo" "remote(Bar,@3$>>>..@+4$foo)"
 "@+4$foo"
 "@+4$foo..@<$1" "remote(Bar,@+4$foo..@<$1)"
 "@<$1"  "@<$1..@<<$-2" "remote(Bar,@<$1..@<<$-2)"
 "@<<$-2"  "@<<$-2..@>>>$<<<" "remote(Bar,@<<$-2..@>>>$<<<)"
 "@>>>$<<<"  "@>>>$<<<..@>>>>$foo" "remote(Bar,@>>>$<<<..@>>>>$foo)"
 "@>>>>$foo"
 "@>>>>$foo..@I$1" "remote(Bar,@>>>>$foo..@I$1)"
 "@I$1"  "@I$1..@I+2$>>" "remote(Bar,@I$1..@I+2$>>)"
 "@I+2$>>"  "@I+2$>>..@III$foo" "remote(Bar,@I+2$>>..@III$foo)"
 "@III$foo"  "@III$foo..@+IIII-44$+4" "remote(Bar,@III$foo..@+IIII-44$+4)"
 "@+IIII-44$+4"
 "@+IIII-44$+4..@III-3$11" "remote(Bar,@+IIII-44$+4..@III-3$11)"
 "@III-3$11"  "@III-3$11..@+I$12" "remote(Bar,@III-3$11..@+I$12)"
 "@+I$12"  "@+I$12..@-II$13" "remote(Bar,@+I$12..@-II$13)"
 "@-II$13"  "@-II$13..@+III-17$14" "remote(Bar,@-II$13..@+III-17$14)"
 "@>$<..@+III-17$foo" "remote(Bar,@>$<..@+III-17$foo)"
 "@+III-17$foo..$<" "remote(Bar,@+III-17$foo..$<)"
 "@+2$foo..@<<<$+14" "remote(Bar,@+2$foo..@<<<$+14)"
 "@>>$foo..$+42"	"remote(Bar,@>>$foo..$+42)"))

(defmacro my:rx-test (str &optional result)
  "Create an `ert' function testing STR expecting RESULT."
  (declare (indent 2))
  (ignore result)
  `(rx-let ,my:rx
     (when (string-match (rx var) ,str)
(mapconcat
 'identity
 (delq nil (seq-map-indexed
	    (lambda (_ n)
	      (when-let ((m (match-string n ,str))
			 (g (nth-value n my:rx-labels)))
		(format "| %s | %s | %s |" n m g)))
	    (make-list 72 nil)))
 "\n")
)))

(defun my:do-rx-test ()
  "Run `my:test-strings' against `my:rx'."
  (interactive)
  (let ((re-string )))
  (with-current-buffer (get-buffer-create "**ox re test result")
    (erase-buffer)
    (goto-char (point-min))
    (insert "#+title: Test Results: ~ox-ox rx~\n
Try ~occur~ where ~n~ is a group number in:
#+begin_example
| n | [^ ]
#+end_example
* Regular Expression\n
#+name: regex
#+begin_src emacs-lisp")
   (print (eval `(macroexpand (rx-let ,my:rx (rx var))))
   (current-buffer))
   (insert "#+end_src\n\n* Test Strings")
   (dolist (test-form my:rx-test-strings)
     (insert (format "\n\n** %s\n\n| n  | match | group |\n" test-form))
     (insert (my:rx-test test-form)))
   (pop-to-buffer (current-buffer))
   (org-mode)))

(provide 'ox-ox-test)
;;; ox-ox-test.el ends here

Overall the current chracter sheet sample feels like a big improvement but I still think there will end up being a use for a stand-alone parser for decoding TBLFL style table, cell and range references.