Rust's Macros 2.0 are intuitive: demonstrate a pattern, and the compiler can insert the pattern into your program wherever you want it. Inspired by this syntax, I wondered: Could you “run a macro backwards”—use the same by-example language to describe patterns to search for?
Searching by syntax has some interesting uses:
These goals present some challenging requirements: support searching for new patterns at runtime, but also be fast enough to search a large codebase for tens of thousands of patterns at once.
To work out the key design decisions, I built a prototype,
comacro. The library is still in
an early research phase, and lacks a lot of necessary functionality, like
optional arguments, arguments types other than ident
and
expr
, and useful error messages. But the core functionality is
working and I thought it might be enough for a demo, so you can try it out
below!
let $t = $x;}
$x = $y;
$y = $t;
StmtSeq | |||||
Local | Semi | Semi | |||
Ident | $2 | Assign | Assign | ||
$1 | $2 | $3 | $3 | Path | |
$1 |
Local{ Ident{ $1 } $2 }
Semi{ Assign{ $2 $3 } }
Semi{ Assign{ $3 Path{ $1 } } }
Local{ Ident{ foo } Call{ Path{ Some } Lit{ 23 } } }
Local{ Ident{ bar } Call{ Path{ mem replace } Reference{ Path{ foo } } Path{ None } } }
Local{ Ident{ Ident{ temp }1 } Expr{ Path{ foo } }2 }
Semi{ Assign{ Expr{ Path{ foo } }2 Expr{ Path{ bar } }3 } }
Semi{ Assign{ Expr{ Path{ bar } }3 Path{ Ident{ temp }1 } } }
Local{ Ident{ nothing } Binary{ Binary{ Path{ Vec } Path{ Option } } Call{ Path{ new } } } }
Local{ Ident{ nothing } MethodCall{ MethodCall{ Path{ nothing } map Closure{ Ident{ x } Path{ x } } } flatten } }
Local{ Ident{ zero } MethodCall{ MethodCall{ Path{ nothing } collect } len } }
let mut foo = Some(23); let mut bar = mem::replace(&mut foo, None); let temp = foo; foo = bar; bar = temp; let nothing = Vec<Option>::new(); let nothing = nothing.map(|x| x).flatten(); let zero = nothing.collect().len();}
The general idea is to compile patterns (and the code to be searched) into a form suitable for regex-like matching.
Pattern definitions need to be parsed as a Rust AST so that metavariables will match the right AST types, optional nodes will work right, etc. Syntax that can contain metavariables of various AST types is naturally a superset syntax of Rust. Early on I considered a choice: do it “properly” by extending a Rust parser with new node possibilities, or “the delta-value hack”: replace metavariables in the token stream with placeholders, twice—with different placeholders. Parse both versions as standard syntax, linearize, and then compare: differences between the streams encode metavars. (See why I call it a hack?)
So I thought, I'll do the work upfront to get it right and have less technical
debt later. I soon discovered though, that extending a Rust parser is the
technical debt equivalent of taking out a private student loan: it requires
maintaining a substantially-modified version of a complex library; there's a
lot of room for bugs like precedence issues with the new expression types, and
it would require far-reaching changes to modify its handling of
Ident
to be an enum
since it uses
proc_macro2
's Ident
unencapsulated. And this is all
just working with syn
—the easy middle-end. I don't even want
to think about what redoing all this for rustc::hir
would involve!
As bad as “do it twice with magic values” may sound, encoding all
the metadata into syntactically-normal Rust preserves the abstraction boundary
that all the complexity of parsing Rust is on the other side of.
Once the parser determines the AST structure, a visitor serializes the tree to
the regex format—a compact linear format I'm calling Trace, with just
enough information to compare two trees and identify any subtrees that differ
(In the demo above, the serialized forms are a readable representation of this
binary format.) Trace supports memcmp
-like comparison without
building a stack; when a comparison fails, if the current position in the
pattern is a wildcard it resumes after consuming one subtree by counting
delimiters. To match against many patterns at once
simply and fast, it
would be possible to build an NFA from a collection of Trace patterns. The
initial comparison (or NFA run) identifies patterns that match the input
structurally; structural matches can be further compared by extracting subtrees
to check that each metavariable binds an equivalence set of subtrees. (There's
a subtree-hashing/partition-refinement approach to doing these checks fast, but
the naive approach is probably fine since structural matching will filter out
the majority of the candidates.)
Trace is specialized for matching; deserializing it back to an AST would be technically possible, but complex. Instead, operations that need to compare two trees use a differential execution approach: visit the input data's tree with a retracer that compares the Trace the second tree would emit with the pattern's Trace. When the retracer finds a difference, the visitor has access to the AST objects that produced different Trace, so it can capture them to bind metavariables, or emit the debugging representations seen above, or do other fun things with them.
This library isn't really intended to run in a webpage, but I've been looking for an excuse to use WASM for something, and with Rust and wasm-pack Rusting once and running anywhere is easier done than said. Put your next project in a web page, just because you can!
I also had to remember how to JavaScript and HTML (everything is reasonably
standardized and the tooling is awesome now!? I suddenly feel like my childhood
was full of unnecessary suffering!). Once I figure out how to compose two
libraries that both want to be used via npm init
(?), I'll
probably end up rewriting most of the JavaScript in OCaml with BuckleScript,
but I'll shave that yak later.
I think Clippy could benefit from an implementation of comacro with a
hir
middle-end and some simple extensions beyond standard macro
syntax (definitely optional metavars, probably a capture syntax to match a node
but also bind it for further inspection)—keeping the pattern syntax
minimal, and using it to describe the overall structure simply. A pattern's
associated match handler can then query additional properties about its match
bindings.
I looked at 10 recently added lints to try to estimate how often this might be useful:
lint | comacro applicable? |
---|---|
#3223: unnecessary filter_map | no |
#3203: map(..).flatten() | yes, but method_chain_args handles this type of case |
#3195: mem::replace(.., None) | yes, reduces 7-if chain to a one-liner pattern |
#3129: mistyped literals | no |
#3104: $ptr.offset($usize as isize) | yes, replace about 50 LOC with 2 one-liner patterns |
#3109: needless .collect() | yes, replace 5-if chain with 3 one-liner patterns |
#2975: identity into_iter | no, already simple |
#2924: Copy+Iterator | no |
#2895: missing_inline | no |
#2844: Default::default | yes, replaces short if-chain |
Some tentative conclusions from this small sample:
© This webpage is free of known copyright restrictions. Get the sources here.