Declarative Rust static analysis

an interactive exploration

Rust's Macros 2.0 are intuitive: demonstrate a pattern, and the compiler can insert the pattern into your program wherever you want it. Inspired by this syntax, I wondered: Could you “run a macro backwards”—use the same by-example language to describe patterns to search for?

Searching by syntax has some interesting uses:

An IDE integration like IntelliJ's Structural Search and Replace, but with Rust type awareness
Semantic patching: a library making an API-breaking change could include a patch file that attempts to update the calling code (similar to Coccinelle for C)
Simpler, faster Clippy lints. More on that below.
A mega-lint that searches for code that could be simplified by using any utility function from the standard library.

These goals present some challenging requirements: support searching for new patterns at runtime, but also be fast enough to search a large codebase for tens of thousands of patterns at once.

To work out the key design decisions, I built a prototype, comacro. The library is still in an early research phase, and lacks a lot of necessary functionality, like optional arguments, arguments types other than ident and expr, and useful error messages. But the core functionality is working and I thought it might be enough for a demo, so you can try it out below!

Try it out

Try some examples: [manual swap] [map/flatten] [replace w/ None] [useless collect]

Or edit the yellow-highlighted text to write your own.

macro pattern1(

$t:ident, $x:expr, $y:expr

) {

let $t = $x;
$x = $y;
$y = $t;

}

↓Parse↓

StmtSeq
Local		Semi		Semi
Ident	$2	Assign		Assign
$1		$2	$3	$3	Path
					$1

↓Serialize↓

Local{ Ident{ $1 } $2 }
Semi{ Assign{ $2 $3 } }
Semi{ Assign{ $3 Path{ $1 } } }

↓Match↓

Local{ Ident{ foo } Call{ Path{ Some } Lit{ 23 } } }
Local{ Ident{ bar } Call{ Path{ mem replace } Reference{ Path{ foo } } Path{ None } } }
Local{ Ident{ Ident{ temp }¹ } Expr{ Path{ foo } }² }
Semi{ Assign{ Expr{ Path{ foo } }² Expr{ Path{ bar } }³ } }
Semi{ Assign{ Expr{ Path{ bar } }³ Path{ Ident{ temp }¹ } } }
Local{ Ident{ nothing } Binary{ Binary{ Path{ Vec } Path{ Option } } Call{ Path{ new } } } }
Local{ Ident{ nothing } MethodCall{ MethodCall{ Path{ nothing } map Closure{ Ident{ x } Path{ x } } } flatten } }
Local{ Ident{ zero } MethodCall{ MethodCall{ Path{ nothing } collect } len } }

↑Parse + Serialize + Match↑

fn code_to_search() {

let mut foo = Some(23);
let mut bar = mem::replace(&mut foo, None);
let temp = foo;
foo = bar;
bar = temp;
let nothing = Vec<Option>::new();
let nothing = nothing.map(|x| x).flatten();
let zero = nothing.collect().len();

}

Implementation details

The general idea is to compile patterns (and the code to be searched) into a form suitable for regex-like matching.

Extend the AST, or slip metavariables under the parser's radar?

Pattern definitions need to be parsed as a Rust AST so that metavariables will match the right AST types, optional nodes will work right, etc. Syntax that can contain metavariables of various AST types is naturally a superset syntax of Rust. Early on I considered a choice: do it “properly” by extending a Rust parser with new node possibilities, or “the delta-value hack”: replace metavariables in the token stream with placeholders, twice—with different placeholders. Parse both versions as standard syntax, linearize, and then compare: differences between the streams encode metavars. (See why I call it a hack?)

So I thought, I'll do the work upfront to get it right and have less technical debt later. I soon discovered though, that extending a Rust parser is the technical debt equivalent of taking out a private student loan: it requires maintaining a substantially-modified version of a complex library; there's a lot of room for bugs like precedence issues with the new expression types, and it would require far-reaching changes to modify its handling of Ident to be an enum since it uses proc_macro2's Ident unencapsulated. And this is all just working with syn—the easy middle-end. I don't even want to think about what redoing all this for rustc::hir would involve! As bad as “do it twice with magic values” may sound, encoding all the metadata into syntactically-normal Rust preserves the abstraction boundary that all the complexity of parsing Rust is on the other side of.

Trace: language of the trees

Once the parser determines the AST structure, a visitor serializes the tree to the regex format—a compact linear format I'm calling Trace, with just enough information to compare two trees and identify any subtrees that differ (In the demo above, the serialized forms are a readable representation of this binary format.) Trace supports memcmp-like comparison without building a stack; when a comparison fails, if the current position in the pattern is a wildcard it resumes after consuming one subtree by counting delimiters. To match against many patterns at once simply and fast, it would be possible to build an NFA from a collection of Trace patterns. The initial comparison (or NFA run) identifies patterns that match the input structurally; structural matches can be further compared by extracting subtrees to check that each metavariable binds an equivalence set of subtrees. (There's a subtree-hashing/partition-refinement approach to doing these checks fast, but the naive approach is probably fine since structural matching will filter out the majority of the candidates.)

Differential execution: easier than reconstructing the eviscerated

Trace is specialized for matching; deserializing it back to an AST would be technically possible, but complex. Instead, operations that need to compare two trees use a differential execution approach: visit the input data's tree with a retracer that compares the Trace the second tree would emit with the pattern's Trace. When the retracer finds a difference, the visitor has access to the AST objects that produced different Trace, so it can capture them to bind metavariables, or emit the debugging representations seen above, or do other fun things with them.

Building the demo

This library isn't really intended to run in a webpage, but I've been looking for an excuse to use WASM for something, and with Rust and wasm-pack Rusting once and running anywhere is easier done than said. Put your next project in a web page, just because you can!

I also had to remember how to JavaScript and HTML (everything is reasonably standardized and the tooling is awesome now!? I suddenly feel like my childhood was full of unnecessary suffering!). Once I figure out how to compose two libraries that both want to be used via npm init (?), I'll probably end up rewriting most of the JavaScript in OCaml with BuckleScript, but I'll shave that yak later.

comacro for Clippy?

I think Clippy could benefit from an implementation of comacro with a hir middle-end and some simple extensions beyond standard macro syntax (definitely optional metavars, probably a capture syntax to match a node but also bind it for further inspection)—keeping the pattern syntax minimal, and using it to describe the overall structure simply. A pattern's associated match handler can then query additional properties about its match bindings.

I looked at 10 recently added lints to try to estimate how often this might be useful:

lint	comacro applicable?
#3223: unnecessary filter_map	no
#3203: map(..).flatten()	yes, but `method_chain_args` handles this type of case
#3195: mem::replace(.., None)	yes, reduces 7-if chain to a one-liner pattern
#3129: mistyped literals	no
#3104: $ptr.offset($usize as isize)	yes, replace about 50 LOC with 2 one-liner patterns
#3109: needless .collect()	yes, replace 5-if chain with 3 one-liner patterns
#2975: identity into_iter	no, already simple
#2924: Copy+Iterator	no
#2895: missing_inline	no
#2844: Default::default	yes, replaces short if-chain

Some tentative conclusions from this small sample:

many lints look for simple patterns and could be easier to read with comacro
some lints look for complex structures and could be greatly simplified with comacro