Semgrep: Stop grepping code

by Isaac Evans

tl;dr: Semgrep is an open-source tool that is like a code-aware grep: you can easily match all calls to a certain function, match all specific function arguments regardless of order, or find all cases where a call like A() occurs after B(). semgrep.dev for source + more.

Who is this post for?

If you have ever:

  • written a custom linter plugin
  • wanted to write a custom linter plugin
  • tried to set up a grep expression to flag issues during code review
  • cobbled together open-source tools for security scanning
  • paid for an expensive SAST (static analysis security testing) product and been disappointed with it
  • wished it was easier to get started with pragmatic program analysis

Then read on!

Part 1: Why Do We Need Static Analysis If Everyone Writes Rust?

Programmers in modern languages (Python, Rust, Golang, Javascript, etc.) mostly don't have to deal with many of the bug classes that have plagued memory-unsafe languages like C or C++ for decades [0]. Most of the bugs in modern programming projects are framework or project-specific, not language-specific.

Imagine a developer writing in a Python web framework like Flask. The security implications of these two lines of code are entirely different, yet not obvious at all to someone who has not read the framework documentation:

# example 1
return render_template_string(template.format(request.referrer))
# example 2
return render_template("index.html", req=request.referrer)

In example 1, the call to .format() will bypass the autoescaping protections provided by the Flask framework, an example of server-side template injection, while example 2 is totally safe.

Or imagine a bespoke authorization framework that's not used outside this developer's company. Our developer might write some code like this:

// oops, used make_super_user which mutates instead of check_super_user…
if (user.isadmin && make_super_user(user.id)) {

Traditional static analysis tools won't find one-off issues specific to a framework or project because they focus on finding a few types of extremely valuable defects (typically language-level, though find common defects classes in the most popular frameworks[1]) in massive projects. The kinds of defects they look for require complex analysis which makes the tools slow—often taking hours to days to scan a codebase.

But newer projects (in part tautologically) contain orders of magnitude fewer lines of code, are probably in new languages that eradicate traditional bug types, and are likely deployed at a faster pace than the days it may take to run traditional tooling.

Part 2: Worse is Better: Why Grep is Good

Many developers have turned to extensible, simple tools like grep or linters to find what are essentially mistakes of ignorance like using render_template_string unintentionally because you saw it in a Stack Overflow answer and missed the tiny notes in the docs, or misuse of make_super_user. Ideally, we would flag the code for review by a senior developer or security engineer when these functions are invoked.

Using grep[5] to find these snippets is not a bad choice (infinite language support! super fast! FOSS!), but it's not an ideal choice either. Consider these examples where we grep for the make_super_user function mentioned earlier:

grep -R "make_super_user\(.*\)" *

Grep will miss several cases:

if (user.isadmin && make_super_user(
user.id)): # oops multiline

if (user.isadmin && make_super_user(log=get_log(), user=user.id))
# ^ nested parens, won't report range correctly

from core import make_super_user as msu
msu(userid) # imported with alias

log.warn("calling make_super_user() now ")
# ^ not code, it's a string

# Remember to call make_super_user() below
# ^ not code, it's a comment

And grep will match on examples like these which are actually ok:

make_super_user(user.id, mutate=False) # this case is actually ok

dont_make_super_user(user.id) # oops prefix is not considered

Unfortunately, using grep made us miss a number of the cases we care about, due to syntactic changes that don't impact the behavior of the code, like whitespace or additional function calls. Further, we find matches we don't care about, like comments or hard-coded strings. If we're just grepping for a function name in a small project, these are probably ok tradeoffs. But if we wanted to search a larger project with some precision, or set something up to trigger on pull requests that used this function, these problems would probably be a blocker.

Looks like it's time to plug this problem into a Real Parser and come up with a solution that is AST-aware [2]! Using the AST means we're going to be considering the code as the compiler or interpreter sees it: a tree, rather than a string of characters.

How can we extend a parser to do this? If we're writing Python, we could write a Flake8 plugin, or directly using the Python ast module, or if we're writing Ruby we could use a framework like Brakeman, PMD for Java, maybe Eslint for Javascript, Clippy for Rust, and Go has a builtin AST module.

Indeed, at r2c we wrote a lot of Flake8 and ESlint plugins internally before we started using Semgrep—one of my co-workers, Ulzii, wrote a comparison of the experience and another of us, Matt, previously published a security linter based on Flake8. The pros of these linter frameworks: they are much more robust and precise than grep. But the cons are considerable.

Grep

  • Pro: fast, infinite languages, readable expressions (arguable)
  • Cons: line-oriented, mismatch with program structure (trees, ASTs)

Real Parsers

  • Pro: Robust and precise against whitespace or other noise
  • Con: Need a parser for each language: babel, go-ast, python-ast, etc.
  • Con: Each parser represents ASTs differently; have to learn and write differently
  • Con: Languages have “more than one way to do it”: need deep language expertise to capture all possible variations of the patterns.

Part 3: Semgrep

Semgrep is a novel solution to this problem that tries to capture the ease of grep syntax and the robustness and precision of a real parser. For our previous example, the Semgrep invocation looks like this:

semgrep --lang python -e 'make_super_user(..., mutate=True)' /path/to/project

Will match:

import make_super_user as msu
u = msu("admin", mutate=True)

Here's a live example you can edit: https://semgrep.live/v60

semgrep live example

Semgrep already handles Python, JavaScript, Java, Golang, and C (coming soon: Ruby, Typescript, PHP, others). The first version of Semgrep (then sgrep) was written at Facebook to enforce thousands of these grep-like rules on the Facebook codebase. Yoann Padioleau, original Semgrep author and first program analysis hire at Facebook, joined us at r2c last year. [3] The Semgrep operator syntax choices come from an even older tool named Coccinelle he worked on during a PhD at Inria.

What exactly is Semgrep doing? Semgrep has parsers for all these languages, and uses the same parser on our query as on the source code. It converts both the query and the source code into ASTs, and then uses the query AST as a tree-matching query on the source code.

Part 4: Semgrep Patterns

We saw the motivation behind semgrep is a desire to combine the speed and iteration time of grep with code-awareness. Ideally, our code search patterns look similar to the code itself and we don't have to write complicated tree visitors (see ESLint rule docs for an example). Semgrep provides three abstractions to make the search patterns look as similar as possible to the source code you want to find:

  1. Equivalences: Matching code that means the same thing even though it looks different.
  2. Ellipsis (...): Abstracting away or ignoring statements, expressions, or function arguments you don't care about.
  3. Metavariables: $X: Matching expressions (like function calls, arguments, assignments, etc.) when you don't know exactly what they'll look like ahead of time, like capture groups in regular expressions.

Let's look at these in more detail.

(1) Equivalences

Many languages have "more than one way to do it." Let's think about all the ways we can create a variable in JavaScript:

let x = 1;
const x = 1;
var x = 1;

With Semgrep, we just write x = 1 and all three of these match—even though those examples are actually different at the AST level. We can write rules for a language without having to become an expert in the TMTOWTDI philosophy of languages like Perl [4].

If Semgrep just performed the direct transformation from AST to AST matcher, that would be a big improvement over alternatives. But Semgrep is a semantic—not just syntactic—grep-esque tool. The semantic part of the name refers to the fact that Semgrep has builtin logic to understand what code patterns are equivalent and save the work of having to specify the thousands of combinatorics we would need if we just worked at the AST level.

In Semgrep, these are called equivalences. For instance, Semgrep knows that in Python foo(a = 1, b = 2) is equivalent to foo(b = 2, a = 1) because the order of keyword arguments doesn't matter. Not only does Semgrep ship with a ton of equivalences, but we’re also working on adding support for defining your own!

(2) "..." Ellipsis operator

Sometimes there's a bunch of stuff in the code that you don't want to specify the names of. The ellipsis operator (...) makes that easy:

Arguments: live example

pattern        |  matching code
---------------+----------------
foo(...,5)	   | foo(1,2,3,4,5)
               | foo(5)

Characters: live example

pattern        |  matching code
---------------+----------------
foo("...")     | foo("whatever sequence of chars")

Statements: live example

pattern            matching code
------------------------------------
$V = get()         user_data = get()
...		           print("do stuff")
eval($V)           foobar()
                   eval(user_data)

(3) $X Metavariable operator

What if we had a common idiom in our code that looked like this:

const transaction_id = 42;
verify_transaction(foobar);
make_transaction(foobar);

Sometimes we want to match a variable in the code, like x = get_input(), and then refer to the same variable later on, e.g. to find process_input(x). We can do this for variables named x, but we’d like to find this pattern for any variable name. Enter metavariables!

pattern            matching code
------------------------------------
$X = 42;           transaction_id = 42;
                   x = 42;

If we refer to the same metavariable more than once, semgrep will enforce an equality constraint on those references. So in our snippet from above, we can match only places where verify_transaction and make_transaction are being called with the same variable.

pattern                    matching code
------------------------------------------
verify_transaction($X)     verify_transaction(foobar)
...                        make_transaction(foobar)
make_transaction($X)

Part 5: Now What?

Run Semgrep Patterns on Your Code!

If you read this and just want to get started with Semgrep patterns that others have already written, we have a lot of examples: check out the community registry! You can easily fork the rules and share them back. If you just to run the r2c-curated rules on your project, install Semgrep and then in a directory with some code run:

semgrep --config r2c

Mailing List

If you're a program analysis guru, you must have a lot of questions: what about stateful queries, dataflow like constant propagation, taint-tracking, and all the other good stuff? Good news: there's a lot more under the hood than we had time to talk about in this blog post. Join our Slack (link at semgrep.dev) or add yourself to the mailing list to keep up to date.

Commercial

r2c is the company behind Semgrep. We're in the early stages of building paid next-generation static analysis product you can use to fully replace your legacy tooling, with Semgrep at the core. If this is interesting to you, drop your email and we'll ping you to learn more about your use-case.


Notes & References

  • [0] Specifically buffer overflows have a very long history: although Wikipedia cites the Morris worm as the first example, other sources suggest it goes back much further.
  • [1] Don't traditional SAST tools have detection for issues like XSS, SQLi, CSRF, etc.? Yes, and they can find those issues in the most popular frameworks.

    Many modern frameworks, however, have evolved to largely solve these security issues for the developer at the framework level. This means that framework-specific issues today are more likely specific gotchas that are non-obvious framework behavior, not necessarily how a generic vulnerability class manifests.

    Semgrep's thesis is that while prior approaches focus on vulnerability identification (which requires expensive, slow, imprecise interprocedural dataflow analysis) but the way forward is good frameworks with strong secure defaults and automatic lightweight enforcement to make sure those defaults are used.

    More fundamentally, there is a never-ending struggle for the tools to be able to understand the bespoke sanitizers that a developer employs to "clean" data on taint-tracked flows from potentially-attacker-controlled data to dangerous sinks. Unfortunately, relying on a SAST tool to find these flows means that we require a whole-program interprocedural analysis, rather than insisting that the type system or a changelog reviewer be able to prove to themselves in the local context that these changes are safe. Consider the Google approach to preventing SQL injection: it doesn't rely on traditional program analysis at all (excerpt from Google's Building Secure And Reliable Systems:

    The database engine can help you prevent SQL injection vulnerabilities by providing bound parameters and prepared statements...However, merely establishing a guideline to use prepared statements does not result in a scalable security process. You would need to educate every developer about this rule, and security reviewers would have to review all application code to ensure consistent use of prepared statements. Instead, you can design the database API so that mixing user input and SQL becomes impossible by design. For example, you can create a separate type called TrustedSqlString and enforce by construction that all SQL query strings are created from developer-controlled input.

  • [2] AST = Abstract syntax tree
  • [3] This explains the OCaml question and why some comments in the source are still en français
  • [4] Caveat: semgrep doesn't support Perl (yet). (Did you know Perl's syntax is worse for novices than that of a randomly-generated language? Source!)
  • [5] Really, do yourself a favor and use a modern re-implementation like ripgrep

Made with ❤️ by r2c.dev,
a software security startup