Cybersecurity and cryptoanalysis is a field filled with logic puzzles, math and numerical techniques. One of the most interesting technical areas I’ve worked goes by the name of satisfiability modulo theories (SMT) and their associated solvers. This post provides a layman’s introduction to SMT and its applications to computer science and the modern practice of learning how to break code hack.
Some background theory
Satisfiability modulo theories are a type of constraint-satisfaction problems that arise many places from software and hardware verification, to static program analysis and graph problems. They apply where logical formulas can describe a system’s states and their associated transformations. If you look under the hood for most tools used today for computer security, you will find they are based on mathematical logic as the calculus of computation. The most common constraint-satisfaction problem is propositional satisfiability (commonly called SAT) which aims to decide whether a formula composed of Boolean variables, formed using logical connectives, can be made true by choosing true/false values for its variables. In this sense, those familiar with Integer Programming will find a lot of similarities with SAT. SAT has been widely used in verification, artificial intelligence and many other areas.
As powerful as SAT problems are, what if instead of boolean constraints, we use arithmetic in a more general application to build our constraints? Often constraints are best described as linear relationships among integer or real variables. In order to understand and rigorously treat the sets involved domain theory is combined with propositional satisfiability to arrive at satisfiability modulo theory (SMT).
The satisfiability modulo theories problem is a decision problem for logical formulas with respect to combinations of background theories expressed in classical first-order logic with equality. An SMT solver decides the satisfiability of propositionally complex formulas in theories such as arithmetic and uninterpreted functions with equality. SMT solving has numerous applications in automated theorem proving, in hardware and software verification, and in scheduling and planning problems. SMT can be thought of as a form of the constraint satisfaction problem and thus a certain formalized approach to constraint programming.
The solvers developed under SMT have proven very useful in situations where linear constraints and other types of constraints are required with artificial intelligence and verification often presented as exemplars. An SMT solver can solve a SAT problem, but not vice-versa. SMT solvers draw on some of the most fundamental areas of computer science, as well as a century of symbolic logic. They combine the problem of Boolean satisfiability with domains (such as those studied in convex optimization and term-manipulating symbolic systems). Implementing SAT solvers requires an implementation of the decision problem, completeness and incompleteness of logical theories, and complexity theory.
The process of SMT solving is a procedure of finding an satisfying assignment for a quantifier-free formula for $F$ with predicates on a certain background theory $T$. Alternatively the SMT solving process could show such an assignment doesn’t exist. An assignment on all variables that satisfies these constraints is the model or $M$. $M$ will be satisified when $F$ evaluates to $\text{true}$ under a given background theory $T$. In this sense, $M$ entails $F$ under theory $T$ which is commonly expressed as: $ M \vDash_T F$. If theory $T$ is not decidable, then the underlying SMT problem is undecidable and no solver could exist. This means that given a conjunction of constraints in $T$, there must exist a procedure of finite steps that can test the existence of a satisfying assignment for these constraints.
Ok, that is a lot of jargon. What is this good for?
SMT solvers have been used since the 1970s, albeit in very specific contexts, most commonly theorem-proving cf ACL2 for some examples. More recently, SMT solvers have been helpful in test-case generation, model-based testing and static program analysis. In order to make this more real with a concrete example, let’s consider one of the most well-known operations research problems: job-shop scheduling.
If there are $n$ jobs to complete, each composed of $m$ tasks with different durations on $m$ machines. The start of a new task can be delayed indefinitely, but you can’t stop a task once it has started. For this problem, there are two constraints: precedence and resource constraints. Precedence specifies that one job has to happen before another and the resource constraint specifies that no two different tasks requiring the same machine are able to execute at the same time. If you are given a total maximum time $max$ and the duration of each task, the task is to decide if a schedule exists where the end time of every task is less than or equal to $max$ units of time. The duration of the $j$th task of job $i$ is $d_{i,j}$ and each task starts at $t_{i,j}$.
I’ve solved this problem before with heuristics such as Simulated_annealing, but you can encode the solution to this problem in SMT using the theory of linear arithmetic. First, you have to encode the precedence constraint:
$$ t_{i,j}+1 \geq t_{i,j} + d_{i,j} $$
This states that the start-time of task $j+1$ must be greater or equal to the start time of task $j$ plus its duration. The resource constraint ensures that jobs don’t overlap. Between job $i$ and job $i’$ this constraint says:
$$ (t_{i,j} \geq t_{i’,j}+d_{i’,j}) \vee (t_{i’,j} \geq t_{i,j} + d_{i,j}) $$
Lastly, each time must be non-negative, or $ t_{i,1} \geq 0 $ and the end time of the last task must be less than or equal to $max$ or $t_{i,m} + d_{i,m} \leq max$. Together, these constraints forms a logical formula that combines logical connectives (conjunctions, disjunction and negation) with atomic formulas in the form of linear arithmetic inequalities. This is the SMT formula and the solution is a mapping from the variables $t_{i,j}$ to values that make this formula $\text{true}$.
So how is this relevant to software security?
Since software uses logical formulas to describe program states and transformations between them, SMT has proven very useful to analyze, verify or test programs. In theory, if we tried every possible input to a computer program, and we could observe and understand every resultant behavior, we would know with certainty all possible vulnerabilities in a software program. The challenge of using formal methods to verify (exploit) software is to accomplish this certainty in a reasonable amount of time and this generally distills down to clever ways to reduce the state space.
For example, consider dynamic symbolic execution. In computational mathematics, algebraic or symbolic computation is a scientific area that refers to the study and development of algorithms and software for manipulating mathematical expressions and other mathematical objects. This is in contrast to scientific computing which is usually based on numerical computation with approximate floating point numbers, while symbolic computation emphasizes exact computation with expressions containing variables that are manipulated as symbols.
The software that performs symbolic calculations is called a computer algebra system. At the beginning of computer algebra, circa 1970, when common algorithms were translated into computer code, they turned out to be highly inefficient. This motivated the application of classical algebra in order to make it effective and to discover efficient algorithms. For example, Euclid’s algorithm had been known for centuries to compute polynomial greatest common divisors, but directly coding this algorithm turned out to be inefficient for polynomials over infinite fields.
Computer algebra is widely used to design the formulas that are used in numerical programs. It is also used for complete scientific computations, when purely numerical methods fail, like in public key cryptography or for some classes of non-linear problems.
To understand some of the challenges of symbolic computation, consider basic associative operations like addition and multiplication. The standard way to deal with associativity is to consider that addition and multiplication have an arbitrary number of operands, that is that $a + b + c$ is represented as $”+”(a, b, c)$. Thus $a + (b + c)$ and $(a + b) + c$ are both simplified to $”+”(a, b, c)$. However, what about subtraction, say $(a − b + c)$? The simplest solution is to rewrite systematically, so $(a + (-1)\cdot b + c)$. In other words, in the internal representation of the expressions, there is no subtraction nor division nor unary minus, outside the representation of the numbers. New Speak for mathematical operations!
A number of tools used in industry are based on symbolic execution. (cf CUTE, Klee, DART, etc). What these programs have in common is they collect explored program paths as formulas and use solvers to identify new test inputs with the potential to guide execution into new branches. SMT applies well for this problem, because instead of the random walks of fuzz testing, “white-box” fuzzing which combines symbolic execution with conventional fuzz testing exposes the interactions of the system under test. Of course, directed search can be much more efficient than random search.
However, as helpful as white-box testing is in finding subtle security-critical bugs, it doesn’t guarantee that programs are free of all the possible errors. This is where model checking helps out. Model checking seeks to automatically check for the absence of categories of errors. The fundamental idea is to explore all possible executions using a finite and sufficiently small abstraction of the program state space. I often think of this as pruning away the state spaces that don’t matter.
For example, consider the statement $a = a + 1$. The abstraction is essentially a relation between the current and new values of the boolean variable $b$. SMT solvers are used to compute the relation by proving theorems, as in:
$$ a == a_{old} \rightarrow a+1 \neq a_{old} $$ is equavalient to checking the unsatisfiability of the negation $ a == a_{old} \wedge a + 1 == a_{old} $.
The theorem says if the current value of $b$ is $\text{true}$, then after executing the statement $ a = a + 1$, the value of $b$ will be $\text{false}$. Now, if $b$ is $\text{false}$, then neither of these conjectures are valid:
$$
a \neq a_{old} \rightarrow a + 1 == a_{old}
$$
or
$$
a \neq a_{old} \rightarrow a + 1 \neq a_{old}
$$
In practice, each SMT solver will produce a model for the negation of the conjecture. In this sense, the model is a counter-example of the conjecture, and when the current value of $b$ is false, nothing can be said about its value after the execution of the statement. The end result of these three proof attempts is then used to replace the statement $a = a + 1$ by:
if b then b = false; else b = *; end
A finite state model checker can now be used on the Boolean program and will establish that $b$ is always $\text{true}$ when control reaches this statement, verifying that calls to
lock()
are balanced with calls to
unlock()
in the original program.
do { lock (); old_count = count; request = GetNextRequest(); if (request != NULL) { unlock(); ProcessRequest(request); count = count + 1; } } while (old_count != count); unlock();
becomes:
do { lock (); b = true; request = GetNextRequest(); if (request != NULL) { unlock(); ProcessRequest(request); if (b) b = false; else b = ∗; } } while (!b); unlock();
Application to static analysis
Static analysis tools work the same way as white-box fuzzing or directed search while ensuring the feasibility of the program through, in turn, checking the feasibility of program paths. However, static analysis never requires actually running the program and can therefore analyze software libraries and utilities without instantiating all the details of their implementation. SMT applies to static analysis because they accurately capture the semantics of most basic operations used by mainstream programming languages. While this fits nicely for functional programming languages, this isn’t always a perfect fit languages such as Java, C#, and C/C++ which all use fixed-width bitvectors as representation for values of type int. In this case, the accurate theory for int is two-complements modular arithmetic. Assuming a bit-width of 32b, the maximal positive 32b integer is 231−1, and the smallest negative 32b integer is −231. If both low and high are 230, low + high evaluates to 231, which is treated as the negative number −231. The presumed assertion 0 ≤ mid < high does therefore not hold. Fortunately, several modern SMT solvers support the theory of “bit-vectors,” accurately capturing the semantics of modular arithmetic.
Let’s look at an example from a binary search algorithm:
int binary_search( int[] arr, int low, int high, int key) { assert (low > high || 0 <= low < high); while (low <= high) { //Find middle value int mid = (low + high)/2; assert (0 <= mid < high); int val = arr[mid]; //Refine range if (key == val) return mid; if (val > key) low = mid+1; else high = mid–1; } return –1; }
Summary
SMT solvers combine SAT reasoning with specialized theory solvers either to find a feasible solution to a set of constraints or to prove that no such solution exists. Linear programming (LP) solvers are designed to find feasible solutions that are optimal with respect to some optimization function. Both are powerful tools that can be incredibly helpful to solve hard and practical problems in computer science.
One of the applications I follow closely is symbolic-execution based analysis and testing. Perhaps the most famous commercial tool that uses dynamic symbolic execution (aka concolic testing) is the SAGE tool from Microsoft. The KLEE and S2E tools (both of which are open-source tools, and use the STP constraint solver) are widely used in many companies including HP Fortify, NVIDIA, and IBM. Increasingly these technologies are being used by many security companies and hackers alike to find security vulnerabilities.
Leave a Reply