Let us start from the taint analysis vulnerability model. Here is the vulnerability definition used in the model.
1. All data originating from web application users is untrusted. To track this data most analyzers associate a special "taint" mark with it.
2. All local data (file system, database, etc.) is trusted (we do not want to address local threats).
3. Untrusted data can be made trusted through special kinds of processing, which we will call sanitization. Thus, if analyzer detects that certain data marked with "taint" flag is passed through a sanitization routine, the flag will be removed.
4. Untrusted data cannot reach security critical operations like database queries, HTTP response generation, evals, etc. The violation of this rule is called a vulnerability.
A security analyst who is going to use this model will be required to:
1. Compile a list of language constructs that return user input. In some technologies (i.e. PHP) these constructs are built-in, in other technologies (Python) they are framework-dependent (consider obtaining HTTP parameters in mod_python vs WSGI).
2. Compile a list of sanitization constructs (built-in, 3rd-party libraries, or even implemented by web application developer).
3. Compile a list of critical operations (these are mostly built-in).
This becomes a configuration for the taint analysis.
Let us point out some limitations of the model.
1. The fisrt is a minor limitation, which can be simply overcome. The basic definition does not support classes of untrusted data. This means that the following code snippet will not produce a vulnerability warning:
$skip = $_GET['skip'];
$skip = htmlspecialchars($skip);
mysql_connect();
mysql_select_db("myDB");
$query = "SELECT text FROM news LIMIT 10 OFFSET ".$skip;
$result = mysql_query($query);
Indeed, htmlspecialchars should be listed as a sanitization routine, so the "taint" flag associated with it at line 1 will be removed. But the the program does have an SQL injection vulnerability. This undetected vulnerability adds to false negatives and affects completeness of the analysis.
The limitation can be overcome by introduction of "taint" classes: "SQLI-untrusted", "XSS-untrusted", "Shell-injection untrusted", etc.
2. The second limitation lies within the assumption that all local data is trusted. This results in inability to detect multi-module vulnerabilities (aka second order injections). Let us consider the following example:
addpost.php
$text = $_GET['text'];
$text = addslashes($text);
mysql_connect();
mysql_select_db("myDB");
$query = "INSERT INTO posts (text) VALUES ('".$text."')";
mysql_query($query);
// do redirect
viewpost.php
$skip = $_GET['skip'] + 0;
mysql_connect();
mysql_select_db("myDB");
$query = "SELECT text FROM posts LIMIT 10 OFFSET ".$skip;
$result = mysql_query($query);
while ($row = mysql_fetch_assoc($result)) {
echo $row['text'];
}
This code demonstrates the simplest stored XSS vulnerability. However, due to the second assumption of the model, the analyzer will treat text data returned from the database as trusted. This limitation also leads to undetected vulnerabilities and affects the completeness of the analysis.
The limitation can be overcome by introduction of inter-module data dependency analysis. This approach was described in some papers:
[2007] Multi-Module Vulnerability Analysis of Web-based Applications
[2008] Detecting Security Vulnerabilities in Web Applications Using Dynamic Analysis with Penetration Testing
It is likely that most commercial static analyzers have adopted this approach by now.
3. The third limitation lies within the third rule, which states that sanitization should be performed through special routines that always return "good" data. But what about input validation through conditioning? This limitation cannot be overcome without integration with other vulnerability models. Let us consider the following example:
$email = $_GET['email'];
$valid_email_pattern = "..."; //complex reg exp from RFC 822
if (preg_match($valid_email_pattern, $email)) {
// do processing
} else {
echo "You have entered an invalid email address: ".$email; //XSS
exit;
}
It is unclear how to "untaint" variables, which are sanitized via such checks. In the code above variable $email should be "untainted" if the call to preg_match returns true. Could we use this as a rule to untaint all variables passed through preg_match? Obviously no! Let's take a look at the other example:
$email = $_GET['email'];
$invalid_email_pattern = "..."; //negated complex reg exp from RFC 822
if (preg_match($invalid_email_pattern, $email)) {
echo "You have entered an invalid email address: ".$email; //XSS
exit;
} else {
// do processing
}
So, in general we cannot determine in which branch we should "untaint" variables validated via conditional statements. Let us examine how analyzers could handle this issue. Basically, there are two options:
- preserve "taint" flag in both branches. This leads to false positives and affects precision of the analysis.
- remove "taint" flag from both branches. This leads to false negatives and affects completeness of the analysis.
4. The last minor drawback is the implicit trust laid on sanitization routines and on operator compiling the configuration. If any sanitization routine contains an error (i.e. is incomplete, does not perform normalization, susceptible to bypassing techniques, etc.) the inherent vulnerability will not be detected. Also, analysis can become either incomplete or imprecise if configuration lists with sanitization, input and critical routines were compiled with errors/omissions.
I hope you have found this blog post useful and I’m always interested in hearing any feedback you have.
p.s. In the next post we will discuss limitations of the other vulnerability model, that is parse-tree model.