пятница, 19 августа 2011 г.

Building a benchmark for SQL injection scanners

Intro

In couple of last years we have seen a lot of emerging projects aiming at web application vulnerability analysis automation. That's right, I mean security scanners. Just to name a few: w3af, skipfish, Grendel-Scan, arachni, wapiti, SecuBat, sqlMap, hexjector, SQLiX and many more.

I like to group security scanners according to their feature sets:
  • general purpose vs special-purpose (testing for SQLi or XSS only);
  • detection only vs detection + exploitation.
Naturally, several good question arise:
  1. Is it true that special-purpose tools perform better than general-purpose ones?
  2. Is it true that commercial tools perform better than free ones?
  3. Is possible to find an ultimate champion for a certain vulnerability class (i.e. SQLi)? If no, which tools would produce the best result when combined together?
Our goal was to answer these question for a specific class of web application vulnerabilities, namely SQLi. The scope:
  1. We are not interested in second order SQLi vulnerabilities. Reason: there are no feasible techniques to detect second order SQLi in black box.
  2. We are not interested in crawling capabilities of the scanners.
  3. We are not interested in measuring exploitation capabilities.
In order to answer the questions we created a benchmark - a comprehensive set of vulnerable and non-vulnerable test cases, and tested several scanners against it.
This is the point you should ask:
- "Why the hell I should care? There was a lot of similar efforts. Do you provide a proof that your test set is complete or something?"1

In fact yes. Our approach is not an ad hoc. Let's take a look at it.
First of all we will outline the capabilities of the existing SQLi detection techniques. After that we will proceed to the discussion of problems arising during automation. Finally, we will present our approach for benchmarking of SQLi scanners.

Existing techniques for SQLi detections are not complete!

A black-box scanner does not have a lot of information channels to make its decisions about the presence or absence of vulnerabilities:
  • HTTP status code;
  • HTTP headers (i.e. Location, Set-cookie, etc.);
  • HTTP body;
  • HTTP response delay;
  • out-of-band channels.

According to its definition, SQLi vulnerability is a possibility to alter a syntactic structure of an SQL statement. The main idea behind existing black-box SQLi detection techniques is to inject just that statements which, when evaluated by back-end, would produce a change measurable from outside (alter status code, response body, delay, issue a DNS request, etc).
To sum the thins up, the community has come up with several SQLi detection techniques:
  • error-based;
  • content-based (or blind);
  • time-based;
  • out-of-band.

Alas, this set is not complete. Consider a code snippet, which extracts a user-agent value from HTTP request and stores it into back-end DBMS without validation:
$ua = $_SERVER["HTTP_USER_AGENT"];
$rs = mysql_query("SELECT id from user_agents WHERE user_agent = '.$ua.'");
$num_rows = mysql_num_rows($result);
if ($num_rows == 0) {
//insert ua into database
}

Furthermore, all possible exceptions are caught and suppressed and DBMS does not support functions like sleep (MS Access, for instance). [Forget about heavy queries either ;)] This means that we cannot influence from outside neither response itself nor response delay. But this may well be exploitable vulnerability (e.g. consider web-shell installation).
Now after we know that there are certain SQLi instances, which could not be detected by existing techniques in general, it is time to survey a potential for their automation.

Automation challenges

SQLi scanner developers face two main challenges:
  1. An injected input should leave the whole query syntactically correct. Thus, a scanner should somehow infer the injection point (e.g. tell a difference between an injection into column-name in a SELECT statement, and an injection after the LIMIT keyword).
  2. An algorithm for page comparison should tell apart a true and a false page when doing a blind SQLi. This is an easy task for a human with cognitive comprehension, and almost impossible in general case for a machine. Current web applications may have irregular page structure and produce different content in response to identical http requests (ads, social widgets, etc.), which make the situation even worse.

Our approach

The idea behind our approach is to establish a classification of all possible implementations of a single web application module interacting with a database. After that we would be able to produce all possible modules (or scripts if you like), which would make a test set. Why do we need classification anyways? Well, to reason about test set completeness, of course.

At the current level of abstraction we define five steps which typically constitute a common web application module:
  1. Get user input.
  2. Validate user input.
  3. Construct a query.
  4. Perform a query and handle the result.
  5. Construct and issue an HTTP response.

We then took a closer look into each step and enumerated all possible ways of its implementation.
Here's what we got.

Classification of environments

Criteria №1: DBMS and version.
Classes: all possible DBMSes with versions.
Reason: a set of features of an SQL dialect is determined by DBMS version. Almost every new major DBMS version is shipped with new built-in functions, which could be used in SQLi attack vectors.
Current implementation: supports only MySQL 5.0.

Criteria №2: exception suppression settings.
Classes: exceptions suppressed, exceptions not suppressed.
Comment: this is an environment setting; thus it does not relate to a possible try-catch block in a module.
Reason: this criteria allows to measure the implementation of error-based technique.
Current implementation: via PHP display_errors setting.

Criteria №3: execution time limit settings.
Classes: limited to 1 second, unlimited.
Reason: this criteria allows to check whether a scanner implements both time-based and blind SQLi detection techniques and is able to determine the correct one to use.
Current implementation: Not implemented. Planned: PHP max_execution_time setting.

Classification of the "Get user input" step

Criteria №4: location of the payload.
Classes: GET-parameter name, GET-parameter value, URL path component, POST-parameter name, POST-parameter value, cookie parameter name, cookie parameter value, header name, header value.
Reason: the scanner should be capable to inject payloads not only into GET and POST parameters, but also into cookies and other headers as well.
Current implementation: only one GET parameter is used to obtain user input.

Classification of the "Validate user input" step

Criteria №5: maximum length enforcement.
Classes: user-defined lengths, unlimited length.
Reason: a scanner should strive to make its probe vectors as tiny as possible.
Current implementation: a test bench operator can alter this limit accordingly. By default is set to 32.

Criteria №6: sanitization approach.
Classes:
proper escaping (prepared statement or built-in escaping function);
improper usage of built-in escaping facilities;
manual escaping:
- of single quotes;
- of double quotes;
- of slashes;
manual removal:
- of single quotes;
- of double quotes;
- of slashes;
- of SQL whitespaces;
- of SQL delimiters;
- of SQL keywords;
manual regexp-like check with error generation:
- of single quotes;
- of double quotes;
- of slashes;
- of SQL whitespaces;
- of SQL delimiters;
- of SQL keywords.
In fact, to obtain all classes from the latter three bullets (manual handling) we should get all permutations of them.
Reason: a perfect scanner should be able to bypass any flawed input validation.
Current implementation: proper escaping (mysql_real_escape_string inside quotes), improper usage of built-in escaping facilities (mysql_real_escape_string for numbers), manual escaping of single quotes, manual escaping of double quotes, manual escaping of all quotes, manual removal of all quotes, manual removal of SQL whitespaces.

Classification of the "Construct a query" step

Criteria №7: An injection point.
Classes: all possible injection points according to DBMS documentation.
Reason: a perfect scanner should be able to detect SQLi in all query types.
Current implementation:
- after the SELECT keyword in the field list:
- inside/without backquotes;
- inside brackets (nesting levels - 1, 2, 3);
- as a last or middle (including first) argument for the SQL function;
- in the WHERE clause:
- in a string/numeric literal;
- inside brackets (nesting levels - 1, 2, 3);
- in the left/right part of a condition ($id=id vs id=$id);
- as a last or middle (including first) argument for the SQL function;
- after the ORDER BY/GROUP BY in the field name
- inside/without backquotes;
- after the ORDER BY/GROUP BY in an expression (ORDER BY `price` * $discount*…) - in a string/numeric literal;
- inside brackets (nesting levels - 1, 2, 3);
- as a last or middle (including first) argument for the SQL function;
- after the ORDER BY/GROUP BY in a sort order ASC/DESC (ORDER BY `price` $style)

Classification of the "Perform a query and handle the result" step

Criteria №8: expected result.
Classes: result is not expected (DML queries), one field expected, one row expected, multiple rows expected.
Reason: an expected result type influences how a scanner should detect a potential SQLi. For example, of query result is discarded it is useless to alter the result set using boolean conditions (blind SQLi).
Current implementation: result is not expected (DML queries), one field expected, one row expected, multiple rows expected.

Criteria №9: error handling.
Classes: error results in a different page (no DBMS error message), error results in a page with DBMS error message, error is suppressed silently.
Reason: scanners should switch to blind or error-based technique accordingly.
Current implementation: implemented.

Classification of the "Construct and issue an HTTP response" step

Criteria №10: which response part depends on the SQL result.
Classes: status code, header, text within body, markup within body.
Reason: a perfect scanner should detect changes after successful injection in any part of HTTP response.
Current implementation: Location header, text within body, markup within body.

Criteria №11: response stability.
Classes: stable, unstable text, unstable initial DOM (before js evaluation), unstable resulting DOM (after js evaluation).
Reason: a perfect scanner should detect changes after successful injection in any part of HTTP response regardless of DOM stability (ads, social widgets, etc.).
Current implementation: stable, unstable text, unstable initial DOM.

The resulting test set would be a complete permutation of the classes. For example, there is among the others a test case with:
C1: MySQL 5.0 backend;
C2: suppressed exceptions;
C3: unlimited execution time;
C4: a test module would get input from the GET parameter value;
C5: without maximum length enforcement;
C6: with improper used mysql_real_escape_string (for numbers);
C7: as a first string argument inside a built-in function after ORDER BY keyword;
C8: one row would be expected;
C9: errors are silently suppressed;
C10: a query result determines a redirection destination (i.e. the value of the Location header);
C11: the response (redirection) is stable except the Date header.

Implementation

We have implemented our test bench as virtual machine containing web server, PHP interpreter, test generator, and result analyzer. It can be downloaded from here.
Currently the generator creates 27680 test cases (both vulnerable and not vulnerable). A test case is a PHP file, which receives one GET-parameter. There's also an index file, which links all test cases. This index file would be a starting point for scanners.
We have also implemented wrappers for several scanners: sqlmap, skipfish, wapiti, w3af. These wrappers with a specially designed scheduler allowed us to run several scanner instances in parallel, from the VM localhost.
Please, referer to the README files inside the archive for further information about the environment and its usage.

Evaluation

Here's just a small portion of the overall table.
 Positives (vulnerable)False positives (not vulnerable)Positives with unstable responseFalse positives with unstable response
Total16224114562016840
sqlmap 0.8-11128229576229
skipfish 1.81b10937378237
wapiti 2.2.1111515813443
acunetix7395010030
w3af 1.0-rc5133161981572126

Comments on the numbers:
  • The score is not normalized; thus, it is not correct to determine the best scanner as a scanner with the highest score. For example, the most of the test cases are the ones with unsuppressed errors. Thus, a scanner, which would perform best with error-based SQLi detection technique would have a more chances to get the highest overall score. This is the case with Wapiti.
  • The main feature of out test bench is that it allows to get scores for custom classes of test cases, once the scanner have done a run. For example, you could define a class "blind SQLi after ORDER BY keyword" and re-compute the results for this class. For the time being I refrained to dig into interpretation of more granular classes of the test cases.

Just a few facts.
Sqlmap 0.8:
- does not detect SQLi with output into HTTP headers;
- lacks intelligence with unstable output;
– fails on tests with SQLi into table fields surrounded in backquotes.

Skipfish 1.81b:
- does not detect SQLi with output into HTTP headers;
- lacks intelligence with unstable output;
- low false positive rate.

Wapiti 2.2.1:
- only error-based and time-based techniques are implemented;
– fails on tests with SQLi into table fields surrounded in backquotes;
– fails on tests with SQLi inside nested brackets.

Acunetix 7.0.0:
- zero false positives;
- often fails to detect SQLi with output into HTTP headers;
- often fails to detect time based SQLi even if it was successful;
- not so good with non error-based techniques.

w3af 1.0-rc5:
- best implementation of error-based technique;
- often fails if the normal query returns zero rows (as with login pages).

Outro

More analytics - in a few weeks.
Care to contribute in extending test classes? YOU ARE WELCOME! Contact us! There's so much to be done!

Side note #1: one of the most comprehensive surveys is one made by Shay Chen. If you haven't checked it out yet - you should!!!

Credits:
Karim Valiev - implementation of benchmark environment, classification.
Andrew Petukhov - main idea and classification.