Stripping nonsense characters in URLs with Apache mod_rewrite

I have had this problem that has been plaguing me for a very long time. Google’s webmaster tools has been reporting incoming links to our website that are returning a 404 (Not found) error. The weird thing is the URL that shows “looks” correct and even clicking the link it looks correct in the browser address bar. But if you right click and copy it and then paste it into notepad you then see the issue. For example this link was reported in webmaster tools:

http://www.dtidata.com/free_data_recovery_%E2%80%8Bsoftware

Apparently the %E2%80%8B characters are URL encoding that gets picked up by scrapper sites which then show up as a 404 error in webmaster tools. Well I don’t know about you but I don’t like seeing errors no matter where the link is coming from. So I decided to figure out a way to fix these bad links and 301 redirect them to the appropriate page. I searched all over the place and came across this forum link: http://www.webmasterworld.com/apache/4281239.htm which had a few decent examples of stripping out the %E2%80%8B characters. However it only worked when the characters appeared at the end or beginning of the filename (e.g. www.example.com/file%E2%80%8B.htm) as was evident by others posting that they now started to appear in the middle of the filename and the example was not working for them. I continued to search and found no other examples on how to get the URL rewritten when they appeared somewhere in the middle.

So I decided to go to Apache’s website and learn the syntax of the rewrite rules and finally came up with something that works (at least for me).

The following rule:

RewriteEngine on
RewriteCond %{THE_REQUEST} ^[A-Z]{3,9}\ /(.*)\%E2%80%8B(.*)\ HTTP/ [NC]
RewriteRule ^.*$ http://www.example.com/%1%2 [R=301,L]

Is what finally worked for me. The %1 and the %2 in the rewrite rule correspond to what is between the first set of parentheses and what is between the second set of parentheses in the rewrite condition. Replace example.com with your domain and you should be good to go! Hopefully this will help others who have been banging their heads trying to figure this out like I was

Comments

  1. Matthew says:

    Perfect. Worked like a charm. I hunted for about an hour trying to find this.

  2. Thanks! Excellent countermeasure. We are experiencing the same ish as well.

    I did a bit of extra hunting, and it turns out %E2%80%8B, taken as UTF-8, translates to the Unicode value of U+200B (hex) and its HTML entity is ​ … also known as a “Zero-width space”:

    http://en.wikipedia.org/wiki/Zero_Width_Space
    http://www.fileformat.info/info/unicode/char/200b/index.htm
    http://www.unicode.org/charts/PDF/U2000.pdf (column “200″ and row “B”)

Speak Your Mind