I consider myself a seasoned PHP Web Developer, with over 10 years working in the field and with all the care and attention I've given Web Development over those 10 years it frustrated me recently to come across a problem that a client was having which baffled me, seriously baffled me!
My client has a TAB Deliminated product feed file (CSV), which he modifies using Excel and then uploads into his website, the file is then processed into a MySQL database, and this has worked fine for a very long time, but then suddenly a problem was noticed with the descriptions where randomly on some products the description just cuts off unexpectedly.
Checking the database I could see that indeed in the database the content is cutting off, when I first saw this I immediately recognised that an apostrophe was where the problem was happening, GREAT I shrieked, the problem is due to sanitisation!
So I quickly loaded up all the source files and painstakingly went through line by line following the process, I could see the code was using PHP's fgetcsv function for loading up the CSV file, all looks great so far. Then I came to the queries that were inserting the data into the tables, unfortunately everything was being sanitized correctly.
So what's the problem??
Ok so now I was back to the drawing board, I looked at the CSV and looked at the database and sat there for a few hours trying to think what could be causing this problem! It's not possible I said, there is no way, this can't be happening (every developer says this about 12 times a day then realises their mistake soon after). But not today...
Eventually I realised that the apostrophe was not the standard U+0027, instead it was a grave accent U+0060.
Unfortunately some software packages such as Microsoft Word do not use the standard apostrophe and instead can use any of the following:
|U+0022||QUOTATION MARK||neutral (vertical), used as opening or closing quotation mark; preferred characters in English for paired quotation marks are U+201C and U+201D|
|U+0027||APOSTROPHE||neutral (vertical) glyph having mixed usage; preferred character for apostrophe is U+2019; preferred characters in English for paired quotation marks are U+2018 and U+2019|
|U+2018||LEFT SINGLE QUOTATION MARK|
|U+2019||RIGHT SINGLE QUOTATION MARK||this is the preferred character to use for apostrophe, automatically used in Word Documents.|
|U+201C||LEFT DOUBLE QUOTATION MARK||this is the preferred character to use for quotation marks, automatically used in Word Documents.|
|U+201D||RIGHT DOUBLE QUOTATION MARK||this is the preferred character to use for quotation marks, automatically used in Word Documents.|
This presented a huge problem in my clients case, the database was UTF8 and PHP was in UTF8 but for some reason this character was killing the MySQL insert statement, but strangely applying thew rest of the data columns in the statement, which I thought was particularly strange.
I tried using various methods such as converting the text using utf8_encode and iconv however this did not rectify my clients issue.
In the end I came up with a simple function to convert Smart Quotes into neutral (vertical) characters, and it works really well.
$search = array(chr(145),
$replace = array("'",
return str_replace($search, $replace, $string);