Sanitize copy/paste text from word

In a recent project I have had to deal with text copied from a Microsoft Word document and pasted into a textarea. Word automatically changes a few certain characters to what it thinks it should be, such as the ellipsis and quotes. When dealing with inserting that text into a database I was getting errors. To solve my problems I created a sanitize function to replace these certain characters with acceptable characters.

<?php
// Used to sanitize Microsoft Word's Special Characters
// Good reference http://www.lookuptables.com

function SanitizeFromWord($Text = '') {

	$chars = array(
		130=>',',     // baseline single quote
		131=>'NLG',   // florin
		132=>'"', 	  // baseline double quote
		133=>'...',   // ellipsis
		134=>'**',	  // dagger (a second footnote)
		135=>'***',	  // double dagger (a third footnote)
		136=>'^', 	  // circumflex accent
		137=>'o/oo',  // permile
		138=>'Sh',	  // S Hacek
		139=>'<',	  // left single guillemet
		140=>'OE',	  // OE ligature
		145=>'\'',	  // left single quote
		146=>'\'',	  // right single quote
		147=>'"',	  // left double quote
		148=>'"',	  // right double quote
		149=>'-',	  // bullet
		150=>'-',	  // endash
		151=>'--',	  // emdash
		152=>'~',	  // tilde accent
		153=>'(TM)',  // trademark ligature
		154=>'sh',	  // s Hacek
		155=>'>',	  // right single guillemet
		156=>'oe',	  // oe ligature
		159=>'Y',	  // Y Dieresis
		169=>'(C)',	  // Copyright
		174=>'(R)'	  // Registered Trademark
	);
	
	foreach ($chars as $chr=>$replace) {
		$Text = str_replace(chr($chr), $replace, $Text);
	}
	return $Text;
}
?>

Enjoy!

  • Facebook
  • Digg
  • del.icio.us
  • Google Bookmarks
  • BlinkList
  • FriendFeed
  • LinkedIn
  • MySpace
  • Slashdot
  • StumbleUpon
  • Twitter
  • Yahoo! Bookmarks
  • Add to favorites
  • email

Comments are closed.

Switch to our mobile site