<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>mikederoche.com - /dev</title>
	<atom:link href="http://mikederoche.com/dev/feed/" rel="self" type="application/rss+xml" />
	<link>http://mikederoche.com/dev</link>
	<description></description>
	<lastBuildDate>Sun, 24 Apr 2011 19:49:18 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>Naive Bayes Classifiers</title>
		<link>http://mikederoche.com/dev/2011/04/23/naive-bayes-classifiers/</link>
		<comments>http://mikederoche.com/dev/2011/04/23/naive-bayes-classifiers/#comments</comments>
		<pubDate>Sat, 23 Apr 2011 22:02:23 +0000</pubDate>
		<dc:creator>tangibleLime</dc:creator>
				<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[GameMaker]]></category>

		<guid isPermaLink="false">http://mikederoche.com/dev/?p=5</guid>
		<description><![CDATA[Note: Knowledge of simple probability theory recommended. Short Introduction The Naive Bayes Classifier (often shortened to NBC), is a probabilistic classifier based on Bayes&#8217; theorem using naive independence assumptions. It is used as an underlying feature for many artificial intelligence systems. The idea behind NBC is simple. Suppose we have some hypothesis X which we <a href='http://mikederoche.com/dev/2011/04/23/naive-bayes-classifiers/'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p><i><b>Note:</b>  Knowledge of simple probability theory recommended.</i></p>
<p><span style="font-size: 14"><b>Short Introduction</b></span></p>
<p>The <b>Naive Bayes Classifier</b> (often shortened to <b>NBC</b>), is a probabilistic classifier based on Bayes&#8217; theorem using naive independence assumptions.  It is used as an underlying feature for <i>many</i> artificial intelligence systems.</p>
<p>The idea behind NBC is simple.  Suppose we have some hypothesis <i>X</i> which we know the prior odds for: <i>O(X)</i>.  Now suppose we come across several pieces of evidence that are either positive or negative examples of our hypothesis (which I will refer to as &#8220;features&#8221;).  By having our classifier understand these pieces of evidence, it can then apply this knowledge to determine that our hypothesis is true or false.</p>
<p><span style="font-size: 14"><b>Basic Formulas</b></span></p>
<p>These are the basic formulas that we&#8217;ll use to create the NBC.  This is where a basic knowledge of probability theory is needed.</p>
<p>Starting off, we have Bayes&#8217; theorem,</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?P(A|B)%20=%20\frac{P(B|A)P(A)}{P(B)}" /></span></p>
<p>Prior odds of X,</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?O(X)%20=%20\frac{P(X)}{P(\neg%20X)}" /></span></p>
<p>Posterior odds of X given Y,</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?O(X|Y)%20=%20\frac{P(X|Y)}{P(\neg%20X|Y)}" /></span></p>
<p>Likelihood ratio of Y with respect to X,</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?L(Y|X)%20=%20\frac{P(Y|X)}{P(Y|\neg%20X)}" /></span></p>
<p><span style="font-size: 14"><b>Creating and Training the Classifier</b></span></p>
<p>Suppose our hypothesis (X) is that the word &#8220;<i>soldaten</i>&#8221; is part of the German language and not part of the English language.  For a human, this task is simple (assuming one is familiar with either the English or German languages).  However, for a computer this task can be difficult!  We have prior knowledge of languages and what words of different languages tend to look like; our program does not share this knowledge.  So what should we do?  I say we GIVE it knowledge, by a process we&#8217;ll call <i>training the classifier</i>.</p>
<p><b>Training the Classifier</b></p>
<p>In our quest to create a program that can differentiate German and English words, we find ourselves trying to teach a program the differences of the two languages.  One way we can do this (and the method I used in the posted example) is to use a set of <i>features</i> and <i>events</i> that each feature was either absent or present from a given word.  In our case, our list of features will be the characters <i>a</i> through <i>z</i>.  These features will correspond to a set of events, say</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?E_{1},E_{2},...,E_{26}" /></span></p>
<p>which will tell us whether the given feature (letter) was present in the given word.  In our example of the word &#8220;soldaten&#8221;, we would find that the following events are true.</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?E_{19},E_{15},E_{12},E_{4},E_{1},E_{20},E_{5},E_{14}" /></span></p>
<p>Since the letters <i>s</i> (19th in the alphabet), <i>o</i> (15th in the alphabet), etc., are present in the word.  All events that are not included in that above list are events in which the corresponding feature was <i>not</i> found.  For example, the event E<sub>26</sub> will report that the character <i>z</i> did not show up in the word.</p>
<p>When we <i>train</i> the classifier, we also provide one more piece of information along with the word we&#8217;re using to train it &#8211; the class of the word.  It won&#8217;t help if we only give our classifier a million words (some English, some German) and not tell it which are which!  So along with our training word, we&#8217;ll tell it if it belongs to class 1 (the word is German) or class 0 (the word is English).</p>
<p>Throughout the entire training process, we&#8217;ll keep a few counters that we&#8217;re going to use in the probability computations.  We keep track of the number of positive examples (class 1, the word is German) and the number of negative examples (class 0, the word is English) for each of our 26 features.  We&#8217;ll also count the number of total examples for each feature.  With these counter variables, we can start <i>computing the probabilities</i>.</p>
<p><b>Computing the Probabilities</b></p>
<p>Our posterior odds, after observing all the evidence, is given by,</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?O(X|E_{1},E_{2},...,E_{26})%20=%20L(E_{1},E_{2},...,E_{26}|X)O(X)%20=%20\frac{P(E_{1}%20\cap%20E_{2}%20\cap%20...%20\cap%20E_{26}|X)}{P(E_{1}%20\cap%20E_{2}%20\cap%20...%20\cap%20E_{26}|\neg%20X)}O(X)" /></span>.</p>
<p>In other words, the posterior probability is the probability of E<sub>n</sub> given our hypothesis is <i>true</i> divided by the probability of E<sub>n</sub> given that our hypothesis is <i>false</i>.  In NBC, we will always assume <i>conditionally independent</i> features, which you should know from your simple understanding of probability theory, will bring you to the following conclusions:</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?P(E_{1}%20\cap%20E_{2}%20\cap%20...%20\cap%20E_{26}|X)%20=%20\prod_{n=1}^{26}%20P(E_{n}|X)" /></span></p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?P(E_{1}%20\cap%20E_{2}%20\cap%20...%20\cap%20E_{26}|\neg%20X)%20=%20\prod_{n=1}^{26}%20P(E_{n}|\neg%20X)" /></span></p>
<p>And in conclusion,</p>
<p><span style="padding-left: 20px"><img src="http://latex.codecogs.com/gif.latex?O(X|E_{1},E_{2},...,E_{26})%20=%20O(X)\prod_{n=1}^{26}%20\frac{P(E_{n}|X)}{P(E_{n}|\neg%20X)}" /></span></p>
<p>We have now found the probability that the specific word we&#8217;re looking at is in class 1, i.e. it is an German word.</p>
<p><b>Example in Game Maker</b></p>
<p>To go along with this topic, I built a Naive Bayes Classifier using Game Maker 8.1.  It is completely compatible with both the registered and unregistered versions.</p>
<p><i>Extended Features</i></p>
<p>I go a step further in my NBC.  Remember our feature list of characters <i>a</i> through <i>z</i>?  In this example, I use features that contain <i>double</i> and <i>triple</i> characters, as well as our standard <i>single</i> characters, i.e. {a, aa, aaa, aab, aac, &#8230;. , a ab aba abb abc }.  As you may imagine, the list of features becomes <i>huge</i> when expanded to triple characters.  To (somewhat) remedy this, I included three feature generation scripts, <i>nbcGenerateFeatures1()</i>, <i>nbcGenerateFeatures2()</i> and <i>nbcGenerateFeatures3()</i>, which create feature lists of character depth 1, 2 and 3 respectively.  By default the classifier will use <i>nbcGenerateFeatures2()</i>.  You can change this in the <i>nbc</i> object in the second line of it&#8217;s create event.</p>
<p><i>Running the Test</i></p>
<p>After you unzip the package, you&#8217;re ready to test it.  I included three files:<br />
(1) A file containing 1,100 German words (for training).<br />
(2) A file containing 1,100 English words (for training).<br />
(3) A testing file that contains 44 words of each language, with their correct classification the same line.  This file is read by the test object and parsed.  If you want to make your own testing file, follow the format with one entry per line: <i>word,class</i></p>
<p>Run the program, and hit T when prompted.  After the training has completed, you can press SPACE to run the test to see how the classifier performs after processing the training files.  <b>Note that both running the training process AND the test process will take a LONG TIME if you are using <i>nbcGenerateFeatures3()</i></b> &#8211; it took a couple minutes on my i7-930.  You also may want to play around with the training files.  Compiling a list of 2,200 total words is something of a chore that takes a month of Sundays; the training files I&#8217;ve provided can easily be improved.</p>
<p>You can also press E to input your own word and the program will try to determine if that word is German or English.</p>
<p><i>Program Output</i></p>
<p>After running the test, the program will open up the test log (testlog.txt, in the same folder as the GM file).  This file displays each word in the test file, with it&#8217;s determined probability for being a German word, it&#8217;s actual class, the determined class and if the classifier made the correct decision (if the probability is over 0.5, it guesses that the word is German).  At the bottom of this file is a summary that displays the percent correct, the number correct and the number wrong.</p>
<p>The program also outputs <i>featurelog.txt</i>, which is a complete list of the features used in the classifier.</p>
<p>I ran it for all three feature scripts, which returned the following statistics:<br />
<span style="padding-left: 20px">Using nbcGenerateFeatures3():  <b>%89.77 correct</b>.</span><br />
<span style="padding-left: 20px">Using nbcGenerateFeatures2():  %87.50 correct.</span><br />
<span style="padding-left: 20px">Using nbcGenerateFeatures1():  %71.59 correct.</span></p>
<p><i>Download</i></p>
<p>Download the NBC program for use with Game Maker (.gm81) here: http://mikederoche.com/files/gmNBC.zip</p>
]]></content:encoded>
			<wfw:commentRss>http://mikederoche.com/dev/2011/04/23/naive-bayes-classifiers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

