In all the news reports, there are various estimates for the probability that Schwarzenegger's message happened by chance. Let's look at a very restrictive version of the question: How likely is it that the initial letters of seven words taken at random from English language text will spell out ‘f— you’?
The numbers that have appeared in the press are questionable. Even in those cases where people have taken into the fact that some letters are more common than others, they haven't taken into account that these are initial letters.
So what are the frequencies at which letters appear as the initial letters of words in written English? I had no idea. So I downloaded plain text versions of two books from Project Gutenberg and analyzed them. The two books were Phineas Finn by Anthony Trollope and Following the Equator by Mark Twain. After some editing (to keep ‘Chapter XXI’ from adding to the count for the letter X, for instance), I saved the books as .txt files and ran the following command (all on one line):
I found that the letters C, F, K, O, U, and Y appeared at the beginnings of 3.6%, 3.5%, 0.7%, 5.9%, 1.0%, and 1.5% of the words in these books, respectively. Using these figures, it appears that the likelihood that the seven letters in Schwarzenegger's veto appeared by chance is about 1 in 1.3 trillion.
State government letters are neither Trollope nor Twain, but these numbers are certainly better than saying that the probability is 1 chance out of 267.
So what are the frequencies at which letters appear as the initial letters of words in written English? I had no idea. So I downloaded plain text versions of two books from Project Gutenberg and analyzed them. The two books were Phineas Finn by Anthony Trollope and Following the Equator by Mark Twain. After some editing (to keep ‘Chapter XXI’ from adding to the count for the letter X, for instance), I saved the books as .txt files and ran the following command (all on one line):
cat *.txt | tr "[:lower:]" "[:upper:]" | awk '{for(i=1;i<NF+1;i=i+1) alpha[substr($i,1,1)]+=1} END {sum=0; s = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; for(i=1;i<27;i+=1){sum+=alpha[substr(s,i,1)]}; print sum; for(i=1;i<27;i+=1) {c=substr(s,i,1); print c " " 100*alpha[c]/sum}}'
I found that the letters C, F, K, O, U, and Y appeared at the beginnings of 3.6%, 3.5%, 0.7%, 5.9%, 1.0%, and 1.5% of the words in these books, respectively. Using these figures, it appears that the likelihood that the seven letters in Schwarzenegger's veto appeared by chance is about 1 in 1.3 trillion.
State government letters are neither Trollope nor Twain, but these numbers are certainly better than saying that the probability is 1 chance out of 267.
No comments:
Post a Comment