Entries Tagged 'Probability' ↓
November 30th, 2006 — Probability, Statistics, Thinking, analytics
The stock market is one of those things that really intrigues me. An open system where everyone can see what’s going on, perhaps make some money, and perhaps influence the direction of the stock. Its a system that is ripe for data mining, something that seems to be equal parts analytical skill, part fortune teller, part industry expert, and often times being plain lucky.
I’ve talked with Hip Egg and Jym Khana about stocks before and one topic that I bring up every now and then are penny stocks. As I’m sure most people with an email account know, there’s a ton of stock related spam going around these days. Most of it appears to be the pump-and-dump variety in which the scammers hope that people will purchase the suggested stock causing the price to rise so that they can sell their shares (that they purchased before sending out the email) at an inflated price. This technique has been around for ever, but it seems to the flavor of the month for scam and con artists.
The main questions that we usually talk about are a)Does any one actually get rich doing this? and b)Just how “influence-able” are these low priced stocks? Well, today has been a banner day for answers, I came across two articles talking about the scams:
While reading these I saw the simplest idea yet to help stop those spams: Simply watch the stocks and see who bought a lot of stock before the email was sent, and who sold a lot right around the sell date in the email.
That idea is pure genius. It targets a potentially large group of people, but the probabilities are that a pattern will emerge that a small group of people are moving from one stock to another. At a minimum those groups would be a starting point for a fraud investigation. More than likely, those would be the people responsible for sending out the emails. And since the spammers are kind enough to send these messages to just about everyone on the planet, it shouldn’t take too long to gather a nice body of evidence (or actionable intelligence). From what I understand in the past scams like this have been hard to track because the scammers can move quickly. But now that they are announcing their moves in advance, it should be pretty easy to set up a system to monitor spams, then watch the stock activity… It just seems so simple, that it should work like a champ!
November 10th, 2006 — Entertainment, Music, Probability, ipod
I really like my iPod. I’ve got a 4GB Mini that I keep in the car (hooked up to the stereo) and I listen to that instead of the radio. I have a couple of playlists that account for about 2GB of songs on there. You would think with that many songs all would be good. And most of time it is, but after a while, you will listen to all of those songs and start to hear the same ones over and over and over. Even your most favorite of songs will begin to grate on your nerves.
So how do you prevent this?
For me the secret has been to set up two new play lists. One is dedicated to new songs (i.e. just bought or ripped), and the other is for songs that haven’t been played a while.
With iTunes, you can setup a play list that will select songs based on certain fields. For me, most of my play lists revolve around the “rating” of the song. 1 Star means I don’t like it, 5 Stars means its the best thing I’ve ever heard. As a consequence, I have a lot of songs that fall into the 3 to 4 star range. Randomly choosing songs from this pool is ok, but for some reason I always seem to wind up with the same core groups of songs, and like I said earlier, they are starting to get stale.
It turns out that there is another field that iTunes keeps track of, the “last time played”. This is interesting because now we can build a play list based not only on how much we like the song, but also how long it has been since we have heard it. Combining the two ideas together leads to an interesting new play list. Here’s a picture of how I have mine setup:

With this play list feeding into my iPod when I sync I get a nice selection of “fresh” songs almost every time. Since I have about 800 songs rated between 3 and 5 stars, this gives me a good size pool of songs to pull from. And since the play list is time dependant, what is in the list today will be different than what is in the list 2 weeks from now.
The real beauty of this play list is that as songs are listened to, their “last played” date is set to now, and when I sync up the next time, a new song will take its place. This way, I can keep listening to the songs I like, but don’t have to worry about stale songs because the play list is always being refreshed.
And as I listen to songs on the Mac, this updates the last played dates also, so the net effect is I’m adding a lot of chaotic variability to the play list. Which in turn means that the songs on the play list will tend to be more “random” because there are two sources of input (the iPod and iTunes) that are influencing the results of what gets picked.
September 3rd, 2006 — Blogging, Probability, Statistics, Thinking
Recently I got an interesting promotion in the mail. Home Depot sent the wife and I a gift card that was loaded with a “mystery amount” between $1 and $10,000. To find out how much the card is worth, you have to go to a Home Depot and redeem it. Our card turned out to be worth $1. How much did we wind up spending? $75. As we were walking out of the store I started thinking about this promotion and it hit me how genius it was.
Think about it: They are giving away money in exchange for people shopping at their stores. But, they aren’t going to loose money, and if anything they will gain huge insights into the consumer base in an area!
For example, lets say they mail out 10,000 of the cards. And lets say there are only two values for the cards, $1, and $10,000. The smart thing would be to send out as few of the $10k cards also possible, so in this exercise we’ll assume 1 was sent out. That means that the other 9,999 cards were all worth $1 which means the total value of the “prize” is $19,999. That is, assuming that all of the cards are used (which in real life probably wouldn’t happen for a variety of reasons), and ignoring other things like the cost of the mailings, the time of the staff to prepare things, etc.
A lot of people who get these cards are probably like me, they have a bunch of small projects around the house, but they just don’t have any motivation to do any work on them (because they need parts, they need time, etc., etc., etc.) This card comes along and they decide “Hey, I could win $10k, why not go down there and get some of the supplies I need anyway, and then I can find out what this is worth!”.
So lets assume that 50% of the people who get the cards decide to use them. My wife and I wound up spending $75, but I’m sure that other people will spend more, and other will spend less. For this little exercise, lets figure on the average amount spend is $50 per person. That works out to $250,000 spent at the store! Even if one of the cards happened to be the winning $10k card, Home Depot will still have brought in over $200k of business. If the average spent per customer was higher and/or more people used the cards, the amount would be even higher.
So, for a $20k investment, the business got back $250k in sales. For a company so big, that’s probably a drop in the bucket, but the effect is something that can’t be ignored. And that’s not even the good part!
For those of you who’ve never seen one of these gift cards, they are basically a plastic card with a bar code on the back. The bar code is scanned at the register which queries the main database to find out the value of the card. Pretty standard stuff. But then it occurred to me that I thought the card was addressed by name to me and my wife. I could be wrong, it might have said resident, but at any rate the ZIP code was on the address.
What this means is that there is now (potentially) a connection between how much was spent at the store and what ZIP code the customer lives in. This is a gold mine of data, it tells you so much about your customer base. Off the top of my head I could think of the following:
Which store did they go to? (i.e. did they drive past one that is closer to where they live?) Did the customer drive past any competitors to get to the store?
How was the turn out from areas where a competitor’s store is closer than a Home Depot store?
For a given ZIP code how much was spent?
Is there a popular item in a certain area?
Continuing with the example above, imagine if the results of this campaign showed that there was a higher turn out from certain part of town those parts could be investigated to find out what kind of homes are there. It could be that the homes in that area are older and thus more likely to be getting “fixed up”. Likewise, if the turnout was low in another area, research could turn up that the houses are newer and less likely to have a need for large amount of work (i.e. the owners are less likely to spend higher amounts of money).
There’s a ton more that could be derived, but the important thing is that the lessons learned can be applied to the next marketing campaign. There’s all kinds of targets that can be aimed for: a higher turn out, a larger purchase per customer, moving a certain type of product (like paint, etc.), the list is endless.
If nothing else, the whole thing has shown that I’m willing to put a lot of thought into something if it will keep me from having to do home repairs. 
July 2nd, 2006 — Blogging, Math, Probability, Programming, Python, Statistics
Recently I discovered the random.randint() function in python. Basically you call it with 2 ints, a low value and a high value. It will return a integer in that range (inclusive). I was playing around with it and I thought it seemed to be giving me the same number awfully often, so I whipped up a test: call that method 1 million times, record the values, then repeat 6 times.
I’m using randint() to simulate dice so I’m curious to see if the number distribution is even across the numbers 1 through 6. Below is my test code:
for x in range(6):
counts = [0,0,0,0,0,0,0,0]
for x in range(ONE_MILLION):
counts[d6()] += 1
for i in counts:
print i, ',',
print ''
Each time d6() (my wrapper around randint()) is called, it returns a number 1-6. This is used as a look up into the counts list, and the number there is incremented by one. I have 0’s on both sides of the 1-6 slots just to make sure it really is returning a correctly bounded value. The numbers in each row should sum up to 1 million.
By running this 6 times, I should get an idea of where the numbers are falling to make sure there is an even distribution. (Truly random numbers will have an average distribution over the long term, if they are grouping around one number, then they random number generator is not doing a good job.) I took the total of each column (which should be very close to 1 million) and then found the percent error ( ((amount - expected) / expected) *100) (omitting the absolute values that are usually used). The average of the percent errors was 0. This leads me to believe that the distribution of random numbers generated by the randint() function are sufficiently random for my uses.
Now that I have stated this, I have no more excuses but to continue on with coding the game that will use said function in a dice throwing function.
Below is the spreadsheet of my data as generated by Google Spreadsheets.
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
166367 |
166368 |
166846 |
166996 |
167006 |
166417 |
0 |
|
1000000 |
|
|
0 |
165463 |
166853 |
166669 |
166644 |
167031 |
167340 |
0 |
|
1000000 |
|
|
0 |
167284 |
166470 |
167052 |
166227 |
166123 |
166844 |
0 |
|
1000000 |
|
|
0 |
166893 |
166893 |
165958 |
166655 |
167011 |
166590 |
0 |
|
1000000 |
|
|
0 |
166887 |
166370 |
166124 |
166672 |
167160 |
166787 |
0 |
|
1000000 |
|
|
0 |
166802 |
167174 |
166724 |
165704 |
166800 |
166796 |
0 |
|
1000000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Total: |
999696 |
1000128 |
999373 |
998898 |
1001131 |
1000774 |
|
|
|
|
|
% Err: |
-0.304 |
0.128 |
-0.627 |
-1.102 |
1.131 |
0.774 |
|
|
|
|
|
Avg % Err: |
-0 |
|
|
|
|
|
|
|
|
|
By the way, this data was generated with python 2.4.3.
May 7th, 2006 — Math, Probability, Technology, Thinking
Recently I have discover Jeff Jonas’ blog. Jeff is a really interesting person and if you are interested in Social Network Analysis or Data Mining, then his page should be on your reading list.
His idea for a simple and anonymous watch list checking is one of those forehead-slapping, “why didn’t I think of that” ideas that I just love to read about. Basically the idea is that if sensitive fields in a database (name, birthdate, etc.) are encrypted in a one-way hash, then the database can be distributed without fear of a massive invasion of privacy (because all of the data in it is encrypted). If you need to check to see if someone is in this database, you simply apply the same one-way hash to their information, and then see if the hashed data is in the database. If it is, then it gets reported back to the original database holder who then is responsible for keeping the data.
Einstein, Dijkstra, and everyone else who ever talked about elegance and simplicity were 100% right. And this idea is simple and elegant. Check out this entry about how the network that connects the facts should be interpreted: Jeff Jonas: Sometimes a Big Picture is Worth a 1,000 False Positives
April 22nd, 2006 — Apple, Blogging, Music, Probability, ipod
Its been almost a year since I started using iTunes, and in that time I’ve adapted to its way of looking at my music library. It took me a while to get used to its mangling of my music directory (I’m picky like that), but all in all things are pretty good now. I do have a few observations:
Metadata and the files
I like that when I play a song the album artwork is displayed. I think that’s neat. I was however really surprised to find out that the image of the album cover is stored in the mp3/m4p file, thus increasing its size. That struck me as really odd since if you have an album’s worth of songs, the same picture could be used for each song/file. It seems that over your entire music library that if there was only one copy of the image floating around, you’d have more file space, which is to say you could fit more songs onto your iPod.
Then I discovered the most interesting thing: The ratings that you assign for a song are not stored in the music file! They are kept in a separate file! I’m pretty sure that the MP3 file standard has a field for the user’s rating, it seems to me that setting the field (which is already there and wouldn’t add anything to size of the file) would be the way to do it, instead of storing that info in a file that if it gets corrupted will effect every rating in the library. Plus if you go an move the music file to another machine, you loose the rating info. I’ve been relying heavily on rating information lately.
Playlist fatigue
I’ve noticed lately that I’m getting really tired of my playlists, they seem to play the same songs over and over. Investigating further I found the problem is that the size of the playlists (which are mostly based off of the ratings of the song) is not as large as I thought they were. It turns out I have not rated a large portion of my music library. This means that the playlist is pulling selections from a rather limited pool. The smaller this pool, the more frequently you are going to hear a repeat.
Of course the way to get around this is to create an “unrated” playlist and force yourself to listen through it. As you listen to the songs, rate them. This will help enlarge the pool of possible songs the playlists can play from. Previous studies have shown the randomizer in iTunes does a pretty good job of picking songs randomly, so by increasing the playlist size you’ll find that you don’t have as many “I just heard that song!” moments.
Please note that the last item about the playlists applies to all music players, not just iTunes. I discovered this problem at work while using the Windows Media player. If you’ve got a small population of things to choose from, then there is a high probability that you are going to hear the same thing often.
February 27th, 2006 — AI, Blogging, Probability, Programming, Python, Web
As a follow up to yesterday’s post, I’ve decided to try and blog my steps as I go. This way when I’m groaning about it a year later I can see exactly where I went wrong…. 
Anyways, the first step in problem solving is to identify the problem. My problem is I want to be able to determine if a company (i.e. its stock) is doing well or not in an analytical way.
My first proposal: Analyze news about the company to determine if the news is good or bad. Bad news usually is an indicator of a problem, good news is an indicator of a company doing well (from a stock point of view).
Using a simple Naive Bayes Classifier I think it should be fairly straight forward to analyze the news for each company and determine if the news is “positive”, “negative”, or “neutral”. After building up a corpus of news articles to train the system, the system should be “good enough” to run on its own, and even to have it re-train itself with newer news articles.
I’ve managed to bang out the code to retrieve the news pages (viva la python!), so the next step is to port a simple perl-based classifier (found here in a great article that requires registration) to python, and then let ‘er rip.
Note: Normally I discourage people from trying to reinvent the wheel. After all, if someone else has already done the hard work, why not follow in their footsteps? In this case the concept behind the code is pretty important to what I want to accomplish, so I feel it is important to really understand it. As a bonus the original code is pretty straightforward, so it should be hard for me to screw it up. So those two factors led me to think I should port the algorithm instead of using the original code.
January 15th, 2006 — Blogging, Entertainment, Fun, Games, Math, Probability, Programming, Statistics, Sudoku, Thinking
Dr. Dobb’s magazine this month has an article entitled “Sudoku & Graph Theory” which caught my eye. The article describes a logical Sudoku solver the authors built that uses graph theory techniques to analyze the puzzle.
This really got my attention because graph theory is an important field of mathematics that has a number of applications (network traffic flows for example), and it is something that I’m always interested to learn more about.
The first thing the article does is assume the 81 cells of a sudoku puzzle represent a vertex on a graph. They then point out that the numbers that can be assigned to each row, column, or 3×3 square can be thought of as a node of a bipartite graph. That node contains a array of numbers that could possibly be in that position on the puzzle.
This is exactly what I do when trying to solve a sudoku puzzle, but expressed in mathematical/topological terms. (This is what I was trying to get across in my post about Sudoku Strategy.)
The article then goes on to present two methods of logically eliminating number from the array to find the correct answer: Pile Exclusion and Chain Exclusion. Sadly, I can not find any links on the web to explain these algorithms in more detail, but the article does an ok job of showing how they work.
I do want to point out that if you read the full article, beware that the sample sudoku puzzle they present does not seem to match up with the sample arrays (or vectors as they call them) when they are demonstrating the chain and pile exclusions!
My own personal preference seems to be the Pile Exclusion, that seems to match up with how I solve puzzles. It is basically a system where you find groups of numbers that are common across several squares (usually in a 3×3 section, but I often expand it to include the row and column). Usually this works out so that you have two squares where the numbers could be 1,3,7 in one and 1,7 in the other. Then you look at the other squares and if you see that 1 and 7 aren’t a choice in any of them, then 1 and 7 must be in the two squares you are looking at. This means that the 3 is not a possible answer, so you can mark it out. This usually winds up helping you figure out where the 3 is supposed to go.
The Chain Exclusion is similar in that you are looking for groupings of numbers, but with this algorithm you are looking for the numbers to be shared in other parts of the array in order to rule out other locations. For example, if you have 1,3 and 3,4 and 4,1 as the possible answers in three cells, then other locations in the puzzle that contain a 3 can be ruled out. Personally I find the Chain Exclusion method to be more of a leap than the Pile Exclusion.
Both of these methods basically boil down to using logic to reduce (or outright find) the possible numbers that could be the answer. Using both, as the program written for the article does, makes for a powerful set of tools to work your way through the puzzle. The alternative is to do a “brute force search” which means simply trying every possible number in ever possible cell until you get the solution. Since there are 81 cells and 9 possible numbers per cell that means there are 9^81 possible answers (in plain english this means 19 followed by 76 zeros) give or take a few depending how many numbers were already filled in for you. Needless to say, using Pile and Chain Exclusions will help you get the puzzle solved much sooner.
So go check out the article in the Feb 2006 issue of Dr. Dobb’s magazine, it’s a great read.
October 15th, 2005 — AI, Blogging, Probability, Statistics, Web
Mr. Cringely’s site is always a good source for something interesting that will make you think. The second half of this weeks’ article is about AdWords and how it works internally. Basically over the last few weeks he has been describing an experiment someone was doing to see how AdWords works (I wrote about this back here).
In this week’s update Seeing Is Believing, Cringely wonders if Google is doing something to prevent people from optimizing their site for AdWords by adjusting the Nash Equilibrium point(s) in their algorithm so that Google winds up the long term winner. Google of course denies this, and Mr. Cringley of course doesn’t fully believe them.
Personally I think what was observed was a shift in the equilibrium points, but it was not an intentional effort by Google to do something bad. Rather, from what I have read about AdWords, it sounds like the system is constantly re-evaluating itself and its environment. I would not be surprised at all to find out that the system is capable of adjusting its parameters (to a limited extent, major changes would probably need to be approved/done by humans). I think when the subject of the AdWords article was conducting the experiments, they just happened to pick a time when the parameters were being adjusted (for their particular keywords).
Is adjusting the parameters wrong? No, not if Google wants to stay in business. The internet is a pretty dynamic place and things happen in short time intervals. The probability of someone figuring out how your “system” works is directly proportional to how long it is on the internet. And as soon as some figures out how something works, they try to make it work for them. Adjusting its parameters is a good way to make sure that Google’s AdWords stay relevant and useful and therefore continue to make them money. (And this is more or less what the representative from Google said in the article.)
If anything, I think this is pretty interesting. While reading the article I thought to myself that this topic would make an interesting doctoral thesis. I had to laugh out loud when I read the same thing being said by Google’s AdWords engineer…
September 24th, 2005 — Blogging, Probability, Statistics, Web
A few days ago I wrote wondering about how people make money using Google Ads and blogging. Mr. Cringely has a really interesting article this week (Google Goes Las Vegas) talking about how Google’s Adwords algorithms seem to work. Its a pretty interesting read.
I can’t help but wonder though if the experiment he describes in the article is solid, the web is pretty dynamic and Google seems to really understand that better than a lot of companies out there. I can’t help but wonder if he was seeing some kind of reaction to his advertising the same merchandise (it sounds like he just copied his website and put it under a different name).
For example, to protect a merchant: When Google sees that someone has copied their pages and tries to buy ads (in order to drive traffic to the copied site), maybe Google charges more for the same amount of traffic. That way the original merchant site still has the advantage (of getting his traffic at a lower price) but the competitor has to pay more just to keep up (as sort of a punishment for just copying someone else’s site).
This is just a guess on my part, and I’m sure there’s a million variables going on that we know nothing about.