As a follow up to yesterday’s post, I’ve decided to try and blog my steps as I go. This way when I’m groaning about it a year later I can see exactly where I went wrong…. ![]()
Anyways, the first step in problem solving is to identify the problem. My problem is I want to be able to determine if a company (i.e. its stock) is doing well or not in an analytical way.
My first proposal: Analyze news about the company to determine if the news is good or bad. Bad news usually is an indicator of a problem, good news is an indicator of a company doing well (from a stock point of view).
Using a simple Naive Bayes Classifier I think it should be fairly straight forward to analyze the news for each company and determine if the news is “positive”, “negative”, or “neutral”. After building up a corpus of news articles to train the system, the system should be “good enough” to run on its own, and even to have it re-train itself with newer news articles.
I’ve managed to bang out the code to retrieve the news pages (viva la python!), so the next step is to port a simple perl-based classifier (found here in a great article that requires registration) to python, and then let ‘er rip.
Note: Normally I discourage people from trying to reinvent the wheel. After all, if someone else has already done the hard work, why not follow in their footsteps? In this case the concept behind the code is pretty important to what I want to accomplish, so I feel it is important to really understand it. As a bonus the original code is pretty straightforward, so it should be hard for me to screw it up. So those two factors led me to think I should port the algorithm instead of using the original code.
7 comments ↓
hi,
i think you may find this research paper relevant in what you’re trying to do. whether you’re building a portfolio of stocks across similar markets, detecting questionable accounting practices, or simply guaging company profitability, spectral clustering is a powerful tool useful for untangling high order correlational data into an embedding of manageable dimension.
Porikli - Ambiguity Detection by Fusion and Conformity - A Spectral Clustering Approach
http://www.merl.com/papers/docs/TR2005-035.pdf
Hey Jerry, how’s it going?
Thanks for the link, I’ve come across the term “spectral clustering” while doing some reading, but this is the first paper I’ve seen that gives some good detail. Thanks!
-Nick
if you want, i can give you a whole slew of references. i’m currently using it for my thesis. the math isn’t hard, but as to why it works is beyond me.
Thanks Jerry, I’ll keep that in mind. I’m working my way though that last document you recommended.
Lots of good info in there.
Also, I hear you on the math and wondering why it works. I’ve been looking at the perl code in that Dr. Dobbs article trying to figure out the advantages of that approach. Very interesting stuff.
i’ve collected some authoritative publications on the subject, if you are interested in how it works. i have a vague understanding, but perhaps we can share our insights and learn a thing or two. i’m still figuring out the details..
http://bertolami.com/jerry/spectral.htm
Very cool! Thank you Jerry, that is quite a list! Looks like I’ll never be able to say “I’ve got nothing to read” again…
-Nick
hello nick,
are you familiar with the Kelly Criterion? i have been reading this book called Fortune’s Formula by Poundstone and it talks about this famous equation. it’s based on Information Theory and was used for beating the stock market.
basically it goes like this, assuming you know in advance the probability of winning a wager and you know the odds you are paid per win, there are two extremes in betting strategies: bet everything you got. bet nothing. both strategies are inadequate, and in order to maximize your chances, there is an ideal middleground that can be calculated.
introduction:
http://www.investopedia.com/articles/trading/04/091504.asp
derivation:
http://www.jimgeary.com/poker/letters/KELLY.HTM
the original formulation as described on wikipedia goes like this:
f = (bp - q) / b
where:
f = ratio of entire bankroll to place in each bet
b = odds paid per win (ratio of bet)
p = chance of winning
q = chance of losing (= 1 - p)
a friend of mine was interested in the situation where only partial loss is incurred in a bad guess (ie. lose 1/3 your bet instead of 100%).
i did the math and i found the new equation to be:
f = (bp - qc) / bc
where:
f = ratio of entire bankroll to place in each bet
b = odds paid per win (ratio of bet)
c = punishment per loss (ratio of bet)
p = chance of winning
q = chance of losing (= 1 - p)
it’s very elegant.. except the equation explodes when you plug 0 into either b or c.
You must log in to post a comment.