Blog mining

I just read a really fascinating article about how blogs are being looked at as sources for data mining. Apparently a lot of companies out there realize that blogs are a great way for people to spread their opinion far and wide. So naturally a couple of companies have sprung up to supply information about what’s going on in the blog world. Check out the article here.

Although I think all of the fuss about blogging has been blown out of proportion, I do think that the efforts described in the article are very interesting. The fact that algorithms are being used to determine not only the content of the post, but detail about who posted it just blows my mind. This is what Artificial Intelligence is supposed to be all about.

Although I do find the claim that Umbria’s spider can index the blogosphere in 20 minutes a bit hard to believe, I do fully believe that the data it gathers must be a virtual goldmine. Imagine the expert system that could be built using knowledge about a product, what the consumer thinks of it, and what demographic the consumer falls into. Its very Big Brother, but at the same time the applications are almost limitless.

Over the last week or two I’ve been thinking it would be fun and interesting to try and collect the top search terms that are going around on the internet. I’ve started this little project and have been able to collect a fair amount of data, but I realized I didn’t really have a good way to qualitatively look at the data (i.e. to find out what it is telling me). This article has inspired me to look a little bit harder because there is some potentially seriously useful information in there.

2 comments ↓

#1 jerry chen on 12.11.05 at 2:18 am

interesting, but no cigar.

just when you think your bayesian nets are powerful enough to figure us out, there’s subtler nuances in human communication that just won’t compute, stuff that appeals to the pathos. at least not with today’s technology.

but here’s an idea, pass your forum snippets to Amazon Mechanical Turk. humans will give the final verdict. you might drop keywords to make your evaluation brand-neutral, but at least relevant context such as tone, mood, and emotion will be included. most importantly, it will remain visible to the eyes of a human, not some machine. you can automate this brand rating system using MT’s web API. shouldn’t be hard long as you know which forums to scour and which excerpts to quote.

in short,
someone smart will come along to take advantage of MT in building a unified brand evaluation service. just wait.

#2 Nick on 12.11.05 at 2:58 am

True it will be tough for the computers to keep up with the changing language and culture of us humans, but there is a big advantage to these systems: Categorization.

If a spider can go out and index a ton of pages and tell you what topic each page is about, that’s a huge step forward. That allows someone to task their (human) workers more effiecently instead of having them look at at web pages at random.

Also, its only a small step from categorization of a web page to discovering the “tone” of the page, that is if it is positive or negative. Imagine this: A spider reads a bunch of sites, then figures out which pages about the target topic (say Nike shoes). If it can then figure out which posts are negative those posts can be routed to a customer service rep faster as the company would like to find out the bad news first (and try to correct it).

I still think its a fascinating topic with a lot of potentinal.

You must log in to post a comment.