Hadoop vs. MySQL

I just play with Hadoop, HBase, Hive, Pig via Cloudera’s guide (thanks to Cloudera for bringing these packages to CentOS) for a couple days. Cloudera is going in the right direction, targeting the enterprises. Hadoop is definitely on the watch list as it matures. But right now, it’s very technical and would not be suitable for the general public. I’m also disappointed on its performance for a smaller testing cluster (which I understand is unfair for what it’s designed for). For its to shine, you would need both, the problem has to be big enough and the server farms has to be big enough. However, I think there are many companies that initially test Hadoop on a small cluster before actually investing more time and money into it. It’s the first impression that makes a lasting impact. As it matures, I expect there will be overhead-reduction optimizations done on the small/low-end clusters.

Setting up MySQL is easy, scaling it is not so easy but there are many related software and technology to help you. But don’t think you can just switch to Hadoop/HBase/Hive in a day. The selling point is there (no-limit scaling on commodity hardware at the core design) but there are many land mines that you could walk on if decisions are not evaluated carefully. Right now, I see Hadoop as one of the last resorts because you’re running into a wall, exhausting RDBMS options and its related software/technology that help you scale, like memcache, message queues, load balancing, etc. You should not choose Hadoop just because you started a company and might get big in a couple years. Of course there are exceptions when you know your problem domain is only solvable in a distributed system. The popularity of Hadoop could change (or not) if the priority on Hadoop is to dominate both markets or just focus on the large farms.

You face complexity when dealing with Hadoop/Hbase/Hive/HDFS (like setting up, breaking things down into tasks, and setting up batch operations). For many many applications, MySQL (or RDBMS) ain’t going anywhere. I see smart companies use both for different parts of their operations. Unless Hadoop can do real-time, low-latency operations in distributed server farms effortlessly, there is no clear winner now, or ever. Maybe the trend on real-time search (Twitter, FaceBook) might be able to speed this up.






Leave a Reply

Your email address will not be published. Required fields are marked *