转载

Don't use Hadoop - your data isn't that big

“So, how much experience do you have with Big Data and Hadoop?” they asked me. I told them that I use Hadoop all the time, but rarely for jobs larger than a few TB. I’m basically a big data neophite - I know the concepts, I’ve written code, but never at scale. The next question they asked me. “Could you use Hadoop to do a simple group by and sum?” Of course I could, and I just told them I needed to see an example of the file format. They handed me a flash drive with all 600MB of their data on it (not a sample, everything). For reasons I can’t understand, they were unhappy when my solution involved pandas.read_csv rather than Hadoop. Hadoop is limiting. Hadoop allows you to run one general computation, which I’ll illustrate in pseudocode: Scala-ish pseudocode:
collection.flatMap( (k,v) => F(k,v) ).groupBy( _._1 ).map( _.reduce( (k,v) => G(k,v) ) ) 
SQL-ish pseudocode:
SELECT G(...) FROM table GROUP BY F(...) 
Or, as I explained a couple of years ago:
Goal: count the number of books in the library. Map: You count up the odd-numbered shelves, I count up the even numbered shelves. (The more people we get, the faster this part goes. ) Reduce: We all get together and add up our individual counts.
The only thing you are permitted to touch is F(k,v) and G(k,v), except of course for performance optimizations (usually not the fun kind!) at intermediate steps. Everything else is fixed. It forces you to write every computation in terms of a map, a group by, and an aggregate, or perhaps a sequence of such computations. Running computations in this manner is a straightjacket, and many calculations are better suited to some other model. The only reason to put on this straightjacket is that by doing so, you can scale up to extremely large data sets. Most likely your data is orders of magnitude smaller. But because “Hadoop” and “Big Data” are buzzwords, half the world wants to wear this straightjacket even if they don’t need to.

But my data is hundreds of megabytes! Excel won’t load it.

Too big for Excel is not “Big Data”. There are excellent tools out there - my favorite is Pandas which is built on top of Numpy. You can load hundreds of megabytes into memory in an efficient vectorized format. On my 3 year old laptop, it takes numpy the blink of an eye to multiply 100,000,000 floating point numbers together. Matlab and R are also excellent tools. Hundreds of megabytes is also typically amenable to a simple python script that reads your file line by line, processes it, and writes to another file.

But my data is 10 gigabytes!

I just bought a new laptop. The 16GB of ram I put in cost me $141.98 and the 256gb SSD was $200 extra (preinstalled by Lenovo). Additionally, if you load a 10 GB csv file into Pandas, it will often be considerably smaller in memory - the result of storing the numerical string “17284932583” as a 4 or 8 byte integer, or storing “284572452.2435723” as an 8 byte double. Worst case, you might actually have to not load everything into ram simultaneously.

But my data is 100GB/500GB/1TB!

A 2 terabyte hard drive costs $94.99, 4 terabytes is $169.99. Buy one and stick it in a desktop computer or server. Then install Postgres on it.

Hadoop << SQL, Python Scripts

In terms of expressing your computations, Hadoop is strictly inferior to SQL. There is no computation you can write in Hadoop which you cannot write more easily in either SQL, or with a simple Python script that scans your files. SQL is a straightforward query language with minimal leakage of abstractions, commonly used by business analysts as well as programmers. Queries in SQL are generally pretty simple. They are also usually very fast - if your database is properly indexed, multi-second queries will be uncommon. Hadoop does not have any conception of indexing. Hadoop has only full table scans. Hadoop is full of leaky abstractions - at my last job I spent more time fighting with java memory errors, file fragmentation and cluster contention than I spent actually worrying about the mostly straightforward analysis I wanted to perform. If your data is not structured like a SQL table (e.g., plain text, json blobs, binary blobs), it’s generally speaking straightforward to write a small python or ruby script to process each row of your data. Store it in files, process each file, and move on. Under circumstances where SQL is a poor fit, Hadoop will be less annoying from a programming perspective. But it still provides no advantage over simply writing a Python script to read your data, process it, and dump it to disk. In addition to being more difficult to code for, Hadoop will also nearly always be slower than the simpler alternatives. SQL queries can be made very fast by the judicious use of indexes - to compute a join, PostgreSQL will simply look at an index (if present) and look up the exact key that is needed. Hadoop requires a full table scan, followed by re-sorting the entire table. The sorting can be made faster by sharding across multiple machines, but on the other hand you are still required to stream data across multiple machines. In the case of processing binary blobs, Hadoop will require repeated trips to the namenode in order to find and process data. A simple python script will require repeated trips to the filesystem.

But my data is more than 5TB!

Your life now sucks - you are stuck with Hadoop. You don’t have many other choices (big servers with many hard drives might still be in play), and most of your other choices are considerably more expensive. The only benefit to using Hadoop is scaling. If you have a single table containing many terabytes of data, Hadoop might be a good option for running full table scans on it. If you don’t have such a table, avoid Hadoop like the plague. It isn’t worth the hassle and you’ll get results with less effort and in less time if you stick to traditional methods.

P.S. The Sales Pitch

I’m building a startup aiming to provide data analysis (big and small) and realtime recommendations and optimization to publishers and e-commerce sites. If you are interested in being a beta user, email me at stucchio@gmail.com. I also do consulting. If your company needs a Big Cloudy Data Strategy (TM), I can help you. But be warned - there is a good chance I’ll set you up with Pandas and tell you to A/B test, rather than giving you hadoop in the cloud.

P.P.S. Hadoop is a fine tool

I don’t intend to hate on Hadoop. I use Hadoop regularly for jobs I probably couldn’t easily handle with other tools. (Tip: I recommend using Scalding rather than Hive or Pig. Scalding lets you use Scala, which is a decent programming language, and makes it easy to write chained Hadoop jobs without hiding the fact that it really is mapreduce on the bottom.) Hadoop is a fine tool, it makes certain tradeoffs to target certain specific use cases. The only point I’m pushing here is to think carefully rather than just running Hadoop on The Cloud in order to handle your 500mb of Big Data at an Enterprise Scale.   本文原名“Don't use Hadoop when your data isn't that big ”,出自有着多年从业经验的数据科学家Chris Stucchio,纽约大学柯朗研究所博士后,搞过高频交易平台,当过创业公司的CTO,更习惯称自己为统计学者。对了,他现在自己创业,提供数据分析、推荐优化咨询服务,他的邮件是:stucchio@gmail.com 。 “你有多少大数据和Hadoop的经验?”他们问我。我一直在用Hadoop,但很少处理几TB以上的任务。我基本上只是一个大数据新手——知道概念,写过代码,但是没有大规模经验。 接下来他们会问:“你能用Hadoop做简单的group by和sum操作吗?”我当然会,但我会说需要看看具体文件格式。 他们给我一个U盘,里面有所有的数据,600MB,对,他们所有的数据。不知道为什么,我用pandas.read_csvPandas是一种Python数据分析库)而不是Hadoop完成了这个任务后,他们显得很不满意。 Hadoop其实是挺局限的。它无非是运行某个通用的计算,用SQL伪代码表示就是: SELECT G(...) FROM table GROUP BY F(...) 你只能改变G和F操作,除非要在中间步骤做性能优化(这可不怎么好玩!)。其他一切都是死的。 (关于MapReduce,之前作者写过一篇“41个词讲清楚MapReduce”,可以参考。) Hadoop里,所有计算都必须按照一个map、一个group by、一个aggregate或者这种计算序列来写。这和穿上紧身衣一样,多憋得慌啊。许多计算用其他模型其实更适合。忍受紧身衣的唯一原因就是,可以扩 展到极大极大的数据集。可你的数据集实际上很可能根本远远够不上那个数量级。 可是呢,因为Hadoop和大数据是热词,世界有一半的人都想穿上紧身衣,即使他们根本不需要。 可我的数据有好几百MB呢!Excel都装不下 对Excel很大可不是什么大数据。有很多好工具——我喜欢用的是基于Numpy的Pandas。它可以将几百MB数据以高效的向量化格式加载到内存,在我已经3年的老笔记本上,一眨眼的功夫,Numpy就能完成1亿次浮点计算。Matlab和R也是很棒的工具。 数百MB数据一般用一个简单的Python脚本逐行读取文件、处理,然后写到了一个文件就行了。 可我的数据有10G呢! 我刚买了一台笔记本电脑。16G内存花了141.98美元,256GB SSD多收200美元。另外,如果在Pandas里加载一个10GB的csv文件,实际在内存里并没有那么大——你可以将 “17284932583” 这样的数值串存为4位或者8位整数,“284572452.2435723”存为8位双精度。 最差情况下,你还可以不同时将所有数据都一次加载到内存里。 可我的数据有100GB/500GB/1TB! 一个2T的硬盘才94.99美元,4T是169.99。买一块,加到桌面电脑或者服务器上,然后装上PostgreSQL。 Hadoop的适用范围远小于SQL和Python脚本 从计算的表达能力来说,Hadoop比SQL差多了。Hadoop里能写的计算,在SQL或者简单的Python脚本都可以更轻松地写出来。 SQL是直观的查询语言,没有太多抽象,业务分析师和程序员都很常用。SQL查询往往非常简单,而且一般也很快——只要数据库正确地做了索引,要花几秒钟的查询都不太多见。 Hadoop没有任何索引的概念,它只知道全表扫描。而且Hadoop抽象层次太多了——我之前的项目尽在应付Java内存错误、内存碎片和集群竞用了,实际的数据分析工作反而没了时间。 如果你的数据结构不是SQL表的形式(比如纯文本、JSON、二进制),一般写一小段Python或者Ruby脚本按行处理更直接。保存在多个文件里,逐个处理即可。SQL不适用的情况下,从编程来说Hadoop也没那么糟糕,但相比Python脚本仍然没有什么优势。 除了难以编程,Hadoop还一般总是比其他技术方案要慢。只要索引用得好,SQL查询非常快。比如要计算join,PostgreSQL只需查 看索引(如果有),然后查询所需的每个键。而Hadoop呢,必须做全表扫描,然后重排整个表。排序通过多台机器之间分片可以加速,但也带来了跨多机数据 流处理的开销。如果要处理二进制文件,Hadoop必须反复访问namenode。而简单的Python脚本只要反复访问文件系统即可。 可我的数据超过了5TB! 你的命可真苦——只能苦逼地折腾Hadoop了,没有太多其他选择(可能还能用许多硬盘容量的高富帅机器来扛),而且其他选择往往贵得要命(脑海中浮现出IOE等等字样……)。 用Hadoop唯一的好处是扩展。如果你的数据是一个数TB的单表,那么全表扫描是Hadoop的强项。此外的话,请关爱生命,尽量远离Hadoop。它带来的烦恼根本不值,用传统方法既省时又省力。
正文到此结束
Loading...