Spark的性能瓶颈

问题描述:

发表在NSDI 2015上的论文了解数据分析框架中的性能"得出的结论是,CPU(而非IO或网络)是Spark的性能瓶颈. Kay在Spark上进行了一些实验,包括BDbench,TPC-DS和生产工作负载(仅使用Spark SQL?).我不知道这个结论是否适用于基于Spark的某些框架(例如Streaming,通过网络接收连续的数据流,网络IO和磁盘都将承受很大压力).

A paper "Making Sense of Performance in Data Analytics Frameworks" published in NSDI 2015 gives the conclusion that CPU(not IO or network) is the performance bottleneck of Spark. Kay has done some experiments on Spark including BDbench ,TPC-DS and a procdution workload(only Spark SQL is used?) in this paper. I wonder whether this conclusion is right for some frameworks built on Spark(like Streaming,with a continuous data stream received through network,both network IO and disk will suffer high pressure ).

Spark Streaming中的网络和磁盘可能受到的压力较小,因为流通常是

Network and disk may suffer less pressure in Spark Streaming because the streams are usually checkpointed, meaning all data is not usually kept around forever.

但是最终,这是一个研究问题:解决这一问题的唯一方法是进行基准测试.凯的代码是开源.

But ultimately, this is a research question : the only way to settle this one is to benchmark. Kay's code is open-source.