分布式工作同步和fork-join并行编程方法有什么区别

分布式工作同步和fork-join并行编程方法有什么区别

问题描述:

In this article on wikipedia: https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming

It is claimed that the go expert used a distribute-work-synchronize pattern to organize his parallel programs, while the non-expert used a fork-join: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Fork_join.svg/2000px-Fork_join.svg.png

I am familiar with the fork-join from school, but I was wondering what the distribute-work-synchronize pattern is, and what the differences are between it and the classical fork-join model that i'm familiar with?

When I do fork join I typically run as many threads as I do cores, but in the paper they say the go expert did this as well, and they briefly mention the overhead of creating new threads being one of the ways the go expert optimized the code, but don't seem to go into detail.

在有关维基百科的本文中: https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming p>

转到专家使用分发工作同步模式来组织其并行程序,而非专家则使用fork-join: https://upload.wikimedia.org/wikipedia/commons/thumb/f/f1/Fork_join.svg/2000px- Fork_join.svg.png p>

我很熟悉学校的fork-join,但是我想知道什么是分配工作同步模式,以及它们之间的区别是什么 它和我熟悉的经典fork-join模型之间? p>

当我进行派生连接时,我通常会运行与内核一样多的线程,但是在论文中他们说go专家也这样做了,并且他们简要地提到了创建新线程的开销。 线程是go专家优化代码的方法之一,但似乎不详细。 p> div>

I would be very careful taking the statement you've quoted from https://en.wikipedia.org/wiki/Go_(programming_language)#Suitability_for_parallel_programming as a general truth for Go programming.

I assume that what's described in the study as distribute-work-synchronize is the approach of dividing a problem into subtasks, that are determined mainly by the parallelization that can be achieved in hardware, and less by the natural way the problem decomposes into smaller tasks.

This approach may give you some performance benefits depending on your specific problem and your expertise, but it may not be trivial to apply even for embarrassingly-parallel problems. In addition, this approach is more dependant on the specific hardware you're using (e.g. 2-32 vs 64-1024 cores, regular CAS vs LL/SC), on the specific problem size (e.g. fits in L1 vs barely fits in RAM), and most importantly, on your expertise with that approach and with the tool that you're using to solve your problem.

The above is standard "premature optimization is the root of all evil" / "simple, adequate and correct trumps complex, super-fast and with an insidious bug" advice, but the actual experiment cited in the paper also gives some examples why you should use your own judgment which approach to use.

  1. The study used Go 1.0.3. The significant improvements done on the scheduling, garbage collection and goroutine/channel performance fronts may have effectively made the results obsolete.

    1.1. Ironically, the paper mentions how for one of the Chapel solutions, expert version was ~68% slower when Chapel 1.6 (instead of 1.5) was used.

  2. The study does not claim to provide statistically significant results - for each of the 4 platforms, a single non-expert solved 6 synthetic problems that fit a specific blueprint, then rewrote his solutions according to the advice of a single expert.

  3. The expert should not be held responsible for applying his advice outside of the specific context. distribute-work-synchronize was the better approach for these specific problems, if you are a Staff Software Engineer on the Golang team (Luuk van Dijk), and your alternative was using plain divide-and-conquer with Go 1.0.3.

When I do fork join I typically run as many threads as I do cores, but in the paper they say the go expert did this as well, and they briefly mention the overhead of creating new threads being one of the ways the go expert optimized the code, but don't seem to go into detail.

I'm assuming the overhead of creating new threads is related to the proliferation of tasks when approaching the bottom of the recursion tree.

I would think that algorithms that fall into Case 2 and 3 of the Master Theorem will be particularly affected, but even algorithms that fall in Case 1 (where the work done on the leaf level of the tree is most significant, i.e. the overhead of the spawned threads is diluted the most) will be affected. For example, making a new logical task for the comparison of each pair of elements in a top-down merge sort is probably redundant.

IMHO running as many threads as you do cores, but logically dividing the work naturally/on each recursion level is a great compromise between my understanding for distribute-work-synchronize and a plain divide-and-conquer.

You are still paying some price in complexity (and probably, but not necessarily, in running time) for the scheduling of your N tasks on the K worker threads. It is possible for this price to cost you in missed parallelization opportunities at runtime, for example because of cache thrashing, because of sub-optmal scheduling, because of thread or core affinity in play, etc. However this may be abstracted away from your program at the language platform level, or at the Fork-Join framework level (in case you are using a framework), or maybe at the OS level. In this case, your solution is not fully responsible for adapting to changes in your language platform, in your problem size, or in your machine hardware, and you should be able to benefit from optimizations in the underlying layers without touching your solution.

My bet is that the increased complexity and the reduced portability of a custom-tailored distribute-work-synchronize solution is worth it only if you can prove that your natural divide-and-conquer solution is not enough and you are aware of the consequences.