I’ve mentioned this topic a few times in recent weeks, so it was really time to sit down and put pen to paper (or fingers to keyboard). This one was a lot of fun to dig into, and I think the results will be of interest to people.
Parallelism is something that I’ve been tracking since my studies during the 90s: I remember programming transputers using a programming language called occam, way back when. I was really happy when – around a decade ago – Microsoft got serious about tackling asynchrony and concurrency with F#’s Asynchronous Workflows and the .NET Framework’s Task Parallel Library. I had a lot of fun playing with both, mostly in the context of AutoCAD programming.
When my colleague Simon Breslav decided to try using Parallel.ForEach from IronPython code inside Dynamo, I was definitely intrigued, especially with respect to how that code would then work with Project Refinery.
Firstly, let’s discuss what Parallel.ForEach does, and how to use it inside Dynamo. Parallel.For and Parallel.ForEach basically step through a set of operations in much the same way as standard for/foreach loops do, but rather than executing each iteration synchronously, they create asynchronous tasks for each loop iteration that will then get executed by the .NET Framework via the CLR thread pool.
It’s a great way to parallelize code – as long as you pay attention to the potential pitfalls – which can improve performance significantly. Simon chose to use Parallel.ForEach to improve the performance of a number of the Project Rediscover metrics, notably for calculating Buzz, Visual Distraction and Daylight. I added a global toggle that allows us to switch easily between serial and parallel loop execution, so we can compare performance and work out whether it’s all worth the effort.
So how do you use Parallel.ForEach with Dynamo? The simplest way is to integrate it into Python code: as Dynamo runs Python code using IronPython – which has access to the .NET Framework – we can get access to it simply by importing the System.Threading.Tasks module.
Let’s take a look at the approach we’ve taken for the various metrics. Here’s the basic structure we’ve used throughout. Assume that the parallel variable is just a Boolean flag saying whether to use Parallel.ForEach or a standard for loop. calculateScore() is our worker function that will store the results in the results array at index i. We’re passing in a pathLattice (actually a spaceLattice), as our operations happen to be on Space Analysis paths. (You’ll be able to see the “final” version of this once we publish the Project Rediscover graph, sometime between now and AU London in June.)
from System.Threading.Tasks import *
Parallel.ForEach(paths, lambda path, state, i: calculateScore(path, i, pathLattice, results))
for i, path in enumerate(paths):
calculateScore(path, i, pathLattice, results)
This approach has worked fairly solidly for our various metrics, but there are definitely areas to consider (that relate to the above-linked pitfalls): when there’s a shared variable – such as an array of values that needs to be contributed to (say when we’re calculating congestion, which is based on the sum of data from multiple paths) – we need to control access to the variable using (for instance) a Monitor object.
Also, if we’re going to use geometry in our function – such as when we analyse daylight by firing rays from the sun positions to various locations in the building – then we need to be super-careful: Autodesk’s Shape Manager library is multi-threaded, but we need to make sure certain geometry is created and accessed from the same thread. We found raycasting to work well, but you would have to be very careful when creating surfaces, etc.
Does using Parallel.For[Each] make a big difference?
It depends on the way your code is structured and the characteristics of the execution environment. I tried a couple of different versions of the Project Rediscover graph, one which only calculates global daylight, while the other does much more, calculating daylight per-neighbourhood as well as teams’ workplace preference. I ran the calculations on two different systems: one inside a Parallels VM on my aging MacBook Pro (which has 4 cores dedicated to it) and a beefy, native Windows desktop with 16 cores and a shedload of RAM.
On the resource-constrained environment, I saw a modest increase in performance: the lighter model ran in 70 seconds in serial mode, and 54 seconds in parallel mode… so a 24% improvement. The heavier model went from 285 seconds in serial to 196 seconds in parallel. So a 31% improvement.
On the beefier system, we saw much more significant gains. The lighter model started at 45 seconds serially and dropped to 14 seconds in parallel. A whopping 69% improvement. And the heavier model went from 196 seconds to 41 seconds… an improvement of 79%!
This was really encouraging, but it’s worth bearing a few things in mind: .NET will use all available cores to run these additional tasks: if you check out the CPU usage when running the graph, you’ll see it go from around 25% utilization (on a quadcore system) to 100% utilization. As you might expect, the .NET Framework uses the CLR thread pool extremely efficiently.
Here are some screenshots taken during a serial run:
And these are from a parallel run:
Which is great, but what happens when it runs inside Refinery, which is in any case using process-level parallelism (running various instances of graph concurrently using separate executables)? Would it negate the gains? Would one graph execution consume all the CPU resources, leaving nothing for its siblings?
Looking at the CPU load during a serial run, at first glance it indeed seems there’s a more even processor utilization:
During a parallel run the processor utilization is heavier for four of the worker processes:
So this does suggest that the greedier resource usage of the parallel code is skewing things towards fewer worker executables. At some moments this becomes extreme:
But how do things look in terms of execution time?
It turns out there’s still huge value in using Parallel.For even with Refinery. I did some tests with the heaviest model on the most performant of my two systems, and found that an optimization run of 5 generations of a population of 12 designs each went from 78 minutes (yes, minutes) execution time to just 26. So an improvement of 67%. This was consistent across 6 different runs (3 serial, 3 parallel) on my beefy desktop system. I didn’t do these tests on my Parallels VM as it’s prone to crashing (the VM, not Dynamo/Refinery) when pushed too far resource-wise.
Incidentally, there’s a really handy server log available at %appdata%\Refinery\refinery-server-log.txt, which avoids you having to measure the timing of runs individually.
So this is really interesting… Here’s what I think is happening: Refinery’s server component appears to execute instances of the graph in groups of 4: it has more processes available, but for optimization runs they appear to be run in 4s. This is presumably related to the need to specify your population as a multiple of 4. I’ll check in with the Refinery team to understand what drives this – perhaps the way the optimisation process works? – or whether I’m mis-reading things. One possibility that the 4 that get launched when running in parallel initially consume most of the system resources, and then so 4 more don’t get launched immediately after, which is what happens when running serially.
Whatever the reason, there do appear to be resources left over that make the use of parallel code workflows worthwhile. I dusted off my grep and sed skills to dig into the Refinery server log, and found that the heavier graph was being executed in (on average) 298 seconds (serial) and 105 seconds (parallel). If you compare this with plain Dynamo Sandbox – which could take 100% of the CPU resources and ran in 196 and 41 seconds respectively – there’s clearly some performance impact with running in this way.
I’m actually a little surprised the serial numbers weren’t more similar… going from 196 to 298 seconds (from running in the UI to running via Refinery), but then the average numbers were skewed slightly higher by runs that come towards the end of each study… sometimes these would weigh in at 11 or 12 minutes (660-720 seconds). This behaviour didn’t appear with the parallel runs, for some reason.
So, in a nutshell, using Parallel.For and Parallel.ForEach can be highly beneficial for graphs with complex calculations, whether just running in the Dynamo UI or being run by Refinery. The specific gains will depend on your system’s resources and how you use parallelism in your graph’s Python blocks.
The specific numbers I’ve shown today are for sure going to change, by the way: Project Rediscover is very much a work in progress, so the graph will a) be doing more and hopefully also b) be doing it more efficiently. Beyond that there are regular performance gains being delivered by both the Dynamo and Refinery teams.
A related note… measurement and optimisation (from a performance perspective, not in the Refinery sense of the word) is a big topic with Dynamo, at the moment. It was the driver behind one of the winning projects at the recent Dynamo and Generative Design Hackathon in London, and the Dynamo team are working on tools to make this easier, too.
Something else that’s worth noting is that the calculations in Dynamo graphs today are largely CPU-bound. One area Simon, Rhys and I have been discussing exploring is the ability to integrate operations that are calculated on the GPU. We’re very curious to find out what others have done in this space, so if you have any pointers, send them over! Right now we’re thinking it might be interesting to have Dynamo host OpenCL, but that’s about as far as we’ve gotten, so far.