Log in

View Full Version : GPU versus multi-core CPU



MarxSchmarx
9th October 2010, 07:54
OK, here is a debate i am having with a colleague.

Suppose we have an application that takes X seconds to run on a single core CPU of 4 Ghz.
Now suppose I have to execute this application N different times. So on a single core CPU it takes N*X seconds to run through what I need to do.

Now, it strikes me that if a GPU has Y cores, and each core has say Z Ghz, with Z << 4 (e.g., Z Ghz= 200 Mhz), then if I port the application to run on a single GPU core, on a single GPU core it will take 4*X/Z seconds to execute the application. Since Z << 4 this means it will take a lot longer to run it on a single GPU core.

OK, so far so good. What I'm arguing is that since the GPU has, say, hundreds or thousands of the core (let's say Y cores), the execution time is 4 * X/(Y*Z). Depending on the value of Y, we have a situation where N*X < 4 * X/(Y*Z), i.e,. N < 4/(Y*Z). My point being that because the application has to be executed N times, even though it is slower per core the saving accrue by virtue of the fact that we have to execute the same program many times across Y cores.

My colleague, however, insists that this isn't the way to go - that we should figure out how to slow the execution time X by a factor Y (or potentially somewhat less, let's say Y/2).

What do you alll think? WHat is the resolution to this debate?

Edit: Whoops, Y needs to be in the denominator :þ

¿Que?
9th October 2010, 08:28
Well, I understand this:
N*X (because N is the number of times you need to run the app, and X is how long it takes, this gives you the amount of time it takes to run the app however many times, consecutively, one after the other, as in a single core).

I think I'm getting tripped up because I don't get if the processor speed is Z or 4.

ÑóẊîöʼn
9th October 2010, 12:22
If I understand you correctly;

The cores still need to communicate with each other (presumably) and with the rest of the system. If you have so many cores, there's going to be an increasing amount of overhead needed for synchronisation and coordination for every core added, with diminishing returns as a result.

Your calculations don't seem to take factors like that into account.

MarxSchmarx
10th October 2010, 13:44
Well, I understand this:
N*X (because N is the number of times you need to run the app, and X is how long it takes, this gives you the amount of time it takes to run the app however many times, consecutively, one after the other, as in a single core).

I think I'm getting tripped up because I don't get if the processor speed is Z or 4.

Right, so if the GPU core processor speed is Z, so if something takes X seconds to run on a CPU core, it takes 4*Z/X seconds to run on the GPU core because if it takes R seconds to run on the GPU core then
R/X = 4/Z -> R=4*X/Z.
For example if the GPU core is let's say Z=2 Ghz, then it should take twice as long to run on the GPU core than the CPU core. If through some miracle the GPU core is say Z = 8 Ghz it should take half as long to run on the GPU core. Z is likely in the range of 100-500 Mhz.



The cores still need to communicate with each other (presumably) and with the rest of the system. If you have so many cores, there's going to be an increasing amount of overhead needed for synchronisation and coordination for every core added, with diminishing returns as a result.

Your calculations don't seem to take factors like that into account.

Sure, although it's a bit tricky here. We're thinking of potentially not having the cores needing to communicate with each other - the reason is that each of the N program runs (that take X seconds on the CPU per run) are in fact independent of each other - it's what's called an embarrassingly parallel problem. So synchronization becomes a bit less of a concern under the scheme I'm proposing.

The issue is interacting with the rest of the system, particularly the CPU. We've thought of having the only communication be the CPU feeds the input into the GPU, let the GPU do its thing, and spit out the result to the CPU. An alternative we thought of is to have the GPU give back the results every cycle and write those results to the hard drive. And that's a concern we've had. The latter would potentially be a computational bane, the former likely less so. Even then, if the IO and other overhead operations cost say M, depending on M we might have a situation where still it's the case that
N*X > M*4 * X/(Z*Y)
But you're right, these calculations aren't taken into account, and for now I'm considering the case where the only interaction between the GPU and the rest of the system are at the seeding and final reading in out stage.