multicore vs bandwidth

IBM (C)(R)(TM)(etc)  has a new 17 core chip, 16 to do the work and 1 to ‘rule them all’ – problem is though that the memory interface has ‘only’ a few channels

do the math: 16 cores on a eg 4 channel bus makes 4 cores compete for one channel, right? a supersize, multilevel, multi-associative caches help, but with multi-gigabyte datasets that goes only so far, and there are lots of problems where CUDA etc doesn’t do so well

better solution would be to have small cores but with built-in RAM, and lots of them, and even if some spend half their life shuffling data (usually considered bad karma) you should still end up with an amazing overall bandwidth and latency (within the chip). across several cores you need a fast multipath router,  and you may loose one or two cores at a time due to transfers. if you got lets say 64 cores that doesn’t really matter. also may make memory use more efficient. think 64 bit: there are so far only a few apps that need more memory than 4 GB. give a gig to every core, and you have to use only 32 bit pointers locally. across the dataset of course you need more.

many problems can be split up into many smaller ones: video/picture compression and analysis – it’s all tiles, blocks, frames anyway, so each core can handle it’s own data and then transfer the result to the ‘mastermind’. games and simulations would benefit  as well – it’s frames and particles. render a few speculative frames ahead based on the probability of a player moving eg forward towards an object (door, enemy, food etc) and discard those if not needed.