Uncertainty in Computing

09.22.10

posted by: mcrocker

Computers and machines (that are well designed and debugged!) are generally much more reliable than humans for completing many simple but repetitive tasks.  The cotton gin changed cotton production dramatically.  Calculators replaced slide rules as they became more powerful and cheaper.  Computers can be programmed to automate any task that can be interfaced to electronic circuits.

As the smallest transistors continue to become smaller, the reliability of computer circuits has come under fire from the possibility of “high” defect and soft fault rates.  In integrated circuit manufacturing (making computer chips), the strategy to handle fabrication defects is to simply throw away any chip that has one or more defects on it.  That works just fine if the number of places that a defect can occur is small and the defect rate is small.  The Intel 4004 microprocessor had between 2,000 and 3,000 transistors.  The newest micro(nano?)processors have 1 billion (or more) transistors.  Defect rates have to be really, really small — much better than 1 in 1 billion — for yields to be high enough to be worth it.  This defect issue has been an issue for many years now.  Programmable logic became a big deal in the early 1980s, in part, because of concerns about defects.  There is another issue, although it mostly an issue for massively parallel supercomputers.  With thousands of computer processing units running for weeks at a time each one executing billions of instruction per second (100,000 processors * 1GHz * 60 sec * 60 min * 24 hours * 7 days = 6*10^19 instructions per week), you cannot rely on the entire parallel job running to completion without a single soft fault even with a low rate.  Some effort has to be taken to keep “snapshots” of the program execution to restart it if it fails in the middle of running.

As computers become more powerful and are used to do more amazing things, the small weaknesses get compounded and become glaring.  Each transistor requires less power, but there are more of them in a smaller space.  Now, personal computers overheat with ease (if the CPU fan were to fail), and the most powerful supercomputer in the world is almost one of the most likely to fail (if it is being used to full capacity without preventative measures).

Be the first to like.

4 Responses to “Uncertainty in Computing”

  1. vgoss Says:

    Thanks for using lots of examples, easy to read, great flow. I’m sure that this will help some folks, including me! Can you suggest journal articles reviews for the topic?

  2. tloughran Says:

    What’s got you thinking about uncertainty in computing, Mike? I’ve been thinking about it in the context of communicating particle physics to students. In particular we’re addressing why a particle’s mass distribution peak is so wide–why the energy of a given particle is so uncertain–especially for more massive particles which tend to have shorter lifetimes. We come up against the energy-time formulation of H’s uncertainty principle in this case. Does uncertainty in computing have quantum-level roots, or is it higher-order than that?

  3. acarr Says:

    I agree that computers are becoming more powerful and remarkably intricate. I would like to ask the question,why we choose to design and build more powerful computers? Is it because we are becoming more dependent on computers? And, is this dependency due to “uncertainties” in other things?…

  4. mcrocker Says:

    Uncertainty in computing is something we talk about quite often with nanomagnet logic. When we run simulations, we often have to deal with the size of the energy barrier between states in the magnet. If the barrier is small, it is easier to get over it, but it is also easier for an unwanted switch to occur (due to random fluctuation due to temperature). Also, as the dimensions of the magnet get smaller, quantum effects can start to affect the state of the magnet.

    Making computers more powerful is a natural extension of a trend that has lasted for decades. However, it is only recently that the transistors have become so small and the currents so small that soft faults are more likely and supercomputers perform so many operations that at least one is bound to happen.

    As for good articles on this, I do not know of any right now. I will keep looking. Also, this sort of thing is often keep quiet since there are mitigation techniques so the problem is not considered serious… yet. (The “soft fault” link in the article has a technical article that relates.)

Leave a Reply

*