Uncertainty in Computing
posted by: mcrocker
Computers and machines (that are well designed and debugged!) are generally much more reliable than humans for completing many simple but repetitive tasks. The cotton gin changed cotton production dramatically. Calculators replaced slide rules as they became more powerful and cheaper. Computers can be programmed to automate any task that can be interfaced to electronic circuits.
As the smallest transistors continue to become smaller, the reliability of computer circuits has come under fire from the possibility of “high” defect and soft fault rates. In integrated circuit manufacturing (making computer chips), the strategy to handle fabrication defects is to simply throw away any chip that has one or more defects on it. That works just fine if the number of places that a defect can occur is small and the defect rate is small. The Intel 4004 microprocessor had between 2,000 and 3,000 transistors. The newest micro(nano?)processors have 1 billion (or more) transistors. Defect rates have to be really, really small — much better than 1 in 1 billion — for yields to be high enough to be worth it. This defect issue has been an issue for many years now. Programmable logic became a big deal in the early 1980s, in part, because of concerns about defects. There is another issue, although it mostly an issue for massively parallel supercomputers. With thousands of computer processing units running for weeks at a time each one executing billions of instruction per second (100,000 processors * 1GHz * 60 sec * 60 min * 24 hours * 7 days = 6*10^19 instructions per week), you cannot rely on the entire parallel job running to completion without a single soft fault even with a low rate. Some effort has to be taken to keep “snapshots” of the program execution to restart it if it fails in the middle of running.
As computers become more powerful and are used to do more amazing things, the small weaknesses get compounded and become glaring. Each transistor requires less power, but there are more of them in a smaller space. Now, personal computers overheat with ease (if the CPU fan were to fail), and the most powerful supercomputer in the world is almost one of the most likely to fail (if it is being used to full capacity without preventative measures).