Memory
Management: Error Checking
Data are stored in memory in 1's and 0's. This is due to the binary nature of electronics. However, as in any system, errors can often be found. These errors can be due to hardware, probably a loosened connection or disconnected joints/soldering. As such, it generates what we call repeatable or hard error. The second kind of error is called a transient or soft error. This occurs when a bit reads back the wrong value once, but subsequently functions correctly. These problems are, understandably, much more difficult to diagnose! They are also, unfortunately, more common.
There are ways to find and correct these errors. Some can only detect it; the other can correct it as well.
PARITY
When parity is in use on a computer system, one parity
bit is stored in DRAM along with every 8 bits (1 byte) of data. Parity simply
counts the number of 1's and 0's in a given bit, since data are stored in
binary. In each parity protocol, a bit is generated to indicate the presence
of correct number of 1's and 0's. The two types of parity protocol, odd parity
and even parity, function in similar ways.
This table shows how odd parity and even parity work:
|
Even parity |
Odd parity |
|
|
Even numbers of 1's |
0 |
1 |
|
Odd number of 1's |
1 |
0 |
The parity bit indicates the correct number of 1's in the byte. For odd parity, it ensures that the sum of 1's in the 8-bit data plus the parity bit is always odd. It works the other way around for even parity protocol.
Parity does
have its limitations. For example, parity can detect errors but cannot make
corrections. This is because the parity technology can't determine which of
the 8 data bits are invalid.
Furthermore, if multiple bits are invalid, the parity
circuit will not detect the problem if the data matches the odd or even parity
condition that the parity circuit is checking for. For example, if a valid
0 becomes an invalid 1 and a valid 1 becomes an invalid 0, the two defective
bits cancel each other out and the parity circuit misses the resulting errors.
Fortunately, the chances of this happening are extremely remote.
ECC
An advanced error detection and correction protocol was
invented to go a step beyond simple parity checking. Called ECC, which
stands for error correcting circuits, error correcting code,
or error correction code, this protocol not only detects both single-bit
and multi-bit errors, it will actually correct single-bit errors on
the fly, transparently. It is the data integrity checking method used primarily
in high-end PCs and file servers. The important difference between ECC and
parity is that ECC is capable of detecting and correcting 1-bit errors. With
ECC, 1-bit error correction usually takes place without the user even knowing
an error has occurred. Depending on the type of memory controller the computer
uses, ECC can also detect rare 2-, 3-, or 4-bit memory errors. While ECC can
detect these multiple-bit errors, it cannot correct them. However, there are
some more complex forms of ECC that can correct multiple bit errors.
Using a special mathematical sequence, algorithm, and
working in conjunction with the memory controller, the ECC circuit appends
ECC bits to the data bits, which are stored together in memory. When the CPU
requests data from memory, the memory controller decodes the ECC bits and
determines if one or more of the data bits are corrupted. If there's a single-bit
error, the ECC circuit corrects the bit. In the rare case of a multiple-bit
error, the ECC circuit reports a parity error.
NEXT STOP: Memory: Memory Management: memory speed