The following is a magazine excerpt
about the "crash" of the AT&T Long Distance Network in 1990 when
faulty software was installed on the Number 4 ESS (Electronic
Switching System) toll tandems throughout the network. The software
glitch managed to disable many switches throughout the network until
the problem was attributed to the faulty software and a previous
version was installed.
The AT&T Crash from "The Risks Digest"
The following is an
excerpt from The Risks Digest - Volume 9, Issue 62 - February
26, 1990.
Cause of AT&T network failure
"Peter G. Neumann"
Fri, 26 Jan 90 14:24:30 PST
From Telephony, Jan 22, 1990 p11:
"The fault was in the code" of the new software that AT&T loaded into
front-end processors of all 114 of its 4ESS switching systems in
mid-December, said Larry Seese, AT&T's director of technology
development. In detail:
The problem began the afternoon of Jan 15 when a piece of trunk
interface equipment developed internal problems for reasons that have
yet to be determined. The equipment told the 4ESS switch in New York
that it was having problems and couldn't correct the fault. "The
recovery code is written so that the processor will run corrective
initialization on the equipment. That takes four to six seconds. At
the same time, new calls are stopped from coming into the switch."
Seese said.
The New York switch sent a message to all the other 4ESS switches it
is linked with that it was not accepting additional traffic. Seese
referred to that message as a "congestion signal." After the switch
successfully completed the reintialization, the New York switch went
back in service and began processing calls. That is when the fault in
the new software reared its ugly head. Under the previous system,
switch A would send out a message that it was working again, and
switch B would double-check that switch A was back in service. With
the new software, switch A begins processing calls and sends out call
routing signals. The reappearance of traffic from switch A is supposed
to tell switch B that A is working again.
"We made an improvement in the way we react to those messages so we
can react more quickly. The first common channel signaling system 7
initial address message (caused by a call attempt) that switch B
receives from switch A alerts B that A is back in service. Switch B
then resets its internal logic to indicate that A is back in service,"
said Seese.
The problem occurred when switch B got a second call-attempt message
from A while it was in the process of resetting its internal logic.
"[The message] confused the software. it tried to execute an
instruction that didn't make any sense. The software told switch B `My
CCS7 processor is insane'", so switch B shut itself down to avoid
spreading the problem, Seese explained.
Unfortunately, switch B then sent a message to other switches that it
was out of service and wasn't accepting additional traffic. Once
switch B reset itself and began operating again, it sent out call
processing messages via the CCS7 link. That caused identical failures
around the nation as other 4ESS switches got second messages from
switch B while they were in the process of resetting their internal
logic to indicate switch B was working again.
"It was a chain reaction. Any switch that was connected to B was put
into the same condition."
"The event just repeated itself in every [4ESS] switch over and over
again. If the switches hadn't gotten a second message while resetting,
there would have been no problem. If the messages had been received
farther apart, it would not have triggered the problem."
AT&T solved the problem by reducing the messaging load of the CCS7
network. That allowed the switches to rest themselves and the network
to stabilize.
Copyright 2008 Telephone World
Page last modified December 20, 2008
Contact us for more information