I overlooked this one with my previous answer:
I'll agree, restarts/reboots can be a drag, but I'm willing to bet that most systems demanding continuos time stamping will be running duplicate/triplicate/multi-cate systems with non-conflicting, single system maintenance schedules. Or, maybe not!
Hook, line and sinker, xairbus !
In the VMS world the related heading is "Cluster".
There are identically named functions in the *nix and Win world which must not be
confused with VMS clustering. VMS clustering means full access to any ressource
(no matter if hard- or software) on any node worlwide being a member of a given
VMS Cluster. Say you can access a remote system's HDD or DVD or tape drives just
as if they would be sitting in the computer beneath your desk. Furthermore you can
upgrade single Cluster members while the other nodes will continue to service the
users without any sign of downtime.
The above is a tad abstract, so here is a report from real life:
=======================================================
If cluster uptimes are going to be boasted about, here is our story.
We are running several major applications for the Regio Politie Amsterdam-Amstelland
(compare Greater Amsterdam Police). Call room, Criminal Investigation,
Recognation Service etc. are really in demand 7*24!
April 13 1997 (sunday morning) after a lot of planning and preparing to allow
for offline-police services, there was a 'big bang' total network upgrade,
which interrupted the online services for several hours. We (the VMS group)
took the opportunity to go down and do various changes that are so much more
cumbersome to perform "rolling". At that time the cluster consisted of 2
Alpha 2100's and a VaxStation 4000-90, running VMS 6.2-1H3 (some sopporting
programs were available only on VAX, they were to be replaced by different
ways to do them). In june 1997 a third 2100 was added. In march 1999 we did a
rolling upgrade from V6.2 to 7.1-1H2. A major change came in may 1999 when a
second location (7 KM away) was activated. An FDDI ring was established.
A test system (Alpha 1200) was configured into the cluster, and moved to the
other (the "dark") site. Then 2 2100 were removed, tranported over, and added
again. The 1200 left again.
The hardest part was explaining to management that we didn't go down for the move.
September 2000 an ES40 was added. February 2001 VMS went from 7.1-1H2 to 7.2-1;
the VaxStation was no longer needed and left. In may the "intestines" of the
2100's were moved into 2100A's to satisfy more PCI need. December 2002/january
2003 saw the upgrade to VMS 7.3-1, to prepare for the big change: the 2100A's
were to be replaced and a SAN deployed. After adding 2 ES45's and an ES40,
the data was moved from the HSZ40-connected SCSI disks to (HSG80-connected)SAN.
Over 850 concealed devices, moved at moments when a specific device was unused.
Only 5 of those had to be forced by deliberately breaking availability during
the SLA-specified "potential maintenance window", 0:30-1:00.
Thereafter the 2100A's and the old disks were removed. So now we are running a
cluster with uptime 2420 days, in which the oldest hardware (FDDI concentrators)
is only 1650 days, and the oldest system's age is less half that of the cluster
uptime. Daily peak concurrent usage is some 600+ interactive users, 50+ batch
jobs, 30+ network jobs, and 180+ detached processes ("other mode", mostly
call-room service processes and the server-end of the radio-connected
MobileDataTerminals in the policecars) Even more interesting, because the
signify the need to not go down: weekly LOW usage is some 50+ interactive,
40+ batch, 20+ net, and 170+ detached. The total environment is far from
unchanging: about twice a month some or other application is upgraded (most
support rolling upgrades), and there are about 100-150 mutations in personnel,
and some 200-300 application (de-)autorizations per week.
All this is maintained by only 3 people: Frank Wagenaar (full time)
Anton van Ruitenbeek (40%) and myself (full time).
=======================================================
I think this shows what well-thought hardware and OS design is able to accomplish
and that you are right in that it takes more than a single Computer to deliver permanent
service...
In this connection... the "disaster proof video" is a must see:
http://brightcove.vo.llnwd.net/e1/uds/pd/1160438707001/1160438707001_3492507031001_Disaster-Proof-MPEG2-3BZ7.mp4?pubId=4119874060001&videoId=4657542628001 Regards !
Michel