Further supplementary evidence submitted
by the Department of Health (EPR 01D)
Questions were raised during the 14 June 2007
hearing about whether evidence existed for the levels of service
availability provided by suppliers under the National Programme,
and about the level of resilience provided to withstand significant
system failure, and to maintain service to the end user.
I undertook to provide a note, specifically
on details of the latter. I have had the attached note prepared,
which I believe fully covers both these issues.
NOTE ON
NPFIT SERVICE
AVAILABILITY AND
RESILIENCE
1. System Availability
Q629 Mr Campbell: ... "we have been told
that when clinical records are remotely hosted, the loss of the
hosting centre or the network for more than a few minutes could
lead to loss of life. So both the hosting and the network need
to be available virtually all of the time. Is there any evidence
of this?"
The systems provided by CFH are monitored and
maintained by the relevant suppliers 24 hours a day 7 days a week
to ensure that any incident is detected and the appropriate measures
are taken to ensure all services are available to the end users.
Service availability statistics can be viewed
on the Connecting for Health public facing web site www.connectingforhealth.nhs.uk.
The Statistics section within the Newsroom tab provides information
on service availability and service level achievements for National
Application Services ( Choose&Book, N3, NHS Care Records Service
(NCRS), Connecting For Health (CFH) Service Desk, NHSmail ) and
Local Service Provider (LSP) application services (eg. Picture
Archiving & Communications Systems (PACS Digital Imaging)
Radiology Information Systems (RIS), Patient Administration Systems
(PAS) etc).
This briefing summarises the levels of system
availability and the performance against agreed system availability
targets (Service Level AgreementsSLAs) with each supplier.
In line with normal industrial practice, where
incidents do occur these are classified according to their impact
on the business and the users and are classified as follows:
Severity 1:
A Severity 1 service failure is a failure which,
in the reasonable opinion of NHS Connecting for Health, the contractor,
or a National Health Service system/service user has the potential
to:
have a significant adverse impact
on the provision of the service to a large number of users; or
have a significant adverse impact
on the delivery of patient care to a large number of patients;
or
cause significant financial loss
and/or disruption to NHS Connecting for Health, or the NHS; or
result in any material loss or corruption
of health data, or in the provision of incorrect data to an end
user.
Severity 2:
A Severity 2 service failure is a failure which,
in the reasonable opinion of NHS Connecting for Health, the contractor,
or a National Health Service system/service user has the potential
to have a significant adverse impact on the provision of the service
to a small or moderate number of service users; or
have a moderate adverse impact on
the delivery of patient care to a significant number of service
users; or
have a significant adverse impact
on the delivery of patient care to a small or moderate number
of patients; or
have a moderate adverse impact on
the delivery of patient care to a high number of patients; or
cause a financial loss and/or disruption
to NHS Connecting for Health, or the NHS which is more than trivial
but less severe than the significant financial loss described
in the definition of a Severity 1 service failure.
The following tables show concurrent, registered
users and service availability statistics for all National and
Local Programmes for IT.
CONCURRENT USERS AND SERVICE AVAILABILITY
STATISTICS24x7
| National systems |
N3 |
QMAS |
NHSmall |
Choose and Book |
Electronic Prescription Service |
SPINE (excluding EPS) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Availability Target
Lost user minutes per user per year
Lost user minutes per user per month
| 470,807.2
470,799.8
7.4
975,000
99,998%
99.99%
7.6
0.6
| 2.220.0
2.220.0
0.0
4.119
100,000%
99.99%
0
0.0
| 57,799.0
57,798.2
0.8
118,223
99.999%
99.99%
8
0.6
| 2.168.0
2.166.3
1.7
6.337
99.924%
99.50%
263
21.9
| 2,840.0
3.839.7
0.3
7.866
99.990%
99.90%
106
8.8
| 64.339.0
64,336.5
2.5
149,453
99.997%
99.90%
8.2
6.9
|
| Service Type |
PACS |
PAS |
PAS (excluding Maidstone) |
Theatres |
Theatres (excluding Maidstone) |
Ambulance |
GP (Primary Care/Decision support) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Availability Target
Lost user minutes per user per year
Lost user minutes per user per month
| 2,623.6
2,622.7
0.9
6,846
99.964%
99.87%
137
11.4
| 5.031.8
5.026.5
5.3
13,361
99.844%
99.90%
394
32.9
| 5.031.8
5.030.2
1.6
13.361
99,962%
99,90%
120
10.0
| 188.5
186.7
1.8
424
98.919%
95.00%
4.361
363.4
| 188.5
188.2
0.3
332
99.832%
95.00
739
62.0
| 221.3
221.3
0.00
496
100.000%
99.30%
0
0.0
| 4.011.1
4.011.0
0.04
9.064
99.999%
99.20%
4
0.4
|
CONCURRENT USER AND SERVICE AVAILABILITY STATISTICS-SERVICE
HOURS
| National systems |
N3 |
QMAS |
NHSmall |
Choose and Book |
Electronic Prescription Service |
SPINE (excluding EPS) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Availability Target
Lost user minutes per user per year
Lost user minutes per user per month
| 196.170.0
196.167.0
3.0
520.000
99.997%
99.99%
5.2
0.4
| 925.0
925.0
0
4.119
100.000%
99.99%
0
0.0
| 24.083.0
24.082.4
0.6
118.223
99.997%
99.99%
5
0.4
| 903.0
901.8
1.2
6.337
99.864%
99.50%
197
16.4
| 1,600.0
1.599.9
0.1
6,866
99.982%
99.90%
227
18.9
| 26.808.0
26,806/1
1.9
149.453
99.994%
99.90%
13
1.1
|
| Service Type |
PACS |
PAS |
PAS (excluding Maidstone) |
Theatres |
Theatres (excluding Maidstone) |
Ambulance |
GP (Primary Care/Decision support) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Availability Target
| 2,623.6
2,622.7
0.9
6.846
99.964%
99.87%
| 5,031.8
5,026.5
5.3
13,361
99.884%
99.90%
| 5,031.8
5,030.2
1.6
13,361
99.962%
99.90%
| 188.5
186.7
1.8
424
98,919%
95.00%
| 188.5
188.2
0.3
332
99.832%
95.00%
| 221.3
221.3
0.00
496
100.000%
99.30
| 4,011.1
4,011.0
0.04
9.064
99.999%
99.20%
|
REGISTERED USERS AND SERVICE AVAILABILITY STATISTICS24x7
| National systems |
N3 |
QMAS |
NHSmall |
Choose and Book |
Electronic Prescription Service |
SPINE (excluding EPS) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Lost user minutes per user per year
Availability Target
Lost user minutes per user per year
Lost user minutes per user per month
| 627,743.0
627,733.1
9.9
1,300,000
99.998%
99.99%
14.0
0.9
| 18,469.4
18,469.4
0
34,323
100.000%
99.99%
0
0.0
| 123,335.5
123.333.7
1.8
253.994
99.9985%
99.99%
7
0.6
| 40.587.3
40.585.6
1.7
87,400
99.9957%
99.50%
20
1.7
| 25,548.1
25,547.1
1.0
52,440
99.9959%
99.90%
32
2.6
| 144,772.5
144.760.1
12.4
297.158
99.9920%
99.90%
6
0.5
|
| Service Type |
PACS |
PAS |
PAS (excluding Maidstone) |
Theatres |
Theatres (excluding Maidstone) |
Ambulance |
GP (Primary Care/Decision support) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Availability Target
Lost user minutes per user per year
Lost user minutes per user per month
| 3,272.2
3,271.2
1.0
6,163
99.968%
99.87%
166
13.9
| 29,798.5
29,788.8
9.7
69,552
99.965%
99.90%
139
11.6
| 29,798.5
29,788.8
9.7
69,552
99.965%
99.90%
139
12.0
| 1,916.7
1,907.2
9.5
4.198
99.455%
95.00%
2,260
188.4
| 1,917.0
1,916.6
0.4
3,081
99.975%
95.00%
105
9.0
| 1,641.4
1,641.4
0.0
3,742
100,000%
99.30%
0
0.0
| 3,877,409.2
3,877,381.4
27.8
8,982,307
99.999%
99.20%
3
0.3
|
REGISTERED USERS AND SERVICE AVAILABILITY STATISTICSSERVICE
HOURS
| National systems |
N3 |
QMAS |
NHSmall |
Choose and Book |
Electronic Prescription Service |
SPINE (excluding EPS) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Availability Target
Lost user minutes per user per year
Lost user minutes per user per month
| 261,559.6
261,552.2
7.4
1,300,000
99.997%
99.99%
8.0
0.7
| 51,389.8
51,389.8
0.0
34,323
100,0000%
99.99%
0
0.0
| 51,389.8
51,388.5
1.3
253,994
99,9974%
99.99%
2
0.1
| 16,911.0
16,909.7
1.3
87,400
99.9923%
99.50%
15
1.2
| 10,645.0
10,644.3
0.7
52,440
99,9821%
99.90%
24
1.6
| 60,322.0
60,312.7
9.3
297,158
99,9856%
99.90%
31
2.6
|
| Service Type |
PACS |
PAS |
PAS (excluding Maidstone) |
Theatres |
Theatres (excluding Maidstone) |
Ambulance |
GP (Primary Care/Decision support) |
| Potential uptime in user mins
Actual uptime in user mins
Lost user mins
No of users
Availability achieved for 1 year
Availability Target
Lost user minutes per user per year
Lost user minutes per user per month
| 1,363.4
1,362.6
0.8
6,163
99,943%
99.87%
166
13.9
| 12,416.1
12,408.8
7.3
69,552
99.937%
99.90%
139
11.6
| 12,416.1
12,408.8
7.3
69.552
99.936%
99.90%
105
9.0
| 698.6
791.5
7.1
4.198
99,019%
95.00%
2,260
188.4
| 799.0
798.7
0.3
3,081
99.955%
95.00%
79
7.0
| 683.9
683.9
0.0
3,742
100.000%
99.30%
0
0.0
| 1,615,587.2
1,615,566.4
20.8
8,982,307
99.999%
99.20%
3
0.3
|
| 2. System Resilience
| | | |
| | | |
Q636 Chairman: Do you have a comparator in terms of databases
in the UK? I know there are different levels of resilience that
evolve but what is the comparator with the one you are implementing
for the national patient record?
Mr Granger: We asked CIOs and frontline clinicians in the
NHS during the specification process what levels of resilience
did they want and they had some degree of tolerance for planning
downtime, and I can let you have a note on the details of this,
and a low degree of tolerance for unplanned downtime.
Suppliers provide services from data centres, where IT systems
are built to withstand significant levels of failure, and maintain
service to the end user.
Suppliers have built primary and secondary facilities at
different sites to provide a back up in the event of a highly
unlikely failure affecting a whole site. Within these data centres
there are multiple levels of resilience, to withstand more localised
failures. In other words, the data centre suppliers ensure they
do not have any "single points of Failure", where one
piece of IT equipment will exist without a back up, or a resilient
partner. Often the additional resilience is also provided to improve
performance by increasing the capability of each piece of IT equipment,
and hence the overall system or service. The data centres are
monitored 24x7 to ensure failures are identified and fixed prior
to them having an impact on end user service. Data is stored securely
over multiple sites, to ensure in the event of failure that no
data is lost.
Additional information is also provided on:
The CSC quad data centre strategy in response
to the service outage in 2006.
National Application Service Provider (NASP) and
Local Service Provider (LSP) data centre architecture and testing.
Details of network switch and circuit resilience.
CSC QUAD DATA
CENTRE STRATEGY
NHS Connecting for Health commissioned an independent review
of the service outages in 2006 which helped to identify areas
where the service provision could be further improved. Key to
business continuity in these areas is the ability to failover
one system to another data centre independently of any other service
that is being hosted and with which it may interact.
With CSC taking over services that were being provided by
Accenture in the North and East, CSC are building two new data
centres to replace those that were being used. These new data
centres will be operational this year and will embody the principles
of independent failover that were highlighted in the review. CSC
is undertaking a reworking of the architecture of the transitioned
services to ensure that they will meet the high standards.
The new data centres have been constructed within 50 kilometres
of the existing CSC/NHS sites, but at a sufficient distance to
ensure that no large scale incident could impact more than once.
This proximity allows the four data centres to be used eventually
to support four way failover, with three sites available for Disaster
Recovery. The locations also allow for a "metropolitan"
high speed network to be implemented that will allow the failover
of N3 connectivity and data storage services, providing further
levels of resilience. The high level architecture diagram, Figure
1, shows the logical relationship between all four data centres.
The infrastructure element relationships supporting continuity
of service are illustrated by the bi-directional arrows.
Note: Please refer to the PDF and use zoom for an improved rendition of the chart.
NASP AND LSP DATA
CENTRE ARCHITECTURE
AND TESTING
The BT Spine architecture typifies the approach across NASP
and LSP suppliers. The Spine service is provided from two data
centres known as Live A, and Live B. They are secure and resilient,
being located and built in such a manner to minimise any potential
disruption to service. They are classed as List X sites. A List
X site is a commercial (non-government) site on UK soil, that
is approved to hold UK Government protectively marked information
(Confidential and above). The approval is in the form of formal
accreditation by the Communications Electronic Security Group
(CESG), the Information Assurance arm of Government Communications
Headquarters (GCHQ). Because companies with this status are those
normally involved with Defence research and manufacturing that
is vital to national security, the details of how resilient List
X data centres are is restricted information. However, the sites
are formally and regularly audited both at Government and Customer
level and offer service levels far in advance of non-List X sites.
The target to resolve a severity one incident is less than
2 hours. The severity one fix time target is linked to the target
time to recover the service. Whereby if BT Spine were to experience
a serious failure at one of the sites, which could mean service
was going to be disrupted for an extended period if no action
was taken, BT Spine would complete a failover to the unaffected
site. This capability is regularly tested. In reality, service
is resumed much more quickly than the target of 2 hours.
BT Spine meets the requirements laid out to them by NHS CFH
and has completed regular successful tests. This major disaster
recovery failover testing is completed by suppliers at a minimum
of every 12 months, with some tests scheduled every 6 months.
Between these times, suppliers also complete other tests, such
as process walkthrough, configuration audits and resilience tests
to ensure they are prepared and ready in the event of a live operational
requirement to complete a failover.
In terms of the resilience within a data centre site, there
is a significant level of testing prior to deployment to ensure
the IT equipment performs as it was designed. Once implemented,
the IT equipment is monitored 24x7 to identify any potential failures
or issues, which, if not resolved, would cause failure.
In addition, the resilience is monitored to identify when
it is invoked automatically, ie if a database fails and a resilient
partner maintains live service, this will be tracked and the outcomes
recorded as a means of testing the resilience on an on-going basis.
DETAILS OF
NETWORK SWITCH
AND CIRCUIT
RESILIENCE
Resilience is provided in the network by the deployment of
primary and secondary circuits and switches to maintain continuity
of service. The level of resilience within the N3 network is based
upon a combination of N3 specific elements and components of Disaster
Recovery Service provided by the suppliers to N3 Service Provider
including BT. The Network, which has been deployed, is based on
Points of Presence (PoPs) and these PoPs are designed to facilitate
the contractual requirement to be able to connect resiliently
all access catalogue services into the N3 Core. The PoPs are designed
to support connections from primary and secondary circuits from
N3 Customer sites. In addition to N3 access circuits being resiliently
connected into the N3 core, the core itself and all key infrastructure
components that operate upon the network core (eg Internet Gateway,
Domain Name Sever (DNS) and infrastructure for other N3 Foundation
Services) have been built to a specification that are resilient
in design. Taking this into consideration, business recovery strategies
are in place for all standard elements of the network and strict
SLAs are in place to ensure that N3SP restores service and original
configuration of those services within the shortest possible time,
should services be interrupted. Business Recovery Plans are also
in place for other supporting service elements delivered by N3.
Richard Granger
Department of Health
5 July 2007
|