Log in Article Discussion Edit History Go to the site toolbox

ComDoc 14

From HTMcommunityDB.org


An Introduction to Reliability-centered Maintenance (RCM): The Modern Approach to Planned Maintenance

(This document was last revised on 4-9-15)

14.1 Evolution of the RCM methodology

RCM was created and first developed during the decade of the 1960s in the civil aviation industry. It quickly revolutionized the way maintenance is performed on all kinds of aircraft and it has since been adopted in virtually every other segment of industry or technical area where the reliable and safe performance of any kind of complex device is important - except for healthcare!

The traditional approach to scheduled (planned) maintenance in civil aviation was based on the idea that every component of a piece of complex equipment has a “right age” at which it must be completely refurbished or replaced to ensure that the complete device continues to operate as intended. In the locomotive and motor industries maintenance was somewhat less sophisticated than it was in aviation and during the early years was focused primarily on the need to lubricate the vehicle’s moving parts and keep the mechanics (and later the hydraulics) of the vehicle’s braking and steering systems in good condition. Reliability was not such a major concern as it was in the aviation industry, primarily because the consequences of auto and locomotive failures were not as potentially catastrophic as were aircraft failures!

In the aviation business these periodic refurbishments were called scheduled overhauls. However, over time it became apparent that a significant percentage of equipment failures are simply not preventable by any kind of “preventive” maintenance, no matter how diligently the PMs or overhauls are performed. In response to this recognition, airplane designers had been increasingly turning to failure mitigation strategies such as incorporating complete redundancy for critical components such as engines, and overdesigning mechanical structures such as wings to make them somewhat tolerant of greater-than-anticipated levels of accidental damage.

By the late 1950s the size of the commercial airline fleet had grown to the point where the cost of the industry’s maintenance programs had become sufficiently high that the Federal Aviation Administration (FAA) collaborated with staff at United Airlines who had formed a special task force to take a hard look at the traditional maintenance practices to see how they might be made more effective and more efficient. A significant catalyst to this concern was the impact that the projected cost of maintaining the next generation of “jumbo” jets - which were already on the drawing board – might have on the economic viability of the airlines’ new business plans.

An important conclusion from the work of the FAA-sponsored task force was that the traditional scheduled overhauls did very little to improve the overall reliability of a complex device, unless the item has a single, dominant failure mode. This surprising finding prompted the FAA to join forces with several investigations that were undertaken by the major airlines on how best to improve aircraft reliability, and an FAA-sponsored Maintenance Steering Group was formed to oversee the development of a more effective maintenance program for the nation’s first jumbo jet - the Boeing 747. The initial report from this group titled “Handbook: Maintenance Evaluation and Program Development”, but better known by its short title, MSG-1, was issued in 1968 and was used to develop the very first maintenance program based on the principles of what is now called Reliability-centered Maintenance. Similar documents were later produced for the Lockheed 1011 and the Douglas DC-10 and, in Europe, for the Airbus A-300 series and the supersonic Concorde. The objective of the new RCM methodology was to design a planned maintenance program that would capture the maximum level of reliability and safety of which the equipment was capable – at the lowest possible cost. That the new programs achieved this is well illustrated by the following statistics:

• Under traditional maintenance programs for the Douglas DC-8, United Airlines typically expended more than 4 million man-hours on major structural inspections before reaching the 20,000 hours of operation point; whereas they expended only 66,000 man-hours to reach the same point for the considerably bigger and more complex DC-10 “jumbo” jet.
• Under traditional maintenance programs the Douglas DC-8 required the scheduled overhaul of 339 items, whereas the bigger and more complex DC-10 required that only 7 items be overhauled. One of the items that no longer required scheduled overhauls were the turbine engines which at the time cost more than one million dollars each. This change alone was a major contributor to a reduction in labor and materials of more than 50%.

Although cost reductions of this magnitude are clearly important to organizations maintaining large fleets of complex equipment, it is equally important to note that these savings were achieved with absolutely no decrease in reliability.

• On the contrary, the better understanding of the underlying failure processes that this new method brought into play actually improved overall reliability. The “reportable event rate” for the DC-10 dropped by a factor of 200 times relative to the DC-8, from 60 per million takeoffs for the DC-8 - to 0.3 per million takeoffs for the DC-10.

The next big milestone came in 1974 when the Department of Defense (DOD) contracted with United Airlines to document the new maintenance processes being used by the civil aviation industry. The DOD then directed that the new approach embodied in these pioneering new concepts be called Reliability-centered Maintenance (RCM).

The seminal document describing what has become known as “Classical RCM” was published in 1978. It is a book titled Reliability Centered Maintenance, authored by F. Stanley Nowlan and Howard F. Heap, both of United Airlines. (HTM ComRef 19)

In the 1980s and 1990s, as the good word about the benefits of RCM became more widely known, the military adopted the RCM approach for both its ships (most notably for its fleet of nuclear submarines) and for its aircraft. NASA also adopted the RCM approach (most notably for its Shuttle Program) and the utility industry adopted RCM for its power plants (most notably for its nuclear power plants).

As other industry segments have adopted the RCM approach many variants of the classical, aviation-oriented methodology have emerged and many specialized consultants, facilitators and educators have championed many of these variants. The extent to which RCM is now a widely practiced and well-respected field can be determined by simply “Googling” the term. As of October 2014 the term “Reliability Centered Maintenance” returned 278,000 “hits” and the term “Reliability Centered Maintenance in medicine” returned 228,000 “hits”.

14.2 Key components of the Reliability-centered Maintenance (RCM) methodology

The key components of the RCM methodology are:

• A careful and comprehensive description of what functions the users expect the device to perform, and;
• A list of all of the possible adverse events (failures) - that are reasonably likely to occur – that would prevent the device from continuing to function as expected.

These two elements constitute the essence of a process called a Failure Mode and Effects Analysis - which is described elsewhere on this site (HTM ComDoc 13.).
Unfortunately, RCM has its own specialized terminology that can make it difficult to follow for those more used to the terminology of traditional maintenance.

14.3 How RCM is different from traditional preventive maintenance

As we have just noted, the traditional approach to PM is based on the simple idea that all that is needed to keep a piece of equipment running properly – even a fairly complex piece of equipment – is to periodically refurbish or replace any of the device’s components that are expected to degenerate or degrade over time, and before the end of the device’s anticipated useful working life. This usually involves periodic lubrication of any moving parts and periodic adjustments of any other mechanical components. Other examples of parts that are sometimes considered to be non-durable in the sense that they have a wear-out type of failure curve are plastic tubing, and primary and secondary batteries. The traditional picture of how the performance of a piece of equipment performs as it ages is shown in Figure 14.1.

• If the device is designed to perform within certain specifications there might also be a periodic need to conduct some kind of check that it is performing within those specifications and to then make appropriate adjustments. In traditional maintenance, as it is practiced in the medical equipment maintenance business, this is called Performance Verification and Safety Testing, or some other similar name. In RCM this is called a Scheduled Failure Finding Task.
• RCM focuses on maintaining the device’s function rather than just its physical integrity, striving for minimum downtime and an acceptable level of safety. Secondary failures that do not degrade the device’s performance or safety are considered to be tolerable.
• The classical RCM method considers the entire system including the device’s accessories, any supporting utilities, the environment around the device, and the patient. The traditional maintenance approach does not.
• Unlike traditional maintenance RCM takes into consideration the relative cost-effectiveness of different control/ maintenance options.
• In contrast to the simple, prescriptive traditional approach, the RCM method involves extensive logical analyses of actual device data (see Figure 14.2) and is based on modern reliability theory. While this has obvious intellectual advantages over the simple prescriptive procedure specified by the device manufacturer, it can place a considerable analytical burden on the device user or owner.
• The manufacturer’s traditional PM procedures do not usually involve a very high level of sophistication and there is certainly little in the way of standardization in the format and content of the procedure. Some are very detailed and instructive; some are extremely sketchy. All are completely prescriptive and usually are not accompanied by any kind of justifying rationale.
• The author of the tasks required to accomplish the periodic restoration tasks as listed in the traditional PM procedures is usually the manufacturer of the device in question. This, in itself, is less than ideal because the manufacturer might well have an economic conflict of interest in the form of a bias against extending the useful working life of the device.
• RCM’s precise analytical methodology strips away the fuzzy logic, mystery and "magical thinking" surrounding traditional PM.

14.4 The traditional RCM Methodology

The analytical process that constitutes the essence of the classical RCM methodology consists of creating, then evaluating, the following elements:

1) A comprehensive description of the essential functions of the device including any associated performance and safety standards. (Sometimes called “mapping the process”)
2) A list of all the ways, that are reasonably likely to occur, in which the device can fail to fulfill its function or functions. (i.e. a listing of the device’s failure modes)
3) A list of the events that can lead to each functional failure. (i.e. a listing of the device’s failure causes)
4) A quantification (based upon some agreed multi-level scale) of the severity of the (worst case) adverse consequences of each failure mode.
5) A quantification (based upon some agreed multi-level scale) of the likelihood that each identified failure mode will actually occur. The combination of the worst-case severity and the probability of occurrence is used to characterize the scale of the hazard associated with the failure.
6) A determination of what can be done to predict or prevent each failure.
7) A determination of what default strategies can be considered if a suitable proactive control task cannot be found.

The first five of these questions constitute what is known as a Failure Mode and Effects Analysis which is addressed in more detail elsewhere on this site. (HTM ComDoc 13.)

A signature feature of the RCM method is a unique decision tree or flowchart known as the Decision Worksheet or Decision Diagram. This flowchart addresses elements 6) and 7), above, by using a series of secondary questions and decision nodes to lead the analyst to the best of all possible control/ PM strategies.

There has been some misunderstanding in some circles about the applicability of RCM to medical equipment maintenance because some of the proposed decision trees show an optional default strategy of “redesign”. This is a default strategy rather than a proactive strategy because it is dealing with a device in a failed state. And because redesigning the medical device itself is generally not an option that can be undertaken by equipment maintainers or even the equipment owners, some have expressed the opinion that the RCM method cannot be applied to this particular area. However, it should be kept in mind that the so-called “redesign” strategy is intended to be applied to the entire equipment system, including the device’s accessories, the support utilities, the environment around the device, and the way in which the device is used - as well as the design and fabrication of the device itself.

14.5 RCM maintenance strategies

The RCM terms for the traditional PM tasks are:

Scheduled restoration tasks (if the component is refurbished), or
Scheduled discard tasks (if the component is replaced)

Although they are seldom designed into medical devices, RCM introduces the option of using some kind of technique for monitoring the condition of a component that is subject to deterioration, creating a potentially very efficient, just-in-time form of failure prevention that is known as Predictive Maintenance. The RCM term for this type of activity is:

Scheduled on-condition tasks

RCM also addresses hidden failures. A hidden failure is one that is revealed by performing periodic inspections or testing of the device to determine whether or not it is still performing within its functional and/ or safety specifications. The RCM term for this type of activity is:

Scheduled failure-finding tasks

Most notably, RCM also permits a maintenance strategy in which nothing is attempted in the way of prevention and the device is simply repaired when it fails. The RCM term for this option is:

Corrective maintenance only (with no PM). This strategy is sometimes called – (allowing the device to) run-to-failure (then repair)

14.6 The many ways that equipment systems fail

(From HTM ComDoc 1. Section 1.6 "What causes equipment systems to fail?")

There are many ways in which equipment systems fail and it is very important to recognize that a substantial number of these failures cannot be pre-empted by any kind of preventive maintenance. There are three general types of failures that can result in the device not functioning the way the user wants it to function.

The first general category is inherent reliability-related failures (IRFs). These are attributable to the design and construction of the device itself, including the inherent reliability of the components used in the device. They typically represent 45 and 55% of all medical equipment repair calls. IRF type failures can be reduced (but not to zero) only by redesigning the device or changing the way it is constructed. These are failures that cannot be prevented by any kind of maintenance actions. IRF failures are sub-categorized as follows.

• Category IR1. A random failure or malfunction of a component part of the device. A result of the device’s intrinsic unreliability. These typically represent between 46-52% of all repair calls.
• Category IR2. Poor fabrication or assembly of the device itself.
• Category IR3. Poor design of the hardware or processes required to operate the device.

The second general category is process-related failures (PRFs). They typically represent 40 to 50% of all medical equipment repair calls. Reducing or eliminating these types of failure typically requires some kind of redesign of the system’s processes - for example, using better methods to train the equipment users to operate the equipment (as intended by the manufacturer) or to train them to treat the equipment more carefully. These too are failures that cannot be prevented by any kind of maintenance activities. PRF type failures are sub-categorized as follows.

• Category PR1. Incorrect set-up or operation of the device by the user. The user has not set the device up correctly or does not know how to operate it. Typically these represent between 13 to 20% of all repair calls. (Note that although this type of “failure” does not represent a complete loss of function, it can have the same effect. For example, an incorrectly set defibrillator can result in a failure to resuscitate the patient).
• Category PR2. Subjecting the device to physical stress outside its design tolerances. Often as a result of the device falling to the floor. These typically represent between 6 to 25% of all calls.
• Category PR3. Problem resulting from failure to recharge a rechargeable battery. These typically represent between 7 to 8% of all calls.
• Category PR4. Using a wrong or defective accessory. These typically represent between 3 to 9% of all calls.
• Category PR5. Exposing the device to environmental stress outside its design tolerances. These typically represent between 1 and 7% of all of the repair calls.
• Category PR6. Human interference with the device. Usually due to someone tampering with an internal control. These typically represent less than 1% of all calls.
• Category PR7. Problem due to an issue within a data network connected to the device’s output.

The third general category is maintenance-related failures (MRFs). They typically represent 2 to 4% of all medical equipment repair calls. These types of failure can usually be prevented through some kind of maintenance strategy incorporated into the facility’s maintenance program. MRF type failures are sub-categorized as follows:

• Category MR1. Problem due to inadequate restoration of a manufacturer-designated non-durable part (inadequate preventive maintenance). These calls typically represent between 1 and 3% of all calls.
• Category MR2. Poor or incomplete initial installation or set-up of the device. These typically represent between 1 and 3% of all calls.
• Category MR3. Problem attributable to poor periodic maintenance. Such as improper periodic calibration.
• Category MR4. Problem attributable to a poor quality previous repair of the device.
• Category MR5. Problem attributable to earlier intrusive maintenance.

The device’s overall reliability, which corresponds directly to the total number of repair calls, irrespective of what caused them, determines the device's effective reliability. However, it is the numbers of maintenance-related failures (MRFs) and inherent reliability-related failures (IRFs) that are of greatest interest to us, as maintainers, at this time. The level of MRF type failures provides a good measure of the effectiveness of the facility’s maintenance program, and the level of IRF type failures provides an equally good measure of the basic or inherent reliability of the devices in question.

14.7 Hidden failures

(From HTM ComDoc 1 Section 1.7)

A hidden failure (HF) is said to have occurred when either:

• The device delivers an output that is significantly out of specification but sufficiently similar to the output that the user wants that the failure is not immediately obvious to the user, or
• The device is significantly out of compliance with the relevant safety specifications for the device in question, and this deterioration is also not obvious to the user.

When this more subtle type of failure introduces a significant performance or safety degradation that can be detected only by some kind of performance verification or safety test it can constitute a serious safety threat. For example, a heart rate alarm that has malfunctioned so that it no longer goes off at the set limit will remain as a hidden, but potentially hazardous, failure until the alarm function is checked and the potentially dangerous degradation discovered. The potential seriousness (i.e. level of severity) of hidden failures will depend on the nature of the failure and on how far the performance or safety flaw is out of specification. For example; a significant reduction in the output of a defibrillator has to be considered life threatening but a small excess in the electrical leakage current of a laboratory centrifuge – although it should be noted in the service report - is unlikely to constitute a significant problem, or be considered an imminent safety hazard.

Hidden failures are discovered when PVST tasks are performed during the PM. When they are found they should be described in some kind of note on the PM work order or the PM report and it would be helpful if the description of the findings provided enough information to enable a judgment to be made as to the worst case potential level of severity of the adverse outcome that would have resulted if the hidden failure had not been discovered (see Section 14.15 below).

A particularly important type of hidden failure is one that impedes or disables the proper operation of an automatic protection mechanism (APM) that is included as a component of the device. An APM is usually included in the design of a particular device to provide protection against another hidden failure that is itself considered capable of resulting in a serious or potentially life-threatening adverse consequence.

14.8 Automatic protection mechanisms

(From HTM ComDoc 1. Section 1.8)

Perhaps because of their increasing complexity and a heightened concern about patient safety, more and more medical devices are being provided with some kind of automatic protection mechanism. There are several different kinds:

1. Automatic warning devices - to alert the operators to the onset of a potentially serious hidden failure, such as the warning lights and audible alarms found on many patient monitoring systems to indicate when there is some kind of degradation, such as a monitoring lead that needs to be adjusted.
2. Automatic equipment shutdown devices - such as the resistance sensors found on all modern powered x-ray tables that are similar to those found on elevator doors.
3. Automatic relief devices - such as the over-pressure pop-off valves found on most sterilizers.
4. Dual components or dual devices set up in parallel to provide automatic functional redundancy.
5. Guard mechanisms to physically eliminate or preclude the possibility of a catastrophic failure from occurring - such as particulate filters in channels circulating lubricating fluids to moving parts that are subject to wear.

A failure within an APM is particularly troublesome because it is often a hidden failure and, because the targeted malfunction is sufficiently hazardous to justify this additional design feature, the consequence of the protection mechanism failing is likely to be very serious.

14.9 The benefits of categorizing repair calls and PM Findings

There is, of course, more to equipment safety than just maintenance and only a very small percentage of injuries caused by the failure of medical devices are related to maintenance. For devices with potential hidden failures, however, periodic performance verification and safety testing is the only preventive measure that the facility can implement, to improve the level of equipment-related safety, once the device has been purchased. To this same end, if repair calls are coded using a coding scheme similar to that described above (in Section 14.6 titled The many ways that equipment systems fail) and the findings from our PM tasks are coded as suggested in Section 14.15 (below), this data can be used to (a) determine the actual level of reliability and safety of the facility's “high-risk” equipment, and (b) identify some non-maintenance-related tasks that can be used to maximize equipment reliability and safety.

14.10 Equipment-related safety measures - in approximate order of effectiveness

1.User training. (Category PR1 repair calls). A program emphasizing how to use the equipment properly, how to exercise care in handling the equipment and how to avoid any temptation to tamper with the device’s internal controls. This is a very important and frequently underutilized safety measure. This single measure alone has the potential to reduce the number of repair calls by 13 to 20%. Because Category PR1 calls indicate the user’s lack of familiarity with the device, this kind of training should also reduce the potential for the user to misuse the device in a way that might lead to a patient injury.
2. Management of rechargeable batteries. (Category PR3 repair calls). A program dedicated to minimizing potential problems with rechargeable batteries. Again it is important to be sure that any user responsibilities are addressed. A program of this kind has the potential to reduce the number of failures of battery-powered devices and reduce the number of total repair calls by 7 to 8%.
3. Accessories. (Category PR4 repair calls). Ensuring that any accessories purchased are of the appropriate quality. This measure has the potential to reduce the number of device failures by 3 to 9%.
4. Environmental conditions. (Category PR5 repair calls). Ensuring that careful attention is given to the environmental requirements of high-risk devices. Tightening up this oversight has the potential to reduce the number of device failures by 1 to 7%.
5. Maintenance quality. (Category MR2 & 4 repair calls). Ensuring that the installation, initial set-up and repair of all of the facility’s critical devices are performed properly.
6. Planned maintenance. (Category MR1, 3 & 5 repair calls). For each of the facility’s devices that has been classified as a high-risk device and that has a failure mode in which the device will fail if a component is not given appropriate periodic rejuvenation, there should be a program to see that this PM is performed competently and in a reasonably timely manner. Usually the manufacturer’s literature will specify which of the device’s components require this periodic attention, and will often recommend a specific inspection or replacement interval. It is to be expected that these recommendations will be sufficiently conservative to allow for the hardest use that the device may encounter. Data from the facility’s PM Findings database (see Section 14.15 below) can provide evidence as to whether or not these recommended intervals are either too short or too long. The interval being too short this would be indicated by a high level of PM Code 1 Findings (“The physical condition of the restored part was found to be good”). And the interval being too long would be indicated by a high level of PM Code 9 Findings (“The restored part was already worn out …”). Some devices, such as certain ventilators, are provided with hours-of-use meters that make it possible to use a more efficient “metered maintenance” strategy rather than the more conventional “interval-based inspections” strategy for rejuvenating the parts subject to wear or deterioration (non-durable parts).
7. Mitigation. Every piece of equipment, even the most carefully designed and well constructed device that is also subjected to the most effective set of preventive measures, will eventually fail. For this reason it is prudent to have some measure of last resort for all of the hospital’s “high risk” devices. The importance of such measures is recognized in the Joint Commission (JC) standards, which for many years have required hospitals to have both written procedures to follow when medical equipment fails and back up equipment, where appropriate.

14.11 Benefits of using the RCM methodology

Texts describing the RCM methodology cite a number of general benefits that are said to result from adopting the RCM approach.

• Higher levels of device reliability and safety
• A significant reduction in equipment downtime
• Lower cost of equipment maintenance
• Longer useful working lives for the maintained items
• Creation of more comprehensive maintenance databases

Although these generalizations seem to offer great promise for the future of medical equipment maintenance, it is in the area of efficiently addressing regulatory compliance that the RCM method can be of immediate help. The Healthcare Technology Management (HTM) Community has two specific challenges that need immediate attention.

There has for some time been a continuing debate about the reluctance on the part of the primary regulator of maintenance practices in the nation’s healthcare facilities (The Center for Medicare and Medicaid Services – also referred to as CMS) to modify its insistence (in the Interpretive Guidelines to its regulation # 482.41 (‘c)(2)) that all maintenance and testing of medical devices be performed “ ... in accordance with the manufacturer’s recommendations …”. Note that this older requirement parallels very closely where the aviation industry was in the 1950s, before the FAA stimulated the development of the RCM methodology.

After a considerable period of lobbying by representatives of the HTM Community for a more modern, risk-based approach, the CMS requirement was partially relaxed on December 2, 2011 to allow alternative equipment maintenance strategies to be used for equipment that can be deemed “not critical to patient health and safety …”. As for equipment that is deemed “critical”, the new CMS interpretive guideline goes on to specify that “such equipment includes, but is not limited to, life-support devices, key resuscitation devices, critical monitoring devices, equipment used for radiological imaging, and other devices whose failure may result in serious injury to patients or staff”, and that these types of equipment must continue to be maintained the traditional way – according to the manufacturer’s recommendations.

The amended guideline also states that even for the so-called “non-critical” types of equipment “… if the hospital is adjusting maintenance activity frequencies below those that are recommended by the manufacturer, such adjustments must be based on a systematic evidence-based assessment. …” and that “… The evidence must provide support that the adjustment will not adversely affect patient or staff health and safety. …”

Although welcome, these changes present the HTM Community with two immediate challenges.

1. Creating a model process for a “systematic evidence-based assessment” to provide acceptable criteria for determining which types of devices should be considered ”critical to the health and safety of patients and staff”; and
2. Determining the relationship (if any) between the length of the device’s maintenance intervals and the level of the threat to “the health and safety of patients and staff”.

14.12 “High Risk” versus “Critical”

Some of the terminology used in the Guidelines to the CMS regulations and in the current Joint Commission (JC) standards has the potential to cause some confusion and it is worth taking a moment here to address this issue.

The current JC standards use the phrase “high-risk medical equipment” …” for which there is a substantial risk of serious injury or death to a patient or staff member should the equipment fail” to define the sub-inventory of devices that (presumably) must still be maintained according to the manufacturer’s recommendations. The CMS document uses the terms “critical equipment” and “non-critical equipment” and “equipment that is critical to patient health and safety” to define the same sub-inventory.

Strictly speaking, according to current risk management usage, the terms are not synonymous. A critical device is one that has the theoretical potential to cause an injury if it fails, because the severity of the consequence of the failure is high. But the failure might not have a high likelihood of actually occurring - whereas, a high-risk device has at least one failure mode with both a high severity consequence and a high probability of actually occurring. We will use the terms "critical device" and “high risk device" in this document according to their proper definitions but keep in mind that both organizations appear to have the same intended target (high-risk devices) for the required sub-inventory. It does not make sense to impose a more stringent PM requirement on devices that are already deemed safe because any failure modes they might have with a serious adverse consequence are very unlikely to occur.

With good reliability With poor reliability
Critical device types Low-risk (safe) devices High-risk (hazardous) devices
Non-critical device types Low-risk (safe) devicesLow-risk (safe) devices

Figure 14.1 - Critical devices that show good reliability are not hazardous!

To illustrate this concept, think about two different modes of air travel. An airplane can be thought of as a critical type of device because a crash when it fails is certainly a high-severity (life-threatening) outcome. However, because they have proved to be reliable (by not crashing very often) commercial airplanes are generally considered to be a safe mode of transportation - even though they have the same worst-case high-severity outcome (i.e. crashing) as experimental aircraft. On the other hand experimental aircraft have a much less reassuring track record with respect to their likelihood of crashing so they are considered to be a high-risk mode of air travel.

14.13 Systematic, evidence-based assessment

A quick review of the example in HTM ComDoc 13. shows that a complete FMEA of a device such as an infusion pump addresses (in Step 2) all three basic types of failure modes including the following:

Inherent-related failures (such as “the pump transferring fluid at too high a rate” - because of a random failure in the electronics of the pump unit itself),
Process-related failures (such as “the pump transferring fluid at too high a rate” - because the wrong dose had been entered), and
Maintenance-related failures (such as “the pump transferring fluid at too high a rate” - because it had been improperly calibrated).

However, for the purpose of this particular type of assessment, which we will call a PM or maintenance-focussed FMEA, we will ignore the inherent-related failure modes and process-related failure modes and focus exclusively on those maintenance-related failure modes that could cause a serious patient injury. We will then categorize those devices found to have maintenance-related failure modes that could have life threatening or serious injury consequences for either the patient or members of the staff as “critical” i.e. as presenting a potential threat to the health and safety of patients or staff. Once these potentially serious maintenance-related failure modes have been identified and determined (in Step 4) to be reasonably likely to occur, it will be possible to select appropriate control/ preventive actions for incorporation into the facility’s maintenance program.

In the example used above - the improper calibration of an infusion pump – this particular failure mode will create a hidden failure because it would not be immediately obvious to the user. And it is possible, if the pump is delivering a potent drug to the patient, that this particular failure could result in a top-level injury. Possible mitigating factors include the pump maybe being provided with some kind of self-check and alarm and, if it is, the likelihood that there will be clinical staff nearby trained to act on the alarm.

After this initial identification of potentially critical failure modes, in Step 3, the FMEA process calls for the members of the analytical team to use their collective professional judgment to determine where to place their finding on the potential (worst case) severity of the consequences of this failure mode on a multi-level scale. There are several alternative scales but in HTM ComDoc 13. we chose to use the four-level scale used by the VA’s National Center for Patient Safety. The highest level of severity on all of the scales is “life threatening or capable of causing a major injury”.

In Step 4, the team estimates how likely it is, on a similar four-level scale, that this kind of event will actually occur. Although it would be helpful to have some reasonably reliable statistics to steer this decision from a repository such as the FDA’s Manufacturer and User Facility Device Experience (MAUDE) database, this particular source is, at the moment, not very comprehensive and therefore of only limited value. There is a proposal elsewhere (HTM ComDoc 7.“Creating a community database of findings on current levels of equipment reliability and safety”) to create a database of PM Findings that could provide the supporting quantitative “evidence” (in the form of relevant data) that the CMS is requiring (see Section 14.15 below).

The next steps in the FMEA process (Steps 5 and 6) call for the analytical team to determine the detectability of this type of failure. If it should be judged to be readily detectable, say, because the device has an alarm and there is likely to be a clinician nearby, then this precludes the need for any other control measure. The following criteria are used (see Figure 13.1 in HTM ComDoc 13.) to determine whether or not the analysis of this particular failure mode should continue:

• Is the hazard sufficiently significant, i.e. the hazard index is 8 or higher?
• Is there another effective control for the hazard that would preclude the need for yet another control measure?
• Is the hazard so obvious that a control measure is not justified?

However, irrespective of the detectability finding devices that are identified as having a maintenance-related failure mode with a potentially serious or life-threatening level of severity and a high likelihood of actually occurring, should be classified as “high-risk” devices.

14.14 The relationship between the length of the maintenance interval and the threat to the health and safety of patients and staff

On the question of a logical relationship between the length of the maintenance interval and a possible threat to “the health and safety of patients and staff” - in the case of the traditional maintenance (scheduled restoration and scheduled discard) tasks we first need to determine the likely consequences of what might happen if the parts of a device that are subject to wear or some other kind of progressive deterioration are not refurbished or replaced before they enter the end-of-life phase and deterioration of the part begins. Because there are many possible scenarios, undertaking all those analyses could be overwhelming, so the only practical approach is to be conservative and assume that using a longer interval than recommended will, indeed, cause the part to deteriorate and that this deterioration will have a significant impact on the performance of the device. And further, that if this is a device that has been categorized as “high-risk”, this degraded performance will have a worst-case adverse impact on the patient.

It is virtually impossible to judge how conservative the manufacturers’ recommendations are for the refurbishment intervals for each device. Nor is it likely that there will be any consistency in this factor from device to device and from manufacturer to manufacturer. So, to be safe, the manufacturer’s recommendations for this refurbishment interval for all “high risk” devices must be respected.

If the device has not been categorized as a “high-risk” device then, by definition, the failure of the device will not have any significant impact on the health and safety of the patient. The failure of the device could have some economic impact but this is not a regulatory issue. Logically, these devices can and should be allowed to “Run-to-Failure if the economic trade-off is favorable.

On the other hand, for device’s that do not have any parts that need refurbishment and the only “maintenance” that the manufacturer recommends is for periodic performance checking and safety testing (failure-finding tasks), there are logical relationships between the maintenance interval and the MTBFs of the failures that can be explored. In the case of Scheduled Failure Finding Tasks there is no optimum interval (shorter is always better) but we have shown elsewhere (HTM ComDoc 6.“Choosing appropriate PM intervals” ) that for reasonable estimates of the MTBFs of random device failures (between 50 and 250 years) and typical maintenance/ inspection intervals (between 6 months and 5 years) the increase in the time that the patient would be exposed to the potentially-hazardous hidden failure if the maintenance interval is extended is extremely small.

For a hidden device failure with an MTBF of 50 years the amount of time that the patient will be exposed to the hidden failure when the testing is done at 6-month intervals is 0. 5% of the interval (0.9 days) and it is 1.0% of the interval (3.65 days) when the testing is done at 12-month intervals. And in the case where the MTBF of the hidden failure is longer, say 250 years (which we consider a quite likely value), the difference between the additional exposure of the patient to this risk between the two test intervals is even smaller – 0.1% (0.18 days) and 0.2% (0.73 days) respectively. In the cited reference we show that there is in fact, for ratios of less than 5%, a linear relationship between the MTBF of the failure and the length of the testing interval. The extra amount of time that the patient will be exposed to the hidden failure (expressed as a percentage of the testing interval) is given by 0.5 times the ratio of the testing interval to the MTBF of the hidden failure.

As a practical matter, this means that for devices that are classified as “low-risk” there is very little disadvantage in testing for hidden failures at the same interval as is used for any scheduled restoration or discard tasks. If the device has no non-durable parts then we suggest using whatever interval is convenient for any other reason. And, according to the RCM methodology these “non-critical” device types are excellent candidates for the very efficient Run-to-Failure strategy. There is, again by definition, no safety downside and the economic advantages can be very significant. Remember that in the aviation business it was by adopting this RTF strategy where it was appropriate that they were able to reduce the industry’s maintenance costs by 50% - which also (amazingly) increased their reliability and safety statistics by a factor of 200 times!

14.15 Proposed format for documenting the PM Findings data on every PM Report

A section is added at the end of each PM procedure asking the service person to indicate by circling one of three letters (A, B or F) whether or not the performance and safety testing of the device revealed any significant degradations or hidden failures.

A = nominal. The letter A should be circled when the results of all of the PVST tests were in compliance with the relevant specifications, and any other functions tested were within expectations.
B = minor OOS condition(s) found. The letter B should be circled when one or more conditions were found that were slightly out-of-spec (OOS) or slightly outside expectations. The purpose of this B rating is to create a watch list to monitor for future adverse trends in particular performance or safety features, even though the discrepancy is not considered to be significant at this time. An example of this would be an electrical leakage reading of 310 microamps which is within 5% of the 300 microamp limit. A B rating should be considered a passing grade.
F = serious OOS condition(s) found. The letter F should be circled when one or more performance or safety features is found to be significantly out-of-spec. (OOS). This is a failing grade and, if it is a high-risk device, it should be removed from service immediately.

The service person is also asked to indicate by circling one of four numbers (1, 5, 9 or 0) the physical condition in which the device parts that were rejuvenated by the traditional PM tasks were found. The numerical ratings should be circled to indicate one of the following findings.

1 = better than expected. There was very little or no deterioration; i.e. the physical condition of the restored part was found to be still good.
5 = nominal. There was some minor deterioration but no apparent adverse effect on the device’s function; i.e. the physical condition of the restored part was found to be about as expected.
9 = serious physical deterioration. The restored part was already worn out and probably having an adverse effect on device function; i.e. the physical condition was found to be considerably worse than expected.
0 = no physical restoration required. The device has no parts needing any kind of physical restoration.

If the PM findings are systematically documented each time a PM is performed, then aggregated into a PM Findings database, it will be possible to:

  • get an indication of the mean time between failures (MTBFs) of any hidden failures, and
  • get an indication of how well the PM interval matches the optimum - which would be when the part being restored has deteriorated - but only to the point just before the deterioration begins affecting the functioning of the device.
  • If the interval is too short, this would be indicated by a preponderance of PM Code 1 findings; and
  • if the interval is too long, it would be indicated by a preponderance of PM Code 9 findings.
  • PM findings of PM Code 5 indicate that the PM interval is just about right.

Site Toolbox:

Personal tools
This page was last modified 23:09, 8 September 2018. - This page has been accessed 279 times. - Disclaimers - About HTMcommunityDB.org