McKinsey estimates that predictive maintenance can reduce unplanned downtime by 30–50%, extend equipment lifespan by 20–40%, and cut maintenance costs by 15–25%. IBM research suggests that well-constructed machine learning models can predict failures with up to 90% accuracy. These figures are real, and the plants achieving them exist. But they are not representative of the average deployment.

What is representative: a pilot that runs on three assets, produces some compelling early results, then quietly stalls. The vendor's engineers move on, the internal champion gets reassigned, and eighteen months later the system is generating alerts that nobody acts on. The board is told the technology wasn't ready. In most cases, the technology was ready. The organisation wasn't.

60–75% of deployments plagued by incomplete sensor coverage or unreliable data integration
65–80% of facilities face skills gaps preventing effective model use and alert triage
55–70% of implementations encounter workforce resistance that undermines adoption

The Data That Doesn't Describe Failure

Predictive maintenance models learn from history. They study how a bearing behaved in the weeks before it failed, how vibration signatures shifted, how temperature crept. The problem is that most industrial historians were never built to capture failure — they were built to log process performance. So when you ask them to train a failure prediction model, they hand you thousands of hours of normal operation and almost no verified failure events.

Six data quality failures recur in virtually every troubled implementation: missing data from sensors that weren't installed or went offline; noisy data corrupted by interference or miscalibration; siloed data locked in OT systems that the analytics platform can't reach; an absence of failure history (the plant ran too well, or records weren't kept); inconsistent timestamp formats that make cross-asset comparison impossible; and a lack of operating context — no record of load conditions, feedstock variation, or seasonal factors that change what "normal" looks like.

"The model is only as good as the history you give it. And most industrial historians were designed to forget failures, not learn from them."

Resolving data quality issues to a sufficient standard takes 6–12 months in most facilities — longer if the OT landscape is fragmented across multiple vendors and decades of legacy installation. Teams that underestimate this phase start model training too early, produce models that are poorly calibrated, and then lose confidence in the outputs when those models begin generating false positives.

// Sensor Coverage Blind Spots

A predictive model cannot warn you about a failure it cannot see. Incomplete sensor coverage is the most common and most underestimated failure mode in PdM deployments. A programme might have excellent vibration monitoring on ten rotating assets and nothing at all on the downstream conveyor that those assets feed into. When the conveyor fails unexpectedly, the credibility of the whole system takes a hit — even though the system was never designed to monitor it.

Sensor retrofitting on brownfield sites is expensive, disruptive, and politically fraught. Decisions about what not to instrument are often made on budget grounds without adequate risk analysis. The result is a coverage map with gaps that only become visible when something fails in one of them.

The OT/IT Disconnection

Predictive maintenance analytics live in IT. The assets being monitored live in OT. Between them sits a gap that most vendors paper over during the pilot — typically by having their engineers manually extract and transfer data — and never solve at scale.

An effective predictive maintenance programme requires bidirectional integration: sensor data flowing up from the OT layer into the analytics platform, and work orders flowing back down from the analytics platform into the CMMS when a failure is predicted. In most implementations, only the first half of this pipe is built. The alert appears on a dashboard somewhere in the control room. A technician sees it, decides whether to trust it, and if they do, manually enters a work order into the CMMS. That manual step — that moment of discretion between the machine's warning and the maintenance system's response — is where value leaks out.

// The Four-Point Failure Pattern

Productive PdM deployments break down at four specific junctions: sensor coverage gaps that leave assets unmonitored; dirty telemetry that corrupts model inputs; isolation from the CMMS that turns predictions into informational displays rather than actionable work orders; and a manual handoff between alert and technician response that introduces latency and inconsistency. Address all four or address none — partial implementations typically perform worse than well-run preventive maintenance programmes.

Organisational Immunology

Every organisation has an immune system — a set of reflexes that identify and expel foreign objects. Predictive maintenance, when it works, tells experienced engineers that they are wrong. It says: the bearing you just inspected and certified as healthy will fail in eleven days. An engineer who has been doing this for twenty years and has never had a bearing fail without warning does not simply accept that claim. He notes it, perhaps mentions it to a colleague, and moves on.

This is not irrationality. It is a reasonable response to a new system that has not yet earned trust. The problem is that predictive maintenance programmes typically don't invest enough time in the process of earning that trust. Vendors demonstrate accuracy during the pilot on assets where the model is well-calibrated. They show impressive catch rates. Then the model goes live across the full asset base, where calibration is weaker, and its error rate climbs. The engineers who were sceptical from the start feel vindicated.

Workforce resistance affects 55–70% of implementations. Resolving it takes 4–8 months with structured change management — and most programmes allocate no budget to change management at all, treating it as something that will sort itself out once people see the system working.

  • No historical failure data to train models on — plants that ran well never logged failure signatures
  • Sensor coverage designed around process monitoring, not asset health — critical components left unmonitored
  • OT/IT integration built for the pilot, not for scale — data pipelines that break when vendor engineers leave
  • Alert-to-action gap never closed — predictions that reach dashboards but never enter the CMMS
  • No skills transfer — maintenance teams who can use the system but can't improve it when it degrades
  • Change management treated as optional — experienced engineers who never trusted the system from day one

What the 18-Month Cliff Looks Like

The failure is not usually sudden. In the first six months, the programme shows promise. There are a few genuine catches — a motor that would have failed without warning, a pump that was pulled from service just in time. These successes are visible and generate enthusiasm. The vendor presents them at quarterly reviews. The project team points to ROI.

Somewhere between months eight and fourteen, the quality of the catches starts to decline. The easy wins have been had. The remaining asset failures are rarer and harder to predict. The false positive rate, which was tolerable at first, starts to feel oppressive to maintenance teams who are being asked to respond to alerts on assets that turn out to be fine. The model hasn't degraded — but it was never well-calibrated in the first place for the full asset population, and that's now becoming clear.

By month eighteen, the system is running but not being used. The dashboard is open in a corner of the control room. Engineers have learned which alert categories are reliable and which to ignore. The programme has entered what McKinsey calls "isolated pilot" status — present, but not scaled, and not delivering the enterprise-level returns that justified the investment.

What Separates Programmes That Scale

The plants achieving 73% reductions in equipment failures and 10:1 ROI ratios are not using fundamentally different technology from the plants that stalled. They are using the same sensors, the same ML platforms, the same statistical approaches. What they did differently sits almost entirely outside the software stack.

They began with a data audit — not an asset audit, a data audit. Before selecting pilot assets, they mapped which assets had sufficient failure history, sufficient sensor coverage, and sufficient data quality to produce a well-calibrated model. They started narrow, on assets where the model could be right frequently enough to earn credibility with maintenance teams.

They built the CMMS integration before going live, not after. Alerts that cannot automatically generate or pre-populate a work order are alerts that will be ignored. Closing the loop between prediction and maintenance action is not a phase two activity — it is a prerequisite for the programme delivering any value at all.

And they invested in internal capability, not vendor dependency. The engineers responsible for operating the system were trained not just to use it, but to understand why it generates the alerts it does, how to assess confidence levels, and how to update models when performance degrades. Programmes with a single internal champion who leaves are programmes that fail.

"The 18-month cliff is not a technology problem. It is what happens when integration, change management, and data quality are treated as implementation details rather than strategic prerequisites."

The fundamental tension is that predictive maintenance is sold as a technology product and must be delivered as an organisational change programme. The technology is mature. The gap is almost always in the layers surrounding it — the data infrastructure, the process integration, the human adoption, and the internal skills to sustain it when the vendor's project team has moved on.

Pilots succeed. Scale fails. The difference between organisations that make the transition and those that don't is rarely about the sophistication of their AI. It is about whether they treated the 18 months after go-live as carefully as they treated the six months before it.