Troubleshooting Instrumentation & Control Systems
A field engineer's story inlcuding checklists and diagnostics during startups and commissioning.
The Startup That Humbled Three Engineers
The feed flow controller was hunting. Not slightly — badly. The control valve was cycling hard. Operators were frustrated. Management wanted answers ten minutes ago.
Three engineers stood around that marshalling cabinet for four hours. Laptops open. P&IDs spread across a folding table. Everyone had a theory.
→ "It's drifting so let's swap the transmitter."
→ "The orifice plate might be backwards."
→ "I'll retune the PID now."
None of them were right.
A senior technician; quiet, 15 years in the field, walked over with a multimeter. Measured the mA signal at the valve actuator terminals in the field junction box.
12.0 mA at the DCS output card 11.4 mA at the valve terminals0.6 mA dropped across a single corroded terminal block connection. Enough to make the loop hunt. Enough to waste four hours. Enough to hold the unit at 40% throughput for half a shift.
One corroded terminal. Three engineers. Four hours.
That story is not unusual. I have seen it or something very close on more startups than I care to count. And that is exactly why troubleshooting is a skill, not just knowledge.
The cost is real. The global process industry loses approximately $20 billion per year to unscheduled downtime and off-spec production (ARC Advisory Group / Fieldbus Foundation, 2009). 80% of those losses are preventable. 40% trace back to the human in the loop — not the equipment.
The Mindset That Separates Good From Great
Before any tool or technique, you need the right mindset. Here is what I have observed over two decades in the field.
▸ Assumes the transmitter is faulty first
▸ Replaces parts before isolating the fault
▸ Works alone without talking to the operator
▸ Ignores documentation and loop history
▸ Speeds up under time pressure makes errors
▸ Asks: is this a fault or a real process change?
▸ Talks to the operator before touching anything
▸ Reviews alarm history and historian trend first
▸ Works methodically, not reactively
▸ Slows down when everyone else is panicking
"The best troubleshooters I have worked with share one trait: they slow down when everyone else speeds up."
Field lesson, 12+ years Oil & GasThe 5-Step Troubleshooting Framework
Most engineers know it. Few follow it completely especially Step 1.
| Step | Action | Most Common Mistake |
|---|---|---|
| 01 | Verify the problem is real | Assuming the operator is always correct |
| 02 | Identify and locate the fault | Guessing without systematic isolation |
| 03 | Fix the problem | Repairing without a clear plan |
| 04 | Verify the repair | Closing the job without full confirmation |
| 05 | Follow up and document | Skipping documentation entirely |
Step 1 is the most skipped. I once watched two technicians spend three hours troubleshooting a level transmitter that was reading correctly. The actual problem: the operator had opened the wrong valve. The level was genuinely high. The instrument was fine. Verify the problem is real before you touch anything.
Pre-Troubleshooting Checklist Before You Touch Anything
- ☑ Talk to the operator: what changed? When exactly?
- ☑ Review DCS alarm history from the past 24 hours
- ☑ Pull the historian trend; PV, output, and setpoint together
- ☑ Check for a recent MOC (Management of Change) on this loop
- ☑ Confirm loop documentation is current (P&ID, loop diagram)
- ☑ Understand what the loop is supposed to do under normal conditions
- ☑ Check loop mode: manual or auto?
- ☑ Confirm process conditions are normal note anything different
- ☑ Inform the control room before any action that could cause an upset
- ☑ Confirm permit requirements, does this need a PTW?
Isolating the Fault: Three Methods
Start at one end, work toward the other.
Transmitter → cable → marshalling → DCS input → controller → DCS output → cable → valve.
Measure at each point. Find where the signal breaks.
Split the loop in half. Check the midpoint. Signal correct?
Fault is in the second half. Wrong? Fault is in the first half.
Keep halving until isolated. Fastest on long cable runs.
What changed recently? New instrument? Cable modification?
DCS update? Brownfield tie-in? Changes cause faults.
Start where the change happened. Fastest method; if documentation is current.
The 4-20mA Loop; Where Most Faults Hide
The 4-20mA signal is the backbone of process instrumentation. It fails in predictable, diagnosable ways. Know this table and you will find most loop faults faster than any other method.
| mA Reading | Condition / Meaning | First Check |
|---|---|---|
| 0 mA | Open circuit: broken wire or blown fuse | Cable continuity, fuse condition |
| 0–3.6 mA | Wiring fault: open loop | Supply voltage, cable breaks, terminal tightness |
| 3.6–3.8 mA | Transmitter Failure (NAMUR NE43 low) | Replace or bench-test transmitter |
| 3.8–4.0 mA | Normal Under-Range | Confirm process is actually at minimum |
| 4.0–20.0 mA | Normal Operation | If reading seems wrong: check process |
| 20.0–20.5 mA | Normal Over-Range | Confirm process is at or above maximum |
| 20.5–22.0 mA | Transmitter Failure (NAMUR NE43 high) | Check transmitter diagnostics |
| > 22 mA | Short circuit / wiring fault | Inspect cable for damage or short |
NAMUR NE43: A reading of 3.7 mA is not a process at 0%. It is the transmitter signalling internal failure. Configure your DCS to recognise NAMUR NE43 status levels and get early warning before a fault causes a process upset or SIS trip. Most plants do not configure this correctly.
During a brownfield tie-in at a gas processing plant, a new mass flow transmitter was installed on an existing hydrocarbon line. After hookup, the DCS read zero even at full process flow.
Everyone assumed a wiring fault. We spent an hour tracing cables. Nothing wrong with the wiring.
I asked one question: "Did anyone check the transmitter configuration?"
The new HART transmitter had been factory-configured in imperial units, cubic feet per hour. The DCS engineering unit was cubic metres per hour. The configured ranges did not overlap at any normal flow rate.
History-based diagnosis: something was configured during installation. Total time to find and fix: 20 minutes. One question saved two more hours of cable tracing.
Common Loop Faults by Type
Pressure Loops
▸ Plugged impulse lines: especially with viscous or dirty process fluids
▸ Condensate in gas impulse lines: seasonal, worse in winter
▸ DP transmitter high/low legs swapped: reads backwards at full flow
▸ Equalising valve left closed after maintenance
▸ Diaphragm seal fill fluid degradation in high-temperature service
Temperature Loops
▸ Wrong thermocouple type configured in DCS vs. actual installed type
▸ Extension cable used instead of proper thermocouple compensating cable
▸ Poor insertion depth: measuring pipe wall temperature, not process
▸ RTD installed as 2-wire, configured as 3-wire produces systematic error
▸ Shield connected at both ends; ground loop induces 50Hz noise
Flow Loops
▸ Impulse lines partially plugged: sluggish response to process changes
▸ Square root extraction applied twice once in transmitter AND once in DCS (very common on DeltaV systems during commissioning)
▸ Orifice plate installed backwards: check direction arrow on plate body
▸ Vortex meter reading zero: flow is below minimum measurable velocity
▸ Gas entrainment in liquid flow meters: causes erratic, spiky readings
▸ Wrong DP range configured for actual orifice plate bore size
Control Valve Troubleshooting
Control valves cause more production losses than any other single component in a process plant. And most valve problems are diagnosed incorrectly while engineers jump to positioner replacement without measuring the signal chain first.
| Symptom | Likely Cause | Check First |
|---|---|---|
| No response to signal change | Solenoid failure or instrument air lost | I/P input signal, air supply pressure |
| Valve hunting / oscillating | Positioner gain too high | Positioner tuning parameters |
| Stuck at fixed position | Positioner feedback linkage broken | Physical linkage between positioner and valve stem |
| Drives fully open or closed | I/P converter failure | Measure I/P output pressure (3–15 psi?) |
| Slow to respond | Restriction in air tubing or filter | Air filter, restrictor, volume booster |
| Leaking at closed position | Seat damage or wrong shutoff class | Packing, seat inspection |
| Oscillating near low % travel | Stiction in packing or positioner | Friction test, packing adjustment |
The I/P converter turns the DCS mA output into pneumatic pressure to drive the valve actuator. It fails silently. Always verify three points: (1) Is the DCS output card sending the correct mA? (2) Is the I/P outputting the correct pressure? (3) Is the valve stem actually moving? The fault is somewhere between these three points.
DCS Troubleshooting Emerson DeltaV
DeltaV has powerful built-in diagnostics. Most engineers use 20% of what is available. Here is where to look and why.
| DeltaV Tool | What It Shows | When to Use It |
|---|---|---|
| Alarm Banner / Alarm Summary | Active alarms, timing, priority | First action what alarmed and when? |
| DeltaV Operate Trends | PV, output, SP over time | Visualise exactly when the fault started |
| Event Chronicle | Every controller action, alarm, operator action with timestamp | Root cause analysis; sequence of events |
| DeltaV Diagnostics | Controller health, network status, module health | System-level faults |
| AMS Device Manager | HART device diagnostics, calibration history, NAMUR NE107 status | Field device faults: always check first |
| DeltaV Inspect | Physical signal at I/O channel level | Verify what is physically coming into the card |
| Download History (DeltaV Explorer) | Configuration change log | "Did something change recently?" |
DeltaV Troubleshooting Checklist
- ☑ Check controller health in DeltaV Diagnostics
- ☑ Verify I/O card channel status; healthy or faulted?
- ☑ Check network node status in DeltaV Explorer
- ☑ Review Event Chronicle from 1 hour before the fault appeared
- ☑ Open AMS Device Manager: check NAMUR NE107 diagnostic alerts
- ☑ Review download history:was a configuration change pushed recently? <
- ☑ Check power supply health for I/O subsystem
- ☑ Verify engineering units in AI block match transmitter configuration
- ☑ Confirm scaling: 4 mA = 0% and 20 mA = 100% correct for this range?
- ☑ Check for square root extraction applied in both transmitter AND DCS
A flow controller on a DeltaV system started reading 2 mA higher than the HART primary variable. The control module appeared fine. No active alarms.
AMS Device Manager showed a single flag: Simulation Mode Active on the transmitter.
Someone had enabled simulation mode during commissioning to test the loop and forgot to disable it. The transmitter was outputting a fixed simulated value while the real process had moved. The DCS was controlling on simulated data.
Event Chronicle showed a technician had connected a HART communicator to that device three days earlier.
Lesson: AMS Device Manager found it in five minutes. Without AMS, this fault could have run undetected for weeks with incorrect flow measurement feeding a critical control loop. Always check simulation mode flag after any HART communicator activity.
SIS Troubleshooting; What's Different
Safety Instrumented System troubleshooting follows the same logical steps but with higher stakes and strict procedural controls. The key difference: every action carries safety implications.
Never bypass an SIS loop without formal bypass authorisation and a documented risk assessment. A bypassed SIS loop means the safety barrier is removed. One undetected process exceedance during that bypass can be catastrophic.
A platformer unit tripped on SIS during stable normal operations. No high temperature. No process exceedance. No equipment failure visible on any trend.
SOE log showed the trip initiator: low-low pressure switch PSLL-PLT-201. But the historian trend showed process pressure was completely normal at the time of trip.
Physical inspection found it. The impulse line block valve for PSLL-PLT-201 was partially closed. During a minor process fluctuation, the impulse line could not equalise pressure fast enough. The switch saw a momentary dip that did not exist in the main process line.
Root cause: a maintenance technician had partially closed that block valve two days earlier while checking for leaks and had not fully reopened it.
Lesson: After any maintenance activity near an SIS instrument, verify that all impulse line block valves are fully open and confirmed. Add this to your post-maintenance checklist. The SOE found the initiator in seconds. The block valve inspection confirmed the root cause in minutes.
SIS Troubleshooting Checklist
- ☑ Obtain bypass authorisation before isolating any SIS loop
- ☑ Review SOE log for the exact trip sequence and initiator
- ☑ Confirm which sensor(s) actuated note voting (1oo2, 2oo3)
- ☑ Check historian trend for process condition at time of trip
- ☑ Inspect all impulse line block valves: fully open?
- ☑ Inspect solenoid valves: energised state correct for failsafe direction?
- ☑ Check last proof test date and results for the initiating device
- ☑ Update proof test record if a fault is confirmed
- ☑ Complete formal bypass close-out before removing the bypass
- ☑ Notify safety authority if SIS architecture integrity was affected
NAMUR NE107; Let Your Instruments Speak
NAMUR NE107 gives every smart field device a built-in health reporting system. If your DCS is configured to read and display this status, you catch failures before they become trips not after.
| NE107 Status | Meaning | HMI Colour | Required Action |
|---|---|---|---|
| Device OK | Normal operation, output valid | 🟢 ⚪ Green / Grey | None |
| Maintenance Required | Degraded but output still valid | 🔵 Blue | Plan maintenance: do not ignore |
| Out of Specification | Process or environment outside design limits | 🟡 Yellow | Investigate cause promptly |
| Function Check | Simulation or calibration active, output temporary | 🟠Orange | Verify intentional disable when done |
| Failure | Non-valid output signal | 🔴 Red | Immediate action; bypass or replace |
In DeltaV with AMS Device Manager, NE107 statuses propagate to the HMI automatically if configured correctly. Most plants find out about device failures after a trip rather than before. Configure NE107. Use it actively. It is the difference between predictive maintenance and reactive firefighting.
Test Equipment; What You Actually Need
You do not need every tool in the catalogue. You need the right tools and you need to know how to use them in the field under pressure.
| Tool | Primary Use | Field Tip |
|---|---|---|
| HART Handheld Communicator | Device configuration, diagnostics, calibration | Always check Simulation Mode flag after use |
| True RMS Multimeter | Voltage, resistance, continuity | Essential: carry always, no exceptions |
| mA Clamp Meter | Measure loop current without breaking circuit | Underused. No loop interruption. No alarms. Carry one always. |
| Process Calibrator | Simulate transmitter signals, verify valve response | Verify mA output and valve position simultaneously |
| Pressure Calibrator | Pressure loop calibration, SIS proof testing | Calibrate at process temperature when possible |
| Deadweight Tester | High-accuracy pressure reference | Critical loop calibration: do not use a gauge as reference |
| ProfiTrace Analyzer | Profibus PA/DP network diagnostics | When fieldbus devices drop off: check segment voltage first |
| Thermal Camera (FLIR) | Electrical hot spots, panel faults | Scan power supply and MCC panels during startup |
| Digital Storage Oscilloscope | Signal noise and interference investigation | When a loop is "noisy": DCS historian hides 50Hz noise |
A temperature controller on a fired heater had been in manual for weeks. Operators reported the PV was "noisy" ±2°C oscillation. Three engineers had looked at it. All blamed PID tuning.
One engineer brought a digital storage oscilloscope. Connected it across the thermocouple terminals in the local junction box.
The signal was oscillating at exactly 50 Hz. Power line frequency interference.
The thermocouple extension cable was running parallel to a 440V motor power cable for 15 metres in the same cable tray. No separation. Cable shield was connected at both ends creating a classic ground loop that picked up power line interference.
Lesson: Electrical noise is invisible on a DCS historian trend because the historian scans at 1-second intervals and averages the noise out. You need an oscilloscope or millivolt meter at the source. Fix: re-route cable to a separate tray. Ground shield at DCS end only. Signal clean in 20 minutes.
OT Cybersecurity; A Necessary Brief
Modern DCS systems are networked. That brings capability and risk. I have seen troubleshooting hours wasted because the real fault was a cybersecurity-related issue that nobody considered.
☑ Windows patch deployment pushed automatically to a DCS engineering station mid-shift took it offline
☑ Antivirus scan running during peak operations consumed CPU and froze DCS operator graphics
☑ Unauthorised network switch added by a contractor, caused a broadcast storm on the DCS control network
☑ NTP server failure; SOE timestamps drifted by hours, making post-trip analysis unreliable
☑ USB device inserted by a contractor; introduced malware that corrupted historian archive data
- ☑ Is the DCS network congested? Check managed switch port statistics
- ☑ Are SOE timestamps consistent? Verify NTP synchronisation status
- ☑ Was any external device connected recently? Check change logs
- ☑ Is antivirus scheduled to scan during plant operations? It should not be
- ☑ Are all active remote access sessions controlled, logged, and authorised?
IEC 62443 and your site's OT security policy govern these controls. If they don't exist, raise it with your cybersecurity authority now, not after an incident.
Common Mistakes; What Engineers Miss
I have made some of these. I have seen all of them — on projects across Oil & Gas, mining, and refining.
- ✕Replacing the transmitter first: without verifying it is actually faulty. Most transmitter replacements on plant are unnecessary.
- ✕Not talking to the operator: missing the one piece of context that would have found the fault in 10 minutes.
- ✕Ignoring alarm history: the answer is often already in the DCS logs, waiting to be read.
- ✕Working from outdated documentation: old P&IDs that do not reflect as-built reality in brownfield projects.
- ✕Fixing the symptom, not the cause: loop works after a reboot. Fault returns next week. Root cause never addressed.
- ✕No post-repair verification: closing the job without confirming alarms, interlocks, and control are all back to normal.
- ✕Skipping documentation update: the next engineer faces the same fault with no history to guide them.
- ✕Not informing the control room: causing a process upset while troubleshooting the original fault.
Safety During Troubleshooting; Non-Negotiable
Troubleshooting under time pressure is the most dangerous moment in plant operations. Pressure causes shortcuts. Shortcuts cause accidents. Slow down. Think. Then act.
☑ Obtain a Permit to Work (PTW) before working on live equipment in classified hazardous areas
☑ Never bypass SIS loops without formal authorisation and documented risk assessment
☑ Apply LOTO before working inside electrical panels or MCC enclosures
☑ Gas-test before opening field junction boxes in potentially explosive atmospheres
☑ Never assume a loop is de-energised, test before you touch
☑ Treat instrument air with respect, high-pressure systems cause serious injuries
☑ Inform the control room of every action that could affect the process
☑ Any modification to a live SIS system requires a formal MOC, no exceptions
Lessons From 12+ Years in the Field
If I could give a junior engineer one page to carry, this would be it.
"Troubleshooting is not about knowing everything. It is about thinking clearly when others are panicking."
Zohaib Jahan | Senior Automation EngineerWhat is the most difficult fault you have ever troubleshot? Was it intermittent? Was the root cause something nobody expected? Did it teach you something that stayed with you? Share it as this profession gets better when we share field knowledge, not just theory.
Connect on LinkedIn →