The Startup That Humbled Three Engineers

Real Refinery Plant Startup · Day 3

The feed flow controller was hunting. Not slightly — badly. The control valve was cycling hard. Operators were frustrated. Management wanted answers ten minutes ago.

Three engineers stood around that marshalling cabinet for four hours. Laptops open. P&IDs spread across a folding table. Everyone had a theory.

→ "It's drifting so let's swap the transmitter."

→ "The orifice plate might be backwards."

→ "I'll retune the PID now."

None of them were right.

A senior technician; quiet, 15 years in the field, walked over with a multimeter. Measured the mA signal at the valve actuator terminals in the field junction box.

12.0 mA at the DCS output card 11.4 mA at the valve terminals

0.6 mA dropped across a single corroded terminal block connection. Enough to make the loop hunt. Enough to waste four hours. Enough to hold the unit at 40% throughput for half a shift.

One corroded terminal. Three engineers. Four hours.

That story is not unusual. I have seen it or something very close on more startups than I care to count. And that is exactly why troubleshooting is a skill, not just knowledge.

The cost is real. The global process industry loses approximately $20 billion per year to unscheduled downtime and off-spec production (ARC Advisory Group / Fieldbus Foundation, 2009). 80% of those losses are preventable. 40% trace back to the human in the loop — not the equipment.


The Mindset That Separates Good From Great

Before any tool or technique, you need the right mindset. Here is what I have observed over two decades in the field.

❌ Junior Engineer

    ▸ Assumes the transmitter is faulty first
    ▸ Replaces parts before isolating the fault
    ▸ Works alone without talking to the operator
    ▸ Ignores documentation and loop history
    ▸ Speeds up under time pressure makes errors
✓ Experienced Engineer

    ▸ Asks: is this a fault or a real process change?
    ▸ Talks to the operator before touching anything
    ▸ Reviews alarm history and historian trend first
    ▸ Works methodically, not reactively
    ▸ Slows down when everyone else is panicking

"The best troubleshooters I have worked with share one trait: they slow down when everyone else speeds up."

Field lesson, 12+ years Oil & Gas

The 5-Step Troubleshooting Framework

Most engineers know it. Few follow it completely especially Step 1.

Step Action Most Common Mistake
01Verify the problem is realAssuming the operator is always correct
02Identify and locate the faultGuessing without systematic isolation
03Fix the problemRepairing without a clear plan
04Verify the repairClosing the job without full confirmation
05Follow up and documentSkipping documentation entirely
⚠️

Step 1 is the most skipped. I once watched two technicians spend three hours troubleshooting a level transmitter that was reading correctly. The actual problem: the operator had opened the wrong valve. The level was genuinely high. The instrument was fine. Verify the problem is real before you touch anything.

Pre-Troubleshooting Checklist Before You Touch Anything

Pre-Troubleshooting Checklist
  • Talk to the operator: what changed? When exactly?
  • Review DCS alarm history from the past 24 hours
  • Pull the historian trend; PV, output, and setpoint together
  • Check for a recent MOC (Management of Change) on this loop
  • Confirm loop documentation is current (P&ID, loop diagram)
  • Understand what the loop is supposed to do under normal conditions
  • Check loop mode: manual or auto?
  • Confirm process conditions are normal note anything different
  • Inform the control room before any action that could cause an upset
  • Confirm permit requirements, does this need a PTW?

Isolating the Fault: Three Methods

Method 01
Input / Output Series

Start at one end, work toward the other.
Transmitter → cable → marshalling → DCS input → controller → DCS output → cable → valve.
Measure at each point. Find where the signal breaks.

Method 02
Divide & Conquer

Split the loop in half. Check the midpoint. Signal correct?
Fault is in the second half. Wrong? Fault is in the first half.
Keep halving until isolated. Fastest on long cable runs.

Method 03
History-Based

What changed recently? New instrument? Cable modification?
DCS update? Brownfield tie-in? Changes cause faults.
Start where the change happened. Fastest method; if documentation is current.


The 4-20mA Loop; Where Most Faults Hide

The 4-20mA signal is the backbone of process instrumentation. It fails in predictable, diagnosable ways. Know this table and you will find most loop faults faster than any other method.

mA Reading Condition / Meaning First Check
0 mAOpen circuit: broken wire or blown fuseCable continuity, fuse condition
0–3.6 mAWiring fault: open loopSupply voltage, cable breaks, terminal tightness
3.6–3.8 mATransmitter Failure (NAMUR NE43 low)Replace or bench-test transmitter
3.8–4.0 mANormal Under-RangeConfirm process is actually at minimum
4.0–20.0 mANormal OperationIf reading seems wrong: check process
20.0–20.5 mANormal Over-RangeConfirm process is at or above maximum
20.5–22.0 mATransmitter Failure (NAMUR NE43 high)Check transmitter diagnostics
> 22 mAShort circuit / wiring faultInspect cable for damage or short
💡

NAMUR NE43: A reading of 3.7 mA is not a process at 0%. It is the transmitter signalling internal failure. Configure your DCS to recognise NAMUR NE43 status levels and get early warning before a fault causes a process upset or SIS trip. Most plants do not configure this correctly.

Field Story #1
The Brownfield Trap

During a brownfield tie-in at a gas processing plant, a new mass flow transmitter was installed on an existing hydrocarbon line. After hookup, the DCS read zero even at full process flow.

Everyone assumed a wiring fault. We spent an hour tracing cables. Nothing wrong with the wiring.

I asked one question: "Did anyone check the transmitter configuration?"

The new HART transmitter had been factory-configured in imperial units, cubic feet per hour. The DCS engineering unit was cubic metres per hour. The configured ranges did not overlap at any normal flow rate.

History-based diagnosis: something was configured during installation. Total time to find and fix: 20 minutes. One question saved two more hours of cable tracing.


Common Loop Faults by Type

Pressure Loops


    ▸ Plugged impulse lines: especially with viscous or dirty process fluids
    ▸ Condensate in gas impulse lines: seasonal, worse in winter
    ▸ DP transmitter high/low legs swapped: reads backwards at full flow
    ▸ Equalising valve left closed after maintenance
    ▸ Diaphragm seal fill fluid degradation in high-temperature service

Temperature Loops


    ▸ Wrong thermocouple type configured in DCS vs. actual installed type
    ▸ Extension cable used instead of proper thermocouple compensating cable
    ▸ Poor insertion depth: measuring pipe wall temperature, not process
    ▸ RTD installed as 2-wire, configured as 3-wire produces systematic error
    ▸ Shield connected at both ends; ground loop induces 50Hz noise

Flow Loops


    ▸ Impulse lines partially plugged: sluggish response to process changes
    Square root extraction applied twice once in transmitter AND once in DCS (very common on DeltaV systems during commissioning)
    ▸ Orifice plate installed backwards: check direction arrow on plate body
    ▸ Vortex meter reading zero: flow is below minimum measurable velocity
    ▸ Gas entrainment in liquid flow meters: causes erratic, spiky readings
    ▸ Wrong DP range configured for actual orifice plate bore size

Control Valve Troubleshooting

Control valves cause more production losses than any other single component in a process plant. And most valve problems are diagnosed incorrectly while engineers jump to positioner replacement without measuring the signal chain first.

Symptom Likely Cause Check First
No response to signal changeSolenoid failure or instrument air lostI/P input signal, air supply pressure
Valve hunting / oscillatingPositioner gain too highPositioner tuning parameters
Stuck at fixed positionPositioner feedback linkage brokenPhysical linkage between positioner and valve stem
Drives fully open or closedI/P converter failureMeasure I/P output pressure (3–15 psi?)
Slow to respondRestriction in air tubing or filterAir filter, restrictor, volume booster
Leaking at closed positionSeat damage or wrong shutoff classPacking, seat inspection
Oscillating near low % travelStiction in packing or positionerFriction test, packing adjustment
The I/P Converter; Three Points of Measurement

The I/P converter turns the DCS mA output into pneumatic pressure to drive the valve actuator. It fails silently. Always verify three points: (1) Is the DCS output card sending the correct mA? (2) Is the I/P outputting the correct pressure? (3) Is the valve stem actually moving? The fault is somewhere between these three points.


DCS Troubleshooting Emerson DeltaV

DeltaV has powerful built-in diagnostics. Most engineers use 20% of what is available. Here is where to look and why.

DeltaV Tool What It Shows When to Use It
Alarm Banner / Alarm SummaryActive alarms, timing, priorityFirst action what alarmed and when?
DeltaV Operate TrendsPV, output, SP over timeVisualise exactly when the fault started
Event ChronicleEvery controller action, alarm, operator action with timestampRoot cause analysis; sequence of events
DeltaV DiagnosticsController health, network status, module healthSystem-level faults
AMS Device ManagerHART device diagnostics, calibration history, NAMUR NE107 statusField device faults: always check first
DeltaV InspectPhysical signal at I/O channel levelVerify what is physically coming into the card
Download History (DeltaV Explorer)Configuration change log"Did something change recently?"

DeltaV Troubleshooting Checklist

DeltaV DCS Checklist
  • Check controller health in DeltaV Diagnostics
  • Verify I/O card channel status; healthy or faulted?
  • Check network node status in DeltaV Explorer
  • Review Event Chronicle from 1 hour before the fault appeared
  • Open AMS Device Manager: check NAMUR NE107 diagnostic alerts
  • Review download history:was a configuration change pushed recently?
  • <
  • Check power supply health for I/O subsystem
  • Verify engineering units in AI block match transmitter configuration
  • Confirm scaling: 4 mA = 0% and 20 mA = 100% correct for this range?
  • Check for square root extraction applied in both transmitter AND DCS
Field Story #2
The Hidden Simulation Mode

A flow controller on a DeltaV system started reading 2 mA higher than the HART primary variable. The control module appeared fine. No active alarms.

AMS Device Manager showed a single flag: Simulation Mode Active on the transmitter.

Someone had enabled simulation mode during commissioning to test the loop and forgot to disable it. The transmitter was outputting a fixed simulated value while the real process had moved. The DCS was controlling on simulated data.

Event Chronicle showed a technician had connected a HART communicator to that device three days earlier.

Lesson: AMS Device Manager found it in five minutes. Without AMS, this fault could have run undetected for weeks with incorrect flow measurement feeding a critical control loop. Always check simulation mode flag after any HART communicator activity.


SIS Troubleshooting; What's Different

Safety Instrumented System troubleshooting follows the same logical steps but with higher stakes and strict procedural controls. The key difference: every action carries safety implications.

🚨

Never bypass an SIS loop without formal bypass authorisation and a documented risk assessment. A bypassed SIS loop means the safety barrier is removed. One undetected process exceedance during that bypass can be catastrophic.

Field Story #3
The SIS Trip Nobody Expected

A platformer unit tripped on SIS during stable normal operations. No high temperature. No process exceedance. No equipment failure visible on any trend.

SOE log showed the trip initiator: low-low pressure switch PSLL-PLT-201. But the historian trend showed process pressure was completely normal at the time of trip.

Physical inspection found it. The impulse line block valve for PSLL-PLT-201 was partially closed. During a minor process fluctuation, the impulse line could not equalise pressure fast enough. The switch saw a momentary dip that did not exist in the main process line.

Root cause: a maintenance technician had partially closed that block valve two days earlier while checking for leaks and had not fully reopened it.

Lesson: After any maintenance activity near an SIS instrument, verify that all impulse line block valves are fully open and confirmed. Add this to your post-maintenance checklist. The SOE found the initiator in seconds. The block valve inspection confirmed the root cause in minutes.

SIS Troubleshooting Checklist

SIS Troubleshooting Checklist; IEC 61511
  • Obtain bypass authorisation before isolating any SIS loop
  • Review SOE log for the exact trip sequence and initiator
  • Confirm which sensor(s) actuated note voting (1oo2, 2oo3)
  • Check historian trend for process condition at time of trip
  • Inspect all impulse line block valves: fully open?
  • Inspect solenoid valves: energised state correct for failsafe direction?
  • Check last proof test date and results for the initiating device
  • Update proof test record if a fault is confirmed
  • Complete formal bypass close-out before removing the bypass
  • Notify safety authority if SIS architecture integrity was affected

NAMUR NE107; Let Your Instruments Speak

NAMUR NE107 gives every smart field device a built-in health reporting system. If your DCS is configured to read and display this status, you catch failures before they become trips not after.

NE107 Status Meaning HMI Colour Required Action
Device OKNormal operation, output valid🟢 ⚪ Green / GreyNone
Maintenance RequiredDegraded but output still valid🔵 BluePlan maintenance: do not ignore
Out of SpecificationProcess or environment outside design limits🟡 YellowInvestigate cause promptly
Function CheckSimulation or calibration active, output temporary🟠 OrangeVerify intentional disable when done
FailureNon-valid output signal🔴 RedImmediate action; bypass or replace
⚙️

In DeltaV with AMS Device Manager, NE107 statuses propagate to the HMI automatically if configured correctly. Most plants find out about device failures after a trip rather than before. Configure NE107. Use it actively. It is the difference between predictive maintenance and reactive firefighting.


Test Equipment; What You Actually Need

You do not need every tool in the catalogue. You need the right tools and you need to know how to use them in the field under pressure.

Tool Primary Use Field Tip
HART Handheld CommunicatorDevice configuration, diagnostics, calibrationAlways check Simulation Mode flag after use
True RMS MultimeterVoltage, resistance, continuityEssential: carry always, no exceptions
mA Clamp MeterMeasure loop current without breaking circuitUnderused. No loop interruption. No alarms. Carry one always.
Process CalibratorSimulate transmitter signals, verify valve responseVerify mA output and valve position simultaneously
Pressure CalibratorPressure loop calibration, SIS proof testingCalibrate at process temperature when possible
Deadweight TesterHigh-accuracy pressure referenceCritical loop calibration: do not use a gauge as reference
ProfiTrace AnalyzerProfibus PA/DP network diagnosticsWhen fieldbus devices drop off: check segment voltage first
Thermal Camera (FLIR)Electrical hot spots, panel faultsScan power supply and MCC panels during startup
Digital Storage OscilloscopeSignal noise and interference investigationWhen a loop is "noisy": DCS historian hides 50Hz noise
Field Story #4
The Noise That Was Not There

A temperature controller on a fired heater had been in manual for weeks. Operators reported the PV was "noisy" ±2°C oscillation. Three engineers had looked at it. All blamed PID tuning.

One engineer brought a digital storage oscilloscope. Connected it across the thermocouple terminals in the local junction box.

The signal was oscillating at exactly 50 Hz. Power line frequency interference.

The thermocouple extension cable was running parallel to a 440V motor power cable for 15 metres in the same cable tray. No separation. Cable shield was connected at both ends creating a classic ground loop that picked up power line interference.

Lesson: Electrical noise is invisible on a DCS historian trend because the historian scans at 1-second intervals and averages the noise out. You need an oscilloscope or millivolt meter at the source. Fix: re-route cable to a separate tray. Ground shield at DCS end only. Signal clean in 20 minutes.


OT Cybersecurity; A Necessary Brief

Modern DCS systems are networked. That brings capability and risk. I have seen troubleshooting hours wasted because the real fault was a cybersecurity-related issue that nobody considered.


    Windows patch deployment pushed automatically to a DCS engineering station mid-shift took it offline
    Antivirus scan running during peak operations consumed CPU and froze DCS operator graphics
    Unauthorised network switch added by a contractor, caused a broadcast storm on the DCS control network
    NTP server failure; SOE timestamps drifted by hours, making post-trip analysis unreliable
    USB device inserted by a contractor; introduced malware that corrupted historian archive data
Quick OT Security Checks During Troubleshooting
  • Is the DCS network congested? Check managed switch port statistics
  • Are SOE timestamps consistent? Verify NTP synchronisation status
  • Was any external device connected recently? Check change logs
  • Is antivirus scheduled to scan during plant operations? It should not be
  • Are all active remote access sessions controlled, logged, and authorised?

IEC 62443 and your site's OT security policy govern these controls. If they don't exist, raise it with your cybersecurity authority now, not after an incident.


Common Mistakes; What Engineers Miss

I have made some of these. I have seen all of them — on projects across Oil & Gas, mining, and refining.

  • Replacing the transmitter first: without verifying it is actually faulty. Most transmitter replacements on plant are unnecessary.
  • Not talking to the operator: missing the one piece of context that would have found the fault in 10 minutes.
  • Ignoring alarm history: the answer is often already in the DCS logs, waiting to be read.
  • Working from outdated documentation: old P&IDs that do not reflect as-built reality in brownfield projects.
  • Fixing the symptom, not the cause: loop works after a reboot. Fault returns next week. Root cause never addressed.
  • No post-repair verification: closing the job without confirming alarms, interlocks, and control are all back to normal.
  • Skipping documentation update: the next engineer faces the same fault with no history to guide them.
  • Not informing the control room: causing a process upset while troubleshooting the original fault.

Safety During Troubleshooting; Non-Negotiable

🚨

Troubleshooting under time pressure is the most dangerous moment in plant operations. Pressure causes shortcuts. Shortcuts cause accidents. Slow down. Think. Then act.


    Obtain a Permit to Work (PTW) before working on live equipment in classified hazardous areas
    Never bypass SIS loops without formal authorisation and documented risk assessment
    Apply LOTO before working inside electrical panels or MCC enclosures
    Gas-test before opening field junction boxes in potentially explosive atmospheres
    Never assume a loop is de-energised, test before you touch
    Treat instrument air with respect, high-pressure systems cause serious injuries
    Inform the control room of every action that could affect the process
    Any modification to a live SIS system requires a formal MOC, no exceptions

Lessons From 12+ Years in the Field

If I could give a junior engineer one page to carry, this would be it.

01
Verify before you touch. Most faults are simpler than they look. The complexity is in the diagnosis, not the fault itself.
02
Talk to the operator first. They have been watching the process for hours. They know things you do not.
03
The DCS historian is your best tool. Before reaching for a multimeter, look at the trend. The answer is often already in the data.
04
Intermittent faults are the most dangerous. They disappear when you arrive. Set up a high-speed trend. Wait for it. Log everything.
05
Documentation is not optional. An undocumented repair is a future mystery fault for the next engineer.
06
FAT does not equal commissioning. Many faults only appear under real process conditions. FAT gives you confidence. Startup gives you truth.
07
Ask for help sooner. Pride costs hours. Asking costs minutes. No experienced engineer works alone on a difficult fault.
08
Every fault has a root cause. Do not close the job until you know what caused it not just what fixed it.

"Troubleshooting is not about knowing everything. It is about thinking clearly when others are panicking."

Zohaib Jahan | Senior Automation Engineer

💬
Over to You

What is the most difficult fault you have ever troubleshot? Was it intermittent? Was the root cause something nobody expected? Did it teach you something that stayed with you? Share it as this profession gets better when we share field knowledge, not just theory.

Connect on LinkedIn →