Instrumentation and Control Troubleshooting - Oil and Gas Experiences

The Startup That Humbled Three Engineers

Real Refinery Plant Startup · Day 3

The feed flow controller was hunting. Not slightly — badly. The control valve was cycling hard. Operators were frustrated. Management wanted answers ten minutes ago.

Three engineers stood around that marshalling cabinet for four hours. Laptops open. P&IDs spread across a folding table. Everyone had a theory.

"It's drifting so let's swap the transmitter."

"The orifice plate might be backwards."

"I'll retune the PID now."

None of them were right.

A senior technician (quiet, 15 years in the field) walked over with a multimeter. Measured the mA signal at the valve actuator terminals in the field junction box.

12.0 mA at the DCS output card 11.4 mA at the valve terminals

0.6 mA dropped across a single corroded terminal block connection. Enough to make the loop hunt. Enough to waste four hours. Enough to hold the unit at 40% throughput for half a shift.

One corroded terminal. Three engineers. Four hours.

That story is not unusual. I have seen it or something very close on more startups than I care to count. And that is exactly why troubleshooting is a skill, not just knowledge.

💡

The cost is real. The global process industry loses approximately $20 billion per year to unscheduled downtime and off-spec production (ARC Advisory Group / Fieldbus Foundation, 2009). 80% of those losses are preventable. 40% trace back to the human in the loop, not the equipment.

The Mindset That Separates Good From Great

Before any tool or technique, you need the right mindset. Here is what I have observed over two decades in the field.

Junior Engineer

Assumes the transmitter is faulty first
Replaces parts before isolating the fault
Works alone without talking to the operator
Ignores documentation and loop history
Speeds up under time pressure — makes errors

Experienced Engineer

Asks: is this a fault or a real process change?
Talks to the operator before touching anything
Reviews alarm history and historian trend first
Works methodically, not reactively
Slows down when everyone else is panicking

"The best troubleshooters I have worked with share one trait: they slow down when everyone else speeds up."

Field lesson, 12+ years Oil & Gas

The 5-Step Troubleshooting Framework

Most engineers know it. Few follow it completely — especially Step 1.

Step	Action	Most Common Mistake
01	Verify the problem is real	Assuming the operator is always correct
02	Identify and locate the fault	Guessing without systematic isolation
03	Fix the problem	Repairing without a clear plan
04	Verify the repair	Closing the job without full confirmation
05	Follow up and document	Skipping documentation entirely

⚠️

Step 1 is the most skipped. I once watched two technicians spend three hours troubleshooting a level transmitter that was reading correctly. The actual problem: the operator had opened the wrong valve. The level was genuinely high. The instrument was fine. Verify the problem is real before you touch anything.

Pre-Troubleshooting Checklist — Before You Touch Anything

Pre-Troubleshooting Checklist

Talk to the operator: what changed? When exactly?
Review DCS alarm history from the past 24 hours
Pull the historian trend — PV, output, and setpoint together
Check for a recent MOC (Management of Change) on this loop
Confirm loop documentation is current (P&ID, loop diagram)
Understand what the loop is supposed to do under normal conditions
Check loop mode: manual or auto?
Confirm process conditions are normal — note anything different
Inform the control room before any action that could cause an upset
Confirm permit requirements — does this need a PTW?

Isolating the Fault: Three Methods

Method 01

Input / Output Series

Start at one end, work toward the other. Transmitter → cable → marshalling → DCS input → controller → DCS output → cable → valve. Measure at each point. Find where the signal breaks.

Method 02

Divide & Conquer

Split the loop in half. Check the midpoint. Signal correct? Fault is in the second half. Wrong? Fault is in the first half. Keep halving until isolated. Fastest on long cable runs.

Method 03

History-Based

What changed recently? New instrument? Cable modification? DCS update? Brownfield tie-in? Changes cause faults. Start where the change happened. Fastest method — if documentation is current.

The 4-20mA Loop — Where Most Faults Hide

The 4-20mA signal is the backbone of process instrumentation. It fails in predictable, diagnosable ways. Know this table and you will find most loop faults faster than any other method.

mA Reading	Condition / Meaning	First Check
0 mA	Open circuit: broken wire or blown fuse	Cable continuity, fuse condition
0–3.6 mA	Wiring fault: open loop	Supply voltage, cable breaks, terminal tightness
3.6–3.8 mA	Transmitter Failure (NAMUR NE43 low)	Replace or bench-test transmitter
3.8–4.0 mA	Normal Under-Range	Confirm process is actually at minimum
4.0–20.0 mA	Normal Operation	If reading seems wrong: check process
20.0–20.5 mA	Normal Over-Range	Confirm process is at or above maximum
20.5–22.0 mA	Transmitter Failure (NAMUR NE43 high)	Check transmitter diagnostics
> 22 mA	Short circuit / wiring fault	Inspect cable for damage or short

💡

NAMUR NE43: A reading of 3.7 mA is not a process at 0%. It is the transmitter signalling internal failure. Configure your DCS to recognise NAMUR NE43 status levels and get early warning before a fault causes a process upset or SIS trip. Most plants do not configure this correctly.

Field Story #1

The Brownfield Trap

During a brownfield tie-in at a gas processing plant, a new mass flow transmitter was installed on an existing hydrocarbon line. After hookup, the DCS read zero even at full process flow.

Everyone assumed a wiring fault. We spent an hour tracing cables. Nothing wrong with the wiring.

I asked one question: "Did anyone check the transmitter configuration?"

The new HART transmitter had been factory-configured in imperial units, cubic feet per hour. The DCS engineering unit was cubic metres per hour. The configured ranges did not overlap at any normal flow rate.

History-based diagnosis: something was configured during installation. Total time to find and fix: 20 minutes. One question saved two more hours of cable tracing.

Common Loop Faults by Type

Pressure Loops

Plugged impulse lines — especially with viscous or dirty process fluids
Condensate in gas impulse lines — seasonal, worse in winter
DP transmitter high/low legs swapped — reads backwards at full flow
Equalising valve left closed after maintenance
Diaphragm seal fill fluid degradation in high-temperature service

Temperature Loops

Wrong thermocouple type configured in DCS vs. actual installed type
Extension cable used instead of proper thermocouple compensating cable
Poor insertion depth — measuring pipe wall temperature, not process
RTD installed as 2-wire, configured as 3-wire — produces systematic error
Shield connected at both ends — ground loop induces 50 Hz noise

Flow Loops

Impulse lines partially plugged — sluggish response to process changes
Square root extraction applied twice — once in transmitter AND once in DCS (very common on DeltaV systems during commissioning)
Orifice plate installed backwards — check direction arrow on plate body
Vortex meter reading zero — flow is below minimum measurable velocity
Gas entrainment in liquid flow meters — causes erratic, spiky readings
Wrong DP range configured for actual orifice plate bore size

Control Valve Troubleshooting

Control valves cause more production losses than any other single component in a process plant. And most valve problems are diagnosed incorrectly — engineers jump to positioner replacement without measuring the signal chain first.

Symptom	Likely Cause	Check First
No response to signal change	Solenoid failure or instrument air lost	I/P input signal, air supply pressure
Valve hunting / oscillating	Positioner gain too high	Positioner tuning parameters
Stuck at fixed position	Positioner feedback linkage broken	Physical linkage between positioner and valve stem
Drives fully open or closed	I/P converter failure	Measure I/P output pressure (3–15 psi?)
Slow to respond	Restriction in air tubing or filter	Air filter, restrictor, volume booster
Leaking at closed position	Seat damage or wrong shutoff class	Packing, seat inspection
Oscillating near low % travel	Stiction in packing or positioner	Friction test, packing adjustment

The I/P Converter — Three Points of Measurement

The I/P converter turns the DCS mA output into pneumatic pressure to drive the valve actuator. It fails silently. Always verify three points: (1) Is the DCS output card sending the correct mA? (2) Is the I/P outputting the correct pressure? (3) Is the valve stem actually moving? The fault is somewhere between these three points.

DCS Troubleshooting — Emerson DeltaV

DeltaV has powerful built-in diagnostics. Most engineers use 20% of what is available. Here is where to look and why.

DeltaV Tool	What It Shows	When to Use It
Alarm Banner / Alarm Summary	Active alarms, timing, priority	First action — what alarmed and when?
DeltaV Operate Trends	PV, output, SP over time	Visualise exactly when the fault started
Event Chronicle	Every controller action, alarm, operator action with timestamp	Root cause analysis — sequence of events
DeltaV Diagnostics	Controller health, network status, module health	System-level faults
AMS Device Manager	HART device diagnostics, calibration history, NAMUR NE107 status	Field device faults — always check first
DeltaV Inspect	Physical signal at I/O channel level	Verify what is physically coming into the card
Download History (DeltaV Explorer)	Configuration change log	"Did something change recently?"

DeltaV Troubleshooting Checklist

DeltaV DCS Checklist

Check controller health in DeltaV Diagnostics
Verify I/O card channel status — healthy or faulted?
Check network node status in DeltaV Explorer
Review Event Chronicle from 1 hour before the fault appeared
Open AMS Device Manager — check NAMUR NE107 diagnostic alerts
Review download history — was a configuration change pushed recently?
Check power supply health for I/O subsystem
Verify engineering units in AI block match transmitter configuration
Confirm scaling: 4 mA = 0% and 20 mA = 100% correct for this range?
Check for square root extraction applied in both transmitter AND DCS

Field Story #2

The Hidden Simulation Mode

A flow controller on a DeltaV system started reading 2 mA higher than the HART primary variable. The control module appeared fine. No active alarms.

AMS Device Manager showed a single flag: Simulation Mode Active on the transmitter.

Someone had enabled simulation mode during commissioning to test the loop and forgot to disable it. The transmitter was outputting a fixed simulated value while the real process had moved. The DCS was controlling on simulated data.

Event Chronicle showed a technician had connected a HART communicator to that device three days earlier.

Lesson: AMS Device Manager found it in five minutes. Without AMS, this fault could have run undetected for weeks with incorrect flow measurement feeding a critical control loop. Always check simulation mode flag after any HART communicator activity.

SIS Troubleshooting — What's Different

Safety Instrumented System troubleshooting follows the same logical steps but with higher stakes and strict procedural controls. The key difference: every action carries safety implications.

🚨

Never bypass an SIS loop without formal bypass authorisation and a documented risk assessment. A bypassed SIS loop means the safety barrier is removed. One undetected process exceedance during that bypass can be catastrophic.

Field Story #3

The SIS Trip Nobody Expected

A platformer unit tripped on SIS during stable normal operations. No high temperature. No process exceedance. No equipment failure visible on any trend.

SOE log showed the trip initiator: low-low pressure switch PSLL-PLT-201. But the historian trend showed process pressure was completely normal at the time of trip.

Physical inspection found it. The impulse line block valve for PSLL-PLT-201 was partially closed. During a minor process fluctuation, the impulse line could not equalise pressure fast enough. The switch saw a momentary dip that did not exist in the main process line.

Root cause: a maintenance technician had partially closed that block valve two days earlier while checking for leaks and had not fully reopened it.

Lesson: After any maintenance activity near an SIS instrument, verify that all impulse line block valves are fully open and confirmed. Add this to your post-maintenance checklist. The SOE found the initiator in seconds. The block valve inspection confirmed the root cause in minutes.

SIS Troubleshooting Checklist

SIS Troubleshooting Checklist — IEC 61511

Obtain bypass authorisation before isolating any SIS loop
Review SOE log for the exact trip sequence and initiator
Confirm which sensor(s) actuated — note voting (1oo2, 2oo3)
Check historian trend for process condition at time of trip
Inspect all impulse line block valves — fully open?
Inspect solenoid valves — energised state correct for failsafe direction?
Check last proof test date and results for the initiating device
Update proof test record if a fault is confirmed
Complete formal bypass close-out before removing the bypass
Notify safety authority if SIS architecture integrity was affected

NAMUR NE107 — Let Your Instruments Speak

NAMUR NE107 gives every smart field device a built-in health reporting system. If your DCS is configured to read and display this status, you catch failures before they become trips — not after.

NE107 Status	Meaning	HMI Colour	Required Action
Device OK	Normal operation, output valid	Green / Grey	None
Maintenance Required	Degraded but output still valid	Blue	Plan maintenance — do not ignore
Out of Specification	Process or environment outside design limits	Yellow	Investigate cause promptly
Function Check	Simulation or calibration active, output temporary	Orange	Verify intentional — disable when done
Failure	Non-valid output signal	Red	Immediate action — bypass or replace

⚙️

In DeltaV with AMS Device Manager, NE107 statuses propagate to the HMI automatically if configured correctly. Most plants find out about device failures after a trip rather than before. Configure NE107. Use it actively. It is the difference between predictive maintenance and reactive firefighting.

Test Equipment — What You Actually Need

You do not need every tool in the catalogue. You need the right tools and you need to know how to use them in the field under pressure.

Tool	Primary Use	Field Tip
HART Handheld Communicator	Device configuration, diagnostics, calibration	Always check Simulation Mode flag after use
True RMS Multimeter	Voltage, resistance, continuity	Essential: carry always, no exceptions
mA Clamp Meter	Measure loop current without breaking circuit	Underused. No loop interruption. No alarms. Carry one always.
Process Calibrator	Simulate transmitter signals, verify valve response	Verify mA output and valve position simultaneously
Pressure Calibrator	Pressure loop calibration, SIS proof testing	Calibrate at process temperature when possible
Deadweight Tester	High-accuracy pressure reference	Critical loop calibration — do not use a gauge as reference
ProfiTrace Analyzer	Profibus PA/DP network diagnostics	When fieldbus devices drop off: check segment voltage first
Thermal Camera (FLIR)	Electrical hot spots, panel faults	Scan power supply and MCC panels during startup
Digital Storage Oscilloscope	Signal noise and interference investigation	When a loop is "noisy" — DCS historian hides 50 Hz noise

Field Story #4

The Noise That Was Not There

A temperature controller on a fired heater had been in manual for weeks. Operators reported the PV was "noisy" — ±2°C oscillation. Three engineers had looked at it. All blamed PID tuning.

One engineer brought a digital storage oscilloscope. Connected it across the thermocouple terminals in the local junction box.

The signal was oscillating at exactly 50 Hz. Power line frequency interference.

The thermocouple extension cable was running parallel to a 440V motor power cable for 15 metres in the same cable tray. No separation. Cable shield was connected at both ends — creating a classic ground loop that picked up power line interference.

Lesson: Electrical noise is invisible on a DCS historian trend because the historian scans at 1-second intervals and averages the noise out. You need an oscilloscope or millivolt meter at the source. Fix: re-route cable to a separate tray. Ground shield at DCS end only. Signal clean in 20 minutes.

OT Cybersecurity — A Necessary Brief

Modern DCS systems are networked. That brings capability and risk. I have seen troubleshooting hours wasted because the real fault was a cybersecurity-related issue that nobody considered.

Windows patch deployment: pushed automatically to a DCS engineering station mid-shift took it offline
Antivirus scan: running during peak operations consumed CPU and froze DCS operator graphics
Unauthorised network switch added by a contractor: caused a broadcast storm on the DCS control network
NTP server failure: SOE timestamps drifted by hours, making post-trip analysis unreliable
USB device inserted by a contractor: introduced malware that corrupted historian archive data

Quick OT Security Checks During Troubleshooting

Is the DCS network congested? Check managed switch port statistics
Are SOE timestamps consistent? Verify NTP synchronisation status
Was any external device connected recently? Check change logs
Is antivirus scheduled to scan during plant operations? It should not be
Are all active remote access sessions controlled, logged, and authorised?

IEC 62443 and your site's OT security policy govern these controls. If they don't exist, raise it with your cybersecurity authority now — not after an incident.

Common Mistakes — What Engineers Miss

I have made some of these. I have seen all of them — on projects across Oil & Gas, mining, and refining.

✕
Replacing the transmitter first: without verifying it is actually faulty. Most transmitter replacements on plant are unnecessary.
✕
Not talking to the operator: missing the one piece of context that would have found the fault in 10 minutes.
✕
Ignoring alarm history: the answer is often already in the DCS logs, waiting to be read.
✕
Working from outdated documentation: old P&IDs that do not reflect as-built reality in brownfield projects.
✕
Fixing the symptom, not the cause: loop works after a reboot. Fault returns next week. Root cause never addressed.
✕
No post-repair verification: closing the job without confirming alarms, interlocks, and control are all back to normal.
✕
Skipping documentation update: the next engineer faces the same fault with no history to guide them.
✕
Not informing the control room: causing a process upset while troubleshooting the original fault.

Safety During Troubleshooting — Non-Negotiable

🚨

Troubleshooting under time pressure is the most dangerous moment in plant operations. Pressure causes shortcuts. Shortcuts cause accidents. Slow down. Think. Then act.

Obtain a Permit to Work (PTW) before working on live equipment in classified hazardous areas
Never bypass SIS loops without formal authorisation and documented risk assessment
Apply LOTO before working inside electrical panels or MCC enclosures
Gas-test before opening field junction boxes in potentially explosive atmospheres
Never assume a loop is de-energised: test before you touch
Treat instrument air with respect: high-pressure systems cause serious injuries
Inform the control room of every action that could affect the process
Any modification to a live SIS system requires a formal MOC means no exceptions

Lessons From 12+ Years in the Field

If I could give a junior engineer one page to carry, this would be it.

Verify before you touch. Most faults are simpler than they look. The complexity is in the diagnosis, not the fault itself.

Talk to the operator first. They have been watching the process for hours. They know things you do not.

The DCS historian is your best tool. Before reaching for a multimeter, look at the trend. The answer is often already in the data.

Intermittent faults are the most dangerous. They disappear when you arrive. Set up a high-speed trend. Wait for it. Log everything.

Documentation is not optional. An undocumented repair is a future mystery fault for the next engineer.

FAT does not equal commissioning. Many faults only appear under real process conditions. FAT gives you confidence. Startup gives you truth.

Ask for help sooner. Pride costs hours. Asking costs minutes. No experienced engineer works alone on a difficult fault.

Every fault has a root cause. Do not close the job until you know what caused it, not just what fixed it.

"Troubleshooting is not about knowing everything. It is about thinking clearly when others are panicking."

Zohaib Jahan

💬

Over to You

What is the most difficult fault you have ever troubleshot? Was it intermittent? Was the root cause something nobody expected? Did it teach you something that stayed with you? Share it because this profession gets better when we share field knowledge, not just theory.

Follow at ▼

LinkedIn Blog

Content is educational!
Always verify against applicable site standards and procedures!