Wireless Sensor Networks Fault Detection and Identification

,


Introduction
Wireless sensor networks (WSNs) can range from a small handful to hundreds or thousands of sensor nodes with various sensors monitoring diverse physical phenomena.WSNs are used to facilitate the automation of factories, monitor off-shore drilling equipment, quantify the health of oil refineries, and many other applications.In health monitoring applications, a WSN can measure anything from the efficiency of a turbine to the pressure in a pipe.Faulty sensor readings in these applications can lead to unnecessary shutdowns of plants or disruptions in monitored processes.While sensor technology advancements can reduce fault occurrences, they cannot be completely eliminated.Therefore, it is important to develop models and techniques that differentiate between legitimately unhealthy conditions and faulty sensor readings in WSNs.This paper explores fault detection and identification techniques and develops mathematical models for such faults in WSNs.
Sensor and actuator faults in complex distributed electro-mechanical systems are well studied [1,2], and in industrial wireless sensor networks [3].In [1], a systematically characterized taxonomy of common sensor data faults that occur in deployed sensor networks and the detailed approaches commonly taken to model these faults are provided.These features include characteristics specific to the data, system, or environment pertaining to the system of interest.According to [1], the data features are usually statistical in nature.A confident diagnosis of any single fault may require more than one of these features to be modeled.The fault detection techniques most commonly used include mean and variance (determine expected behavior via regression models or correcting faulty sensor values), correlation (regression methods), gradient (rate of change), and spatial or temporal distance (determine if data is faulty).Table 1 shows common fault detection techniques that include statistical, nearest neighbor, clustering, and classification methods.Statistical Mathematically justified models Fails if data do not fit assumed distribution [6,7] Classificat.
Provides an exact set of faults Computationally expensive, requires proper kernel choice [9,10] Nearest neighbor No assumptions on data distribution Computationally expensive in large networks, dependent on input parameters [11] Clustering No previous knowledge of data statistics needed, adapt to new data Cluster width must be defined Statistical methods create a model of healthy data and then classify any behavior that does not fit that model as a fault.For example, a statistical model of an outdoor temperature sensor may assume that temperature should be high during the day and low at night and register a fault anytime that trend is not observed.In this case, a fault could indicate a malfunction in the sensor network or an environmental irregularity, like a storm.Statistical methods have the advantage of being based on justifiable mathematical models but are not robust to phenomena that do not fit assumed distributions.
Fault classification methods take the opposite approach, and instead of creating mathematical models for healthy data, they use mathematical models for faulty data.Incoming data from the WSN is analyzed for behavior that corresponds to the faulty behavior previously modeled.The main advantage of classification methods is that they allow users to specify an exact set of faults to be considered, while the main disadvantage is that they usually cannot detect faults not previously observed.
Nearest neighbor methods forgo data models in favor of using a sensor-to-sensor comparison as a fault metric.These methods identify sets of sensors whose data should be similar (i.e., the spatially nearest sensors).Anytime these neighbors report significantly different readings from each other, a fault is registered.The main advantage of nearest-neighbor methods is that they allow faults to be detected without a mathematical model for healthy or faulty data.The main disadvantage is that they require appropriate neighbors to be defined, and mistakes in deciding which sensors are related can result in poor data comparisons.
Clustering-based methods compare clusters of nodes rather than sensors.During a learning period, sets of nodes are grouped into clusters, which are then compared to each other to establish which clusters are measuring related data.During fault detection, each cluster compares its data with its related clusters, and faults are registered whenever two related clusters do not measure similar data.Clustering methods are less computationally expensive than nearest-neighbor methods.The main disadvantage of clustering methods is the difficulty of appropriately defining clusters.
Table 2 contains results from the literature on fault detection and split them into categories useful for mathematical modeling.In [12], principal component analysis (PCA) is used during a learning period to model healthy data behavior with the first four eigenvectors.Using these eigenvectors as a baseline for healthy behavior, any activity that falls outside of the healthy model is considered an outlier.In [7], a support vector machine models healthy behavior and detects outliers in real time for a WSN.In [13], possible outliers are sorted and ranked periodically, speeding up the process of outlier discovery for a high-dimensional dataset.Calibration methods focus on identifying and accounting for offsets and gains in a WSN.In [20], it is assumed that the phenomena being measured should have similar behavior everywhere, and the characteristics of a small base set are used to model the characteristics of the entire network.In [23], it is assumed that the measured phenomena have some linear correlation between a sensor and its nearest neighbor.This information is used to calibrate the sensors in the WSN.
Previous fault detection techniques center mainly on simulation-based tests for algorithm development.In contrast, in this paper, we provide an overview of multiple fault models and test various fault detection models on an actual wireless network comprised of industrial-grade sensors.This allows us to demonstrate functional differences between models in real-world applications and to confirm that the fault models are capable of improving the health monitoring capabilities of an industrial system.
Other techniques for fault detection and identification include machine learning-based methods [24].In [25], a digital equivalent of the sensor was developed using a generative adversarial network.The trained model works as a digital copy of the real sensor and watches for possible faults in the real sensors.Recent convolutional neural network-based sensor fault detection methods were developed in [26].The convolutional neural network was used to detect and identify the sensor fault, while a bank of convolutional autoencoders was used to reconstruct the correct sensor data.This paper studies fault detection and identification in WSNs.Faults included in this study are outlier, spike, variance, high-frequency noise, offset, gain, and drift faults.These faults affect the system operations and endanger operators, final users, and the general public.We have developed a set of fault detection models for WSNs and implemented them on a system consisting of multiple wireless sensor nodes, fault detection software, a server, and a client (Fig. 1).The system is intended for health monitoring applications of the NASA Stennis Space Center (SSC) test stands and widely distributed support systems, including pressurized gas lines, propellant delivery systems, and water coolant lines [27,28,29,30].The concepts presented here can also be used in other distributed industrial systems with a large number of networked electromechanical devices.The system within which our fault models were tested contains distributed software resources for providing ubiquitous information capability.The main system blocks are shown in Fig. 1 and include: (a) Server (developed in [31]); (b) Network Capable Application Processor (NCAP); (c) Communication Module; and (d) set of Transducer Interface Modules (TIMs) previously developed using the Coremicro Reconfigurable Embedded Smart Sensor Node (CRE-SSN) [32,33].The TIMs monitor sensors that measure physical phenomena such as temperature, pressure, and flow rate.They communicate measured data to the NCAP, which analyzes the data for faults.The NCAP records the data (as well as any detected faults) in a server where it can be accessed later by a user.

Mathematical Models
Fig. 2 illustrates the sensor network model.Sensor readings for node j are given by where individual sensors on a node are identified as i = 1, 2, . . ., N , [34].We assume that each node has an equal number of sensors, N, with output values at time instance k given by z i j (k) (i-th sensor on j-th node at time k).A true value of a measured physical variable is u i j (k) and a faulty measurement of the i-th sensor on the j-th node at time instance k is ẑi j (k).There are two general categories for fault-detection techniques: data-centric and system-centric [1].Data-centric techniques focus on a single data stream to identify faults, while system-centric techniques consider the whole system to detect faults.The two types of techniques need not be used exclusively.A computationally inexpensive data-centric technique may be used to identify abnormal behavior on a sensor stream, at which point a more expensive system-centric technique can be implemented that compares the sensor data to related neighboring nodes, verifying whether the abnormal data reflects an abnormal environment or a faulty sensor reading.

Data-Centric Techniques
For each type of fault, there are defining characteristics, e.g., magnitude for outlier fault or variance for noise fault.In our approach, we identify a range for that characteristic under normal operating conditions, and we say that a fault occurs any time the characteristic of interest falls outside of the expected range.
The energy constraint in WSNs limits communication and computation complexities, while one of the leading hardware constraints is memory.For a large network, storing every data sample collected by a sensor is infeasible.Therefore, online fault detection algorithms with a sliding data window are preferable ("one-pass algorithms" [35]).A sliding window considers the last W sl data samples from a sensor data stream.Data samples are stored for the duration of the sliding window and then erased.This method is robust to changes in the "normal" behavior of sensor data.
A less robust and less costly method of defining data norms is to use a learning period.A standard assumption is that sensor data are healthy during the learning period.By analyzing a data stream over the learning period, normal behavior can be defined.The advantage is that normal behavior has to be calculated only once.However, the method is not robust to data streams whose behavior changes with time.For example, the pressure in a pipe may be low during a learning period but rise when gas is pumped into it.Using a sliding window allows algorithms to update the expected behavior of a data stream as its normal behavior changes while still detecting true faults in the data stream.

Outlier Fault
We define an outlier as a single data sample whose value is significantly (defined by the user) outside of the range defined by previous data samples, see Fig. 3 [36].To quantify this "no-fault range", we define Zi j as an upper bound of expected data (i.e., the highest value we expect from the i-th sensor on the j-th node) and Z i j as a lower bound of expected data.The upper and lower bounds Zi j and Z i j are given by for p = k 0 , k 0 + 1, ..., k 0 + W ln − 1, where k 0 is the first data point in the learning period and W ln is the number of data points in the learning period.We define c out to be a positive constant that determines the outlier detection sensitivity.For instance, c out = 0.2 allows the signal to vary 20% of the range above and below the data extrema observed during the learning period.Increasing c out lowers outlier detection sensitivity by allowing some faulty readings to go undetected but while reducing the false alarm rate.
When using a sliding window, Zi j (k) and Z i j (k) are recalculated periodically.Bounds can be recalculated every time new data is received.For a sliding window of a length W sl , every time new data is received, the sliding window is updated, and the bounds become functions of the current data sample.The running index p in this case is p = k − W sl , k − W sl + 1, ..., k − 1.The benefit of using such bounds is that if the range of expected data changes, the bounds will adapt to them.However, frequent updating of the signal bounds increases the computational cost and is a natural trade-off between the false alarm rate and computational cost.
We say that an outlier has occurred when the following condition is met: An example of an outlier is shown in Fig. 3.

Spike Fault
We use the term spike to refer to a small number (r s ) of data points that rise or fall more rapidly than the data during the healthy sensor behavior [37].Fig. 4 shows a spike fault where data returns to normal behavior after the spike.To define a spike, we look at the gradient of data points and define the following: Ri j being the upper bound on the gradient, which may be ascribed to healthy data on the i-th sensor at the j-th node; and r s being the number of successive samples required to have an unhealthy gradient before a spike occurs.
As with outliers, the bound we use to determine healthy behavior is based on a learning period W ln or a sliding window and is defined as for in case of a sliding window; c spk is a positive constant that determines the detection sensitivity.
When using a sliding window, the gradient bound is a function of sample time k, and it is assumed that data in the sliding window are healthy.The end of the sliding window should be set more than r s samples away from the current data sample.The number of successive samples, r s , required for a spike fault is influenced by the phenomena being measured and the sampling rate.Some phenomena, such as light intensity, may be expected to produce signals with sharp gradients [1] while others are expected to change smoothly.For a slowly varying signal, a spike could indicate a fault in the sensor or that the sampling rate is too slow.Regardless, a spike in a data stream is a characteristic of interest and rapid detection is important.A small r s will hasten the detection of spikes, which is crucial in time-constrained applications.Conditions for spike detection are given by: or where ( 6) and ( 7) represent criteria for upward and downward spikes respectively.The necessity for the two sets of conditions, instead of just taking the absolute value of the gradient, is because a spike fault model requires r s successive points to have either abnormally large positive or negative gradients.Moreover, this eliminates false positive spike identification when high-frequency noise is present, resulting in sensor readings having large gradients in random directions.

Variance Fault
The term variance fault is used to describe a set of data whose values tend to differ from the mean of that set by an abnormally large or small amount [38], where the threshold parameters are chosen by the user.Defining the length of the window over which variance is calculated as W v , and the expected value of data over that window as z i j (p), the variance is then given by There are two types of variance faults: low and high variance faults.A low variance fault (also called a stuck-at fault) occurs when the variance of a data stream is abnormally low, i.e., when a signal is stuck at a certain value.A high variance fault occurs when the variance is abnormally high.The model parameters include V i j and V i j , a lower and an upper bound on signal variance under healthy conditions.Using a learning period, the variance fault thresholds are given by: where p = k 0 , k 0 + 1, • • • , k 0 + W ln , the length of the learning period is W ln , and c var is a constant that determines the variance sensitivity.Using a sliding window, the index p belongs to a set of W sl points as We say that a variance fault has occurred if Increasing the learning or sliding window size provides a larger sample size for determining the variance.If the phenomena being measured are slow-varying throughout the learning period, it will result in a more accurate representation of the steady-state variance of the system.However, if the phenomena are fast-varying over the learning period, signal changes will be reflected in the healthy variance parameters.A small window size will emphasize the role of variance as imperfections in the system but will also increase false positives.Fig. 5 and Fig. 6 show low variance and high variance faults, respectively.There are several possible causes of variance faults.The term "noise" is often used to describe some high-variance faults in the case of signals with particular mathematical attributes.Several types Vol. 3, No. 4, 2023, pp. 804-823 ISSN 2775-2658 of noise exist, such as white noise, pink noise (equal power in bandwidths that are proportionally wide; see below), and violet noise (increases with frequency).Noise is usually classified by its behavior in the power spectrum, and we use the Fast Fourier Transform (FFT) to detect and isolate a noise fault.

High-Frequency Noise Fault
Noise classification and quantification are usually done in the frequency domain.We use the FFT to obtain the signal power spectrum, Fig. 7. Here, we describe a method for detecting undesired high-frequency signals in the power spectrum and how to distinguish the presence of unexpected high-frequency useful data and high-frequency noise [39].Fig. 7 illustrates how the FFT can be used to detect noise in a sensor signal.Fig. 7 shows a power spectrum of a 50 Hz sinusoid with added white noise at different power levels.The uniform effect of white noise over the power spectrum can be observed.Conversely, if the high variance fault is caused by a new high-frequency signal added to the data stream, only a specific high-frequency band will be affected.
We define the window of data on which the FFT is performed as W f f t .The variables that define a high-frequency noise fault include Y i j -the maximum power level classified as noise in the power spectrum, and ν i j -the highest frequency of a useful signal with a power spectrum contribution above the noise level Y i j .
To define what constitutes a meaningful contribution to the power spectrum, we compute the power spectrum of a signal over a range of frequencies from ν = 0 to ν = ν max .We assume that the highest frequency (ν max ) is significantly higher than any frequency, contributing more than noise to the power spectrum.In particular, we assume the highest frequency component is at least W pow higher than any frequency, which makes a meaningful contribution to the power spectrum.The maximum power level, which noise is expected to contribute, Y i j , is then based on the power spectrum contributions of the last W pow frequency components in the FFT: for ν ∈ (ν max − W pow , ν max ) and where c f f t is a parameter that is used to tune the sensitivity of the high-frequency noise model.Any frequency component with a power spectrum contribution at or below Y i j is considered noise.Any frequency component that contributes a power level above Y i j is Rastko R. Selmic (Wireless Sensor Networks Fault Detection and Identification) said to have a meaningful contribution to the data stream.Examining the power spectrum of a signal over a healthy learning period allows one to define the highest contributing frequency component under healthy conditions as the highest frequency ν with a power spectrum contribution above Y i j : A high-frequency noise fault is then defined as the phenomenon wherein the power spectrum of a signal has a meaningful contribution at a frequency higher than the highest frequency expressed during the learning period:

System-Centric Faults
System-centric techniques use data from multiple sensors in the system to detect faults [21,22].The following techniques require at least one healthy sensor to be correlated to the faulty sensor.This, in turn, requires that the nodes in the WSN have multiple sensors that measure the same (or related) phenomena, i.e., that nodes are densely deployed to the point of oversampling the phenomena of interest [23].
To determine whether or not two sensors are sampling related data, we use variograms.Variograms allow us to quantify the correlation of phenomena at one point in the system with phenomena at another point in the system.A variogram γ i j,l between sensors j and l is defined as: where W vgm is the window of data where the variogram is being calculated.In the event of spatial correlation among sensors, the variogram is a function of radius r around sensor j: where r is the radius around sensor j, Ω j (r) is the set of all neighbors of sensor j within radius r, and |Ω| is cardinality of the set Ω. A small variogram implies a high correlation among the sensors.We say that if the variogram between a sensor and its neighbors is smaller than some threshold Γ, then the sensor reading is related to its neighbors.

Offset Fault
An offset fault occurs when sensor data values are offset from the true phenomenon being measured by a constant amount: where the function f represents a nonlinear sensor model, u i j (k) is a true value being measured on the i-th sensor on j-th node, and β 0 is the offset.Fig. 8 shows an example of an offset fault.To determine the offset value, one usually requires either a ground truth value of the sensor readings or a precise sensor model f , see [27].We instead assume that the sensor network is distributed densely enough that any sensor has at least one related neighbor.An offset is defined as the steady-state condition of a constant difference between the readings of a sensor and the readings of its related neighbors.

Experimental Results
We gathered data from a wireless sensor network composed of industrial-grade sensors to test our models.Fig. 11 shows the experimental testbed that was used for sensor fault detection, fault information processing, and data storage.The testbed consists of Coremicro ® Reconfigurable Embedded Smart Sensor Nodes (CRE-SSNs) [33] capable of ZigBee wireless communication, an NCAP with a ZigBee wireless base station, NCAP software that supports sensor and actuator health monitoring, a database server, and a smartphone running a health monitoring Android application.Each node can read and wirelessly transmit data from up to four sensors simultaneously.The sensors measure their environment and report these measurements to the wireless sensor nodes.The nodes wirelessly transmit the sensor data to the ZigBee wireless base station, which reports its readings to the NCAP, where software processes and records the data.
As an example of healthy sensor behavior, 10 sets of 6400 data samples (each approximately 1 minute long) were recorded.The experiment was carried out under two assumptions: (1) the phenomenon being measured (temperature) was slow-varying throughout all datasets, and (2) the health of the system was constant throughout all datasets.The obtained sensor data is shown in Fig. 12.
The characteristics of interest, other than the raw sensor data, are the gradient, variance, offset, gain, and drift of each sensor.Note that Data Set 1 has a spike just before 20 seconds and this spike in raw data results in a spike for all of the characteristic plots simultaneously.A single sensor fault may result in multiple fault types being flagged.

High-Frequency Noise Parameters
High-frequency noise is modeled similarly to the other data-centric faults; however, the frequency domain analysis imposes certain constraints.Namely, using the Fast Fourier Transform (FFT) places constraints on the window size used (power of two).While a dataset may be "padded" with zeros to bring the total number of points up to the nearest power of two, this padding skews the noise levels of the power spectrum.
The high-frequency noise fault model has three main parameters: the threshold parameter c f f t , the window of data used W f f t , and the window of frequencies assumed to be purely attributed to noise at the upper end of the power spectrum W pow .As with the other threshold parameters, increasing the value of c f f t is expected to decrease the number of high-frequency noise faults detected for a given dataset.Using a larger W f f t is expected to yield a better approximation of a data stream's behavior and is also expected to correspond with a lower number of faults detected.Finally, using a larger W pow assumes that a larger range of frequencies can be attributed to noise during the learning period.Because the datasets under consideration are approximating a slow-varying phenomenon (assuming the temperature was constant during data collection), this results in a better approximation of the typical noise level for a dataset.Therefore, we expect larger values of W pow to correspond with fewer high-frequency noise faults detected in a given dataset.Fig. 17, Fig. 18, and Fig. 19 show the effects of varying c f f t , W f f t , and W pow , respectively.To cover a wide range of values for W f f t (4 -1024 Varying the threshold parameter for each fault has the expected effect.Fig. 20 shows the number of detected faults when threshold parameters are changed.Fig. 21 shows the number of detected faults versus the consistency parameter.Note that for offset and drift faults, a higher consistency parameter results in more faults being detected, as expected.For gain faults, however, the consistency parameter has a different effect.No gain faults are detected when the consistency parameter is set to zero, but for any value higher than zero, the same number of gain faults are detected.This behavior was found regardless of the values of the threshold parameter c gn .Such step behavior suggests that for gain faults, the threshold parameter has a much greater effect on the sensitivity than the consistency parameter.While these models have shown satisfactory performance with our testbed, please note that there is always a trade-off between sensitivity and false positives in fault detection.There is no universal value of the fault detection parameters presented here that would work for any application.Setting proper values of those parameters requires the operator's experience and intuition related to the specific system.Some parameters, such as upper and lower data limits for outlier faults, can be adjusted online.However, the window size and other sensitivity parameters are based on experience.Future research will study intelligent learning techniques where algorithms will set all fault detection parameters.

Conclusion
We have developed models for common faults in WSNs.It is important to be able to differentiate between true data changes and sensor faults to have a full understanding of a system's health.The models developed here are intended as a library of common fault types to be considered when gathering and analyzing data from a WSN.The developed models are designed to work for a wide variety of sensors and applications by tuning and adjusting model parameters.The effects of adjusting these parameters for an industrial temperature sensor were explored.The sensitivity of the models was evaluated using experimental testbed results.While the various fault models were evaluated on the WSN, the methods presented here apply to any distributed electro-mechanical system consisting of multiple sensors, actuators, and networks.The network communication does not need to be wireless.Future work will create a corresponding library of models for common equipment failures in a system.The models of equipment failures can then be compared to the sensor fault models developed here, and algorithms that differentiate between the two can be developed.Differentiating between true physical system anomalies and sensor faults will allow for more precise monitoring of a system's health.Using novel machine-learning methods for learning healthy sensors' data and later detecting faults is another future research topic.The learning methods can be used to learn the system model or just data.The machine learning techniques can also be considered as a tool to automatically adjust various fault detection parameters currently set manually.

Fig. 7 .
Fig. 7. Time domain and power spectrum for a sinusoid with a small amount of white noise (top) and a large amount of white noise (bottom).

Fig. 9 .
Fig. 9.The green signal has a different gain than the blue signal.

Fig. 10 .
Fig. 10.The offset between the signals is steadily increasing, resulting in a drift fault.

Fig. 13 .
Fig. 13.Number of data-centric faults detected for various values of c xxx with W ln = 100: (i) Top left: outliers; (ii) Top right: spikes; (iii) Bottom left: low variance; and (iv) Bottom right: high variance.

Fig. 14 .
Fig. 14.Number of data-centric faults detected for various values of c xxx with W sl = 100: (i) Top left: outliers; (ii) Top right: spikes; (iii) Bottom left: low variance; and (iv) Bottom right: high variance.

Fig. 17 .
Fig. 17.Number of high-frequency noise faults detected for various values of c f f t with W f f t = 256 and W pow = 32.

Fig. 18 .
Fig. 18.Number of high-frequency noise faults detected for various values of W f f t with c f f t = 0.2 and W pow = W f f t /4.

Vol. 3 ,Fig. 19 .
Fig. 19.Number of high-frequency noise faults detected for various values of W pow with W f f t = 512 and c f f t = 0.2.

Table 1 .
Common fault detection techniques.

Table 2 .
Fault detection in sensor networks research results.