visit
My aim in this article is to bring forward the growing importance of IoT devices in different fields, understand the discrepancy of data generated by different class of devices, and highlight techniques to make the data interpretable for all cases.
This article is structured into the following areas:Intrusion detection : As IoT devices gets connected to the internet and remains vulnerable to security-related attacks. Such attacks involve denial-of-service (DoS) attacks and distributed denial-of-service (DDoS) attacks which incur heavy damage to IoT services and smart environment applications.
Fraud detection : IoT networks remain susceptible to stealing credit card information, bank account details, or other sensitive information during logins or online payments.
Data Leakage: Sensitive information from databases, file servers and other information sources can leak to any external entity. Such leakage not only results in loss of information, but also creates threat where the attacker can destroy confidential information from the system. Use of proper encryption mechanisms can prevent such leaks.
Anomalies in the IoT system can be detected based on its type like : Point-wise, Contextual or Collective.Point-wise anomalies from individual devices are detected by stochastic descriptors and used when the evolution of the series is not predictable.
Collective anomalies can be detected by typical time series patterns such as shapes, recurring patterns or residuals from multiple IoT devices.
Contextual anomalies are detected when previous type of information or context is taken into account such as day of the week.
Resampling : Imbalanced IoT datasets can be processed with various sampling strategies like under-sampling or over-sampling. Though both of these measures aim to increase the accuracy of the minority class by either removing samples from the majority class (under-sampling) and / or adding more examples from the minority class (over-sampling), they sometimes result in overfitting or cause loss of information.
Random under-sampling: This mechanism is used to down-sample majority class by randomly removing observations from it. The main purpose of this sampling technique is to remove the dominance of the majority class and combine the samples of majority and minority class. This is achieved by re-sampling without replacement, where the number of samples of the majority class becomes equal to the minority class. One of the methods used for under-sampling is generating cluster-centroids to group or condense similar data.
Random over-sampling: This mechanism is used to up-sample minority class by randomly duplicating observations to strengthen its impact. Minority class is resampled with replacement and then the samples are combined with the majority class. Random over-sampling is also achieved by generating new synthetic data of minority class by interpolation, through popular techniques SMOTE and ADASYN.
ADASYN generates samples next to the original samples which are wrongly classified using a k-Nearest Neighbors classifier.SMOTE connects existing minority instances and generates synthetic samples anywhere between existing minority instances and K closest minority class neighbors. Or in other words, the interpolated new points lie between the marginal outliers and inliers.The above figure illustrates Regular Smote.
Regular SMOTE
SMOTE algorithm comes into 3 flavors. Regular SMOTE randomly generates samples without any restriction. Borderline-SMOTE offers two types of parameters “Borderline-1” and “Borderline-2” , where it classifiesEach sample to be in different class than each of its nearest neighbors.Half of nearest neighbors are in same class as the sample.All nearest neighbors of the sample are in the same class.“Borderline-1” generates synthetic samples belonging to same class, while “Borderline-2” generates synthetic samples belonging to any other class.SMOTE Borderline-1SMOTE “Borderline-1” and “Borderline-2” works as follows:The above figure illustrates Smote Borderline-1
M nearest neighbors (NNs) for every sample in the minority class are selected. Minority samples surrounded by majority samples, includes all m nearest neighbors that belong to the majority class. Such members are considered as noisy samples. Samples with at most m/2 NNs from the majority class are considered to be safe. Both safe and noisy neighbors are also excluded from the synthetic sample generation process.Samples for which the number of NNs from the majority class is greater than m<2 (but not m) are considered in danger (near the borderline) and used to generate synthetic samples.As highlighted in figures, “Borderline-1” and “Borderline-2” , synthetic samples are created both from nearest minority neighbors as well as nearest majority neighbors. However, synthetic samples created from majority neighbors in case of “Borderline-2”, are created closer to the minority samples than when created from minority neighbors.The above figure illustrates SMOTE Borderline-2
The third type of SMOTE known as SVM SMOTE uses parameter proximity ratio of different types of samples, or the classification boundary C of SVM classifier to generate samples. All varieties to SMOTE defines “m_neighbors” to determine how the sample is generated and whether it falls in either a. or b. or c.ADASYN generates synthetic outlier samples corresponding to any data point, proportional to the number of samples which are not from the same class, in a given neighborhood.Random under-sampling and over-sampling with imbalanced-learn: Oversampling techniques used to generate new synthetic data using SMOTE can be pipelined with under-sampling techniques to clean and condense the generated data. Such combination algorithms are SMOTE-Tomek and SMOTENN.
The above figure illustrates SMOTETomek
A Tomek link is a connection between a pair of neighboring samples that belongs to different classes. Under-sampling is employed by removing either tomek links or the majority class sample from the oversampled dataset.In Edited Nearest Neighbor (ENN), the majority class instance which differs from majority of its k nearest neighbors is removed.The above figure illustrates SMOTEENN
How to use over-Sampling Methods:
#Details on columns of dataframe sma_avg is explained below in the next code snippet
print('Original dataset shape %s' % Counter(y))
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(sma_avg.iloc[:,[0,2,4]].dropna().values, y)
print('Resampled dataset shape %s' % Counter(y_res))
sm = BorderlineSMOTE(random_state=42, kind ='borderline-1', k_neighbors=50)
X_res, y_res = sm.fit_resample(sma_avg.iloc[:, [0, 2, 4]].dropna().values, y)
print('Resampled dataset shape %s' % Counter(y_res))
sm = BorderlineSMOTE(random_state=42, kind='borderline-2', k_neighbors=50)
X_res, y_res = sm.fit_resample(sma_avg.iloc[:, [0, 2, 4]].dropna().values, y)
print('Resampled dataset shape %s' % Counter(y_res))
sm = SVMSMOTE(random_state=42, k_neighbors = 50)
X_res, y_res = sm.fit_resample(sma_avg.iloc[:, [0, 2, 4]].dropna().values, y)
print('Resampled dataset shape %s' % Counter(y_res))
sm = ADASYN(random_state=42)
X_res, y_res = sm.fit_resample(sma_avg.iloc[:, [0, 2, 4]].dropna().values, y)
print('Resampled dataset shape %s' % Counter(y_res))
kNNo is a distance-based unsupervised outlier detection technique (or kNN). This metric is computed based on the distance to its kth nearest neighbor in the data set.LOF (Local Outlier Factor) is a density-based unsupervised outlier detection technique. This absolute metric measures the ratio of the average density of the k- nearest neighbors of a data point and the local density of the point itself.CBLOF (Cluster-Based Local Outlier Factor) is a cluster-based unsupervised outlier detection technique.SSAD is a semi-supervised anomaly detection approach based on one-class SVM.SSDO is semi-supervised anomaly detection algorithm that uses Constrained Clustering along with Active learning.
The above figure illustrates SSDO (Semi-Supervised Detection of Outliers).
SSDO (Semi-Supervised Detection of Outliers), works on data F = {f0, . . . , fn.}, with each example fi represented in the standard feature-vector format. The algorithms works :1. Assigns an anomaly score to each example based on an example’s position with respect to the data distribution found using constraint-based clustering.Anomaly Score Detector on Individual features.The above figure illustrates Anomaly Score Detector on Individual features
2. Instances with known labels propagate their labels to other neighboring instances. Thus each instance’s anomaly score is updated based on nearby instances’ known labels.3. The algorithm employs an active learning strategy to acquire more labels, and the process is repeated whenever additional data or labels are provided.Semi-supervised active learning strategies also incorporates user feedback and feedback from external authentic sources for assigning correct labels on the learned decision boundaries. This algorithm further uses Constrained Clustering that identifies anomalies by using distance metrics and cluster size, the larger is the distance and smaller the size of the cluster, greater is the probability the point being anomalous. The distance metrics used areDistance of the instance from the cluster centroid.Distance of a cluster centroid from other cluster centroids.Size of the cluster. It reduces to standard k-means when no constraints are provided.Mathematically,With this clustering each instance f ∈ F , has the following anomaly score:#Moving average on Signal Strength of IOT sensors
df = pd.read_csv(file) #where the csv has following columns #time,avg_rss12,var_rss12,avg_rss13,var_rss13,avg_rss23,var_rss23
df["ma_avg_rss12"]=df.avg_rss12.rolling(window=10).mean()
sma_avg = pd.DataFrame()
sma_avg["ma_avg_rss12"] = df["ma_avg_rss12"].to_frame().dropna()
df["ma_var_rss12"]=df.var_rss12.rolling(window=10).mean()
sma_avg["ma_var_rss12"] = df["ma_var_rss12"].to_frame().dropna()
df["ma_avg_rss13"] = df.avg_rss13.rolling(window=10).mean()
df["ma_var_rss13"] = df.var_rss13.rolling(window=10).mean()
sma_avg["ma_avg_rss13"] = df["ma_avg_rss13"].to_frame().dropna()
sma_avg["ma_var_rss13"] = df["ma_var_rss13"].to_frame().dropna()
df["ma_avg_rss23"] = df.avg_rss23.rolling(window=10).mean()
df["ma_var_rss23"] = df.var_rss23.rolling(window=10).mean()
sma_avg["ma_avg_rss23"] = df["ma_avg_rss23"].to_frame().dropna()
sma_avg["ma_var_rss23"] = df["ma_var_rss23"].to_frame().dropna()
#Constrained Clustering on 3 features of IoT
from copkmeans.cop_kmeans import cop_kmeans
must_link = [(0, 10), (0, 20), (0, 30)]
cannot_link = [(1, 10), (2, 10), (3, 10)]
feature_values = sma_avg.iloc[:,[0,2,4]].dropna().values
clusters, centers = cop_kmeans(dataset=feature_values, k=5, ml=must_link, cl=cannot_link)
cluster_centers = np.array(centers)
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(feature_values[:, 0], feature_values[:, 1], feature_values[:, 2], c=clusters, cmap='viridis',
edgecolor='k')
ax.scatter(cluster_centers[:, 0], cluster_centers[:, 1], cluster_centers[:, 2], marker='*', c='r', s=1000, label = 'Centroid', linewidths=15)
ax.autoscale(enable=True, axis='x', tight=True)
The above figure illustrates Constrained Clustering with 5 clusters with IoT sensors on human body
References: