preface
In this article, we solve the problem of outlier screening using the data of the overall dimension based on our own defined methods rather than calling readymade modules, and finally visually display the results.
Years are like clouds, bandits I want to save, and writing is not easy. I hope friends passing by will praise, collect and pay attention to ha. Thank you here!
1: Introduction of clustering algorithm
General introduction of clustering algorithm
Different from the classification algorithm, it is a method of sample grouping according to data similarity without a given classification category. According to the data distribution characteristics of the original samples, it can be divided into semi supervised clustering and unsupervised clustering.
 Semi supervised clustering: make full use of the known label information and then classify the unknown label information
 Unsupervised clustering: similarity classification without any known label information
Common clustering methods are:
 Partition clustering: KMeans(+ +), KMedoids, Clarans, etc
 Hierarchical clustering: Birch, Cure, agglomerative clustering, etc
 Density based methods: DBSCAN, DENCLUE, options, etc
 Grid based: STING, CLIOUE, etc
 Other methods: statistics, deep learning, etc
Of course, there are many methods. If you can use and understand several of them in practice, you can solve many problems.
There are many evaluation methods for clustering effect, among which the most simple and practical is purity method. Of course, there are some advanced discrimination methods, such as contour coefficient, elbow method and calinski_harabaz value, etc,

purity method refers to only calculating the proportion of correct clusters in the total number, i.e
p u r i t y ( X , Y ) = 1 n ∑ i k ( x i ∩ y i ) purity(X,Y)=\frac{1}{n}\sum_{i}^{k}{(x_i\cap y_i)} purity(X,Y)=n1 Σ ik (xi ∩ yi), where
x = ( x 1 , x 2 , . . . x k ) x=(x_1,x_2,...x_k) x=(x1, x2,... xk) is a set that has been clustered, y = ( y 1 , y 2 , . . . , y k ) y=(y_1,y_2,...,y_k) y=(y1,y2,...,yk)
Represents the original data set to be clustered, n n n represents the total number of clustered objects, 
Some advanced discriminant methods will be introduced later, and will not be described too much here.
 Write Kmeans code at the bottom
This paper mainly finds out multiple clustering centers of data through Kmeans + + algorithm, and then finds out the outliers of data, so we first review the underlying principle of Kmean algorithm with a simple sentence.
 Initialize the number constant k of clustering types, and randomly select the initial point as the data center,
 Define the distance operation between each sample data and each center (each sample point dimension here must be consistent with each center point dimension, and matrix operation can be performed), and classify the samples into the most similar classes,
 Recalculate the central value and bring it into step 2 above until the category of each sample does not change (of course, we can also set the number of iterations at the beginning to exit the logic),
 Output the final center value and each class.
According to the above steps, we implement the Kmeans algorithm:
 Define the distance function between the calculated sample and the center point
import numpy as np def distance(vA,vB): d_t = vA  np.array(vB)[0] dist = np.power(sum(np.power(d_t, 4)), 1 / 4) # dist = np.dot((vA  vB),(vA  vB).T)[0,0]##Two formulas for calculating distance return dist
 Define initialization random clustering center function
def init_randCent(data,K): n = np.shape(data)[1]##Number of attributes cent_value = np.mat(np.zeros((K,n))) for j in range(n): min_t = np.min(data[:,j]) range_t = np.max(data[:,j])  min_t ##Random initialization between maximum and minimum values cent_value[:,j] = min_t * np.mat(np.ones((K,1))) + np.random.rand(K,1) * range_t return cent_value
 The main logic is written by recursive method in this paper
def K_means(data, K, cent_value):###K is the clustering data we set m = np.shape(data)[0] n = np.shape(data)[1] subcenter = np.mat(np.zeros((m, 2))) ##Initialization: by default, all samples belong to class 1 count = 0 ls_r = [] def main_process(tag, count, ls_r):##Define recursive functions if tag == False:##Set a trigger event, initially True, and execute else df1 = pd.DataFrame(subcenter,columns=['Clustering category','Distance from center(0 Represents the initialization distance, which has never changed for this class)']) df2 = pd.DataFrame(cent_value,index=['Category 1','Category 2','Category 3']) ls_rr = [] for i in ls_r: ls_rr.append(round(i / sum(ls_r),3)) df3 = pd.DataFrame([ls_r,ls_rr],columns=['Category 1','Category 2','Category 3'],index=['Number of samples','Proportion%']).T df4 = pd.concat([df2,df3],axis=1) print(df1) print(df4) print('\033[1;38m Total iterations:%s\033[0;m'%count) return 'END!!' else: ls_r = [] tag = False count = count + 1 for i in range(m): minDist = np.inf minIndex = 0 for j in range(K): ###Calculate the distance between i and each cluster dist = distance(data[i, :], cent_value[j, :]) if dist < minDist: minDist = dist minIndex = j ###Determine whether changes are needed if subcenter[i, 0] != minIndex: ##Need to change subcenter[i, :] = np.mat([minIndex, minDist]) tag = True ###Recalculate cluster center for j in range(K): sum_all = np.mat(np.zeros((1, n))) r = 0 for i in range(m): if subcenter[i, 0] == j: ###Calculate the jth category sum_all = sum_all + data[i, :]##Center values belonging to the same category are added r = r + 1 for k in range(n): try: cent_value[j, k] = sum_all[0, k] / r###All characteristic data of the same kind are taken as the mean except ZeroDivisionError: print('r is zero!') ls_r.append(r) return main_process(tag, count, ls_r) main_process(True, count, ls_r)#Call recursive function K_means(data, 3, init_randCent(data, 3))
PS: the above code can be used directly after I debug, copy and paste, and the data input is in the form of data frame!
Based on the data in the previous article, we test the results
Using advanced 3sigm criterion to deal with practical problems
Clustering category and shortest distance(0 Represents the initialization distance, which has never changed for this class) 0 2.0 0.434378 1 2.0 0.301428 2 1.0 0.192687 3 2.0 0.427841 4 2.0 0.233552 .. ... ... 935 2.0 0.347604 936 2.0 0.334781 937 2.0 0.299281 938 2.0 0.378599 939 2.0 0.433147 [940 rows x 2 columns] 0 1 2 Number of samples Proportion% Category 1 0.505249 0.272886 0.258199 37.0 0.039 Category 2 0.120483 0.553747 0.151894 325.0 0.346 Category 3 0.125011 0.123172 0.115365 578.0 0.615 Total number of iterations: 10
We can standardize the data before clustering to reduce the difference caused by the order of magnitude and avoid that the value of an attribute is too large, and the distance operation depends entirely on this attribute. The standardization method can adopt maximum and minimum standardization, i.e ( V − m i n ( V ) ) / ( m a x ( V ) − m i n ( V ) ) (Vmin(V))/(max(V)min(V)) (V−min(V))/(max(V)−min(V)) .
The right side of the above result category column is the final cluster center of the data set. Through the underlying code, we know that the initialization center is obtained randomly each time, so the clustering results will be different each time. There may be a local optimization in the distance operation without iterative updating. Although the data is standardized, some clustering results will still be inclined. In this case, Monte Carlo can be used to solve the expected problem, This chapter will not start the description for the time being.
2: Implement Kmeans + + algorithm
Next, we understand and implement the Kmeans + + algorithm at the bottom, and finally use this algorithm for systematic outlier filtering.
 Bottom layer understanding of Kmeans + + algorithm
The Kmeans + + algorithm mainly deals with the initialized data center. In the case of excluding outliers, we hope that the initialization center point should be as far away as possible. The specific steps are as follows:
 Firstly, the number k value of cluster centers is determined, and the first cluster center is randomly selected from the input data,
 For each point in the data set, calculate its distance from the existing cluster center. We set it as D ( x e x i s t ) D(x_{exist}) D(xexist) ,
 Then select a new data point as a new clustering center. The selection requirements are D ( x e x i s t ) D(x_{exist}) The point with larger D(xexist) is also more likely to be selected,
 Repeat steps 2 and 3 above until k k K centers are selected, and the remaining operations are the same as general Kmeans.
Note that the existence of outliers has an impact on the selection of new clustering centers. Because outliers exist, D ( x e x i s t ) D(x_{exist}) D(xexist) will be very large. We have two methods to deal with this situation: 1. Deal with outliers before modeling, but it won't work if we want to deal with outliers through clustering method; 2. In the distance value set D ( x ) = { D ( x 1 e x i s t ) , D ( x 2 e x i s t ) . . . , D ( x n e x i s t ) } D(x)=\{D(x_{1exist}),D(x_{2exist})...,D(x_{nexist})\} D(x)={D(x1exist), D(x2exist)..., D(xnexist)}( n n n is the number of data), we cannot directly select the maximum distance value, but the larger data point. The "larger" here can be selected through the "area probability idea", that is, if these distances are "connected together", then the random points fall on the larger data point D ( x i e x i s t ) D(x_{iexist}) The probability within D(xiexist) is also large.
We implement the above process in code
init_value = np.inf#First defined as infinity def nest_dist(point,clc): min_dist = init_value m = np.shape(clc)[0]###Number of currently initialized cluster centers for i in range(m): d = distance(point,clc[i,:])##Calculate the distance from each cluster center ###Select the shortest distance if min_dist > d: min_dist = d return min_dist def get_cent_value(data,K): m,n = np.shape(data)###m is the number of data clc = np.mat(np.zeros((K,n))) ###Randomly select a sample point as the first cluster center index = np.random.randint(0,m) clc[0,:] = np.copy(data[index,:]) ##Initialize a distance sequence d = [0 for i in range(m)] for i in range(1,K): sum_all = 0 for j in range(m): ##Calculate the distance between each sample and the found cluster center, and return the nearest distance value d[j] = nest_dist(data[j,:],clc[0:i,:]) ##Add all shortest distances sum_all = sum_all + d[j] sum_all = sum_all * np.random.random()###take ##The farthest sample point is obtained as the cluster center point for j,dist in enumerate(d): sum_all = sum_all  dist if sum_all > 0: continue else: clc[i] = np.copy(data[j,:]) break return clc K_means(data, 3, get_cent_value(data,3))
Let's look at the running results:
Cluster category and center distance(0 Represents the initialization distance, which has never changed for this class) 0 2.0 0.127224 1 2.0 0.172709 2 0.0 0.000000 3 1.0 0.208919 4 1.0 0.157283 .. ... ... 935 2.0 0.136519 936 2.0 0.137244 937 1.0 0.155543 938 2.0 0.194231 939 2.0 0.126017 [940 rows x 2 columns] 0 1 2 Number of samples Proportion% Category 1 0.118565 0.621701 0.162584 231.0 0.246 Category 2 0.188617 0.313694 0.151047 265.0 0.282 Category 3 0.133774 0.077739 0.108143 444.0 0.472 Total number of iterations: 14
Similarly, after the data is standardized, we use the Kmeans + + algorithm. The clustering results are relatively "average", and the results of multiple runs are relatively stable. This is mainly because Kmeans + + can optimize the initial center point value as far as possible at the beginning of clustering, so as to make each cluster center as far away as possible, rather than local optimization based on more chance.
3: Data outlier filtering based on Kmeans + + algorithm
 This article implements the Kmeans + + algorithm from the bottom. Of course, it needs to be used
Based on the previous article, we only need to slightly modify the porder norm of each line and change it to the for loop form, because now we need to calculate each class data with the center of each class, and finally merge all the results. The code is relatively simple, so we won't describe it too much here. In addition, the number of clusters we choose is 3.
The advanced version of 3sigm IgM criterion is used to deal with the actual data and the advanced processing of outliers
4: Summary
 Starting from the underlying principle of the algorithm, this paper implements the Kmeans + + algorithm and finally applies it to the screening of outliers. Theoretically, the Kmeans + + algorithm is better than the ordinary kmeans algorithm,
 Nevertheless, we have not solved an important problem, that is, when using clustering algorithm (whether hierarchical clustering or partitioned clustering, etc.), we do not specify in advance how many classes are best. Generally, the number of cluster categories is determined by "a posteriori" method, such as elbow method, CH value, etc. at the same time, in order to prevent local optimization, Monte Carlo can be used to take the expected idea. These contents will be presented in the next article.