Hybrid approach redefinition with progressive boosting for class imbalance problem

The imbalance class problem in the classification process gets increasingly widespread attention, not only because it can affect the accuracy but also because it ignores important information contained in the minority class [1]. The problem of imbalance class is often a problem of binary classification problem where the sample of one class, especially the class minority, will be compared with all samples of all existing classes [2], [3]. The class imbalance problem became a new research topic of machine learning with the theme "Learning from Imbalanced Data", which began to develop in the 2000s along with the first workshop on the imbalance class organized by the American Association for Artificial Intelligence Conference [4]. Several methods of solving class imbalance explained that there were 2 (two) things which are ARTICLE INFO ABSTRACT


Introduction
The imbalance class problem in the classification process gets increasingly widespread attention, not only because it can affect the accuracy but also because it ignores important information contained in the minority class [1]. The problem of imbalance class is often a problem of binary classification problem where the sample of one class, especially the class minority, will be compared with all samples of all existing classes [2], [3]. The class imbalance problem became a new research topic of machine learning with the theme "Learning from Imbalanced Data", which began to develop in the 2000s along with the first workshop on the imbalance class organized by the American Association for Artificial Intelligence Conference [4]. Several methods of solving class imbalance explained that there were 2 (two) things which are the main issues, namely: regarding the number of classifiers and data diversity [5]. Then Luque et al. have proposed classification performance as one aspect to measure success in handling imbalance class problems [1]. Lachiche and Flach [6] and Yang et al. [7] use the Received Operating Characteristic (ROC) Curve to evaluate the classifier. Wang and Yao [5] and Sun et al. [8] using the F-Measure, G-Means, and Q-Statistic methods for determining diversity data. Galar et al. [2] suggested a taxonomy consisting of 4 (four) groups of approaches in the use of ensemble learning methods, namely: cost-sensitive boosting, boosting-based ensembles, bagging-based ensembles, and hybrid ensembles. Hybrid Ensembles is an ensemble learning method that combines Bagging and Boosting methods. Jian et al. [9] suggested a new ensemble learning method called Different Contribution Sampling (DCS), which can be said as a samplingbased and Boosting hybrid ensemble method. Ren et al. [10] suggested an ensemble learning method based on sampling by presenting the Ensemble Based Adaptive Over-Sampling method that modifies the Over-Sampling method by using Adaptive SMOTEBoost in overcoming the problem of imbalanced class. Hybrid Approach Redefinition (HAR) Method is one method of hybrid ensembles. In this method, the preprocessing stage will be done using the Random Under Sampling and SMOTEBoost (Random Balance Ensemble Method) method. The processing stages will be carried out using UnderBagging and Different Contribution Sampling [11]. Research conducted by Fernandez et al. showed that smote boost has a weakness if it is faced with small disjuncts, lack of data, and noise which indirectly can affect low diversity data because it allows misclassification only in one class [12]. According to Díez-Pastor et al. [13] is important to pay attention to the diversity of data in handling imbalance classes. It means that attempted misclassification produced by each classifier is as small as possible, and if there is misclassification, it is expected to occur on different objects or parts [14]. Progressive Boosting (PBoost) is a proposal that progressively groups samples that are not correlated to the Boost procedure. It is intended that information is not lost and can produce a collection of various classifications. Based on this, the PBoost method is expected to improve diversity data [15]. This research will combine the Hybrid Approach Redefinition by replacing the use of SMOTE Boost by using Progressive Boosting to get better data diversity, a small number of classifiers, and better performance.

Method
In this research, there are three stages, namely preprocessing, processing, and evaluation. The datasets in this study are the KEEL-Dataset Repository by considering the imbalance ratio. The datasets of a low imbalance ratio will use the Pima dataset, while the datasets with a moderate imbalance ratio use the Abalone9vs18 dataset. Besides, the dataset of the high imbalance ratio uses the Yeast2vs8 dataset. All datasets consider the number of attributes and instances [16].

Preprocessing Stage
In the Hybrid Approach Redefinition with Progressive Boosting method, the preprocessing stage will be carried out using the Random Under Sampling method and also Progressive Boosting. The results of the preprocessing stage are in the form of a preprocessing dataset, which will then proceed to the processing stage. The preprocessing stages can be seen in Fig. 1. At the preprocessing stage, for example, clustering is done using K-Means. As a result, there are times when imbalance class problems occur. To overcome this, we need to handle the imbalance class, which starts from the preprocessing stage. At this stage, for the first time on the issue of the imbalance binary class, the majority and minority class will be determined. After that, a random number generation process is performed to determine the size of the new majority class. If the results show that the size of the new majority class is smaller than the old majority size, then this means that the new majority class still shows the difference in the number of large instances with minority class, so this needs to be addressed by moving several majority class instances to minority class through the PBoost process.
Conversely, if the new majority size is larger than the old majority size, then this indicates that there is a reverse where several instances in the minority class must be moved to majority classes through the PBoost process. This process is widespread because sometimes handling the imbalance class can cause the class that was originally a minority class to have a tremendous instance, so it needs to be considered about this. In Fig. 1, it can be seen that the main difference with the Hybrid Approach Redefinition classic is that the use of the SMOTEBoost method is replaced by using Progressive Boosting (PBoos).

Processing Stage
The processing stages will be carried out using UnderBagging and Different Contribution Sampling. The processing stages can be seen in Fig. 2.

Fig. 2. Processing Stage at Hybrid Approach Redefinition with Progressing Boosting
At this stage, the dataset that has undergone the preprocessing stage will undergo a further process at the processing stage using Different Contribution Sampling, which begins by classifying both the Majority Class and Minority Class into SV Sets and NSV Sets. SV Sets and NSV Sets for majority classes and minority classes will experience different treatment processes. In Minority Class, SV Sets will experience a noise removal process and will then undergo PBoost stages while NSV Sets will be combined with SV Sets PBoost results to New Positive Sample. Whereas in the Majority Class, NSV Sets will undergo the sampling stage with the RUS method and will be combined with SV Sets, which has been eliminated by noise to New Negative Sample. In Fig. 2, it can be seen that the main difference with the Hybrid Approach Redefinition classic is that the use of the SMOTEBoost method is replaced by using Progressive Boosting (PBoost).

Evaluation Stage
The evaluation process will be conducted to determine the number of classifiers, data diversity, and also the determination of classification performance. A comparison will be made to compare the results obtained by HAR Method with Progressive Boosting. The Evaluation Stage can be seen in Fig.  3.

Fig. 3. Evaluation Stage at Hybrid Approach Redefinition with Progressing Boosting
After obtaining the result dataset, it is necessary to evaluate the number of classifiers, data diversity, and also classification performance. Comparisons will be made between Hybrid Approach Redefinition with Progressive Boosting with Hybrid Approach Redefinition classic.

Progressive Boosting
The main thing about Progressive Boosting is that there are weight matrices for positive and negative samples, which can be seen in (1) and (2). Weighted matrices for positive samples denoted by ,+ and Negative matrices for negative samples denoted by ,− . Weighted matrices for positive samples and negative samples can be calculated using (1) to (2) [15].
Weighted matrices for true positive, false positive, true negative, and false negative can be calculated using (3) to (6).
Error of classifier can be calculated using (7). Weight update parameter ∝ can be seen in (8).
= 1+ (8) The pseudocode from Progressive Boosting is as follows [15]. In the pseudocode above, it can be seen that the determination of the weighted matrix for each majority class and minority class is the most basic thing from Progressive Boosting. The process begins with the determination of the weight distribution, which includes several training stages to obtain weighted matrices. Then the performance and errors in the classifier will be calculated. Based on the error classifier parameters can be determined for determining weight updates.

Classifier
Classifiers can generally be defined as Decision Region ℜ that place an object into a set class Ω, where Ω consists of class 1 , 2 , until . It can be seen in (9) [14].
Where D is the classifier and is the set of each point in the decision region ℜ which is intended for class .

Data Diversity
According to Díez-Pastor et al. from their research about class imbalances [13], the diversity of data is essential in handling imbalance classes. This means that attempted misclassification produced by each classifier is as small as possible, and if there is misclassification, it is expected to occur on different objects or parts [14].

Science in Information Technology Letters
ISSN 2722-4139 Vol. 1., No. 1, May 2020, pp. 40-51 Suppose that Z= { 1 , . . . , }, which is a dataset that is in the decision region ℜ , so that ∈ ℜ it is an instance involved in the classification problem. Then the output of the classifier as a classifier paired comparison matrix (relationship pairwise classifier) shown in Table 1.

Confusion Matrix
Receiver Operating Characteristic (ROC) Curve is often used to describe performance from the results of a classification or diagnostic rule [18]. ROC Curve is one statistical method that is often used to determine the performance of a classifier. This curve is generated by plotting the true positive fraction of a positive sample in the Y axis with the false positive fraction of a negative sample (False Positive Rate) in the X axis [19]. True Positive and False Positive concepts in the Confusion Matrix are shown in Table 2.

Classification Performance
In classification performance there are several terms that need to be known, namely: Sensitivity, Specificity, F-Measure, and G-Mean [8], [20].
Sensitivity relates to the ability of the classifier to classify a minority class (positive sample) correctly. The existing value range is in the range of 0 to 1. Sensitivity can be calculated using (11). = + (11) Specifity relates to the ability of the classifier to classify a negative sample or majority class correctly. The existing value range is in the range of 0 to 1. Specifity can be calculated using (12).
The F-Measure value is usually smaller than 2, the higher the value of F-Measure states that both recall and precision are quite high. G-Mean on the other hand states the balance between positive and negative samples (minority and majority class) [8]. F-Measure and G-Means calculations are shown in (13) to (14).

Hybrid Approach Redefinition with Progressive Boosting
The Pseudocode of the Hybrid Approach Redefinition with Progressive Boosting is as follow. In the above pseudocode, it can be seen that the preprocessing stage begins with a sampling process towards majority classes, to get the balance of the number of instances with minority classes. It is done by moving the majority class instance that has the closest proximity to the minority class into the minority class. It is done by determining a random number that will determine the number of new majority sizes, and this means determining how many instances are transferred to the minority class. In the process, it should also be noted that not too many instances are transferred to minority classes, causing minority classes that have too many instances. It is done by checking the number or size of the instance of the new majority class. If it is greater than the size of the old majority class, the minority class will experience PBoost to move several minority class instances to the majority class. Conversely, if the new majority class is still smaller in the number of instances, than the Majority class will experience PBoost. The processing stage involves the use of Different Contribution Sampling. Based on the dataset that has undergone preprocessing, this stage will involve the division of majority classes and minority classes into SV Sets, and NSV Sets using the BSVM Method. SV Sets and NSV Sets results for each minority, and the majority class will undergo a further process. In Minority Class, SV Sets will experience a noise removal process and will then undergo PBoost stages while NSV Sets will be combined with SV Sets PBoost results to New Positive Sample. Whereas in the Majority Class, NSV Sets will undergo the sampling stage with the RUS method and will be combined with SV Sets, which has been eliminated by noise to New Negative Sample.

Dataset Description
The datasets used in this research are Pima, Abalone9vs18, and Yeast2vs8. The description of datasets can be seen in Table 3.

Testing
Testing is done to measure the number of classifiers, data diversity, and classification performance. Testing is done 10 times for each method. The average value of classifier and diversity data can be seen in Table 4. In Table 4, it can be seen that for each dataset, Hybrid Approach Redefinition with Progressive Boosting gives better results for the data diversity category compared to the Hybrid Approach Redefinition classic. It is because both the positive samples and the negative samples will be weighted in the form of weighted matrices so that the existing misclassification can be spread to obtain better data diversity. Measurements based on the number of classifiers give results that Hybrid Approach Redefinition with Progressive Boosting will provide better results in the form of smaller classifiers for datasets with small and medium imbalance ratios, while for datasets with large imbalance ratios, Hybrid Approach Redefinition classic slightly better. The testing result for Sensitivity, Specificity, F-Measure, and G-Mean can be seen in Table 5. ISSN 2722-4139 Science in Information Technology Letters 49 Vol. 1., No. 1, May 2020, pp. 40-51 The average value of Sensitivity, Specificity, F-Measure, and G-Mean can be seen in Table 5. Based on Table 5, it can be seen that the Hybrid Approach Redefinition with Progressive Boosting provides better results for sensitivity, specificity, F-Measure, and G-Mean compared to Hybrid Approach Redefinition classes in the three datasets.

Results
In datasets with several instances that are not too large, the number of classifiers may not affect the computational process, whereas, in datasets with a large number of instances, the number of classifiers needs to be considered. In general, both Hybrid Approach Redefinition Classic and Hybrid Approach Redefinition with Progressive Boosting both provide several classifiers that are not too large. The fewer classifiers in the Hybrid Approach Redefinition are supported by the existence of preprocessing stages using the Random Under Sampling method and also the SMOTEBoost, which can effectively reduce the classifier. It can be seen that Progressive Boosting with weighted matrices on both positive samples and negative samples can make the number of classifiers more efficient so they can provide the number of classifiers in the Hybrid Approach Redefinition that uses SMOTEBoost. However, there is a tendency that for datasets with a larger imbalance ratio, Hybrid Approach Redefinition classic tends to give results that are not too far apart and even in datasets with large imbalance ratios, Hybrid Approach Redefinition classes can produce slightly better results compared to Hybrid Approach Redefinition with Progressive Boosting.
Better data diversity, on the other hand, shows that the ensemble process in the form of classifier merger has been done well, in the Hybrid Approach Redefinition with Progressive Boosting, the results given are better because the weighting on positive samples and negative samples on PBoost is more effective than the SMOTEBoost method. The discussion about sensitivity and specificity is to describe the performance of the classifier in classifying an instance in the majority and minority class. The higher sensitivity means, the more appropriate a classifier is in placing an instance of the minority class correctly. At the same time, the specificity states the accuracy of the classifier in placing an instance in the majority class. The sensitivity and specificity produced by Hybrid Approach Redefinition with Progressive Boosting are better than Hybrid Approach Redefinition classic for both datasets with low, medium, or large imbalance ratios.
The value of F-Measure produced by Hybrid Approach Redefinition classic and Hybrid Approach Redefinition with Progressive Boosting is excellent, which shows that both methods successfully classify instances in minority classes correctly or instances that should be minority classes but placed in majority classes. So that based on the results of handling the imbalance class, an instance of minority class can be obtained, so that an interesting pattern of minority classes can still be obtained. The G-Means measurement itself states the balance of the accuracy of the classification results between minority and majority class. Hybrid Approach with Progressive Boosting Redefinition still gives better results for F-Measure and G-Mean.

Conclusion
Based on the tests that have been done, the results show that Hybrid Approach Redefinition with Progressive Boosting gives better results in handling the imbalance class when compared to Hybrid Approach Redefinition classic. A number of indicators such as the number of classifiers, diversity data, and also classification performance provided by Hybrid Approach Redefinition with Progressive Boosting are better compared to Hybrid Approach Redefinition classic.
Through the results of this study, an excellent multi class imbalance treatment was obtained through Hybrid Approach Redefinition with Progressive Boosting which is a hybrid ensemble approach that can obtain good data diversity, small number of classifiers, and good classification performance. However, what needs to be considered is the tendency of a large number of classifiers for datasets with a large imbalance ratio. This may be refined in future research.