Bridge Crack Detection Based on Attention Mechanism

with


Introduction
In recent years, with the increase in budget for bridge construction and the strengthening of supervision of the safety of old bridges by the government, the issue of bridge crack detection has become a hot topic. Currently, the detection of bridge cracks mainly relies on bridge inspection vehicles, which use folding or telescopic arms attached to the vehicle or a hybrid aerial ladder to move a platform carrying inspection personnel under the bridge for crack detection while driving along the edge of the bridge. However, this method wastes manpower, materials, and financial resources, and there are significant limitations on the viewing angle of the inspection personnel, and the detection results depend on their level of expertise, making it difficult to carry out large-scale inspections. Therefore, visual-based bridge inspection projects have emerged, using cameras mounted on drones to capture relevant bridge images and utilizing bridge crack target detection algorithms to detect cracks on bridges.
In the past two decades, traditional visual-based methods for detecting bridge cracks have mainly used graphical analysis [1], pattern recognition [2], edge detectors [3], [4], line detectors [5], [6], and threshold segmentation [7]. These methods can achieve good detection accuracy for continuous cracks with high contrast, demonstrating the feasibility of automated detection based on vision. However, during the actual process of collecting bridge crack images, factors such as collection equipment, shooting angle, external lighting, and vibration often affect the model's ability to achieve good ISSN 2775-2658Vol. 3, No. 2, 2023 Geng Chuang (Bridge Crack Detection Based on Attention Mechanism) detection results. In response to this situation, simple detection algorithms based on traditional image processing can no longer meet the growing production demands in terms of robustness.
With the development of the machine learning field, it has become possible to detect bridge cracks based on vision in complex environments. Henrique et al. [8] proposed a machine learningbased method that detects crack blocks by statistical processing of the mean and standard deviation of the gray values within image blocks. Edrardo Zalama et al. [9] proposed an instrumented vehicle for detecting cracks based on an imaging system, two inertial profilers, differential global positioning systems, and network cameras. They designed a method based on Gabor filters to identify horizontal and vertical cracks, improved single classifier results using the Adaboost algorithm [10], validated the feasibility of the solution and method with a large database, and obtained good results through rigorous testing. Prateek Prasanna et al. [11] proposed a new automatic crack detection algorithm called STRUM (Spatially Tuned Robust Multi-Feature), which was tested on real bridge data using an advanced robot bridge scanning system. This algorithm mainly avoids manually adjusting threshold parameters through machine learning classification algorithms. It fits potential cracks in space and calculates visual features in that area, and the entire algorithm is built using reasonable crack information representation and a classifier trained multiple times. Through scientific experiments, the improved algorithm achieved the highest accuracy of up to 95%. Cord and Chambon [12] described the texture features of cracks in images using a model designed with the AdaBoost algorithm. Shi et al. [13] proposed a model based on random forests to extract image features and detect cracks in the CrackForest road crack dataset. These traditional object detection methods based on expert features have deviations in robustness in practical applications due to their dependence on expert features.
With the development of deep learning, CNN (Convolutional Neural Networks) based on deep learning has gradually become a hot research direction in object detection. Region-based methods such as R-CNN [14], Fast R-CNN [15], and Faster R-CNN [16] achieve high detection accuracy but have slow detection speeds, which cannot meet practical needs. Regression-based algorithms such as YOLO [17] and SSD [18] have good performance in both detection accuracy and speed and have become popular model architectures in the field of object detection.
Convolutional neural network algorithms based on deep learning can also be applied to bridge crack detection problems. Zhang et al. [19] used a convolutional neural network to achieve singlepixel classification, which can predict whether a single pixel belongs to a crack. However, this method did not utilize the semantic information of crack targets very well and required manually designed feature extractors for image preprocessing, which lacks universality. Zou et al. [20] proposed DeepCrack, which is the first detector to use multiscale convolutional features to detect cracks, opening up a new path for pixel-level crack detection in bridges based on deep learning. Wang et al. [21] used HDCBs to learn spatial features by adding them to the neural network, enlarging the receptive field of the convolutional kernel, avoiding the loss of a large amount of semantic information due to the grid effect, and maintaining the continuity of pixel-level cracks. Shuai Teng et al. [22] tested different object detection algorithms for the detection of bridge surface defects by introducing Gaussian white noise. Through experiments, it was found that using transfer learning and data augmentation methods to improve the YOLO V3 [23] network can effectively improve the bridge defect detection capability but does not address the most important bridge crack recognition problem. Jinsong Zhu et al. [24] improved the VGG-16 network classifier and collected real bridge surface defect images labeled into seven categories. Through comparison with multiple detection and classification algorithms, it was found that the improved algorithm is superior to other algorithms, providing a feasible solution for bridge surface defect detection and classification. Philipp Hüthwohl et al. [25] collected defect data from many different bridges to create a relevant classification dataset and proposed a three-level concrete defect classifier for bridge defect detection, achieving a detection accuracy of 85% through experiments. Sizeng Zhao et al. [26] combined the YOLO V5 algorithm with 3D photogrammetric reconstruction methods to propose a defect detection method for concrete dams. An improved algorithm was proposed for the problems of complex backgrounds and blurred boundaries, which improved accuracy by 3.8% compared to the original algorithm, especially for small object detection. Gang Li et al. [27] used a fully convolutional network to extract bridge crack features and then used a naive Bayes data fusion model to segment the cracks. Compared with traditional visual feature detection methods, there was a significant improvement in accuracy and detection time.
However, none of the above-mentioned studies have explored the issue of high-precision localization in bridge crack detection. Insufficient high-precision localization ability can lead to successful recognition of the target, but the label box cannot fully enclose the target, which affects the subsequent risk assessment of bridge cracks. Therefore, this paper will conduct relevant research on the high-precision localization of bridge cracks.

Related Work
The attention mechanism is a method of allocating limited computing resources to important local information. This method is consistent with the cognitive rules of the human brain and eyes and is a bionic neural network-assisted algorithm. In recent years, it has been widely used in the field of computer vision and has been proven to be beneficial in improving model performance. The essence of the attention mechanism is to locate the information that is of interest and beneficial to the recognition results, suppress irrelevant information, and output the results in the form of a probability map or probability feature vector.

Channel Attention Mechanism
If we divide them by dimension, convolutional neural networks in the field of image processing are two-dimensional. One dimension contains information about the spatial scale of the image, namely its width and height. The other dimension contains information about the image's channels. There are two commonly used channel attention mechanisms: SENet (Squeeze and Excitation Net) [28] and ECA modules (Efficient Channel Attention) [29].
SENet is a channel-based attention mechanism model that models the importance of each feature channel in an image and enhances or suppresses different channel information for different recognition tasks. The principle diagram of the SENet module is shown in Fig. 1. First, the feature map is compressed along the spatial dimensions using a Squeeze operation (⋅), similar to global average pooling. After this operation, the number of feature channels remains the same. For each feature map channel, a weight value is generated using a function (⋅, ), and then the weights are normalized and multiplied with the original feature map channel-wise using a function (⋅,⋅) to complete the channel attention operation. The feature weights are learned using a fully connected network based on the result of the loss function, avoiding feature weight obtained solely based on the numerical values of feature channels. This ensures that the weights of effective feature channels are larger, resulting in higher learning efficiency.
However, the dimensionality reduction used in SENet can affect the predictive performance of channel attention and result in low efficiency in capturing channel dependencies in images. Therefore, the ECA module was developed to reduce the dimensionality reduction and improve cross-channel  First, the input feature map is compressed in spatial dimension by using global average pooling. Then, the inter-channel dependencies of the compressed feature map are learned by applying a 1×1 convolution. Next, the learned channel attention information, which contains the weight information, is multiplied by the input feature map channel-wise.
SENet uses fully connected layers (FC) to globally learn the input channel features, while the ECA module uses a 1×1 convolution to locally learn the channel correlation information. By using a dynamic convolutional kernel size, the ECA module can learn the correlation between different channels. When the number of channels is large, a larger kernel size is used to perform 1×1 convolution to achieve cross-channel interaction with more channel information. When the number of channels is small, a smaller kernel size is used to perform 1×1 convolution to achieve cross-channel interaction with less channel information.
The adaptive function of a dynamic convolution kernel is where is the convolution kernel size, is the number of channels, which | | means an odd number for the result, and generally set to 2 and 1, which is used to change the ratio between the number of channels and the convolution kernel size .

Mixed Attention Mechanism
CBAM (Convolutional Block Attention Module) [30] is one of the representative methods in the hybrid attention mechanism, which combines channel attention and spatial attention mechanisms. The structure diagram of the CBAM module is shown in Fig. 3. CBAM is an improvement based on the SENet method, which models the importance of channel features using channel attention and the degree of attention to spatial positions using spatial attention. CBAM learns the channel and spatial features of the feature map separately, which allows it to improve model performance in most cases and also has a wider range of applications. The channel attention in CBAM is similar to SENet, and its block diagram is shown in Fig. 4. The principle underlying the spatial attention mechanism is that different regions of an image contribute differently to the recognition task, and improving the model's performance only requires focusing on the regions that have a higher contribution to the task, which can enhance the model's performance and reduce computation. Essentially, the spatial attention mechanism locates the target and performs some transformations to obtain corresponding weights during the learning process. In the mixed-domain attention mechanism, the spatial attention mechanism is shown in Fig. 5.

Fig. 5. Spatial attention in CBAM
As shown in Fig. 5, the spatial attention mechanism in CBAM first reduces the dimensionality of the channels by applying both max-pooling and average-pooling operations. Then, the results are concatenated into a feature map, which is further processed by a convolutional layer to learn spatial features. Finally, a sigmoid activation function is applied to obtain the attention weights for the spatial features.
In order to improve the performance of the CBAM module for bridge crack recognition tasks and enhance the learning efficiency of the module for features, this paper added three convolutional layers on top of the CBAM module. The enhanced CBAM module with the additional three convolutional layers is referred to as CBAMC3, which has the advantages of high lightweight, strong applicability, and strong performance improvement.

Fusion with YOLO V5
YOLO V5 algorithm is a one-stage object detection algorithm based on regression. In the data preprocessing process, the same mosaic image online enhancement method as YOLO V4 algorithm is used to expand the number of small targets in a single batch, which improves the network's ability to recognize small target objects and increases the data information of a single batch. In the backbone network, FPN feature pyramid structure is used to extract and fuse feature information from the bottom up. In the neck structure, the PAN (Path Aggregation Network) network structure is used to fuse the top-down PAN, shortening the path between the bottom-level features and the prediction layer. The CSP (Cross Stage Partial Network) layer is used instead of the residual structure connection layer, which enhances the model's learning ability, lightweight the model, maintains the model's accuracy performance and reduces the computational bottleneck.
YOLO V5 algorithm improves the problem of class imbalance that existed in previous YOLO algorithms. In previous YOLO algorithms, positive samples were defined based on the IOU value between the anchor box and the true target box. When the IOU value was greater than the threshold, the anchor box was set as a positive sample. However, due to the one-to-one correspondence between anchor boxes and true target boxes, there could only be as many positive samples as true target boxes, resulting in class imbalance. YOLO V5 algorithm defines positive samples based on the aspect ratio between anchor boxes and true target boxes. When the aspect ratio is less than a threshold, it is defined as a positive sample. Additionally, YOLO V5 predicts the same target in nearby grids simultaneously to increase the number of positive samples, effectively solving the problem of class imbalance.
Regarding the problem of bridge crack recognition, because crack targets are elongated, discontinuous, and have large-scale changes, using the YOLO V5 algorithm can, to some extent, avoid the problem of imbalance between positive and negative samples and further improve model performance. However, the accuracy of YOLO V5 algorithm in recognizing some bridge crack targets cannot meet the requirements. Therefore, a method of fusing attention mechanism is used to further improve the performance of YOLO V5 algorithm. The improved YOLO V5 algorithm network structure is shown in Fig. 6. Incorporating attention mechanisms before the convolutional layers in the YOLO V5 network's prediction layer can increase the influence of the learned attention weights on the final performance of the model. By validating the effectiveness of different attention mechanisms in addressing the bridge crack recognition problem, it is possible to better solve real-world bridge crack recognition problems.

Experiment Environment and Evaluation Indicators
The experimental environment is a high-performance server dedicated to deep learning object recognition, with two high-performance RTX 8000 graphics cards running the stable version of Ubuntu 20.04.3 LTS. The object recognition framework is PyTorch, version 1.12, and the basic YOLO V3 algorithm is the PyTorch version from Ultralytics. The experimental dataset is a self-made dataset, which includes images collected from the bridge crack detection dataset and real-world bridge crack images. The dataset has been manually annotated and verified multiple times for accuracy and The evaluation metric uses the concept of "average precision" referenced in the current mainstream VOC 2007. Precision P represents the proportion of correct predictions made by the model, while recall R represents the coverage of the target category in the recognition results. In the object recognition task, there are two types of samples: positive and negative, and two types of detection results: correct and incorrect. Positive samples predicted correctly are defined as TP, positive samples predicted incorrectly as FP, negative samples predicted correctly as TN, and negative samples predicted incorrectly as FN. Precision and recall can be calculated as (2)-(3).
According to precision, the average precision (AP) for each class can be calculated, = ∫ ( ) 1 0 and then the mean average precision (mAP) for all classes can be computed, = 1 ∫ where m is the number of classes, the commonly used mAP has different versions depending on the IOU threshold used, such as mAP50 and mAP50-95. The latter means that the AP values are calculated with IOU thresholds ranging from 0.5 to 0.95 with a step size of 0.05, and then the average value is taken. Compared to mAP50, mAP50-95 can better reflect the performance of the algorithm and also demonstrates the algorithm's ability to recognize targets with high accuracy and confidence.

Comparative Experiments
To verify the performance improvement brought by the fusion of three different attention mechanisms with YOLO V5, a comparative experiment method was adopted to validate. Meanwhile, the performance of the improved YOLO V5 algorithm was compared with the original YOLO V5 algorithm and YOLO V3 algorithm in the bridge crack detection task to evaluate the performance parameters. The relevant experiments have carried out the re-clustering of anchor boxes in advance. The clustering algorithm used was K-Means algorithm, and the same experimental conditions and dataset were used for re-clustering to eliminate external factors that may interfere with the performance comparison of the models. The performance indicators of the five algorithm models are shown in Table 2. in mAP50 but did so in mAP50-95, indicating that the improved fusion channel attention mechanisms have good performance in addressing the high-precision issue of bridge crack target detection tasks. The improved algorithm with fused hybrid domain channel attention mechanism, CBAMC3, achieved a 0.5% increase in mAP50 and a 6.57% increase in mAP50-95 compared to the original algorithm, showing good improvement for high-precision localization issues.

Experiment Results
The F1 curve and PR curve can effectively demonstrate the convergence process and performance of the model. The F1 curve and PR curve for YOLO V5 and the three improved YOLO V5 algorithms with fusion attention mechanisms are shown in Fig. 7. As shown in Fig. 7, it can be observed that the improved YOLO V5 algorithm with fused hybrid domain attention mechanism, CBAMC3, has the largest area under the F1 curve and PR curve, indicating that the YOLO V5 algorithm with CBAMC3 module fusion has the best convergence effect compared to the other three algorithms.

Conclusion
This article addresses the issue of incomplete box selection caused by the lack of high-precision localization capabilities in bridge crack target detection tasks and selects the YOLO V5 algorithm as the backbone network. In response to the YOLO V5 algorithm's inability to achieve expected performance in high-precision localization, fusion attention mechanism is used to optimize the YOLO V5 algorithm's high-precision localization problem. Two-channel attention mechanisms, SENet and ECALayer modules, and one hybrid domain attention mechanism, CBAMC3, were selected for fusion, and relevant experiments were conducted. The results show that the fusion attention mechanism method can effectively improve the YOLO V5 algorithm's high-precision localization performance. Among them, the CBAMC3 module fused with the YOLO V5 algorithm has the best effect, and the mAP50-95 is improved by 6.5% for high-precision localization issues. In the future, we will conduct research on bridge crack problems in more complex scenarios. We will expand the bridge crack dataset by collecting more bridge crack images and conducting algorithm improvement research on the object detection problem of bridge crack images under high resolution. Highresolution images require object detection algorithms to have a larger receptive field range than commonly used algorithms, but simply increasing the receptive field may not significantly improve the model's performance. Therefore, a better attention mechanism that utilizes information more fully is needed to extract feature information from large receptive fields. This will be a feasible method for object recognition tasks in high-resolution image detection.
Author Contribution: All authors contributed equally to the main contributor to this paper. All authors read and approved the final paper.