J Innov Med Technol 2024; 2(1): 11-19
Published online May 30, 2024
https://doi.org/10.61940/jimt.240002
© Korean Innovative Medical Technology Society
Correspondence to : Kwang Gi Kim
Department of Biomedical Engineering, Gachon University Gil Medical Center, 38-13 Dokjeom-ro 3beon-gil, Namdong-gu, Incheon 21565, Korea
e-mail kimkg@gachon.ac.kr
https://orcid.org/0000-0001-9714-6038
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: Minimally invasive surgery (MIS) and robot-assisted surgery have gained recognition as procedures safer than traditional laparotomy which facilitate faster patient recovery. However, MIS limits the sense of the surgeon. Therefore, a computer-assisted algorithm is proposed to assist in this surgery. With the advent of convolutional neural networks, machine vision technology has become an attractive option.
Materials and Methods: We use four networks, TernausNet, TernausResNet, LinkNet, and DeepLab V3+, to predict organ segments in endoscopy images. Furthermore, endoscopy images have several issues such as noise, hemorrhage, and shading. Therefore, we perform preprocessing and draw parallels between the images with and without preprocessing.
Results: The network with the lowest performance is TernausNet; the performances of the other three networks show marginal differences. The most significant factor for predicting performance is the encoder network. All networks demonstrate reliable performance with a minimum intersection over union score of 0.68 in TernausNet.
Conclusion: The segmentation of organs in images can be used for the quantitative evaluation of surgery and to help surgeons understand anatomy.
Keywords Deep learning; Artificial intelligence; Diagnosis, computer-assisted; Minimally invasive surgical procedures
Minimally invasive surgery (MIS) and robot-assisted surgery (RAS) have gained recognition for their ability to provide safer procedures and facilitate faster patient recovery than traditional laparotomy, resulting in reduced hospitalization durations1,2. However, surgeons encounter challenges in effectively manipulating surgical tools and obtaining a comprehensive understanding of the tissues at the surgical site due to their reliance on screens and endoscopic instruments for information gathering3,4. To overcome these challenges, researchers have actively conducted studies aimed at providing valuable feedback information by attaching sensors to instruments or employing computer vision technology to supplement the visual information5-8. Notably, computer vision-assistive technology has emerged as a promising solution, capitalizing on the rapid advancements in deep learning and convolutional neural networks (CNNs)7-9.
Even before the advent of deep learning and CNNs, researchers have proposed algorithms that leverage classical computer vision techniques to assist with MIS. In a notable 2003 study, Lo et al.10 introduced an algorithm aimed at evaluating tissue–instrument interactions through image analysis. Their approach involved instrument segmentation and tracking using color segmentation while quantitatively assessing tissue deformation caused by instruments through optical flow and shape-from-shading techniques. In a separate study conducted in 2006, Bilodeau et al.11 proposed an algorithm to segment the cavity and thereby aid surgeons in performing thoracic laminectomies; to validate their method, they segmented the cavity using laparoscopic images of a surgery performed on a pig. Their technique involved splitting and merging the cavity into meaningful regions using a multilevel graph approach known as the recursive shortest spanning tree. Both studies are significant because they utilize computer vision techniques to track instruments and aid surgeons performing MIS. These methods offer benefits such as lower computational costs and interpretability of the segmentation process when compared to deep learning approaches. However, they exhibit notably lower generalization capabilities than deep learning methods, making their application in endoscopic environments challenging due to the presence of multiple variables12.
Deep learning-based semantic segmentation has demonstrated its superiority over classical methods in achieving more accurate segmentation, particularly exhibiting enhanced generalization capabilities, which render it well-suited for endoscopic images encompassing multiple variances like illuminance, blurring, light spillage, hemorrhage, and overshadowing12-15. In 2018, Shvets et al.14 proposed a deep-learning-based semantic segmentation method for robotic instrument detection and tracking. They conducted binary and multiclass classifications of instruments in porcine surgical images acquired using the da Vinci Xi surgical system. The authors employed U-net-based networks, specifically TernausNet16 and LinkNet34, achieving a binary classification intersection over union (IoU) score of 0.66 and a multiclass classification IoU score of 0.35 using TernausNet16. Despite the effectiveness of deep-learning-based semantic segmentation in laparoscopic images, it exhibits insufficient multiclass classification performance. In 2020, Scheikl et al.15 proposed the application of semantic segmentation to assist surgeons in scene understanding during laparoscopic surgery. They trained segmentation networks, including U-net, TernausNet, LinkNet, FCN, and SegNet, using various encoder networks. Three loss functions, namely soft-Jaccard, generalized Dice, and cross entropy, were cross-applied to laparoscopic cholecystectomy images using labeled data categories such as image exterior, liver, gallbladder, instrument, fat, and others. The network trained with TernausNet11 using the soft-Jaccard function achieved a maximum IoU score of 0.78 for image segmentation.
In 2021, Sun et al.16 proposed a lightweight segmentation network for the real-time detection and tracking of surgical instruments in RAS. Their approach involved the utilization of a lightweight network that Ghost Module apply to MobileNetV3 as the encoder for real-time image segmentation while employing Lite R-ASPP as the decoder. The network was trained using image sequences obtained from the da Vinci Xi system provided by the MICCAI Endoscopic Vision Challenge 2017. The segmentation achieved an impressive speed of approximately 37.0 frames per second (FPS) with an accuracy of 0.70; notably, this real-time speed was accomplished with a minimal compromise in accuracy. In 2019, Ni et al.17 conducted surgical instrument segmentation using RASNet, incorporating a decoder with an attentional mechanism. They utilized RAS images from the MICCAI Endoscopic Vision Challenge 2017 dataset and achieved an IoU score of 0.90. To address the issue of background class imbalance in surgical images, they implemented the global attention upsample, which focused on the features of surgical instruments. This approach resulted in a noteworthy 7.58% improvement in the IoU score compared with that of the baseline model. By directing attention to the instruments through the attention mechanism, the challenge of class imbalance was effectively mitigated, leading to a substantial enhancement in the multi-classification performance of various surgical instruments. However, it is important to note that both studies specifically focused on segmenting the surgical instruments and did not encompass other aspects likes organs within the surgical images.
In MIS, the primary objective of image segmentation is to delineate surgical instruments accurately and assist surgeons in understanding the precise positioning of these instruments during the procedure. However, it is equally important to identify and analyze organs within MIS images, which play a crucial role in computer-aided diagnosis15,18. The accurate segmentation of organs in endoscopy images relies on the effective handling of variables, and multiple segmentations of organs are required. In this context, deep-CNN-based multiclass segmentation has emerged as a compelling solution. Therefore, this study proposes an image segmentation method for organs that employs a deep CNN as an encoder within a multiclass segmentation network and whose primary goal is to provide surgical assistance. This study provides a detailed description of the proposed solution, including the architecture of the semantic segmentation network, the learning process of the network, and a comprehensive analysis of the results obtained by segmenting organs in real surgical images.
Our study focuses on organ semantic segmentation during MIS. Therefore, we utilize several semantic segmentation networks, including TernausNet, TernausResNet, LinkNet, and DeepLab V3+ for MIS images. Endoscopy images used in MIS may demonstrate noise, hemorrhage, and shading issues. To address this problem, we apply preprocessing techniques and compare the results obtained using the datasets with and without preprocessing; as a result, the most influential factor is observed to be the encoder network, whereas the decoder network and preprocessing have marginal variances. Organ-semantic segmentation in MIS images has a different purpose: enabling the quantitative evaluation of surgery and assisting surgeons in understanding anatomical structures.
Fig. 1A illustrates the network training sequence used in a medical image segmentation system for image segmentation. The networks are implemented using TensorFlow 2.6.0 with CUDA 11.3 and cuDNN 8.2.1. The Adam optimizer is employed, with the learning rate, beta1, and beta2 set to 0.001, 0.9, and 0.999, respectively. The number of epochs is set to a maximum of 1,000, and the early stop method is applied with a patience of 20. During training, the networks are evaluated using a validation dataset and the top-performing networks are saved.
This retrospective study was approved by the Institutional Review Board of Gachon University Gil Hospital, and the requirement for patient informed consent was waived (approval number: GDIRB2020-346). The raw data comprises PNG images with a resolution of 1,920×1,080 pixels and RGB channels, along with XML annotations with polygon masks recording each organ. The mask is encoded as a one-hot vector comprising four binary images of the same size as the original image. To reduce the training time the image resolution is reduced to 512×288, while maintaining the aspect ratio of the image at 16:9. Table 1 presents the components of the data, including the number of classes. Each image contains one or more classes, and the total number of images is 2,244.
Table 1 Components of the data, including the corresponding number of classes
Raw data category | Liver | Gallbladder | Spleen | Total |
---|---|---|---|---|
Each | 231 | 60 | 847 | 1,138 |
Liver–gallbladder | 936 | 936 | 0 | 936 |
Liver–spleen | 160 | 0 | 160 | 160 |
Liver–gallbladder–spleen | 10 | 10 | 10 | 10 |
Total | 1,337 | 1,006 | 1,017 | 2,244 |
Each image contains one or more classes, with a total of 2,244 images.
Preprocessing prevents the network training from being affected by noise and unwanted features, and it is compared with a non-conducted dataset to evaluate compatibility. Preprocessing normalization, contrast-limited adaptive histogram equalization (CLAHE), and Gaussian blur are applied for preprocessing19. Fig. 1B depicts the preprocessing process.
The network architectures are constructed based on the U-net model, which comprises an encoder (or a backbone network) and a decoder. These include skip connections that facilitate the recovery of resolution through the transport of low-level features20. This is illustrated in Fig. 2A. Two variants of TernausNet, LinkNet, and DeepLab V3+ are used for image segmentation of organs, with each network’s encoder pre-trained with the ImageNet dataset14,21-24.
TernausNet is a semantic segmentation network proposed by Iglovikov and Shvets21 in 2018. It utilizes a VGG11 encoder, and its architecture is inspired by U-net. The networks concatenate the low-level and decoded features. In our study, VGG16 and ResNet50 are used instead of VGG11; hereafter, these networks are referred to as TernausNet and TernausResNet21,25,26. Fig. 2B illustrates the decoder block of TernausNet. LinkNet is a semantic segmentation network proposed by Chaurasia and Culurciello22 in 2017. It is based on U-net and applies a ResNet18 encoder, which constructs a skip-connection with a residual connection between the low-level and decode features. Fig. 2C illustrates the decoder block of LinkNet. DeepLab V3+ is a semantic segmentation network proposed by the Google Brain team in 2018, which enhanced DeepLab V3+ in 2017. By using atrous spatial pyramid pooling (ASPP), the DeepLab V3+ processing encoder has high-level features on a variable scale23. The DeepLab V3+ has encoder-decoder structures, the backbone is Xeception, and the mask is decoded by the ASPP and decoding modules23. It is illustrated in Fig. 2D.
The loss function is composed of two components: cross entropy, which evaluates the accuracy of the entire pixel, and the IoU loss, which evaluates each label prediction. The IoU loss is defined as the negative logarithm of the IoU, and the loss function is based on the study by Shvets et al.14 Equation (1) represents the loss function:
Data preprocessing utilizes normalization as a default, with histogram equalization through CLAHE, and Gaussian blur as an option. To evaluate the effectiveness of each preprocessing method, the networks were trained using distinct preprocessed datasets. By comparing each area, the network prediction labels were validated using labels assigned by the surgeons. IoU and Dice are coefficients calculated to compare the performance of each network14,15,23. These coefficients range from zero to one, where a value closer to one indicates a higher similarity between the two areas. Table 2 lists the IoU score, and Dice coefficient based on the type of preprocessing and network architecture.
Table 2 IoU score and Dice coefficient according based on the type of preprocessing and network architecture
Network | Preprocessing | IoU | Dice | Latency (ms) |
---|---|---|---|---|
TernausNet | ○ | 0.68±0.25 | 0.75±0.25 | 85 |
TernausResNet | ○ | 0.76±0.23 | 0.82±0.23 | 103 |
LinkNet | ○ | 0.74±0.25 | 0.80±0.24 | 85 |
DeepLab V3+ | ○ | 0.75±0.23 | 0.81±0.23 | 87 |
TernausNet | × | 0.69±0.25 | 0.75±0.25 | 86 |
TernausResNet | × | 0.74±0.25 | 0.80±0.24 | 102 |
LinkNet | × | 0.73±0.25 | 0.79±0.25 | 88 |
DeepLab V3+ | × | 0.74±0.23 | 0.80±0.22 | 86 |
Values are presented as mean±standard deviation.
IoU: intersection over union.
TernausNet has the lowest performance among the networks, showing an IoU score approximately 8.58% lower and a Dice coefficient approximately 7.11% lower than the average. In contrast, TernausResNet has the highest IoU score and Dice coefficient among the networks with marginal variances. Network training using preprocessed datasets has a higher performance than that using unprocessed datasets, with marginal variances. Examples of the network predictions are listed in Fig. 3.
TernausResNet, trained on preprocessed datasets, exhibits the best performance among the networks. To assess the performance within each class, the IoU scores and Dice coefficients are calculated for each class in TernausResNet, which has been trained on preprocessed datasets. The gallbladder class has an IoU score of 0.77, indicating the highest performance; by contrast, the liver class has an IoU score of 0.72, which represents a lower but nonetheless reliable performance (Fig. 4).
Fig. 5 lists three cases of poor performance. Case 1 shows an image including a hemorrhage, case 2 shows overshadowing, and case 3 shows edge fading owing to light spillage. In these cases, an endoscopy image has noise, hemorrhage, and shading issues; these problems confuse the network and result in erroneous class predictions27. To address these problems and improve the network performance, a follow-up study collects additional images that correspond to these cases and proposes the development of automated exclusion systems or generalization techniques.
In studies by Shvets et al.14 and Scheikl et al.15, TernausNet16 has exhibited good performance, and shallow networks such as TernausNet11 and TernausNet16 outperform deep networks such as LinkNet34 and LinkNet50; this is different from our results, where TernausNet exhibits poor performance. We infer that the reason for the differences in the results is the use of a deeper encoder network than that used by Scheikl et al.15 Scheikl et al.15 reported that TernausNet16 exhibited the highest performance when considering only the best performance from seven runs; however, when considering all the runs, TernausNet16 was excluded from the ranking. Therefore, the differences in the results can be attributed to variations in the number and quality of the datasets, as well as differences in the training methods.
The network with the lowest performance is TernausNet; the performances of the other three networks show marginal differences. These findings can be attributed to discrepancies in network encoders. Specifically, TernausNet employs VGG16 as its encoder, whereas the other networks utilize ResNet50 as their encoder. In our study, we observe that the choice of decoder does not have a significant impact on performance. Consequently, it is necessary to optimize the decoder and utilize various decoders to improve the results. In our study, TernausNet and LinkNet have distinct encoders compared to the original model21-23. While these modified models perform reliably, it is essential to conduct a performance comparison with the original model and analyze the influence of the encoder modifications. The network latencies are as follows: TernausResNet, 85 ms; LinkNet, 85 ms; and DeepLab V3+, 87 ms. When converted to FPS, the approximate range is 10–11 FPS. The network with the longer latency is TernausNet because its employed encoder, VGG16, requires a longer inference time than ResNet5026. The segmentation of organs in images can be used for the quantitative evaluation of surgery and to help surgeons understand anatomy.
Our study aims to alleviate the diminished sensory perception in minimally invasive and robot-assisted surgeries through the application of deep learning-based computer-aided techniques. We used deep learning for laparoscopic image segmentation, achieving a DICE score of 0.82 at approximately 11 FPS. Our results need refinement, especially in enhancing inference speed and diversifying data. Employing deep learning in MIS, like our approach stands as a promising solution.
None.
No potential conflict of interest relevant to this article was reported.
This work was supported by the GRRC program of Gyeonggi province. [GRRC-Gachon2023(B01), Development of AI-based medical imaging technology], and by the Technology Innovation Program (K_G012001185601, Building Data Sets for Artificial Intelligence Learning) funded By the Ministry of Trade Industry & Energy (MOTIE, Korea).
J Innov Med Technol 2024; 2(1): 11-19
Published online May 30, 2024 https://doi.org/10.61940/jimt.240002
Copyright © Korean Innovative Medical Technology Society.
Jun-Ha Park1 , Young Jae Kim2 , Kwang Gi Kim1,3,4,5
1Department of Bio-Health Medical Engineering, Gachon University Gil Medical Center, Incheon, Korea, 2Gachon Biomedical & Convergence Institute, Gachon University Gil Medical Center, Incheon, Korea, 3Medical Devices R&D Center, Gachon University Gil Medical Center, Incheon, Korea, 4Department of Biomedical Engineering, Gachon University Gil Medical Center, Incheon, Korea, 5Department of Health Sciences & Technology, Gachon Advanced Institute for Health Sciences & Technology (GAIHST), Gachon University, Lee Gil Ya Cancer and Diabetes Institute, Incheon, Korea
Correspondence to:Kwang Gi Kim
Department of Biomedical Engineering, Gachon University Gil Medical Center, 38-13 Dokjeom-ro 3beon-gil, Namdong-gu, Incheon 21565, Korea
e-mail kimkg@gachon.ac.kr
https://orcid.org/0000-0001-9714-6038
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: Minimally invasive surgery (MIS) and robot-assisted surgery have gained recognition as procedures safer than traditional laparotomy which facilitate faster patient recovery. However, MIS limits the sense of the surgeon. Therefore, a computer-assisted algorithm is proposed to assist in this surgery. With the advent of convolutional neural networks, machine vision technology has become an attractive option.
Materials and Methods: We use four networks, TernausNet, TernausResNet, LinkNet, and DeepLab V3+, to predict organ segments in endoscopy images. Furthermore, endoscopy images have several issues such as noise, hemorrhage, and shading. Therefore, we perform preprocessing and draw parallels between the images with and without preprocessing.
Results: The network with the lowest performance is TernausNet; the performances of the other three networks show marginal differences. The most significant factor for predicting performance is the encoder network. All networks demonstrate reliable performance with a minimum intersection over union score of 0.68 in TernausNet.
Conclusion: The segmentation of organs in images can be used for the quantitative evaluation of surgery and to help surgeons understand anatomy.
Keywords: Deep learning, Artificial intelligence, Diagnosis, computer-assisted, Minimally invasive surgical procedures
Minimally invasive surgery (MIS) and robot-assisted surgery (RAS) have gained recognition for their ability to provide safer procedures and facilitate faster patient recovery than traditional laparotomy, resulting in reduced hospitalization durations1,2. However, surgeons encounter challenges in effectively manipulating surgical tools and obtaining a comprehensive understanding of the tissues at the surgical site due to their reliance on screens and endoscopic instruments for information gathering3,4. To overcome these challenges, researchers have actively conducted studies aimed at providing valuable feedback information by attaching sensors to instruments or employing computer vision technology to supplement the visual information5-8. Notably, computer vision-assistive technology has emerged as a promising solution, capitalizing on the rapid advancements in deep learning and convolutional neural networks (CNNs)7-9.
Even before the advent of deep learning and CNNs, researchers have proposed algorithms that leverage classical computer vision techniques to assist with MIS. In a notable 2003 study, Lo et al.10 introduced an algorithm aimed at evaluating tissue–instrument interactions through image analysis. Their approach involved instrument segmentation and tracking using color segmentation while quantitatively assessing tissue deformation caused by instruments through optical flow and shape-from-shading techniques. In a separate study conducted in 2006, Bilodeau et al.11 proposed an algorithm to segment the cavity and thereby aid surgeons in performing thoracic laminectomies; to validate their method, they segmented the cavity using laparoscopic images of a surgery performed on a pig. Their technique involved splitting and merging the cavity into meaningful regions using a multilevel graph approach known as the recursive shortest spanning tree. Both studies are significant because they utilize computer vision techniques to track instruments and aid surgeons performing MIS. These methods offer benefits such as lower computational costs and interpretability of the segmentation process when compared to deep learning approaches. However, they exhibit notably lower generalization capabilities than deep learning methods, making their application in endoscopic environments challenging due to the presence of multiple variables12.
Deep learning-based semantic segmentation has demonstrated its superiority over classical methods in achieving more accurate segmentation, particularly exhibiting enhanced generalization capabilities, which render it well-suited for endoscopic images encompassing multiple variances like illuminance, blurring, light spillage, hemorrhage, and overshadowing12-15. In 2018, Shvets et al.14 proposed a deep-learning-based semantic segmentation method for robotic instrument detection and tracking. They conducted binary and multiclass classifications of instruments in porcine surgical images acquired using the da Vinci Xi surgical system. The authors employed U-net-based networks, specifically TernausNet16 and LinkNet34, achieving a binary classification intersection over union (IoU) score of 0.66 and a multiclass classification IoU score of 0.35 using TernausNet16. Despite the effectiveness of deep-learning-based semantic segmentation in laparoscopic images, it exhibits insufficient multiclass classification performance. In 2020, Scheikl et al.15 proposed the application of semantic segmentation to assist surgeons in scene understanding during laparoscopic surgery. They trained segmentation networks, including U-net, TernausNet, LinkNet, FCN, and SegNet, using various encoder networks. Three loss functions, namely soft-Jaccard, generalized Dice, and cross entropy, were cross-applied to laparoscopic cholecystectomy images using labeled data categories such as image exterior, liver, gallbladder, instrument, fat, and others. The network trained with TernausNet11 using the soft-Jaccard function achieved a maximum IoU score of 0.78 for image segmentation.
In 2021, Sun et al.16 proposed a lightweight segmentation network for the real-time detection and tracking of surgical instruments in RAS. Their approach involved the utilization of a lightweight network that Ghost Module apply to MobileNetV3 as the encoder for real-time image segmentation while employing Lite R-ASPP as the decoder. The network was trained using image sequences obtained from the da Vinci Xi system provided by the MICCAI Endoscopic Vision Challenge 2017. The segmentation achieved an impressive speed of approximately 37.0 frames per second (FPS) with an accuracy of 0.70; notably, this real-time speed was accomplished with a minimal compromise in accuracy. In 2019, Ni et al.17 conducted surgical instrument segmentation using RASNet, incorporating a decoder with an attentional mechanism. They utilized RAS images from the MICCAI Endoscopic Vision Challenge 2017 dataset and achieved an IoU score of 0.90. To address the issue of background class imbalance in surgical images, they implemented the global attention upsample, which focused on the features of surgical instruments. This approach resulted in a noteworthy 7.58% improvement in the IoU score compared with that of the baseline model. By directing attention to the instruments through the attention mechanism, the challenge of class imbalance was effectively mitigated, leading to a substantial enhancement in the multi-classification performance of various surgical instruments. However, it is important to note that both studies specifically focused on segmenting the surgical instruments and did not encompass other aspects likes organs within the surgical images.
In MIS, the primary objective of image segmentation is to delineate surgical instruments accurately and assist surgeons in understanding the precise positioning of these instruments during the procedure. However, it is equally important to identify and analyze organs within MIS images, which play a crucial role in computer-aided diagnosis15,18. The accurate segmentation of organs in endoscopy images relies on the effective handling of variables, and multiple segmentations of organs are required. In this context, deep-CNN-based multiclass segmentation has emerged as a compelling solution. Therefore, this study proposes an image segmentation method for organs that employs a deep CNN as an encoder within a multiclass segmentation network and whose primary goal is to provide surgical assistance. This study provides a detailed description of the proposed solution, including the architecture of the semantic segmentation network, the learning process of the network, and a comprehensive analysis of the results obtained by segmenting organs in real surgical images.
Our study focuses on organ semantic segmentation during MIS. Therefore, we utilize several semantic segmentation networks, including TernausNet, TernausResNet, LinkNet, and DeepLab V3+ for MIS images. Endoscopy images used in MIS may demonstrate noise, hemorrhage, and shading issues. To address this problem, we apply preprocessing techniques and compare the results obtained using the datasets with and without preprocessing; as a result, the most influential factor is observed to be the encoder network, whereas the decoder network and preprocessing have marginal variances. Organ-semantic segmentation in MIS images has a different purpose: enabling the quantitative evaluation of surgery and assisting surgeons in understanding anatomical structures.
Fig. 1A illustrates the network training sequence used in a medical image segmentation system for image segmentation. The networks are implemented using TensorFlow 2.6.0 with CUDA 11.3 and cuDNN 8.2.1. The Adam optimizer is employed, with the learning rate, beta1, and beta2 set to 0.001, 0.9, and 0.999, respectively. The number of epochs is set to a maximum of 1,000, and the early stop method is applied with a patience of 20. During training, the networks are evaluated using a validation dataset and the top-performing networks are saved.
This retrospective study was approved by the Institutional Review Board of Gachon University Gil Hospital, and the requirement for patient informed consent was waived (approval number: GDIRB2020-346). The raw data comprises PNG images with a resolution of 1,920×1,080 pixels and RGB channels, along with XML annotations with polygon masks recording each organ. The mask is encoded as a one-hot vector comprising four binary images of the same size as the original image. To reduce the training time the image resolution is reduced to 512×288, while maintaining the aspect ratio of the image at 16:9. Table 1 presents the components of the data, including the number of classes. Each image contains one or more classes, and the total number of images is 2,244.
Table 1 . Components of the data, including the corresponding number of classes.
Raw data category | Liver | Gallbladder | Spleen | Total |
---|---|---|---|---|
Each | 231 | 60 | 847 | 1,138 |
Liver–gallbladder | 936 | 936 | 0 | 936 |
Liver–spleen | 160 | 0 | 160 | 160 |
Liver–gallbladder–spleen | 10 | 10 | 10 | 10 |
Total | 1,337 | 1,006 | 1,017 | 2,244 |
Each image contains one or more classes, with a total of 2,244 images..
Preprocessing prevents the network training from being affected by noise and unwanted features, and it is compared with a non-conducted dataset to evaluate compatibility. Preprocessing normalization, contrast-limited adaptive histogram equalization (CLAHE), and Gaussian blur are applied for preprocessing19. Fig. 1B depicts the preprocessing process.
The network architectures are constructed based on the U-net model, which comprises an encoder (or a backbone network) and a decoder. These include skip connections that facilitate the recovery of resolution through the transport of low-level features20. This is illustrated in Fig. 2A. Two variants of TernausNet, LinkNet, and DeepLab V3+ are used for image segmentation of organs, with each network’s encoder pre-trained with the ImageNet dataset14,21-24.
TernausNet is a semantic segmentation network proposed by Iglovikov and Shvets21 in 2018. It utilizes a VGG11 encoder, and its architecture is inspired by U-net. The networks concatenate the low-level and decoded features. In our study, VGG16 and ResNet50 are used instead of VGG11; hereafter, these networks are referred to as TernausNet and TernausResNet21,25,26. Fig. 2B illustrates the decoder block of TernausNet. LinkNet is a semantic segmentation network proposed by Chaurasia and Culurciello22 in 2017. It is based on U-net and applies a ResNet18 encoder, which constructs a skip-connection with a residual connection between the low-level and decode features. Fig. 2C illustrates the decoder block of LinkNet. DeepLab V3+ is a semantic segmentation network proposed by the Google Brain team in 2018, which enhanced DeepLab V3+ in 2017. By using atrous spatial pyramid pooling (ASPP), the DeepLab V3+ processing encoder has high-level features on a variable scale23. The DeepLab V3+ has encoder-decoder structures, the backbone is Xeception, and the mask is decoded by the ASPP and decoding modules23. It is illustrated in Fig. 2D.
The loss function is composed of two components: cross entropy, which evaluates the accuracy of the entire pixel, and the IoU loss, which evaluates each label prediction. The IoU loss is defined as the negative logarithm of the IoU, and the loss function is based on the study by Shvets et al.14 Equation (1) represents the loss function:
Data preprocessing utilizes normalization as a default, with histogram equalization through CLAHE, and Gaussian blur as an option. To evaluate the effectiveness of each preprocessing method, the networks were trained using distinct preprocessed datasets. By comparing each area, the network prediction labels were validated using labels assigned by the surgeons. IoU and Dice are coefficients calculated to compare the performance of each network14,15,23. These coefficients range from zero to one, where a value closer to one indicates a higher similarity between the two areas. Table 2 lists the IoU score, and Dice coefficient based on the type of preprocessing and network architecture.
Table 2 . IoU score and Dice coefficient according based on the type of preprocessing and network architecture.
Network | Preprocessing | IoU | Dice | Latency (ms) |
---|---|---|---|---|
TernausNet | ○ | 0.68±0.25 | 0.75±0.25 | 85 |
TernausResNet | ○ | 0.76±0.23 | 0.82±0.23 | 103 |
LinkNet | ○ | 0.74±0.25 | 0.80±0.24 | 85 |
DeepLab V3+ | ○ | 0.75±0.23 | 0.81±0.23 | 87 |
TernausNet | × | 0.69±0.25 | 0.75±0.25 | 86 |
TernausResNet | × | 0.74±0.25 | 0.80±0.24 | 102 |
LinkNet | × | 0.73±0.25 | 0.79±0.25 | 88 |
DeepLab V3+ | × | 0.74±0.23 | 0.80±0.22 | 86 |
Values are presented as mean±standard deviation..
IoU: intersection over union..
TernausNet has the lowest performance among the networks, showing an IoU score approximately 8.58% lower and a Dice coefficient approximately 7.11% lower than the average. In contrast, TernausResNet has the highest IoU score and Dice coefficient among the networks with marginal variances. Network training using preprocessed datasets has a higher performance than that using unprocessed datasets, with marginal variances. Examples of the network predictions are listed in Fig. 3.
TernausResNet, trained on preprocessed datasets, exhibits the best performance among the networks. To assess the performance within each class, the IoU scores and Dice coefficients are calculated for each class in TernausResNet, which has been trained on preprocessed datasets. The gallbladder class has an IoU score of 0.77, indicating the highest performance; by contrast, the liver class has an IoU score of 0.72, which represents a lower but nonetheless reliable performance (Fig. 4).
Fig. 5 lists three cases of poor performance. Case 1 shows an image including a hemorrhage, case 2 shows overshadowing, and case 3 shows edge fading owing to light spillage. In these cases, an endoscopy image has noise, hemorrhage, and shading issues; these problems confuse the network and result in erroneous class predictions27. To address these problems and improve the network performance, a follow-up study collects additional images that correspond to these cases and proposes the development of automated exclusion systems or generalization techniques.
In studies by Shvets et al.14 and Scheikl et al.15, TernausNet16 has exhibited good performance, and shallow networks such as TernausNet11 and TernausNet16 outperform deep networks such as LinkNet34 and LinkNet50; this is different from our results, where TernausNet exhibits poor performance. We infer that the reason for the differences in the results is the use of a deeper encoder network than that used by Scheikl et al.15 Scheikl et al.15 reported that TernausNet16 exhibited the highest performance when considering only the best performance from seven runs; however, when considering all the runs, TernausNet16 was excluded from the ranking. Therefore, the differences in the results can be attributed to variations in the number and quality of the datasets, as well as differences in the training methods.
The network with the lowest performance is TernausNet; the performances of the other three networks show marginal differences. These findings can be attributed to discrepancies in network encoders. Specifically, TernausNet employs VGG16 as its encoder, whereas the other networks utilize ResNet50 as their encoder. In our study, we observe that the choice of decoder does not have a significant impact on performance. Consequently, it is necessary to optimize the decoder and utilize various decoders to improve the results. In our study, TernausNet and LinkNet have distinct encoders compared to the original model21-23. While these modified models perform reliably, it is essential to conduct a performance comparison with the original model and analyze the influence of the encoder modifications. The network latencies are as follows: TernausResNet, 85 ms; LinkNet, 85 ms; and DeepLab V3+, 87 ms. When converted to FPS, the approximate range is 10–11 FPS. The network with the longer latency is TernausNet because its employed encoder, VGG16, requires a longer inference time than ResNet5026. The segmentation of organs in images can be used for the quantitative evaluation of surgery and to help surgeons understand anatomy.
Our study aims to alleviate the diminished sensory perception in minimally invasive and robot-assisted surgeries through the application of deep learning-based computer-aided techniques. We used deep learning for laparoscopic image segmentation, achieving a DICE score of 0.82 at approximately 11 FPS. Our results need refinement, especially in enhancing inference speed and diversifying data. Employing deep learning in MIS, like our approach stands as a promising solution.
None.
No potential conflict of interest relevant to this article was reported.
This work was supported by the GRRC program of Gyeonggi province. [GRRC-Gachon2023(B01), Development of AI-based medical imaging technology], and by the Technology Innovation Program (K_G012001185601, Building Data Sets for Artificial Intelligence Learning) funded By the Ministry of Trade Industry & Energy (MOTIE, Korea).
Table 1 . Components of the data, including the corresponding number of classes.
Raw data category | Liver | Gallbladder | Spleen | Total |
---|---|---|---|---|
Each | 231 | 60 | 847 | 1,138 |
Liver–gallbladder | 936 | 936 | 0 | 936 |
Liver–spleen | 160 | 0 | 160 | 160 |
Liver–gallbladder–spleen | 10 | 10 | 10 | 10 |
Total | 1,337 | 1,006 | 1,017 | 2,244 |
Each image contains one or more classes, with a total of 2,244 images..
Table 2 . IoU score and Dice coefficient according based on the type of preprocessing and network architecture.
Network | Preprocessing | IoU | Dice | Latency (ms) |
---|---|---|---|---|
TernausNet | ○ | 0.68±0.25 | 0.75±0.25 | 85 |
TernausResNet | ○ | 0.76±0.23 | 0.82±0.23 | 103 |
LinkNet | ○ | 0.74±0.25 | 0.80±0.24 | 85 |
DeepLab V3+ | ○ | 0.75±0.23 | 0.81±0.23 | 87 |
TernausNet | × | 0.69±0.25 | 0.75±0.25 | 86 |
TernausResNet | × | 0.74±0.25 | 0.80±0.24 | 102 |
LinkNet | × | 0.73±0.25 | 0.79±0.25 | 88 |
DeepLab V3+ | × | 0.74±0.23 | 0.80±0.22 | 86 |
Values are presented as mean±standard deviation..
IoU: intersection over union..