Explainable Deep Learning (xDL) in the context of Human-Machine Interaction (HMI) for diagnosing nephropathology using Deep Learning refers to a transparent and interpretable approach to AI models. It involves using deep learning techniques to analyze medical images of kidneys and detect potential diseases while providing clear explanations for the AI’s diagnostic decisions. In this process, the AI model is trained on a large dataset of kidney images to learn patterns and features associated with various kidney diseases and normal kidney structures. Once trained, the model can analyze new medical images and identify potential abnormalities, such as kidney tumors, cysts, or infections. What sets xDL apart is its ability to offer insights into the reasoning behind its predictions. This transparency allows healthcare professionals to understand how the AI model arrived at its diagnostic decision, making the interaction between humans (doctors) and machines (AI) more collaborative and trustworthy. By providing explanations in a comprehensible manner, xDL enables doctors to better interpret and validate the AI’s diagnostic suggestions. It also helps in identifying potential biases or limitations in the AI model’s decision-making process, making the overall diagnostic process more reliable and accurate. Explainable Deep Learning holds great promise in nephropathology diagnosis, as it fosters an effective partnership between human expertise and AI capabilities. By combining the strengths of both, xDL can lead to more precise and early detection of kidney diseases, ultimately improving patient outcomes and advancing medical research in nephrology.

Figure 1 : Tubulus of kidney in HSA Kit


The HSA KIT program ensures accuracy down to the tiniest pixel detail with its user-friendly interface and sophisticated professional annotation tools. It vastly improves the whole analytic process through process standardization, consistency maintenance, and support for reproducibility.

Zoom Out View

Figure 2: IgAN_Tubulus with High opacity

Figure 3: IgAN_Tubulus with Low opacity

Process using HSA KIT
At HS Analysis, our approach to slide analysis is thorough. We don’t simply examine the slides individually; we also incorporate our fixes into the current system. Connecting to LIS (Laboratory Information System) and identifying the slides in a way that makes them simple to save, search for, and utilize for medical purposes are two examples of what is required. We want to give our customers a seamless experience that enables them to quickly and effectively access and use the data they require.

AI-based analysis utilizing the HSA KIT would offer:

standard procedure with subjective and objective evaluation

removing pertinent elements from raw data and producing usable representations for AI model training

choosing and configuring modules without using a lot of code

Software that’s simple to use: Annotate and Train both Automate

Multiple medical photos are analyses quickly and effectively, reducing the amount of time needed for diagnosis or treatment.

Automated report production to increase efficiency and assist radiologists or doctors in the evaluation process

Figure 4: Internal network


The kidney is a vital organ in the human body responsible for filtering blood and removing waste products, excess water, and toxins from the bloodstream. Each person typically has two kidneys, located on either side of the spine, below the ribcage. The main functions of the kidneys include:

  1. Filtration: The kidneys filter the blood, separating waste products, excess salts, and water from useful substances, creating urine as a result.
  2. Regulation of Water and Electrolyte Balance: The kidneys maintain the body’s water and electrolyte balance, ensuring that the right levels of minerals, such as sodium, potassium, and calcium, are maintained in the bloodstream.
  3. Acid-Base Balance: The kidneys help regulate the body’s pH levels by excreting or retaining hydrogen ions as needed.
  4. Blood Pressure Regulation: The kidneys play a role in controlling blood pressure by releasing hormones that affect blood vessel constriction and fluid balance.
  5. Production of Hormones: The kidneys produce hormones, such as erythropoietin, which stimulates the production of red blood cells, and renin, which helps regulate blood pressure.

Figure 5: Kidney of human


Nephrology is a medical specialty that focuses on the diagnosis, treatment, and management of conditions related to the kidneys. Nephrologists are doctors who specialize in nephrology and are trained to address various kidney-related disorders, as well as electrolyte and fluid imbalances in the body.

The kidneys play a crucial role in maintaining the body’s overall health by filtering waste products and excess fluids from the blood, regulating electrolyte and fluid balance, and producing hormones that help control blood pressure and red blood cell production. Nephrologists are responsible for diagnosing and treating a wide range of kidney-related issues, which can include:

Nephrologists work closely with other medical specialists, such as urologists, cardiologists, endocrinologists, and primary care physicians, to provide comprehensive care to patients with kidney-related issues. They use various diagnostic tools, including blood and urine tests, imaging studies, and biopsies, to assess kidney function and develop appropriate treatment plans.

Overall, nephrology is a critical field that addresses the health and well-being of individuals with kidney disorders, ensuring they receive the best possible care to manage their conditions and maintain their overall quality of life.

Figure 6: Nephrology

Whole Slide Imaging (WSI)

All current WSI systems consist of illumination systems, microscope optical components, and a focusing system that precisely places an image on a camera. The final product, or virtual slide, can be assembled in various ways, depending on the particular scanner being used (tiling, line scanning, dual sensor scanning, dynamic focusing, or array scanning). The result is a comprehensive digital rendering of an entire glass slide, visible at resolutions of less than 0.5 μm, that can be examined with interactive software on a computer screen. These sequential images are then combined together to make up one digital image of the slide that will be analyzed in the future. shows that the base level is the original slide image which has the highest resolution and the other level have different magnifications along with its image down-sampling. Each magnification level includes different types of information, since slide samples structures appear in different ways according to their magnification level. Therefore, it is essential to detect an abnormality and detect it in a specific range of levels

Figure 7: Structure of the WSI pyramid according to

Deep Learning

Deep learning is a subset of machine learning, which is essentially a neural network with three or more layers. These neural networks attempt to simulate the behavior of the human brain albeit far from matching its ability allowing it to “learn” from large amounts of data. While a neural network with a single layer can still make approximate predictions, additional hidden layers can help to optimize and refine for accuracy. Deep learning drives many artificial intelligence (AI) applications and services that improve automation, performing analytical and physical tasks without human intervention. Deep learning technology lies behind everyday products and services (such as digital assistants, voice-enabled TV remotes, and credit card fraud detection) as well as emerging technologies (such as self-driving cars).

Figure 8: Pathology image

Digitalization of slides

The software HSA SCAN, developed by HS Analysis GmbH, is used to carry out the task of digitizing slides. The goal of HSA SCAN is to convert physical slides into digital files. Additionally, the process makes use of the HSA SCANNER, a piece of technology made to cost-effectively transform an analog microscope into a digital one.

Figure 9: Digitalization of slide

Workflow with HSA KIT

Sample analysis and slide digitization have never been easier. Customers that want to adopt cutting-edge alternatives and improve the effectiveness of their workflow can benefit from an unmatched experience provided by HSA KIT. The HSA staff goes above and beyond to please its clients, offering continual support and upgrades in addition to aiding with software installation and integration.

Documentary for tubulus kidney project:

Name of file:Name of project:Number of annotation:
IgA_OT_01_C2+C4b.czi  tubulus465
IgA_OT_03_C2+C4b.czi  tubulus346  
IgA_OT_28_C3+CFB.czi IgA_OT_28_C3+CFB.czi  tubulus430 350
10 3644

The IgA is an important antibody that plays a key role in the immune system’s defense against infections, particularly at mucosal surfaces like the lining of the respiratory, digestive, and reproductive tracts.

A documentary for the Tubulus Kidney Project would be a visual representation of the project’s goals, progress, and impact. It would provide an in-depth look into the research, development, and implementation of the Tubulus Kidney Project, showcasing the scientific advancements and breakthroughs achieved in the field of kidney health. The documentary would likely feature interviews with experts, patients, and researchers involved in the project, as well as compelling visuals and storytelling to engage and inform the audience. Its purpose would be to educate and raise awareness about the Tubulus Kidney Project and its potential to revolutionize kidney health.

Mask R-CNN

Mask R-CNN is a deep learning model that can detect objects in images and accurately segment them at the pixel level. It combines object detection and instance segmentation.

In simpler terms, Mask R-CNN can look at an image and not only tell you what objects are present, but also precisely outline and label each object’s shape within the image. It does this by creating a binary mask for each object, which is like a stencil that fits perfectly around the object.

This model is called „Mask R-CNN“ because it is an extension of another model called „Faster R-CNN,“ which is commonly used for object detection. Mask R-CNN adds an extra step to Faster R-CNN, allowing it to generate these detailed masks.

The advantage of Mask R-CNN is that it provides a more detailed understanding of the objects in an image. It can be used in various applications, such as autonomous driving, medical imaging, and video surveillance, where accurate object detection and segmentation are important.

In terms of explainability, Mask R-CNN is considered explainable in deep learning because it produces interpretable results. The generated masks allow us to visually understand how the model identifies and segments objects. Additionally, researchers can analyze the model’s architecture and training process to gain insights into its decision-making process, making it more interpretable and explainable compared to other deep learning models.

Figure 10:Mask R CNN network architecture  

Vision Transformer

Vision Transformer, also known as ViT, is a deep learning model that applies the transformer architecture to computer vision tasks. Transformers were originally introduced for natural language processing, but ViT adapts them for image understanding.

In traditional convolutional neural networks (CNNs), local spatial relationships are captured through convolutional layers. However, ViT replaces these convolutional layers with self-attention mechanisms, which allow the model to capture global dependencies between image patches.

ViT divides an input image into fixed-size patches and flattens them into sequences. These patches are then fed into a transformer encoder, which consists of multiple layers of self-attention and feed-forward neural networks. The self-attention mechanism enables the model to attend to different patches and learn their relationships, while the feed-forward networks process the attended information.

During training, ViT learns to associate each patch with a class label through a classification head. This enables the model to perform tasks such as image classification, object detection, and image segmentation.

One advantage of ViT is its ability to handle images of varying sizes, as it operates on fixed-size patches. However, it may struggle with capturing fine-grained details compared to CNNs, especially when dealing with large images or complex visual patterns.

In summary, Vision Transformer is a deep learning model that applies the transformer architecture to process images by dividing them into patches and using self-attention mechanisms to capture global dependencies between these patches.

Figure 11: VIT

Explainable (xDI)

Explainable AI (xDI) refers to the ability of an artificial intelligence system to provide understandable and transparent explanations for its decisions and actions. It aims to bridge the gap between the complex inner workings of AI models and the need for human comprehension and trust.

xDI techniques focus on providing insights into how AI models arrive at their predictions or recommendations. This can involve techniques such as generating feature importance scores, highlighting relevant data points, or providing textual or visual explanations. The goal is to make AI systems more interpretable and accountable, enabling users to understand the reasoning behind the AI’s outputs.

Figure 12: Explainable (xDI)


In the context of explainable deep learning, a heatmap is a visual representation that highlights the important regions or features in an input image that contribute to a model’s prediction. It provides a way to understand which parts of the image are most influential in the decision-making process of the model.

Heatmaps are typically generated using techniques such as gradient-based methods or attention mechanisms. These methods analyze the gradients or attention weights of the model to determine the importance of different regions in the input image. The resulting heatmap is then overlaid on the original image, with brighter or hotter regions indicating higher importance.

By examining the heatmap, researchers and practitioners can gain insights into how the model focuses on specific areas of the image to make predictions. This helps in understanding the model’s decision-making process and provides a form of interpretability and transparency. Heatmaps are particularly useful in tasks such as object localization, where they can highlight the regions that the model considers most relevant for a particular object class.

Figure 13:Heatmap

Creation of Ground Truth Data

The samples that were used in this work were prepared by the HS Analysis’ company customer. The slides were converted to digital NDPI files by scanning and finally sending the NDPI files to HSA In order for us to make DL models.

In order to train a model in deep learning (GTD) is needed which is accomplished by simply creating a Base ROI and annotating the existing cells within the Base ROI. The number of cells in the GTD form that is available in this work (see Tab.1) were 8 files along with + 2,000 GTD utilizing the HSA KIT proprietary software. There are various phases involved in creating the GTD. The Carl Zeiss Image Data File contains a WSI file, which is first loaded into the HSA KIT (CZI). Then the kidney cell’s structure is annotated with the „Tubulus“ structure. The quantity and quality of these annotations depends on the location, clarity and size of the base ROI. Table.1 is a breakdown of the class distribution along with the corresponding available and used data.

ClassesBase ROITubules
All data142178
Used data81397

Table.2: The available and used amount of data.

Man Machine Interaction (MMI)

HMI is all about how people and automated systems interact and communicate with each other. That has long ceased to be confined to just traditional machines in industry and now also relates to computers [39]. And with HMI it is easy to crate and modify GTD IN HSA KIT in order to produce new and better AI models.

Figure 14: Man Machine Interaction

Automated classification

The HSA software uses AI to automatically classify images to detect cells and\or classes through the use of GTD, which the software uses to train a model. This allows for more efficient and effective analysis of images. For the purpose of this work, 5 files were used along with +2,000 GTD in two main stages of implementation. Initially, in the first stage, the first model was trained on 5 files with the Mask R architecture once with tubules. Afterwards, for the second main stage the same process was implemented but the only difference was using the Vision Transformer In (Fig.15) the file with the GTD highlighted in Red color indicates that it is detected as (Tubulus).

Figure.15: A screenshot of the HS Analysis software interface during an image classification process according to [Own illustration].

Selection of the data set

After creation of GTD, Table.2 hyperparameters were used in the single class AI model training for two different architectures to train Mask R CNN and Vision Transformer (VIT).  

Model TypeEpochsLearning RateBatch SizeTile Size
Instance Segmentation1000.00012512

Table.3: The settings used for the 3 class AI training.

Interpretation and validation of result

This section will focus on the comparison of both loss and acc evaluations on the instances segmentation results performed by both trained (Mask R CNN & Vision Transform) for Tubulus and the 1 classification. Table.3 shows the actual obtained results from the model training.

Mask R CNNTubulus0.63520195.068032
Vision TransformTubulus0.94342192.873263

Table.4: Loss and acc results of trained models.

Figure.16 and 17: (A) Loss evaluation comparison, (B) acc evaluation comparison