Computer Lab 2: Machine Learning in Biomedicine¶
Author:
Dirk de Weerd
Updated by:
Henrik Podéus Derelöv
Ralph Monte
Last updated: 2026-02-05
Built with R-4.4.1
Contents
- Objectives
- 0. Before starting
- 1. PCA and K means clustering
- 2. Decision trees and LASSO
- Homework Exercise
- References
Objectives¶
In this lab, you will:
- Apply Principal Component Analysis (PCA) for dimensionality reduction.
- Perform k-means clustering to explore patterns in the data.
- Build and evaluate a simple decision tree classifier.
- Use LASSO regression for feature selection and prediction.
- Analyze and interpret results with visualizations and metrics.
0 Before starting¶
Throughout the lab you will encounter different "admonitions", which is a type of text box. A short legend of the different kinds of admonitions used can be found in the box below:
Different admonitions used (click me)
Background and Introduction
Useful information
Guiding for implementation
Here you are tasked to do something
Reflective questions
Create a R script
Open a new R script file in RStudio and save it, adding your name and surname (e.g. "Lab1_David_Davidsson.R").
You will write there the R code that you run during this Computer Lab.
Remember to save your work frequently to avoid losing progress.
First, packages we will use during the lab are installed:
Install and load R packages
# Install the necessary packages
install.packages("ggplot2")
install.packages("rpart")
install.packages("rpart.plot")
install.packages("caret")
install.packages("glmnet")
Load the installed packages:
# Load the necessary packages
library(ggplot2)
library(rpart)
library(rpart.plot)
library(caret)
library(glmnet)
Set a random seed for reproducibility:
Set a seed
# Set a seed to produce reproducible results
set.seed(77)
Throughout this lab we will be working with a dataset that is available for download. It is taken from the expression atlas [1] and consists of gene expression data measured by RNA-seq on 34 Ulcerative Colitis patients and 61 healthy controls. The genes have been selected on most significantly differentially expressed (between patients and controls), so the data has a very strong disease signal. The data is provided in RDS format, download it to your computer and read in the file like this:
Download and load data
if not already done, download the data and save it on your computer. Thereafter, load the data (remember to change the path to you actual file location).
resources <- readRDS("path/to/file/resources.RDS")
Now inspect the data before continuing.
Inspect the data
The .RDS file contains 6 fields:
- counts, a matrix containing 95 samples with the RNA expression values of the 100 most differentially expressed genes. The columns are genes and the rows samples. The count values are normalized using a logarithmic transformation.
- sample_assignments, a vector containing to what class a sample is assigned, either patient or control. It is in the same order as the rows.
- TF, a matrix containing 82 transcription factor expression values for the 95 samples.
- The same 3 data structures prefixed with homework_, this will be the data used for the homework.
It can be easier to assign the different data structures to their own variable names:
counts <- resources$counts
sample_assignments <- resources$sample_assignments
TF <- resources$TF
1 PCA and K means clustering¶
Principal Component Analysis (PCA) is a dimensionality reduction method that transforms a large(r) set of variables into a smaller set while trying to retain most of the information in the large set. By identifying the principal components, which are the directions of maximum variance in high-dimensional data (in this case the expression values of genes). Gene expression datasets typically contain thousands of genes (more than the 100 in this lab) and PCA helps in reducing the number of variables into a smaller set of principal components that still capture most of the variability in the data.
1.1 PCA¶
As stated before, the goal of PCA in this context is to reduce the dimensionality of a dataset while retaining as much variability in the data as possible.
Scale the variables
Before using PCA, we first need to scale the variables to make them comparable. If variables are on different scales, those with larger scales will dominate the variance which can lead to misleading directions of maximum variance. Scaling ensures that each variable contributes equally to the variance. To do this scaling, we can use the R function scale()
scaled_counts <- scale(counts)
To calculate the actual PCA, the base R function prcomp() is used:
Calculate the PCA
pca_result <- prcomp(scaled_counts)
Each principal component is calculated as a linear combination of all 100 genes. The first principal component (PC1) is the direction in the data that captures the largest amount of variation. The second principal component (PC2) is orthogonal (at a right angle) to the first and captures the next highest amount of variance, and so on. The PCs themselves are not individual genes but rather constructed features. Each PC is a weighted sum of the original genes. For instance, PC1 might be something along the lines of \((0.2 * Gene1) + (-0.3 * Gene2) + (0.4 * Gene3) + ... + (0.1 * Gene100)\), and similar for other PCs.
The function returns a prcomp object containing different data on the PCA result. One of these is under rotation and shows the weights of the genes:
as.data.frame(pca_result$rotation[1:10, 1:10])
| Gene | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 |
|---|---|---|---|---|---|---|---|---|---|---|
| PDPN | 0.1099960 | 0.0510340 | -0.0128633 | 0.0363176 | 0.0310546 | -0.0791853 | -0.0568378 | -0.0185588 | 0.0726820 | 0.1914090 |
| FCN3 | 0.0959363 | 0.0273608 | -0.0612145 | 0.1794276 | 0.0960795 | -0.1172794 | 0.0027734 | 0.0462446 | 0.2363801 | 0.0264787 |
| CSMD2 | 0.1019169 | -0.1010545 | 0.1537644 | 0.0203178 | -0.0702877 | 0.0861881 | 0.0952559 | -0.0053192 | 0.1188184 | -0.0191787 |
| CSF3R | 0.1079575 | 0.0549893 | -0.0328023 | -0.1436733 | 0.0197655 | -0.1182247 | -0.0968843 | -0.0431660 | 0.0551422 | -0.0722753 |
| GBP5 | 0.1040900 | -0.0571512 | -0.0564504 | 0.0700394 | -0.1061002 | -0.0334285 | -0.1557183 | -0.1639598 | 0.0948853 | -0.1332464 |
| S100A9 | 0.1023584 | 0.1602274 | -0.0841380 | -0.1294852 | 0.0574188 | -0.1049216 | -0.0712580 | -0.0071285 | 0.0720515 | 0.0117849 |
| S100A12 | 0.1004667 | 0.1624255 | -0.0601773 | -0.1243799 | -0.0094104 | -0.0616694 | -0.0005206 | 0.0477384 | -0.0131477 | -0.0310802 |
| S100A8 | 0.1015826 | 0.1735901 | -0.0747971 | -0.1053461 | 0.0200346 | -0.1056420 | -0.0441210 | -0.0180896 | 0.0241264 | -0.0259085 |
| S100A3 | 0.0986871 | 0.0599422 | 0.0692475 | 0.0635451 | 0.1753312 | -0.1559075 | 0.0843673 | -0.0758575 | 0.0539966 | 0.1060303 |
| OLFML2B | 0.1105228 | 0.0215786 | 0.1143851 | -0.0154766 | 0.0513583 | -0.0680384 | -0.1105413 | 0.1067265 | -0.0020355 | -0.0629706 |
The principal components (PCs) are stored in the x element of the result object, and we will extract this information for plotting. It is extracted as data.frame as that is the format for the ggplot2 package.
plot the PCs
pca_data <- as.data.frame(pca_result$x)
The explained variance for every PC is calculated from the standard deviation and can be used to label the axis in the plots.
variance_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
The samples are coloured according to their status (patient or control):
ggplot(pca_data, aes(x = PC1, y = PC2, color = sample_assignments)) +
geom_point() +
xlab(paste0("Principal Component 1 (", round(variance_explained[1] * 100, 2), "%)")) +
ylab(paste0("Principal Component 2 (", round(variance_explained[2] * 100, 2), "%)")) +
ggtitle("PCA Plot")
Figure 1. Example of an PCA plot.
1.2 k-means¶
k-means clustering is an unsupervised learning algorithm that is used to partition the dataset into k distinct clusters. The algorithm assigns each data point to the nearest cluster while keeping the centroids of the clusters as small as possible. Centroids are the mean position of the points in a cluster. The "k" in k-means refers to the number of clusters and is a parameter that needs to be specified before running the algorithm. k-means clustering partitions the data into clusters based on the Euclidean distance between data points and the centroid.
Because k-means depends on the distance between datapoints, just like in the PCA the scaled data is required as otherwise data with larger range will dominate the distance. Another thing that needs to be set is the number of clusters.
Set number of clusters
Since our data has two classes, patient and control, we use two clusters.
k <- 2
kmeans_result <- kmeans(scaled_counts, k)
One way to visualize the clusters in two dimensions is to add the clustering results to the PCA results. Because the clusters are numeric (either 1 or 2) they have to be appended as a factor. Then, we can redo the plot and set the shapes of the data points to correspond to cluster.
plot the clustering
pca_data$cluster <- factor(kmeans_result$cluster)
ggplot(pca_data, aes(x = PC1, y = PC2, color = sample_assignments, shape = cluster)) +
geom_point() +
xlab(paste0("Principal Component 1 (", round(variance_explained[1] * 100, 2), "%)")) +
ylab(paste0("Principal Component 2 (", round(variance_explained[2] * 100, 2), "%)")) +
ggtitle("k-Means Clustering on PCA Data")
Figure 2. Example of k-means clustering of the PCA data.
1.3 Exercise¶
Now to the first exercise.
Exercise 1
For this exercise, we generate a new count matrix based on the characteristics of the data used in the example. Run the following code to generate this new matrix:
#Generates a column (gene). It takes the number of patients and controls, mean and SD for patients and controls and returns a vector with these characteristics
generate_data <- function(index, patient_i, control_i, counts){
c(rnorm(length(patient_i), mean = mean(counts[patient_i, index]), sd = sd(counts[patient_i, index])),
rnorm(length(control_i), mean = mean(counts[control_i, index]), sd = sd(counts[control_i, index])))
}
#Change the order of the samples
genes_rearranged <- sample(100, 100)
#Get the indices of the patient and control samples
patient_i <- which(sample_assignments == "patient")
control_i <- which(sample_assignments == "control")
#Generate new synthetic data for every gene
synthetic_counts <- sapply(genes_rearranged, generate_data, patient_i, control_i, counts)
#If the data generation returned negative values, set to low value log1p(1)
synthetic_counts[synthetic_counts <= 0] <- log1p(1)
#Copy the row and column names
dimnames(synthetic_counts) <- dimnames(counts)
Perform a PCA on the new dataset.
- Are the patient and control groups well separated in the PCA plot?
- Which principal component explains the most variation in the data?
Perform k-means clustering with 2 clusters and plot the results alongside the PCA plot.
- How many samples are in each cluster?
2. Decision trees and LASSO¶
Both decision trees and LASSO are examples of supervised learning, where the goal is to learn a mapping from input features to output labels based on a set of labelled training data. The algorithms use training data to learn the relationship between the features and the labels, and then apply this learned relationship to make predictions on new, unseen data.
2.1 Split the data in train and test sets¶
Before training models, the data needs to be split into a training set and a test set. The reason for this split is to evaluate the performance of the model. The model is trained using the training set, which consists of a portion of the original data. Then, the model needs to be tested on the test set, a separate portion of the data, to assess how well it generalizes to unseen data.
It is important to randomly sample the data when creating these sets to avoid any bias, so we randomly sample 70% of the data as the train set. The remainder is the test set.
Sample data into train and test set
To do this, we first store the index of the train set in a vector, in this case named train_index. Then, we subset the scaled_counts matrix with vector, and use the negation to retrieve the test set. The same procedure is applied to the sample assignments.
train_index <- sample(1:nrow(scaled_counts), 0.7 * nrow(scaled_counts))
train_counts_scaled <- scaled_counts[train_index, ]
test_counts_scaled <- scaled_counts[-train_index, ]
train_status <- sample_assignments[train_index]
test_status <- sample_assignments[-train_index]
2.2 Decision tree¶
The model learns to predict the status of a sample by selecting features that best split the data. For constructing a decision tree, it does not matter if we use scaled or unscaled data, as the algorithm functions on the order of the variables and not the distance. However, for interpreting a tree the original values instead of their scaled counterparts are usually more informative. We can use the same train/test split vector to partition the original data:
Partition the original data
train_counts <- counts[train_index, ]
test_counts <- counts[-train_index, ]
Now to constructing the decision tree.
Construct and plot the decision tree
We can use the rpart() function to construct our tree. The parameter method = class denotes that we use the tree as a classifier.
tree_model <- rpart(train_status ~ ., data = as.data.frame(train_counts), method = "class")
The rpart.plot package is used to plot the tree. Since we have such a strong signal in the data, one or two genes could be enough to construct a meaningful model.
rpart.plot(tree_model)
Figure 3. Example of an decision tree.
The next step is to make predictions using the decision tree on the test set. For this we use the predict() function.
Make predictions
The first parameter is our decision tree, the second the test dataset and the type of the tree (classifier). The output will be a factor containing the prediction for each sample.
predictions <- predict(tree_model, as.data.frame(test_counts), type = "class")
A common way to evaluate performance is through a confusion matrix. It is a table that visualizes how well the model is classifying data. Because we have a binary classification problem (patient or control), the matrix has two rows and two columns, representing the two classes. The four fields in the matrix represent the following:
- True Positives (TP): The count of instances correctly classified as Positive.
- True Negatives (TN): The count of instances correctly classified as Negative.
- False Positives (FP): The count of instances incorrectly classified as Positive when they are actually Negative (Type I error).
- False Negatives (FN): The count of instances incorrectly classified as Negative when they are actually Positive (Type II error).
To generate the confusion matrix, use the confusionMatrix() function from the caret package
Generate the confusion matrix
confusionMatrix <- confusionMatrix(predictions, as.factor(test_status), positive = "patient")
knitr::kable(confusionMatrix$table)
| control | patient | |
|---|---|---|
| control | 14 | 1 |
| patient | 2 | 12 |
The true negative samples are in the top left cell, and the model classified 14 correct. The false negatives are in the top right cell, and in this case there is 1. In the bottom left cell we find the false positives, 2 samples. Finally, in the bottom right are the true positives, 12 in total.
2.3 LASSO¶
LASSO Regression (Least Absolute Shrinkage and Selection Operator) is a regression analysis method that performs both variable selection and regularization. It's used to enhance the prediction accuracy and interpretability of regression models by altering the model to include only the most important variables. LASSO does this by imposing a penalty on the absolute size of the coefficients. Because of this, some coefficients can become zero, which means LASSO completely removes some features from the model, and in this way performs feature selection. This is particularly useful in cases where there a large number of features, as it simplifies the model. LASSO regression involves a penalty term λ that controls the degree of regularization. This lambda value determines how much the coefficients are shrunk towards zero. If λ is too large, the model might end up with too little variables. If it's too small, the model might become too complex and overfit the data. Therefore, in LASSO it is very important to select a suitable λ value.
In this lab, we use the expression of Transcription Factors to predict the expression of a random gene from our dataset. Again, we train the model using the train data we prepared previously.
To perform the LASSO, we use the R package glmnet. For this example, we first select a random gene to predict:
Select a random gene
random_n <- sample(x = 1:nrow(scaled_counts), size = 1)
Subset the expression values of the randomly selected gene in a vector:
gene_to_predict <- scaled_counts[,random_n ]
The workflow for doing a LASSO in R starts with the function cv.glmnet().
Cross validation
cv stands for Cross Validation, and is used to determine a suitable value for the previously mentioned λ. cv.glmnet performs k-fold cross-validation to find the lambda value that minimizes prediction error. In k-fold cross-validation, the data is divided into k subsets. The model is trained on k-1 subsets and validated on the remaining subset, and this process is repeated k times. The alpha parameter specifies the type of model we are fitting, and alpha = 1 is a LASSO model, as it penalizes the absolute value of the coefficients, allowing them to become zero. As LASSO requires data to be on the same scale, the TF data also needs to be scaled.
scaled_TF <- scale(TF)
cv_model <- cv.glmnet(scaled_TF[train_index,], gene_to_predict[train_index], alpha = 1)
The plot of the cross validation gives a visual overview of the trade-off between complexity and performance.
Plot the cross validation
The logarithm of λ is on the x-axis and the Mean Squared Error on the y-axis. A higher λ generally means that the model complexity decreases (fewer features are selected), but this might come at the cost of increased error. The plot helps in finding a balance between a simpler model and one that performs well. In general, the λ value that minimizes the MSE is considered best.
plot(cv_model)
Figure 4. Example of a cross validation plot.
The numbers above the plot represent how many features have non-zero coefficients at the corresponding λ value. As λ increases, the number of features with non-zero coefficients typically decreases, reducing the model's complexity. The plot shows how many features are selected at a given λ value and its associated MSE value, linking the model's sparsity (number of features) with its performance (MSE).
With the cross validation object we can also find the the lowest MSE.
Find the lowest MSE
The λ that minimizes the MSE can easily be extracted out of the CV object:
best_lambda <- cv_model$lambda.min
With the appropriate λ value identified, we can now fit the model.
lasso_model <- glmnet(scaled_TF[train_index, ], gene_to_predict[train_index], lambda = best_lambda)
To inspect the coefficients, the coef() function can be used. There we can see which and how much each transcription factor contributes to the model.
Inspect coefficients
head(coef(lasso_model))
## 6 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.01312105
## RPS6KA1 .
## CHTOP .
## SMG5 .
## CCT3 .
## MEF2D .
Now we have the model, the logical continuation is to use to make predictions, and again we use the predict() function and the samples we have not used to train the model, in other words the test set
Make predictions
predictions <- predict(lasso_model, newx = scaled_TF[-train_index,])
It can be convenient to bind the actual and the predicted values together in a dataframe for further analysis and plotting:
predicted_df <- as.data.frame(cbind(test_counts_scaled[,random_n],predictions))
colnames(predicted_df) <- c("Actual", "Predicted")
To calculate the MSE between the predictions and the actual values we can use the values in the newly created dataframe:
mean((predicted_df$Actual - predicted_df$Predicted)^2)
## [1] 0.2699112
To visualize the performance of the model, a common choice is to make a scatterplot between the predicted and actual values. We also fit a dotted blue regression line.
Scatterplot
ggplot(predicted_df, aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.7) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "blue") +
labs(x = "Actual Values", y = "Predicted Values", title = "Comparison of LASSO Predictions and Actual Values") +
theme_minimal()
Figure 5. Example of a scatter plot.
2.4 Exercise¶
Now to the next exercise.
Exercise 2
Use your generated dataset and make a new decision tree.
- What gene(s) are used to classify the samples?
- How many false positives do you get?
- Also construct a LASSO model and predict your test set.
- Which transcription factor contributes the most to the model? What MSE do you get?
Homework Exercise¶
Lastly, you will also do a homework exercise on these topics.
Homework exercise
For this homework exercise, you do a similar analysis as in this document. For this you can use the homework dataset in the resources RDS file that you have downloaded earlier. It is similar to the dataset used previously, with a normalized count matrix, sample assignments and TF expression values. The data is on Asthma versus control patients [2]. It can be accessed using:
resources$homework_countsresources$homework_sample_assignmentsresources$homework_TF
Use this dataset to:
- A. Construct a decision tree to classify patients and controls
- B. Make a confusion matrix on the classification results
- Calculate the specificity and sensitivity of the results
- C. Perform a PCA and plot the results. Give a short interpretation of the plot.
-
D. Perform a Lasso regression using the transcription factors as predictors for all 100 genes where the presence of a coefficient above a certain threshold will be considered as an edge in the gene-TF network. As this is not an easy task, a function to do this,
iterate_lasso(), is provided below:Information about the iterate_lasso function
The function takes four arguments: -
scaled_counts: The scaled expression values -scaled_TF: The scaled expression values for the TFs -train_index: The index for the train set in a vector -min_coef: The minimal absolute value the coefficient should have. The default is set to 0.2, but you can change it.The function
iterate_lasso()returns a two-column edge list, where:- Column 1 (TF): Contains transcription factors (TFs).
- Column 2 (genename): Contains genes regulated by the TFs.
Code for the iterate_lasso function
iterate_lasso <- function(scaled_counts, scaled_TF, train_index, min_coef = 0.2){ all_edges <- NULL #For every column (gene) for (i in 1:ncol(scaled_counts)){ #Extract the gene expression values we want to predict gene_to_predict <- scaled_counts[,i ] #Also extract the gene name of that gene genename <- colnames(scaled_counts)[i] #Finding the right lambda value and extract cv_model <- cv.glmnet(scaled_TF[train_index,], gene_to_predict[train_index], alpha = 1) best_lambda <- cv_model$lambda.min lasso_model <- glmnet(scaled_TF[train_index, ], gene_to_predict[train_index], lambda = best_lambda) predictions <- predict(lasso_model, newx = scaled_TF[-train_index,]) #Take the coefficients as a matrix as this will be easier for downstream analysis coefficients_lasso <- as.matrix(coef(lasso_model)) #Which TFs have a larger coefficient then min_coef, omitting the first as this is the intercept TFs <- rownames(coefficients_lasso)[which(abs(coefficients_lasso) > min_coef)[-1]] #If there are such interactions if (length(TFs) > 0){ #Bind them in dataframe current_edges <- cbind(TFs, genename) #And append to the previous all_edges <- rbind(all_edges, current_edges) } } return(as.data.frame(all_edges)) } -
E. Which TF regulates the largest number of genes? In other words, which transcription factor appears most frequently in column one? How many genes are regulated by this TF?
Hint
Use the
table()function to count how many times each transcription factor appears in the edge list's first column. Then, use theorder()function to find the TF with the highest count. -
F. Which gene is regulated by the largest number of TFs? Similar to question E, which gene name appears most frequently in column two? How many TFs regulate this gene?
-
G. Construct an accompanying data frame node list (which is the unique set of genes and TFs in the edge list).
-
H. Create a graph from the node and edge list, as you did in lab 1.
- Should the graph be directed, as in arrows indicating direction of regulation or undirected?
- Is it a bipartite graph? Why or why not?
- Color the TFs and genes differently.
- Include the plot in your report.
Submit your code and results in two separate files through Lisam or as a single R-markdown file. The deadline is Wednesday 18th of February 2026 at 17:00. Note: late submissions will not be graded. You are then only allowed to hand-in during the resubmission opportunity (resubmission deadline: Friday 27th of March 2026 at 23:30).