Lab: Exploring the use of LLMs for modeling¶
Updated version 2025
The lab has been updated for this year. If you find some oversights or something that is not working correctly, please reach out to the teachers. Also, if you have already started on this previous version of the lab, the gene differentiations and autoencoder lab, please reach out to the lab supervisors about which version of those labs you should submit in the report.
Lab Overview¶
In this lab we'll work with lightweight LLMs that run on a laptop CPU. It will be practical and not much focus on the theory behind LLMs. In the first Part we will see how text is turned into tokens that a model understands, in Part 2 how these tokens become vectors, then in Part 3 how new text is generated and Part 4 is on summarization and ends with a simple instruction set in Part 5.
Lab overview
LLMs can be useful in modelling when the model requires structured fields but the input data we have is (partly) free text. Some things these models could help with in modelling include:
Inputs: extract meals, medications, symptoms, timings, units.
Normalisation: map synonyms to canonical variables ("cornbread" → bread), fix casing, standardise units.
Design: skim papers for equations, priors, parameter ranges, assumptions, The LLM could pull short, source-linked snippets.
Feedback: use model outputs to ask for missing fields ("portion size?"), then re-run.
Scale: do the same thing for many notes, logs, or reports with the same instructions.
To work with LLMs in practice, we need a toolbox. Hugging Face is the most widely used ecosystem for open source LLMs. It provides both pre-trained models and convenient Python libraries for running them.
Lab Setup¶
Check the instructions below and decide if you are running the lab on a local python setup on your computer, in the computer hall, or through a cloud service.
Note that the LLMs are cpu-heavy and your computer might not be able to handle the requirement, if so - use the cloud notebook instead.
If you haven't done this, follow the general Get started instructions before you start with this lab. But, in short you need:
- Python installation running on your computer (or a computer in the computer hall)
- Python packages (which is done in the package installation step below).
- A text editor or IDE to write code in. If you have no preference, we recommend using VS Codium or VS Code.
- Downloaded and extracted the scripts for Lab 4B: Lab4B files
Package installation
To install the packages required for this lab we recommend using uv as your package manager, see installation instructions.
You need to located in the same folder as the pyproject.toml file that was included in the downloaded files.
uv sync
You need to located in the same folder as the requirements.txt file that was included in the downloaded files.
pip install -r requirements.txt
If you are using a computer in the computer hall, Python and a valid c-compiler is already installed. You only need to download and extract the scripts for Lab 4B: Lab4B files
Navigate to the folder with the downloaded scripts in the terminal, and then create a new virtual environment and install the required packages:
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Now, open the project in your preferred text editor or IDE. If you have no preference, we recommend using vc code which should be available in the computer hall.
- Started the notebook in Google Cloud
- or downloaded the notebook and uploaded it to a cloud service of your choice. If the download link does not work properly, right-click the link an "save link as ...", and save the
.ipynb-file.
More about the packages installed
We are using some hugging face packages in this lab, namely:
transformers is the main library to load and run LLMs and includes pre-trained models like GPT, BERT and T5, their tokenizers and pipelines for common LLM tasks like text generation and text summarization.
accelerate is a helper library that makes running models easier on different devices.
datasets provides access to publicly available datasets and also makes it easier to load and preprocess your own (small) datasets.
sentence_transformers this package can directly convert sentences to embeddings
How to pass (Click to collapse)
At the bottom of this page you will find a collapsible box with questions. To pass this lab you should provide satisfying answers to the questions in this final box in your report. Throughout this lab you encounter green boxes marked with the word "Task", these boxes contain the main tasks that should be performed to complete the lab. There are five parts in total, and for each part there are questions (found at the end of this page) that should be answered in the report. The questions can be answered in a point-by-point format but each answer should be comprehensive and adequate motivations and explications should be provided when necessary. Please include figures in your answers where it is applicable.
Part 1: Tokenization¶
Before an LLM can read anything, we have to decide what counts as a "piece" of text. This step is called tokenization. Instead of whole sentences or even whole words, the text is split into smaller units the model understands, so called tokens. Tokens are not the same thing as words, but some tokens are short words like "cat", but longer or less common words may be split into several tokens. Also, punctuation marks and spaces are typically tokens.
More about tokenization
Tokenization is one of the reasons LLMs can handle so many languages and unusual symbols. Instead of memorizing every possible word and their conjugations, the model only needs to learn a finite set of tokens, called the vocabulary, and how to combine them. If we were to use only complete words the vocabulary would be very large, as English alone has hundreds of thousand words, and the model would have a hard time with newly combined or misspelled words. If we were to split on letters, the vocabulary would be tiny but the model would struggle to capture meaning.
That's why tokenization is the first step in an LLM pipeline. Before any prediction can happen, the input text is split into tokens. The model then processes these tokens, predicts the next token, and finally the tokens are decoded back into human-readable text. Every interaction with an LLM begins with tokenization.
DistilGPT2 Model
From the transformer package we import the AutoTokenizer method and use the tokenizer from distilgpt2, which is small "distilled" version of GPT-2, trained on internet text up to 2019. For today's standards, it is a very small model with "only" 82 million parameters, but it's fast and lightweight. However, it does not follow instructions well and outputs can be incoherent or irrelevant.
Let us now proceed to implement the tokenization.
Task 1: Explore tokenization with DistilGPT2
Run the following code to see how different texts are tokenized:
Code for tokenization
# Here we use the token from the distilgpt2 model
tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
def inspect(text):
# Convert the text to token ids
ids = tokenizer.encode(text, add_special_tokens=False)
# Convert to text tokens
toks = tokenizer.convert_ids_to_tokens(ids)
dec_pieces = [tokenizer.decode([i]) for i in ids]
print(f"\nTEXT: {repr(text)}")
print(f"Token IDs: {ids}")
print(f"Tokens (raw): {toks}")
print(f"Tokens (decoded piece-by-piece): {dec_pieces}")
print(f"Token count: {len(ids)}")
examples = [
"biology",
"biologicla",
"biological",
"CRISPR",
"β-catenin",
"∑ signals",
"🧬",
"Hybrid modeling combines mechanistic and ML models.",
]
for x in examples:
inspect(x)
These examples show distilgpt2 byte-level tokenization. Common words like "biology" are one token, rarer forms split into subwords ("bi" + "ological"), and acronyms like "CRISPR" break into chunks. Unicode symbols (β, ∑, 🧬) are represented as byte pieces, so the raw tokens can look strange and the decode shows replacement chars, but the full sequence decodes correctly, there are effectively no unknown tokens. The Ġ prefix is a leading space, so Ġsignals = " signals", and punctuation like "-", "." is its own token.
Questions part 1
Explore tokenization a bit and answer the following questions:
- Add "crispr" to the examples list. Does it break into different tokens than "CRISPR"? Why could that be?
- Intentionally misspell a word, for example "biologicla" instead of "biological". How does the tokenizer handle it?
Part 2: Embeddings¶
Now that text has been split into tokens, the model can process it. But computers cannot work directly with words or symbols, they need numbers. Each token is first mapped by a lookup table to a fixed embedding: a vector of numbers learned during training. "Fixed" here means that for a given model the lookup returns the same base vector for the same token, regardless of where it appears. The model also adds positional information so it knows where each token sits in the sequence.
More about embeddings
These base vectors are the starting point. As the tokens pass through the transformer layers, self-attention mixes information from neighbouring tokens. After each layer, the representation of a token is updated based on its context. By the end, the token's contextualized embedding reflects both its original meaning and how surrounding words shape that meaning. The word "zoom" near "meeting" will live in a different spot than "zoom" near "camera."
Over training, the model arranges these representations so that related meanings end up closer in this high-dimensional space. "biology" and "biological" have a similar meaning, so they should end up close together, like "cat" sits nearer "dog" than "microscope." You can think of this space as a mathematical map of meaning, where distance encodes similarity.
Sometimes we want a single vector for an entire sentence or paragraph. We can pool the final token embeddings (for example by averaging or using a special [CLS] token) to get a sentence embedding. Libraries like sentence-transformers package this step for you, providing models that directly output sentence-level embeddings suited for comparison, clustering, and search.
all-MiniLM-L6-v2 Embedder
For embedding we can use the all-MiniLM-L6-v2. This embedder is one of the most popular pre-trained models from the sentence-transformers library. MiniLM means it is based on a compact transformer architecture designed to be lightweight and fast. L6 means it has 6 transformer layers. v2 is simply the improved second version.
Despite being small, it produces high-quality sentence embeddings and runs quickly even on a laptop CPU. That's why it is a useful choice to use in this lab for tasks like semantic search, clustering, and text similarity.
Now let us move onwards and implement embeddings for our tokenized text.
Task 2: Create and compare embeddings
Run the following code to create embeddings for individual words:
Code for embeddings
# Load a all-MiniLM-L6-v2
embedder = SentenceTransformer("all-MiniLM-L6-v2")
sentences = [
"biology",
"biological",
"cat",
"cats",
"dog",
"microscope",
"model",
"nature",
"deer",
"pet"
]
word_embeddings = embedder.encode(sentences, normalize_embeddings=True)
for s, emb in zip(sentences, word_embeddings):
print(f"Text: {s}")
print(f"Embedding length: {len(emb)}")
print(f"First 5 numbers: {emb[:5]}\n")
Here we see the first five values for each embedding. The embedding is always 384 values long, just like a single token, even if there are multiple tokens. This is achieved through pooling, a topic that we will revisit later.
Once we have embeddings for two words or texts, we need a way to measure how similar they are. One common method is cosine similarity.
Task 3: Calculate cosine similarity
The cosine similarity compares the angle between two vectors in the high dimensional space. If two embeddings point in almost the same direction, their cosine similarity will be close to 1, meaning the sentences are semantically similar. If they point in very different directions, the similarity will be closer to 0.
Code for cosine similarity
sims = cosine_similarity(word_embeddings, word_embeddings)
word_sim_df = pd.DataFrame(
sims.round(2),
index=sentences,
columns=sentences
)
We can plot the table (to more easily visualize the result compared to printing) by using the given custom plotting logic below.
Plotting the table
# plot the word similarity matrix as a table
fig, ax = plt.subplots(figsize=(14, 10))
ax.axis('tight')
ax.axis('off')
table = ax.table(cellText=word_sim_df.values,
rowLabels=word_sim_df.index,
colLabels=word_sim_df.columns,
cellLoc='center',
loc='center')
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1.5, 2.0) # Increased scaling for better fit
plt.title('Word Similarity Matrix', pad=20)
plt.subplots_adjust(left=0.2, right=0.95, top=0.9, bottom=0.1)
plt.tight_layout()
plt.show()
When we embed a single word, the model does not have much context to work with. In that case, the embedding will be very close to the word's fixed embedding from the lookup table that the model has learned during training. Contextual models like BERT or GPT normally adjust each token's embedding depending on the surrounding words, but with no neighbors to provide meaning, the result is almost the same as the fixed embedding for that token.
With short sentences, we can pool the contextualized token vectors into a single sentence embedding. Pooling means that we collapse the token vectors into a single vector, typically by taking the mean, but there are other ways to collapse the embeddings in a single vector.
Task 4: Compare sentence embeddings
Even a few content words are enough to anchor the vector in a semantic neighbourhood. Sentences about animals end up with embeddings that are close to each other, while science sentences should cluster in a different region, and unrelated topics like finance or cooking are pushed further away.
Code for sentence embeddings
animals = [
"The cat is sleeping on the couch.",
"A dog barked loudly in the park.",
"Cats and dogs are common household pets.",
"The puppy learned a new trick today."
]
science = [
"Nonlinear systems play a crucial role in biology.",
"Researchers analysed gene expression data.",
"A simple model describing the signalling between glucose and insulin.",
"We modelled the system using differential equations."
]
unrelated = [
"The stock market fell sharply today.",
"A car sped down the highway at night.",
"Tourists took photos at the beach.",
"The chef prepared a spicy curry."
]
sentences = animals + science + unrelated
sentence_embeddings = embedder.encode(sentences, normalize_embeddings=True)
If we put the results in a DataFrame we can see pairwise cosine similarity matrix. Each row and column is a sentence and the score in each cell shows how similar that pair is.
Dataframe
sims = cosine_similarity(sentence_embeddings, sentence_embeddings)
sentence_sim_df = pd.DataFrame(
sims.round(2),
index=sentences,
columns=sentences
)
We can plot the table (to more easily visualize the result compared to printing) by using the given custom plotting logic below.
Plotting the table
# plot the sentence similarity matrix as a table
fig, ax = plt.subplots(figsize=(18, 12))
ax.axis('tight')
ax.axis('off')
# Create table without column labels (we'll add them manually with rotation)
table = ax.table(cellText=sentence_sim_df.values,
rowLabels=sentence_sim_df.index,
colLabels=None, # Remove default column labels
cellLoc='center',
loc='center')
table.auto_set_font_size(False)
table.set_fontsize(8)
table.scale(1.2, 1.8)
# Force a draw to get accurate table positions
fig.canvas.draw()
# Add rotated column headers manually, using actual table cell positions
for i, col_label in enumerate(sentence_sim_df.columns):
# Get the actual position of the cell in the first row (header row would be row 0)
# We use row 1 (first data row) and column i to get the x-coordinate
cell = table[(1, i)] # First data row, column i
cell_bbox = cell.get_window_extent(fig.canvas.get_renderer())
cell_center_x = (cell_bbox.x0 + cell_bbox.x1) / 2
# Convert to axes coordinates
x_pos_display = cell_center_x
x_pos_axes = ax.transAxes.inverted().transform([(x_pos_display, 0)])[0][0]
ax.text(x_pos_axes, 0.72, col_label,
rotation=45,
ha='left',
va='bottom',
transform=ax.transAxes,
fontsize=8,
wrap=True)
plt.title('Sentence Similarity Matrix', pad=40)
plt.subplots_adjust(left=0.3, right=0.92, top=0.75, bottom=0.1)
plt.show()
Scores near 1 mean very similar, near 0 mean unrelated and some can be slightly negative. Scan a row to find a sentence's closest neighbors, animal sentences should light up with other animal sentences, science with science, and unrelated ones stay low.
We can also visualize the similarities in plots, in this example we have two plots:
Task 5: Visualize embeddings
Heatmap: this plots the cosine similarity matrix as colors, so you can see at a glance which sentences are close and which are not.
PCA (2D): this compresses the high dimensional embeddings down to two axes so you can see the structure. Sentences that land near each other are more similar. Separate groups form visible clusters. The labels show which point is which. Remember: PCA is an approximation, which works great for intuition but it is not such a great map of distances.
Let us start with the heatmap.
Visualizing using heatmap
# Heatmap
plt.figure(figsize=(14,10))
sns.heatmap(sims, xticklabels=sentences, yticklabels=sentences, annot=True, fmt=".2f", cmap="Blues")
plt.title("Cosine similarity between sentences")
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0)
plt.show()
And then using PCA.
Visualizing using PCA
# PCA
pca = PCA(n_components=2)
proj = pca.fit_transform(embeddings)
plt.figure(figsize=(14,10))
for i, txt in enumerate(sentences):
plt.scatter(proj[i,0], proj[i,1])
plt.text(proj[i,0]+0.01, proj[i,1]+0.01, txt, fontsize=9)
plt.title("Sentence embeddings projected to 2D (PCA)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()
Questions part 2
Explore embeddings a bit and answer the following questions:
- Add the misspelled word to the single word list and inspect if it is close in cosine similarity to the original word
- Add a misspelled word to the sentences too. Does the cosine similarity change much? Does it move a lot on the PCA?
Part 3: Generation¶
Text generation is repeated next token prediction. After a prompt is tokenized and run through the transformer blocks, every token has a contextualized vector. To predict what comes next, the model uses only one vector, the final hidden state of the last token. Thanks to self-attention, that single vector already encodes the whole sequence. A linear layer, the LM head, maps it to one logit per vocabulary item, softmax turns those logits into probabilities, and a decoding rule (greedy, temperature, top-k/top-p) picks the next token. Append it to the prompt and the process repeats.
Task 6: Step-by-step text generation
Below is a step by step example on how text generation works:
Code for text generation
# Load our trusty distilgpt2, you can replace with other (small) models.
model_id = "distilgpt2"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.eval();
# Define an example prompt
prompt = "In biomedical research, hybrid modeling means"
tokens = tok(prompt, return_tensors="pt")
with torch.no_grad():
out = model(**tokens, output_hidden_states=True, return_dict=True)
# distilgpt2 has a vector size of 768
last_hidden_vec = out.hidden_states[-1][0, -1, :] # (hidden_dim,)
print("Hidden dim:", last_hidden_vec.shape[-1])
# Get the logits and "softmax" them into probabilities
logits = model.lm_head(last_hidden_vec)
probs = torch.softmax(logits, dim=-1)
# Show the top-10 next-token predictions with their probabilities
topk = torch.topk(probs, k=10)
top_ids = topk.indices.tolist()
top_ps = topk.values.tolist()
top_toks = [tok.decode([i]) for i in top_ids]
for t, p in zip(top_toks, top_ps):
print(f"{repr(t):>12} p={p:.3f}")
However, we can also simply use the pipeline from the transformers library to generate text:
Code for text generation using transformers library
gen = pipeline("text-generation", model="distilgpt2")
generated = gen(prompt, max_new_tokens=4)[0]["generated_text"]
print(f"\nGenerated text: {generated}")
As explained briefly before, there are a few decoding rules. Here, temperature controls how random the next token choice is by scaling the logits before softmax. In the example code, softmax(logits / T) sharpens or flattens the distribution: T = 1 leaves it unchanged, T smaller than 1, like the 0.2 example, makes it peaky so the top token dominates. T larger than 1 spreads probability mass across more tokens.
Task 7: Explore temperature effects
You will now explore how the temperature effects the decoding.
Code for temperature exploration
prompt = "In biomedical research, hybrid modeling means: \n"
tokens = tok(prompt, return_tensors="pt")
with torch.no_grad():
out = model(**tokens, output_hidden_states=True, return_dict=True)
last_hidden_vec = out.hidden_states[-1][0, -1, :]
logits = model.lm_head(last_hidden_vec)
def show_with_temperature(T):
probs_T = torch.softmax(logits / T, dim=-1)
topk = torch.topk(probs_T, k=8)
ids = topk.indices.tolist()
toks = [tok.decode([i]) for i in ids]
print(f"\nTemperature={T}")
for t, p in zip(toks, topk.values.tolist()):
print(f"{repr(t):>12} p={p:.3f}")
for T in [0.2, 1.0, 1.5]:
show_with_temperature(T)
With low temperature, the model's output becomes basically deterministic. Lower temperature makes the output more fluid, but can be more incoherent, as the model reads its own output as new input. Appending tokens, either good or bad choices get reinforced because they change the next step's probabilities.
Questions part 3
Explore generation a bit and answer the following questions:
- For prediction, the model only uses the last token in the text. Why is this enough?
- Why does a single new token force a fresh forward pass over the sequence?
Part 4: Summarization¶
Another very useful feature of LLMs is summarization. Summarization compresses a longer text into a shorter one while keeping the important meaning. The model reads the input, builds a contextual picture of what matters, and then generates a shorter version.
Task 8: Text summarization
Let's give it a try. We can use the "summarization" pipeline from the transformers lib. The model we use is distilbart. We summarize a text I took from lab 1B.
Code for summarization
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")
text = """
Now, let us imagine a scenario where an individual consumes a high-carbohydrate meal.
This leads to an increase in blood glucose levels, triggering the release of insulin from the pancreas.
Initially, the insulin response may be delayed, causing a temporary mismatch between the rising glucose levels and insulin secretion.
As a result, the blood glucose levels continue to rise before they start to decline.
We can add a meal to this system by introducing a meal_influx, which adds glucose to the system over a period.
"""
summary = summarizer(text, max_length=60, min_length=25, do_sample=False)[0]["summary_text"]
print("Original:\n", text)
print("\n--- Summary (DistilBART) ---\n", summary)
Traditional search (like Ctrl+F) matches exact words. Embedding based retrieval instead encodes both queries and documents into the same vector space, by using the same embedder. By comparing vectors with cosine similarity, we can find passages that are semantically close even if they don't share the same words.
Task 9: Semantic search with embeddings
We use the same embedder as before, and provide a small dictionary of 5 "documents" we can search in.
Code for semantic search with embeddings
embedder = SentenceTransformer("all-MiniLM-L6-v2")
docs = [
{"id":"A1", "text":"Hybrid modeling combines mechanistic models with machine learning."},
{"id":"A2", "text":"Mechanistic models simulate physiology and predict interventions over time."},
{"id":"A3", "text":"k-nearest neighbors imputes missing clinical values."},
{"id":"A4", "text":"Deep learning can segment cardiac structures from MRI images."},
{"id":"A5", "text":"Risk scores estimate stroke probability using patient features."},
]
doc_texts = [d["text"] for d in docs]
doc_embs = embedder.encode(doc_texts, normalize_embeddings=True)
Using the function "search", the cosine similarity between the query and the embedded documents is scored. There are 5 example queries, the first three are relevant to the documents and the last two are not. The top hits and their score should reflect that.
Code for the search function
def search(query, k=3):
q_emb = embedder.encode([query], normalize_embeddings=True)
sims = cosine_similarity(q_emb, doc_embs)[0]
idx = sims.argsort()[::-1][:k]
return [(docs[i]["id"], docs[i]["text"], float(sims[i])) for i in idx]
# Examples
for q in [
"What is hybrid modeling?",
"How do we handle missing patient data?",
"How are medical images analyzed?",
"What is arachnophobia?",
"What year was the Godfather in the cinema?"
]:
print("\nQ:", q)
for rid, txt, s in search(q, k=3):
print(f" {rid} (sim={s:.3f}): {txt}")
This scales to lots of text. If all documents are embedded once, only the vectors need to be stored. At query time only embed the question and do a fast nearest-neighbour search. With normalized embeddings, cosine similarity becomes a simple dot product, so ranking is computationally cheap.
Questions part 4
Explore summarisation a bit and answer the following question:
- Write two queries with different wording but same meaning. Do they retrieve the same document?
Part 5: Uses of LLMs in mechanistic modeling¶
Mechanistic models introduced in the course are very powerful but picky, they need structured inputs in specific fields. A meal model would need things like food items, portions, timing, and units, which hinder uptake of these models in the general public. People write grocery lists, messages, or diary notes in free text. The gap between ordinary language and model inputs is where an LLM could help.
Below we will instruct a small LLM to extract food items from free text. We do by providing it with strict instructions through a system message (the SYSTEM variable) and ask for a JSON list of lowercase food items. The model used is Qwen2-0.5B-Instruct, which has 0.5 billion parameters, small enough to run on CPU in this notebook. The "food_items" function wraps this in an extractor, namely if there's a JSON present in the model's output, then parse it. If not, return an empty list. The apply_chat_template is used per the instructions on the model card at Hugging's Face.
Extracting the foods from text is a very crude and simple but realistic first step. The point in this is the workflow, we constrain the LLM to produce structured outputs that then can be processed further downstream.
Food item extraction with LLM
Below you can find an implementation that extracts some food items from a different prompts.
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct"
)
# The system message that instructs and constrains the LLM
SYSTEM = (
"You are a food item searcher.\n"
"Return ONLY food items (ingredients or dishes) from the input.\n"
"- Output: JSON list of strings only, all lowercase."
)
def food_items(text):
msgs = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": f'Input: "{text}"\nOutput:'}
]
x = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
y = model.generate(x, max_new_tokens=80, do_sample=False, eos_token_id=tok.eos_token_id)
out = tok.decode(y[0][x.shape[-1]:], skip_special_tokens=True)
# Find the JSON in the LLM's output
m = re.search(r"\[[\s\S]*?\]", out)
return (json.loads(m.group(0)) if m else [])
# demo
print(food_items("Shopping list BBQ night: pulled pork, coleslaw, cornbread, and pickles."))
print(food_items("I had coffee and water only."))
print(food_items("Dark chocolate cake with whipped cream."))
print(food_items("This easy tomato soup recipe is the best you'll ever make, and it's shockingly simple. It calls for just three main ingredients: butter, onion, and tomatoes."))
You will now implement a similar extraction methods for a given example.
Task 10: Food mapping and similarity analysis
Now let's work with a more comprehensive example that demonstrates how LLM extraction can be combined with embeddings for food mapping. First let us define a table within nutritional composition of the foods.
List of nutritional composition
# List of foods with (fictional) nutritional composition
FOODS = {
"egg": {"kcal": 78, "carb_g": 0.6, "protein_g": 6.3, "fat_g": 5.3},
"toast": {"kcal": 75, "carb_g": 13, "protein_g": 3.0, "fat_g": 1.0},
"pasta": {"kcal": 220, "carb_g": 43, "protein_g": 8.0, "fat_g": 1.3},
"apple": {"kcal": 95, "carb_g": 25, "protein_g": 0.5, "fat_g": 0.3},
"salmon": {"kcal": 233, "carb_g": 0.0, "protein_g": 25, "fat_g": 14},
"rice": {"kcal": 206, "carb_g": 45, "protein_g": 4.3, "fat_g": 0.4},
"yogurt": {"kcal": 150, "carb_g": 17, "protein_g": 6.0, "fat_g": 4.0},
"banana": {"kcal": 105, "carb_g": 27, "protein_g": 1.3, "fat_g": 0.4},
}
Below you have some example texts that can be used as prompts.
Prompts with food items
texts = [
"Shopping list BBQ night: pulled pork, coleslaw, cornbread, tuna, and pickles.",
"Breakfast 08:15: two eggs and toast; lunch pasta; snack 16:00 an apple; dinner salmon with brown rice.",
"Dark chocolate cake with whipped cream.",
"Blueberry pie and a side salad after a banana.",
"I had a pear with some yogurt before we left."
]
You will now need to extract the items from the text prompts.
Extracting food items from strings
We have already defined a function that finds foods in strings food_items(). Using this function you can simply iterate over the strings and extract the items.
extracted = []
for t in texts:
extracted += food_items(t)
extracted = sorted({s.strip().lower() for s in extracted if s.strip()})
You can then check if the extracted items are defined in the nutritional table.
Check for items in table
known_names = sorted(FOODS.keys())
unknown = [x for x in extracted if x not in known_names]
# now print the known_names, extracted, and unknown items
Moving forward, we will now create embeddings:
Create embeddings
embedder = SentenceTransformer("all-MiniLM-L6-v2")
ALL = known_names + unknown
Embs = embedder.encode(ALL, normalize_embeddings=True)
And define the cosine matrix:
Calculate the cosine similarity
S = cosine_similarity(Embs, Embs) # (N x N)
Now, move on to print the data frame table of the cosine similarity - as previously done in Task 4.
Lastly, also plot the heatmap and the PCA - as previously done in Task 5.
Questions part 5
- Add 3–5 of your own lines to the list with food items prompts (strings). What did you extract from these sentences?
- What food items did you get in the known and unknown item lists?
- Create a heatmap of similarity and a PCA to 2D with labels. Does the clusters make sense?
- Discuss how this LLM-based extraction and embedding approach could be useful in mechanistic modeling workflows. What are the advantages and limitations?
Lab Conclusion: LLMs in Modeling Pipelines¶
Congratulations! You have now completed a comprehensive exploration of how Large Language Models can be integrated into modeling workflows.
What you've accomplished:
- Tokenization → Understanding how text is converted into model-readable tokens
- Embeddings → Learning how tokens become vectors that capture semantic meaning
- Generation → Exploring how LLMs predict and generate new text
- Summarization → Using LLMs to compress and extract key information
- Instruction Following → Constraining LLMs to produce structured outputs for downstream processing
This pipeline showcases how LLMs can bridge the gap between unstructured text data and the structured inputs that mechanistic models require. The combination of:
- Text extraction and normalization using instruction-following LLMs
- Semantic similarity through embeddings for data mapping and retrieval
- Structured output generation for feeding into downstream models
represents a powerful approach for making mechanistic models more accessible and applicable to real-world, messy data sources.
Questions for Lab 4B - LLMs for Modeling:¶
In your lab report, you should answer the following questions:
Tokenization
- Add "crispr" to the examples list. Does it break into different tokens than "CRISPR"? Why could that be?
- Intentionally misspell a word, for example "biologicla" instead of "biological". How does the tokenizer handle it?
Embeddings
- Add the misspelled word to the single word list and inspect if it is close in cosine similarity to the original word
- Add a misspelled word to the sentences too. Does the cosine similarity change much? Does it move a lot on the PCA?
Generation
- For prediction, the model only uses the last token in the text. Why is this enough?
- Why does a single new token force a fresh forward pass over the sequence?
Summarization
- Write two queries with different wording but same meaning. Do they retrieve the same document?
Uses of LLMs in mechanistic modeling
- Add 3–5 of your own lines to the list with food items prompts (strings). What did you extract from these sentences?
- What food items did you get in the known and unknown item lists?
- Create a heatmap of similarity and a PCA to 2D with labels. Does the clusters make sense?
- Discuss how this LLM-based extraction and embedding approach could be useful in mechanistic modeling workflows. What are the advantages and limitations?