This subdirectory describes the creation of the 2023-2024 H3N2 library, including details on the barcoded HA expression plasmid and the rationale for the strains chosen for the library. Author: Caroline Kikawa
This directory contains the analysis performed to select strains and the design of the barcoded HA constructs for the 2023-2024 H3N2 library. The most relevant steps are documented in interactive Jupyter Notebooks placed in notebooks/:
- Picking strains using the Nextstrain trees and Nextclade datasets available in November 2023 notebooks/compare_trees_design_library.ipynb
- Designing the actual HA barcoded construct sequences notebooks/design_H3_HAconstructs.ipynb
The most relevant results (HA sequences, sequence IDs) are placed in results/, specifically:
- The trimmed HA1 sequences are placed in results/2023-2024_H3_library_protein_HA1.fasta
- The HA ectodomain sequences (with the H3 transmembrane domain removed) are placed in results/2023-2024_H3_library_protein_HA_ectodomain.fasta.
- The chimeric HA protein construct sequences with upstream signal peptide fixed to WSN sequence, downstream transmembrane domain fixed to H3 consensus, and downstream C-terminal tail fixed to WSN sequence are listed in results/2023-2024_H3_library_protein_constructs.fasta
The rest of this README contains the overview (section 1) and the details of strain choice (section 2) and barcode design (section 3) involved in designing a library for seqneut
assays, as well as an analysis of library diversity using Hamming Distances (section 4).
There were several approaches to subselect pdmH1N1 and H3N2 strains for the HA variant libraries, keeping in mind an overall library size is limited to ~100 HA variants.
The first of these is Nextstrain, a web-based tool that builds subsampled phylogenetic trees for multiple time frames (e.g., 6-month, 2-year, etc.). In addition to clade and subclade levels of influenza diversity, Nextstrain also defines derived haplotypes, which are strain groupings within subclades containing additional HA1 amino acid mutations that achieve a given threshold of descendent/children strains. I selected a single representative strain from each derived haplotype by choosing a strain closest to the node of the defined haplotype sequence. The JSON format Nextstrain trees we used are placed in data/nextstrain_trees/.
Additionally, Nextclade datasets can be used to identify high-frequency haplotypes that had not been defined as derived haplotypes by Nextstrain.
I then fixed the upstream signal peptide of these sequences to WSN, the transmembrane domain to an H3 consensus, and the C-terminal tail to WSN (see section 3 for more details). These protein sequences are listed above in the Quick summary.
As stated above, we aimed to create a seasonal influenza viral variant library for sequencing-based neutralization assays, specifically for H3N2 subtype viruses within a 12-month period of November 2023. Therefore, the primary goal was to identify representative circulating strains at high density across clades and subclades (i.e., broad genetic groupings of seasonal influenza viruses).
We identified all derived haplotypes on Nextstrain season-flu 6-month and 2-year builds available at the the time of library design. These trees are subsampled by time and geographic regions.
- A
derived haplotype
is a Nextstrain-specific piece of jargon that is assigned to an HA1 haplotype (a subclade with additional amino acid mutations called on that background, for example, a J.2 virus with S145N mutation would be called 'J.2:S145N') that has achieved some threshold of child/descendent sequences over a specified timeline.
We also used Nextclade datasets available at the time, which are not subsampled. These Nextclade datasets allowed us to supplement our analysis in multiple ways:
- First, we needed the GISAID IDs for the strains we chose using the the Nextstrain trees. We do a simple string match of strain name (e.g., A/Darwin/9/2021) to find matching GISAID IDs.
- Second, we decided to add additional strains outside of the derived haplotypes defined by Nextstrain. We used a method that took haplotype record count into consideration. This analysis necessitated a non-subsampled dataset for absolute record counts.
This method uses a script written by John Huddleston in the Bedford lab, with edits from Andrea Loes and Caroline Kikawa in the Bloom lab, to identify H3 derived haploytypes. Strains that represent these derived haplotype are chosen by selecting the strain with least divergence from determined derived haplotype node.
-
Many of the next steps use csvtk which you can install with:
conda install -c conda-forge -c bioconda csvtk
-
Alternatively, build and activate the
conda
virtual environment listed in environments/nextstrain_environment.yml with:conda env create -f environments/nextstrain_environment.yml conda activate nextstrain_library_design
First, download the trees.
curl --compressed -OL https://data.nextstrain.org/flu_seasonal_h3n2_ha_6m.json
Then use the following command(s) for the 6m tree:
-
Extract relevant attributes from the tree JSON in a data frame, see original gist for more details:
python scripts/auspice_tree_to_table.py \ --tree nextstrain_trees/flu_seasonal_h3n2_ha_6m.json \ --output-metadata auspice_tables/flu_seasonal_h3n2_ha_6m.tsv \ --attributes div num_date clade_membership subclade haplotype
-
Select representative strain per haplotype with shortest branch length. Filter to samples collected in 2022 or after, sort by haplotype and cumulative divergence, and select the first record per haplotype. I ran the following command to get a per haplotype table:
csvtk filter -t -f 'num_date>2022.0' auspice_tables/flu_seasonal_h3n2_ha_6m.tsv \ | csvtk sort -t -k haplotype -k div:n \ | csvtk uniq -t -f haplotype > auspice_haplotypes/representative_strains_per_haplotype_6m.tsv
Repeat the above commands for the 2y tree. Here are the commands I used:
curl --compressed -OL https://data.nextstrain.org/flu_seasonal_h3n2_ha_6m.json
python scripts/auspice_tree_to_table.py \
--tree nextstrain_trees/flu_seasonal_h3n2_ha_2y.json \
--output-metadata auspice_tables/flu_seasonal_h3n2_ha_2y.tsv \
--attributes div num_date clade_membership subclade haplotype
csvtk filter -t -f 'num_date>2022.0' auspice_tables/flu_seasonal_h3n2_ha_2y.tsv \
| csvtk sort -t -k haplotype -k div:n \
| csvtk uniq -t -f haplotype > auspice_haplotypes/representative_strains_per_haplotype_2y.tsv
For the next steps, I needed a Nextclade metadata file data/nextclade_metadata_h3n2_2023-11-21.tsv file from John Huddleston (Bedford lab), which I don't have permissions to create, and which John generated for me with the following commands:
aws s3 cp s3://nextstrain-data-private/files/workflows/seasonal-flu/h3n2/metadata.tsv.xz .
aws s3 cp s3://nextstrain-data-private/files/workflows/seasonal-flu/h3n2/ha/sequences.fasta.xz .
Then, download the latest Nextclade dataset for H3N2 HA (A/Darwin/6/2021 reference):
nextclade dataset get -n flu_h3n2_ha --output-dir nextclade_h3n2
To filter sequences and metadata to those collected in 2022 or after and exclude egg-passaged strains:
augur filter \
--sequences sequences.fasta.xz \
--metadata metadata.tsv.xz \
--min-date 2022 \
--query "passage_category != 'egg'" \
--output-sequences filtered_sequences.fasta.xz \
--output-metadata filtered_metadata.tsv
Then, run Nextclade on the full dataset.
nextclade run \
-D nextclade_h3n2 filtered_sequences.fasta.xz \
--quiet \
--output-tsv nextclade_h3n2.tsv
Finally, we need to join Nextclade annotations and metadata. First, select columns we want from Nextclade, then extract HA1 substitutions, and join Nextclade data with metadata.
csvtk cut -t -f seqName,short_clade,subclade,aaSubstitutions nextclade_h3n2.tsv \
| csvtk mutate -t -f aaSubstitutions -n HA1_substitutions -p "^(HA1.*?),(HA2|Sig).*" \
| csvtk join -t --fields "seqName;strain" /dev/stdin filtered_metadata.tsv \
| csvtk cut -t -f -virus,-segment,-ha,-na,-is_reference > nextclade_metadata_h3n2_2023-11-21.tsv
Add a column with count of records per distinct HA1 haplotype, ignoring empty strings and ordering by count in descending order:
csvtk summary -t -f seqName -g HA1_substitutions nextclade_metadata_h3n2_2023-11-21.tsv \
| csvtk rename -t -f "seqName:count" -n "count" \
| csvtk filter2 -t -f '${HA1_substitutions}!=""' \
| csvtk sort -t -k count:rn > nextclade_haplotypes/nextclade_HA1_haplotype_counts.tsv
Annotate NextClade table with HA1 haplotype counts, select a representative sample for each haplotype, and sort by haplotype count (to top N records or haplotypes with at least N samples)
csvtk join -t -f HA1_substitutions nextclade_metadata_h3n2_2023-11-21.tsv nextclade_haplotypes/nextclade_HA1_haplotype_counts.tsv \
| csvtk uniq -t -f HA1_substitutions \
| csvtk sort -t -k count:rn \
| csvtk rename -t -f "count" -n "HA1_haplotype_count" > nextclade_haplotypes/representative_strains_per_HA1_haplotype_nextclade.tsv
First, build and activate the conda environment placed in environments/library_environment.yaml with:
conda env create -f environments/library_environment.yaml
conda activate library_design
Then, run the Jupyter Notebook analyze_library_HA_sequences.ipynb interactively to analyze all Nextstrain and Nextclade haplotypes together. That notebook is annotated with details on how we finalized the set of sequences included in the library.
The final output of library design includes key files:
- The list of library strain GISAID IDs that can be directly copy-pasted into GISAID EpiFlu search function is placed in data/library_accession_numbers.csv
- Actual GISIAD sequences are not tracked in this repository due to data sharing restrictions
- A table of strains, the method by which they were selected and associated information (GISAID ID, collection date, HA1 substitutions, haplotype counts, etc.) results/auspice_and_newHAsubs_library_strains.csv
The barcoded construct used in these experiments is similar to that described in Loes et al (2024), but with a few modifications. See below figure and legend for details.
(A) The unmodified HA expression construct, with packaging signals indicated by red underlines. (B) The modified chimeric H3 HA expression construct, with an upstream signal peptide fixed to WSN sequence, downstream H3 transmembrane domain fixed to H3 consensus sequence, and C-terminal tail fixed to WSN sequence. As in the Loes et al (2024) H1 HA construct, there is a duplicated stop codon (indicated by "**") upstream of the 16-nucleotide barcode, which is followed by the Illumina R1 sequence and duplicated packaging signal. (C) The spike-in control RNA barcoded construct, which is identical to that described in Loes et al (2024).
First, align the HA sequences of interest. We used MAFFT and placed these alignments in data/alignments/. Activate the library_design
conda environment with:
conda activate library_design
Then, create the construct sequences by running the interactive Jupyter notebook placed in notebooks/design_H3_HAconstructs.ipynb.
The final output of the above notebook are:
- A directory of results/ordersheets/ containing tables including columns
name
andsequence
required for ordering constructs from Twist - A table of library strains with columns
library_ID
(a short ID used in cloning),plasmid
(the Bloom lab plasmid log name),strain-name
,epi
(a GISAID ID associated with each strain) andexpected_seq
(the HA construct sequences, minus the barcode) placed in results/library_ID_sheet.csv - The actual HA library sequences in FASTA format:
- The trimmed HA1 sequences are placed in results/2023-2024_H3_library_protein_HA1.fasta
- The HA ectodomain sequences (with the H3 transmembrane domain removed) are placed in results/2023-2024_H3_library_protein_HA_ectodomain.fasta.
- The chimeric HA protein construct sequences with upstream signal peptide fixed to WSN sequence, downstream transmembrane domain fixed to H3 consensus, and downstream C-terminal tail fixed to WSN sequence are listed in results/2023-2024_H3_library_protein_constructs.fasta
Using the output HA1 and HA ectodomain sequences from section 3, run an interactive Jupyter Notebook that calculates pair-wise Hamming distances between all library strains. First activate the conda environment with:
conda activate library_design
Then run the interactive Jupyter notebook placed in notebooks/flu_H3_2023-2024_sequence_analysis.ipynb.