In a recent study posted to the bioRxiv* preprint server, researchers developed a novel method to screen publically available raw whole genome sequencing (WGS) data of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) and identify mutations of interest more rapidly.

Study: Intrahost SARS-CoV-2 k-mer identification method (iSKIM) for rapid detection of mutations of concern reveals emergence of global mutation patterns. Image Credit: Tartila/Shutterstock
Study: Intrahost SARS-CoV-2 k-mer identification method (iSKIM) for rapid detection of mutations of concern reveals emergence of global mutation patterns. Image Credit: Tartila/Shutterstock


The current review used an intrahost SARS-CoV-2 k-mer identification method (iSKIM), which had ‘k-mers’ for the SARS-CoV-2 variants of concern (VOCs) and variants of interest (VOIs). SARS-CoV-2 often exhibits variation within a host and live and transmits as a population of variants, referred to as intrahost minor variants.

Evaluating intrahost dynamics across several million raw genome samples remains computationally challenging. In this context, counting relatively short genome sequences of length k, or ‘k-mers’, has proven itself as an efficient bioinformatics approach for multiple high throughput sequencing datasets. It helps avoid time-consuming alignment and post-processing steps involved in more traditional reference-based genome assembly methods.

About the study

In the present study, researchers retrieved the National Center for Biotechnology Information (NCBI) sequence read archive (SRA) datasets containing SARS-CoV-2 VOCs/VOIs defining mutations presented as intrahost minor variants to perform intrahost analysis at a much larger scale.

The iSKIM method rapidly scanned the raw SARS-CoV-2 sequencing reads in the SRA database containing SARS-CoV-2 VOC/VOI mutations between February 2020 and April 2021, i.e., the period spanning the beginning of the pandemic to the emergence of the Delta variant. In total, the researchers screened 108 SARS-CoV-2 lineage-specific mutations.

The SRA dataset lacks sufficient metadata and details of the specific primer sets used for each run. Thus, the authors used the ARTIC protocols during the study analyses. The popular ARTIC primer sets are regularly updated and exclude primer sequences from the sequencing submission. The team also performed a comparison of iSKIM to LoFreq and ngs_mapper, both established reference genome assembly-based methods, to confirm the accuracy of its results.

Study findings

The authors observed that 15 of 108 mutations showed a marked increase as minor intrahost variants in samples collected one to five months before fixation. Based on the study results, specific mutations appeared in the population as minor variants a few months before these mutations appeared as fixed mutations in most of the study samples.

Of the 15 mutations identified with this pattern, 10 statistically significant ones have nested inside the spike (S) protein. However, 17 mutations showed no substantial increase in the presence of minor variants despite a surge in the samples possessing these mutations as fixed variants.

Further, the authors observed that 11 of the other statistically significant mutations were nested inside the non-structural proteins of SARS-CoV-2. Perhaps many of these mutations did not provide a fitness advantage to SARS-CoV-2 and were neutral mutations that emerged alongside beneficial mutations that later became fixed.

The iSKIM consistently called the S L452R mutation at a slightly higher frequency in the 68 samples from September 2020, while LoFreq did not identify this minor intrahost mutation of SARS-CoV-2 variants. In addition, the minor variant intrahost results of the Spike N501Y mutation in the 834 samples from October 2020 were comparable across all three methods. Moreover, most of these samples belonged to the Alpha (B.1.177) lineage, while others belonged to the B.1.36.28, B.1.36.17, and B.1.221.1/B.1.221.2 lineages.

Each of these lineages was involved in multiple recombination events that occurred during the emergence of the Alpha VoC; however, none of the recombinants contained a full complement of the Alpha mutations.


The iSKIM rapidly screened newly sequenced samples containing SARS-CoV-2 VOCs or other VOI mutations of interest. Thus, it empowered the researchers to prioritize a specific set of genome samples for reference-based genome assembly and downstream analysis, thus, enabling timely reporting when turnaround time was critical. Additionally, the iSKIM results provided a complementary view of the SARS-CoV-2 genomic data alongside typical consensus genome results.

Studies have identified rare SARS-CoV-2 mutations in genome sequences sampled from wastewater surveillance or animal reservoirs, with the latter being a source of SARS-CoV-2 variants that might spill back into humans. The iSKIM approach could help with the early detection and curation of such rare mutations and the growing list of SARS-CoV-2 mutations not seen in any of the previous SARS-CoV-2 VOCs (e.g., mutations emerging in patients with various forms of immunosuppression). Indeed, this novel method showed promise compared to the current paradigm for early detection of SARS-CoV-2 intrahost minor variants.

*Important notice

bioRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference: