Pacbio metagenomics of air samples from Kumamoto

Testing out the new Pacbio metagenomics workflow for air samples

Published

December 15, 2025

Introduction

This notebook presents a comprehensive analysis of PacBio metagenomic sequencing data from air samples collected in Kumamoto, Japan. The study evaluates the performance of the PacBio META pipeline developed by AIRLAB for environmental DNA (eDNA) analysis in atmospheric samples.

Study Design and Objectives

We analyzed air samples collected using different sampling strategies:

  • Blank samples: Control samples to assess contamination
  • Daily samples: 24-hour collection periods
  • Weekly samples: 7-day collection periods

The primary objectives are to:

  1. Evaluate sequencing quality and taxonomic assignment success across sample types
  2. Compare biodiversity patterns between different sampling durations
  3. Assess the pipeline’s ability to capture environmental DNA from major taxonomic groups (Bacteria, Fungi, Metazoa, Viruses)
  4. Investigate temporal patterns in microbial community composition

Table of contents (click to go to each section)

The analysis is structured into several key sections:

The analysis employs both Python and R environments, utilizing established ecological analysis packages (vegan, phyloseq) alongside modern data science tools (pandas, plotnine) to ensure reproducible and statistically robust results.

Key Findings Preview

Initial results suggest significant differences in community composition between sampling strategies, with temporal factors (season, month) showing stronger effects than sampling duration. The pipeline demonstrates high taxonomic assignment success rates, with distinct patterns observed across major taxonomic groups.

Preamble

In this section we load the necessary libraries and set the global parameters for the notebook. It’s irrelevant for the narrative of the analysis but essential for the reproducibility of the results. Since in this case we are using both Python and R, we have a bit of a more complex setup than usual. This includes using the rpy2 package to run R code within the Python environment, as well as ensuring that data is properly passed between the two languages.

Show code imports and presets
Code imports
import os
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='rpy2')

# Clean environment variables that cause issues with R
# Remove bash functions that R can't parse properly
env_vars_to_remove = [
    'BASH_FUNC_ml%%', 
    'BASH_FUNC_module%%', 
    'BASH_FUNC_which%%'
]

for var in env_vars_to_remove:
    if var in os.environ:
        del os.environ[var]

# only if the OS is macOS and R_HOME is not set (otherwise fails...)
if os.name == 'posix' and os.uname().sysname == 'Darwin':
    os.environ["R_HOME"] = "/Library/Frameworks/R.framework/Resources"
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
%%R

# Load the packages
library(dplyr)
library(vegan)
library(ggplot2)
library(phyloseq)
library(magrittr)
import re
import numpy as np
import pandas as pd
import plotnine as p9
import geopandas as gpd

from glob import glob
from Bio import Entrez
from ete3 import NCBITaxa
from collections import defaultdict
from itables import init_notebook_mode, show
from plotnine.composition import plot_spacer
from mizani.labels import label_percent, label_scientific
Presets
# Matplotlib settings
import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
plt.rcParams['font.family'] = 'Georgia'
plt.rcParams['svg.fonttype'] = 'none'
set_matplotlib_formats('retina')
plt.rcParams['figure.dpi'] = 300

# Plotnine settings (for figures)
p9.options.set_option('base_family', 'Georgia')

p9.theme_set(
     p9.theme_classic()

    + p9.theme(panel_grid=p9.element_blank(),
               legend_background=p9.element_blank(),
               panel_grid_major=p9.element_line(size=.5, linetype='dashed',
                                                alpha=.15, color='black'),
               plot_title=p9.element_text(ha='center'),
               dpi=300
    )
)

Entrez.email = "alejandro.fontal@isglobal.org"

Introduction

We have been working on the taxonomic analysis of air samples collected in Kumamoto, Japan, which were subsequently sequenced using PacBio, and which we have processed using the PacBio META pipeline developed in the AIRLAB. The goal of this notebook is to evaluate the results of the pipeline, compare the biodiversity of the samples from the point of view of Bacteria, Fungi, other Metazoa and to assess the ability of the pipeline to properly capture the eDNA present in the environment.

Analysis

Sequencing stats

We’ll load the metadata from our filter database and the sequencing metrics from the PacBio pipeline, then merge them to create a single dataframe with all the information needed to examine and compare the results from the blanks, the daily and the weekly samples.

pacbio_metrics = (pd.read_csv('../data/pacbio_samples_metrics.csv', sep='\t')
     .rename(columns={
         'Names': 'sample_id',
         'DNA (ng)': 'dna_yield',
         'Number_of_reads': 'n_reads',
         'Total_bases': 'total_bases',
         'Median_read_length': 'median_read_length',
     'Median_read_quality': 'median_read_quality'
 })
 [['sample_id', 'dna_yield', 'n_reads', 'total_bases', 'median_read_length', 'median_read_quality']]
  .assign(header=lambda dd: np.where(dd.sample_id.str.endswith('.1'), 'PM10', 'PM2.5'))
  .assign(sample_id=lambda dd: 
          np.where(dd.sample_id.str.startswith('T'), dd['sample_id'],
                   dd['sample_id'].str[:2] + dd['sample_id'].str[2:].str.zfill(3)))
  .assign(sample_type=lambda dd: 
          np.where(dd.sample_id.str.contains('B', regex=False), 'Blank', 'Sample'))
  .assign(n_reads=lambda dd: dd.n_reads.str.replace(',', '').astype(int))
  .assign(total_bases=lambda dd: dd.total_bases.str.replace(',', '').astype(int))
  .assign(median_read_length=lambda dd: dd.median_read_length.str.replace(',', '').astype(int))
  .assign(location=lambda dd: np.where(dd.sample_id.str.startswith('K'), 'Kumamoto', 'Toyama'))
)

kumamoto_stats = pacbio_metrics.query('location == "Kumamoto"')
toyama_stats = (pacbio_metrics
    .query('location == "Toyama"')               
    .assign(sample_n=lambda dd: dd.sample_id.str.split('.').str[0].str[2:].str.zfill(2))
    .query('sample_n.str.len() == 2')
    .assign(sample_pre=lambda dd: dd.sample_id.str[:2].astype(str))
    .assign(sample_i=lambda dd: dd.sample_id.str.split('.').str[1].fillna(''))
    .assign(sample_i=lambda dd: np.where(dd['sample_i'].str.len() == 0, '', '.' + dd['sample_i']))
    .eval('sample_id = sample_pre + sample_n + sample_i')
    .sort_values('sample_id')
    .reset_index(drop=True)
)
toyama_metadata = pd.read_csv('../data/toyama_filters_clean.csv', 
                              parse_dates=['date_start', 'date_end'])

toyama_comps_df = (toyama_stats
 [['sample_id', 'dna_yield', 'total_bases', 'median_read_length', 'median_read_quality']]
 .assign(total_bases=lambda dd: dd.total_bases / 1e6)  # Convert to millions of bases
 .merge(toyama_metadata, how='left', on='sample_id')
)
kumamoto_metadata = (pd.read_csv('../data/kumamoto_filters.csv', 
                                 usecols=[1, 3, 4, 5, 6, 7, 8, 9])
    .rename(columns={
        'Sample ID': 'sample_id',
        'Initial ': 'date_start',
        'Final ': 'date_end',
        'Filters': 'sample_type',
        'Duration (h)': 'duration',
        'Sampled air volume (m3)': 'air_volume',
        'Weather': 'weather',
        'Remarks': 'remarks'
    })
    .assign(sample_id=lambda dd: dd['sample_id'].str[:2] + dd['sample_id'].str[2:].str.zfill(3))
    .query('sample_id.notna()')
    .assign(date_start=lambda dd: dd.date_start.ffill())
    .assign(date_end=lambda dd: dd.date_end.ffill())
    .assign(date_start=lambda dd: pd.to_datetime(dd['date_start'], format='%d/%m/%Y'))
    .assign(date_end=lambda dd: pd.to_datetime(dd['date_end'], format='%d/%m/%Y'))
    .assign(air_volume=lambda dd: 
            dd['air_volume'].astype(str).str.replace(',', '').astype(float).fillna(0))
    .assign(sample_duration=lambda dd: np.where(dd['sample_type'] == 'BLANK', 'Blanks',
                                                np.where(dd['duration'] < 24, 'Daily', 'Weekly')))
    .drop(columns=['duration', 'weather', 'remarks'])

)

kumamoto_comps_df = (kumamoto_stats
 [['sample_id', 'dna_yield', 'total_bases', 'median_read_length', 'median_read_quality']]
 .assign(total_bases=lambda dd: dd.total_bases / 1e6)  # Convert to millions of bases
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
)

If we compare the results from the blanks, the daily and the weekly samples, we observe the following:

  1. The blanks have lower DNA yield (as expected) which also leads to a lower total bases count. The median read length and quality, however, are similar to the values for the samples, with the highest median read length. This can be explained by the fact that the blanks are not as degraded as the samples, which have been exposed to the environment for longer periods of time.

  2. The daily samples have lower DNA yield than the weekly samples, which is expected since they collected approximately 7 times less air volume. However, this doesn’t appear to affect the total base count, as most likely the sequencing was performed to saturation for all samples.

Show Code
a = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='dna_yield')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.scale_y_log10()
  + p9.ggtitle('DNA yield (ng)')
  + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())
))

b = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='total_bases')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Total bases (Mbp)')
    + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())

))

c = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_quality')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read quality (Phred score)')
))

d = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_length')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration', ), 
                   alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read length (bp)')
))

((a | b) / 
 (c | d) 
 & p9.guides(fill=False, color=False)
 & p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.labs(x='', y='')
 & p9.theme(
     figure_size=(6, 4),
     plot_title=p9.element_text(size=10, ha='center'),
     plot_margin=.015
 )
)

The observed differences are statistically significant, with a Kruskal-Wallis test showing differences across all four metrics (p-values < 0.001). The post-hoc Dunn’s test shows that the differences are significant between all groups, with the exception of the median read length and quality between the blanks and the daily samples and blanks and weekly samples, respectively:

Statistical Analysis of Sequencing Metrics

%%R -i kumamoto_comps_df -o pairwise_df

kumamoto_comps_df$sample_duration <- as.factor(kumamoto_comps_df$sample_duration)
numeric_vars <- c('dna_yield', 'total_bases', 'median_read_length', 'median_read_quality')

kruskal_results <- lapply(numeric_vars, function(var) {
      test <- kruskal.test(kumamoto_comps_df[[var]] ~ kumamoto_comps_df$sample_duration)
      data.frame(
        variable = var,
        statistic = test$statistic,
        p_value = test$p.value)
})


significant_vars <- numeric_vars[sapply(kruskal_results, function(x) x$p_value < 0.05)]

pairwise_df <- lapply(significant_vars, function(var) {
  # Run Dunn's test
  dunn_res <- FSA::dunnTest(kumamoto_comps_df[[var]] ~ kumamoto_comps_df$sample_duration,
                            method = "bh")
  
  # Extract the "res" table and add the variable name
  df <- dunn_res$res %>% dplyr::mutate(variable = var)
  return(df)
}) %>% dplyr::bind_rows()
Show Code
(pairwise_df
 .replace({'variable': {
     'dna_yield': 'DNA Yield',
     'total_bases': 'Total Bases',
     'median_read_length': 'Median Read Length',
     'median_read_quality': 'Median Read Quality'}})
 [['variable', 'Comparison', 'Z', 'P.unadj', 'P.adj']]
 .style
 .hide(axis='index')
 .format(precision=4)
 # color P.adj < .05 rows in red
 .apply(lambda x: ['color: red' if v < 0.05 else '' for v in x], 
       subset=['P.adj'])
)
variable Comparison Z P.unadj P.adj
DNA Yield Blanks - Daily -2.5939 0.0095 0.0095
DNA Yield Blanks - Weekly -6.6136 0.0000 0.0000
DNA Yield Daily - Weekly -7.1209 0.0000 0.0000
Total Bases Blanks - Daily -4.6865 0.0000 0.0000
Total Bases Blanks - Weekly -2.2314 0.0257 0.0257
Total Bases Daily - Weekly 3.0662 0.0022 0.0033
Median Read Length Blanks - Daily 1.8021 0.0715 0.0715
Median Read Length Blanks - Weekly 3.4533 0.0006 0.0017
Median Read Length Daily - Weekly 3.0759 0.0021 0.0031
Median Read Quality Blanks - Daily -2.6883 0.0072 0.0215
Median Read Quality Blanks - Weekly -0.7613 0.4465 0.4465
Median Read Quality Daily - Weekly 2.6092 0.0091 0.0136

If we run the same comparisons, but now for the Toyama samples, however…

%%R -i toyama_comps_df -o pairwise_toyama_df -o kruskal_results_header

toyama_comps_df$sample_duration <- as.factor(toyama_comps_df$sample_duration)
toyama_comps_df$sample_header <- as.factor(toyama_comps_df$sample_header)
numeric_vars <- c('dna_yield', 'total_bases', 'median_read_length', 'median_read_quality')

kruskal_results <- lapply(numeric_vars, function(var) {
      test <- kruskal.test(toyama_comps_df[[var]] ~ toyama_comps_df$sample_duration)
      data.frame(
        variable = var,
        statistic = test$statistic,
        p_value = test$p.value)
})


significant_vars <- numeric_vars[sapply(kruskal_results, function(x) x$p_value < 0.05)]

toyama_comps_samples <- toyama_comps_df %>%
  dplyr::filter(sample_type != 'Blank')

kruskal_results_header <- lapply(numeric_vars, function(var) {
      test <- kruskal.test(toyama_comps_samples[[var]] ~ toyama_comps_samples$sample_header)
      data.frame(
        variable = var,
        statistic = test$statistic,
        p_value = test$p.value)
})

pairwise_toyama_df <- lapply(significant_vars, function(var) {
  # Run Dunn's test
  dunn_res <- FSA::dunnTest(toyama_comps_df[[var]] ~ toyama_comps_df$sample_duration,
                            method = "bh")
  
  # Extract the "res" table and add the variable name
  df <- dunn_res$res %>% dplyr::mutate(variable = var)
  return(df)
}) %>% dplyr::bind_rows()
Show Code
(pairwise_toyama_df
 .replace({'variable': {
     'dna_yield': 'DNA Yield',
     'total_bases': 'Total Bases',
     'median_read_length': 'Median Read Length',
     'median_read_quality': 'Median Read Quality'}})
 [['variable', 'Comparison', 'Z', 'P.unadj', 'P.adj']]
 .style
 .hide(axis='index')
 .format(precision=4)
 # color P.adj < .05 rows in red
 .apply(lambda x: ['color: red' if v < 0.05 else '' for v in x], 
       subset=['P.adj'])
)
variable Comparison Z P.unadj P.adj
DNA Yield Blank - Daily -3.5145 0.0004 0.0013
DNA Yield Blank - Weekly -3.4864 0.0005 0.0007
DNA Yield Daily - Weekly 0.6171 0.5372 0.5372
Total Bases Blank - Daily -3.3732 0.0007 0.0022
Total Bases Blank - Weekly -2.9686 0.0030 0.0045
Total Bases Daily - Weekly 1.1722 0.2411 0.2411
Median Read Length Blank - Daily -2.7546 0.0059 0.0176
Median Read Length Blank - Weekly -1.9343 0.0531 0.0796
Median Read Length Daily - Weekly 1.7097 0.0873 0.0873
Median Read Quality Blank - Daily -0.2594 0.7953 0.7953
Median Read Quality Blank - Weekly 1.4831 0.1380 0.2071
Median Read Quality Daily - Weekly 2.7184 0.0066 0.0197
Show Code
a = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='dna_yield')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.scale_y_log10()
  + p9.ggtitle('DNA yield (ng)')
  + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())
))

b = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='total_bases')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Total bases (Mbp)')
    + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())

))

c = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_quality')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read quality (Phred score)')
))

d = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_length')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration', ), 
                   alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read length (bp)')
))

((a | b) / 
 (c | d) 
 & p9.guides(fill=False, color=False)
 & p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.labs(x='', y='')
 & p9.theme(
     figure_size=(6, 4),
     plot_title=p9.element_text(size=10, ha='center'),
     plot_margin=.015
 )
)

Taxonomy

We’ll begin by loading the taxonomic assignments generated by Diamond and the results from the PacBio META pipeline, then merge them into a single dataframe containing all required information. We will then compare the results from the blanks, the daily and the weekly samples to test for differences in biodiversity between the groups.

The analysis output provides a complex .txt file that requires parsing and cleaning to extract the needed information. For now, we will load all the different taxonomic assignments (including Bacteria, Fungi, Metazoa and Viruses) and then merge them into a single dataframe.

Function to process MEGAN .txt files
def process_megan_txt(file_path, sample_id=None):
    df = pd.read_csv(
        file_path,
        sep='\t',
        header=None,
        names=['taxonomy', 'count'],
        engine='python'
    )
    df['taxonomy'] = df['taxonomy'].str.strip().str.rstrip(';')
    df['count'] = df['count'].astype(float)
    tax_split = df['taxonomy'].str.split(';', expand=True).apply(lambda col: col.str.strip())
    tax_split.columns = [f"level_{i+1}" for i in range(tax_split.shape[1])]
    df_expanded = pd.concat([tax_split, df['count']], axis=1)
    cols_to_keep = [f"level_{i+1}" for i in range(11)] + ['count']
    df_trimmed = df_expanded[cols_to_keep]
    df_grouped = (
        df_trimmed
        .groupby([f"level_{i+1}" for i in range(11)], dropna=False, as_index=False)
        .agg({'count': 'sum'})
    )
    # Create a unique taxon string for each row
    df_grouped['taxon'] = df_grouped[[f'level_{i+1}' for i in range(11)]].fillna('').agg(';'.join, axis=1)
    df_grouped = df_grouped[['taxon', 'count']]
    if sample_id is not None:
        df_grouped = df_grouped.set_index('taxon').rename(columns={'count': sample_id})
    else:
        df_grouped = df_grouped.set_index('taxon')
    return df_grouped
blank_files = glob("../data/kumamoto/blanks/*.txt")
sample_files = glob("../data/kumamoto/samples/*.txt")
all_files = blank_files + sample_files

feature_tables = []
sample_ids = []

for file_path in all_files:
    sample_id = os.path.splitext(os.path.basename(file_path))[0]
    df_grouped = process_megan_txt(file_path, sample_id=sample_id)
    feature_tables.append(df_grouped)
    sample_ids.append(sample_id)

# we combine into a single DataFrame (taxa as rows, samples as columns)
big_table_KM = pd.concat(feature_tables, axis=1).fillna(0).astype(float)
big_table_KM = big_table_KM.T  # transposed so samples are rows, taxa are columns

big_table_KM.index.name = 'sample_id'
sample_ids = [s.split('_')[0] for s in big_table_KM.index]
big_table_KM.index = sample_ids
blank_files_TY = glob("../data/toyama/TB*.txt")
sample_files_TY = glob("../data/toyama/TS*.txt")
all_files_TY = blank_files_TY + sample_files_TY

feature_tables_TY = []
sample_ids_TY = []

for file_path in all_files_TY:
    sample_id = os.path.splitext(os.path.basename(file_path))[0].split('_')[0]
    df_grouped = process_megan_txt(file_path, sample_id=sample_id)
    feature_tables_TY.append(df_grouped)

# let's now combine the Toyama data
big_table_TY = pd.concat(feature_tables_TY, axis=1).fillna(0).astype(float)
big_table_TY = big_table_TY.T
big_table_TY.index.name = 'sample_id'

Aggregate alpha-diversity metrics

Since the standard in metagenomics is the usage of the vegan package for ecological data analysis, we will use it to perform some basic analyses on our feature table, including the calculation of alpha and beta diversity metrics.

%%R -i big_table_KM -i big_table_TY -o diversity_df


richness_KM <- vegan::specnumber(big_table_KM)
shannon_KM <- vegan::diversity(big_table_KM, index = "shannon")
simpson_KM <- vegan::diversity(big_table_KM, index = "simpson")

richness_TY <- vegan::specnumber(big_table_TY)
shannon_TY <- vegan::diversity(big_table_TY, index = "shannon")
simpson_TY <- vegan::diversity(big_table_TY, index = "simpson")

diversity_KM <- data.frame(
    sample_id = rownames(big_table_KM),
    sample_type = ifelse(grepl('B', rownames(big_table_KM)), 'Blanks', 'Samples'),
    richness = richness_KM,
    shannon = shannon_KM,
    simpson = simpson_KM
)

diversity_TY <- data.frame(
    sample_id = rownames(big_table_TY),
    sample_type = ifelse(grepl('B', rownames(big_table_TY)), 'Blanks', 'Samples'),
    richness = richness_TY,
    shannon = shannon_TY,
    simpson = simpson_TY
)

diversity_df <- rbind(diversity_KM, diversity_TY)
diversity_df$sample_type = factor(diversity_df$sample_type, levels=c('Samples', 'Blanks'))

To start, we will compare the alpha diversity metrics between blanks and samples, considering each of the assigned taxa at every taxonomic level as a unique feature. If we do so:

%%R -o diversity_tests_df 

# We run a wilcoxon test for each metric as read counts are not normally distributed
richness_test_KM <- wilcox.test(richness ~ sample_type, data = diversity_KM)
shannon_test_KM <- wilcox.test(shannon ~ sample_type, data = diversity_KM)
simpson_test_KM <- wilcox.test(simpson ~ sample_type, data = diversity_KM)
richness_test_TY <- wilcox.test(richness ~ sample_type, data = diversity_TY)
shannon_test_TY <- wilcox.test(shannon ~ sample_type, data = diversity_TY)
simpson_test_TY <- wilcox.test(simpson ~ sample_type, data = diversity_TY)

diversity_tests_df <- data.frame(
  metric    = c("Richness", "Shannon", "Simpson", "Richness", "Shannon", "Simpson"),
  statistic = c(richness_test_KM$statistic, shannon_test_KM$statistic, simpson_test_KM$statistic,
                richness_test_TY$statistic, shannon_test_TY$statistic, simpson_test_TY$statistic),
  p_value   = c(richness_test_KM$p.value, shannon_test_KM$p.value, simpson_test_KM$p.value,
                richness_test_TY$p.value, shannon_test_TY$p.value, simpson_test_TY$p.value),
  location  = c(rep("Kumamoto", 3), rep("Toyama", 3))
)

The results change by location: In Kumamoto, the samples are, as expected, more diverse than the blanks, with a significant difference in all three metrics (richness, Shannon, and Simpson). However, in Toyama, none of the metrics show significant differences between blanks and samples:

Show Code
diversity_melt = (diversity_df
.melt(
    id_vars=['sample_id', 'sample_type'],
    value_vars=['richness', 'shannon', 'simpson'],
    var_name='metric',
    value_name='value'
)
.assign(metric=lambda dd: dd['metric'].str.capitalize())
.assign(location=lambda dd: np.where(dd['sample_id'].str.startswith('K'), 'Kumamoto', 'Toyama'))
.merge(diversity_tests_df, how='left', on=['metric', 'location'])
)

(p9.ggplot(diversity_melt, p9.aes(x='sample_type', y='value', fill='sample_type'))
    + p9.geom_boxplot(outlier_alpha=0, alpha=0.5, width=.4)
    + p9.geom_jitter(width=0.2, alpha=0.7, size=2, stroke=0)
    + p9.facet_grid('metric.str.capitalize()', 'location', scales='free_y')
    + p9.geom_text(
        p9.aes(x=1.5, y=0,
               label=f'"$p=$" + p_value.round(4).astype(str) + sig'),
        data=lambda dd: dd
            .groupby(['metric', 'location'], as_index=False)
            .agg({'value': 'max', 'p_value': 'first'})
            .assign(sig=lambda dd: np.where(dd['p_value'] < 0.05, '*', '')), 
        ha='center',
        va='bottom',
        size=8,
        inherit_aes=False

    )
    + p9.scale_fill_manual(values=["#c16514",  "#5198ca"])
    + p9.labs(x='', y='', title='',
              caption='*$p$-values from Wilcoxon rank sum tests')
    + p9.theme(
        figure_size=(6.5, 5),
        # strip_background=p9.element_blank(),
        panel_spacing=.02,
        legend_position='none'
    )
)

Taxonomy pipeline stats

Long abundance table code
def process_megan_txt(file_path, sample_id=None):
    df = pd.read_csv(
        file_path,
        sep='\t',
        header=None,
        names=['taxonomy', 'count'],
        engine='python'
    )
    df['taxonomy'] = df['taxonomy'].str.strip().str.rstrip(';')
    df['count'] = df['count'].astype(float)
    if sample_id is not None:
        df = df.set_index('taxonomy').rename(columns={'count': sample_id})
    else:
        df = df.set_index('taxonomy')
    return df

# Gather all sample files
blank_files = glob("../data/kumamoto/blanks/*.txt")
sample_files = glob("../data/kumamoto/samples/*.txt")
all_files = blank_files + sample_files

# Build the abundance table
feature_tables = []
sample_ids = []

for file_path in all_files:
    sample_id = os.path.splitext(os.path.basename(file_path))[0]
    df_grouped = process_megan_txt(file_path, sample_id=sample_id)
    feature_tables.append(df_grouped)
    sample_ids.append(sample_id)

# Merge all samples (outer join, fill missing with 0)
abundance_table = pd.concat(feature_tables, axis=1).fillna(0).T
abundance_table.index.name = "sample_id"
abundance_table.index = [s.split('_')[0] for s in abundance_table.index]

tax_split = abundance_table.columns.str.split(';', expand=True)
tax_levels = [f'level_{i+1}' for i in range(tax_split.to_frame().shape[1])]

taxonomy_names_df = (tax_split
 .to_frame()
 .reset_index(drop=True)
 .rename(columns=dict((i, f'level_{i+1}') for i in range(tax_split.to_frame().shape[1])))
 .map(lambda x: x.lstrip() if isinstance(x, str) else x)
)

def get_complete_taxonomy_lineage(row, ncbi):
    """
    Extract the most specific taxon from a row and get the COMPLETE lineage
    preserving all NCBI taxonomic ranks, not just the standard 7.
    """
    
    # Get all non-null values from the row (excluding non-taxonomic columns)
    taxa = [val for val in row if pd.notna(val) and val != '' and val != 'NCBI']
    
    if not taxa:
        return {}
    
    # Start from most specific (rightmost) taxon
    for taxon in reversed(taxa):
        try:
            # Get taxon ID
            name_dict = ncbi.get_name_translator([taxon])
            if taxon in name_dict:
                taxid = name_dict[taxon][0]
                
                # Get COMPLETE lineage (all taxonomic IDs)
                lineage = ncbi.get_lineage(taxid)
                
                # Get ranks and names for all lineage members
                ranks = ncbi.get_rank(lineage)
                names = ncbi.get_taxid_translator(lineage)
                
                # Create complete taxonomy dictionary with ALL ranks
                result = {}
                
                # Add all ranks found in the lineage
                for tid in lineage:
                    rank = ranks[tid]
                    name = names[tid]
                    
                    # Skip 'no rank' entries but keep everything else
                    if rank != 'no rank' and rank != '':
                        result[rank] = name
                
                # Also add any intermediate ranks that might be missing
                # by checking for common NCBI ranks
                all_possible_ranks = [
                    'superkingdom', 'kingdom', 'subkingdom',
                    'superphylum', 'phylum', 'subphylum', 'infraphylum',
                    'superclass', 'class', 'subclass', 'infraclass',
                    'cohort', 'subcohort', 'superorder', 'order', 'suborder', 'infraorder',
                    'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
                    'genus', 'subgenus', 'species group', 'species subgroup', 
                    'species', 'subspecies', 'strain', 'varietas', 'forma'
                ]
                
                return result
                
        except Exception as e:
            print(f"Error processing {taxon}: {e}")
            continue
    
    return {}

def standardize_taxonomy_dataframe(df, ncbi=None):
    """
    Process a taxonomy dataframe and return a standardized version with
    all available taxonomic ranks preserved.
    """
    
    if ncbi is None:
        print("Initializing NCBI taxonomy database...")
        ncbi = NCBITaxa()
    
    
    # Apply the taxonomy extraction to each row
    taxonomy_results = []
    
    for _, row in df.iterrows():        
        result = get_complete_taxonomy_lineage(row, ncbi)
        taxonomy_results.append(result)
    
    # Convert to DataFrame
    df_standardized = pd.DataFrame(taxonomy_results)
    
    # Reorder columns by typical taxonomic hierarchy if they exist
    preferred_order = [
        'acellular root', 'cellular root', 'domain', 'realm', 'clade',
        'superkingdom', 'kingdom', 'subkingdom',
        'superphylum', 'phylum', 'subphylum', 'infraphylum',
        'superclass', 'class', 'subclass', 'infraclass',
        'cohort', 'subcohort', 'superorder', 'order', 'suborder', 'infraorder',
        'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
        'genus', 'subgenus', 'species group', 'species subgroup', 
        'species', 'subspecies', 'strain', 'varietas', 'forma'
    ]
    
    # Reorder columns based on preferred hierarchy
    existing_cols = [col for col in preferred_order if col in df_standardized.columns]
    other_cols = [col for col in df_standardized.columns if col not in preferred_order]
    new_column_order = existing_cols + other_cols
    
    df_standardized = df_standardized[new_column_order]
    
    
    return df_standardized

def get_taxon_ids_for_best_taxonomy(df_standardized, ncbi=None):
    """
    Get NCBI taxon IDs for the most specific taxonomic rank available in each row.
    
    Parameters:
    df_standardized: DataFrame with standardized taxonomy and 'best_available_taxonomy' column
    ncbi: NCBITaxa object (optional, will create if not provided)
    
    Returns:
    DataFrame with added taxon ID columns
    """
    
    if ncbi is None:
        print("Initializing NCBI taxonomy database...")
        ncbi = NCBITaxa()
    
    df_result = df_standardized.copy()
    
    # Add columns for taxon IDs
    df_result['best_taxonomy_taxid'] = np.nan
    df_result['best_taxonomy_name'] = ''
    df_result['best_taxonomy_level'] = ''
    df_result['taxid_status'] = ''  # Track success/failure
    
    print("Getting taxon IDs for most specific taxonomic ranks...")
    
    # Define hierarchy from most specific to least specific
    hierarchy = [
        'subspecies', 'strain', 'varietas', 'forma',  # Below species
        'species',                           # Species level
        'species subgroup', 'species group', # Between genus and species
        'genus', 'subgenus',                # Genus level
        'tribe', 'subtribe',                # Above genus
        'subfamily', 'family',              # Family level
        'order', 'class', 'phylum',       # Higher levels
        'realm', 'clade', 'domain', 'subkingdom', 'superkingdom', 'kingdom',
        'cellular root', 'acellular root'  # NCBI ranks

    ]
    
    # Statistics tracking
    total_rows = len(df_result)
    success_count = 0
    failed_names = []
    
    for idx, row in df_result.iterrows():
    
        # Find the most specific taxonomic level available
        best_taxon_name = None
        best_level = None
        
        for level in hierarchy:
            if level in df_result.columns and pd.notna(row[level]) and row[level] != '':
                best_taxon_name = row[level]
                best_level = level
                break
        
        if best_taxon_name is None:
            df_result.loc[idx, 'taxid_status'] = 'no_taxonomy_data'
            continue
        
        # Try to get taxon ID from NCBI
        try:
            # Clean the name (remove extra whitespace)
            clean_name = str(best_taxon_name).strip()
            
            # Get taxon ID using ete3
            name_dict = ncbi.get_name_translator([clean_name])
            
            if clean_name in name_dict and len(name_dict[clean_name]) > 0:
                taxid = name_dict[clean_name][0]  # Take first match
                
                # Store results
                df_result.loc[idx, 'best_taxonomy_taxid'] = taxid
                df_result.loc[idx, 'best_taxonomy_name'] = clean_name
                df_result.loc[idx, 'best_taxonomy_level'] = best_level
                df_result.loc[idx, 'taxid_status'] = 'success'
                success_count += 1
                
            else:
                df_result.loc[idx, 'taxid_status'] = 'name_not_found'
                failed_names.append(clean_name)
                
        except Exception as e:
            df_result.loc[idx, 'taxid_status'] = f'error: {str(e)}'
            failed_names.append(best_taxon_name)
    
    # Print summary statistics
    print(f"\n=== TAXON ID RETRIEVAL SUMMARY ===")
    print(f"Total entries processed: {total_rows}")
    print(f"Successfully retrieved taxon IDs: {success_count}")
    print(f"Success rate: {(success_count/total_rows)*100:.1f}%")
    
    # Show status breakdown
    status_counts = df_result['taxid_status'].value_counts()
    print(f"\nStatus breakdown:")
    for status, count in status_counts.items():
        print(f"  {status}: {count}")
    
    # Show most common failed names
    if failed_names:
        print(f"\nTop 10 most common failed lookups:")
        failed_series = pd.Series(failed_names)
        for name, count in failed_series.value_counts().head(10).items():
            print(f"  {name}: {count}")
    
    return df_result
ncbi = NCBITaxa()
taxonomy_standardized = standardize_taxonomy_dataframe(taxonomy_names_df, ncbi)
taxonomy_with_ids = get_taxon_ids_for_best_taxonomy(taxonomy_standardized, ncbi)
Getting taxon IDs for most specific taxonomic ranks...

=== TAXON ID RETRIEVAL SUMMARY ===
Total entries processed: 14391
Successfully retrieved taxon IDs: 14388
Success rate: 100.0%

Status breakdown:
  success: 14388
  no_taxonomy_data: 3
Define and validate species name standardization functions
def standardize_species_names(df):
    """
    Standardize species names across the entire taxonomy dataframe
    """
    import re
    
    def fix_species(row):
        # If species already exists and looks good, keep it
        if pd.notna(row['species']) and row['species'] != '' and not 'unclassified' in str(row['species']).lower():
            return row['species']
        
        # Only try to fix if best_taxonomy_level is at genus level or below
        genus_and_below_levels = ['genus', 'subgenus', 'species group', 'species subgroup', 
                                 'species', 'subspecies', 'strain', 'varietas', 'forma']
        
        if row['best_taxonomy_level'] not in genus_and_below_levels:
            return "Unclassified"
        
        # Try to extract from best_taxonomy_name
        if pd.notna(row['best_taxonomy_name']):
            name = str(row['best_taxonomy_name']).strip()
            
            # Pattern for proper binomial nomenclature
            binomial_patterns = [
                r'^([A-Z][a-z]+)\s+([a-z]+)$',  # Simple binomial
                r'^([A-Z][a-z]+)\s+([a-z]+)\s+([a-z]+)$',  # Trinomial (subspecies)
                r'^([A-Z][a-z]+)\s+(sp\.)\s+([A-Za-z0-9_-]+)$',  # Genus sp. strain
            ]
            
            for pattern in binomial_patterns:
                match = re.match(pattern, name)
                if match:
                    genus_from_name = match.group(1)
                    species_epithet = match.group(2)
                    
                    # Skip non-specific epithets that aren't real species names
                    non_specific_epithets = ['group', 'clade', 'complex', 'cluster', 
                                           'assemblage', 'section', 'series']
                    if species_epithet.lower() in non_specific_epithets:
                        return "Unclassified"
                    
                    # Verify genus consistency
                    if pd.notna(row['genus']) and row['genus'] != '':
                        if genus_from_name == row['genus']:
                            if species_epithet == 'sp.':
                                return f"{genus_from_name} {species_epithet} {match.group(3)}"
                            else:
                                return f"{genus_from_name} {species_epithet}"
                    else:
                        # Use what we found
                        if species_epithet == 'sp.':
                            return f"{genus_from_name} {species_epithet} {match.group(3)}"
                        else:
                            return f"{genus_from_name} {species_epithet}"
        
        return "Unclassified"
    
    return df.assign(species=df.apply(fix_species, axis=1))

def validate_species_extraction(df):
    """
    Check how many species were recovered vs missing
    """
    before_fix = df['species'].notna().sum()
    df_fixed = standardize_species_names(df)
    after_fix = (df_fixed['species'] != 'Unclassified').sum()
    
    print(f"Species names before fix: {before_fix}")
    print(f"Species names after fix: {after_fix}")
    print(f"Improvement: {after_fix - before_fix} additional species recovered")
    
    # Show examples of what was fixed
    fixed_examples = (df_fixed[(df['species'].isna() | (df['species'] == '')) & 
           (df_fixed['species'] != 'Unclassified')])

    for _, row in fixed_examples.iterrows():
        print(f"{row['best_taxonomy_level']}: {row['best_taxonomy_name']} -> Fixed Species: {row['species']}")

validate_species_extraction(taxonomy_with_ids)
Species names before fix: 10840
Species names after fix: 10868
Improvement: 28 additional species recovered
species subgroup: Stutzerimonas stutzeri subgroup -> Fixed Species: Stutzerimonas stutzeri
species group: Burkholderia cepacia complex -> Fixed Species: Burkholderia cepacia
species group: Alternaria alternata complex -> Fixed Species: Alternaria alternata
species group: Pseudomonas aeruginosa group -> Fixed Species: Pseudomonas aeruginosa
species group: Pseudomonas fluorescens group -> Fixed Species: Pseudomonas fluorescens
species group: Pseudomonas syringae group -> Fixed Species: Pseudomonas syringae
species subgroup: Stutzerimonas stutzeri subgroup -> Fixed Species: Stutzerimonas stutzeri
species group: Agrobacterium tumefaciens complex -> Fixed Species: Agrobacterium tumefaciens
species group: Stutzerimonas stutzeri group -> Fixed Species: Stutzerimonas stutzeri
species group: Enterobacter cloacae complex -> Fixed Species: Enterobacter cloacae
species group: Streptococcus anginosus group -> Fixed Species: Streptococcus anginosus
species group: Citrobacter freundii complex -> Fixed Species: Citrobacter freundii
species group: Pseudomonas putida group -> Fixed Species: Pseudomonas putida
species group: Bacillus subtilis group -> Fixed Species: Bacillus subtilis
species subgroup: Bacillus amyloliquefaciens group -> Fixed Species: Bacillus amyloliquefaciens
species group: Bacillus cereus group -> Fixed Species: Bacillus cereus
species group: Rhodococcus erythropolis group -> Fixed Species: Rhodococcus erythropolis
species group: Enterobacter cloacae complex -> Fixed Species: Enterobacter cloacae
species group: Mycobacterium simiae complex -> Fixed Species: Mycobacterium simiae
species group: Amycolatopsis methanolica group -> Fixed Species: Amycolatopsis methanolica
species group: Pseudomonas aeruginosa group -> Fixed Species: Pseudomonas aeruginosa
species group: Vibrio harveyi group -> Fixed Species: Vibrio harveyi
species group: Mycobacterium tuberculosis complex -> Fixed Species: Mycobacterium tuberculosis
species group: Pseudomonas chlororaphis group -> Fixed Species: Pseudomonas chlororaphis
species group: Prauserella coralliicola group -> Fixed Species: Prauserella coralliicola
species group: Stenotrophomonas maltophilia group -> Fixed Species: Stenotrophomonas maltophilia
species group: Streptomyces violaceusniger group -> Fixed Species: Streptomyces violaceusniger
species group: Streptococcus dysgalactiae group -> Fixed Species: Streptococcus dysgalactiae
taxonomy_with_ids.query('taxid_status == "no_taxonomy_data"').index
Index([0, 7258, 13054], dtype='int64')
taxonomy_names_df.iloc[0].dropna()
level_1    NCBI
Name: 0, dtype: object

First entry has no lineage at all (NCBI root), so we set its taxid to 0

taxonomy_with_ids.loc[0, 'best_taxonomy_taxid'] = 0
taxonomy_with_ids.loc[0, 'best_taxonomy_name'] = 'NCBI root'
taxonomy_with_ids.loc[0, 'best_taxonomy_level'] = 'root'

Second entry belongs to unclassified_entries -> unclassified_sequences -> environmental_samples so with a search in NCBI’s taxonomy browser we can find that its taxid is 151659:

taxonomy_names_df.iloc[7258].dropna()
level_1                                                 NCBI
level_2                                 unclassified entries
level_3                               unclassified sequences
level_4    environmental samples <unclassified sequences,...
Name: 7258, dtype: object
taxonomy_with_ids.loc[7258, 'best_taxonomy_taxid'] = 151659
taxonomy_with_ids.loc[7258, 'best_taxonomy_name'] = 'Environmental samples'
taxonomy_with_ids.loc[7258, 'best_taxonomy_level'] = 'root'
taxonomy_names_df.iloc[13054].dropna()
level_1                    NCBI
level_2           other entries
level_3         other sequences
level_4    artificial sequences
Name: 13054, dtype: object

Third and last entry belongs to other entries -> other sequences -> artificial_sequences

taxonomy_with_ids.loc[13054, 'best_taxonomy_taxid'] = 81077
taxonomy_with_ids.loc[13054, 'best_taxonomy_name'] = 'Artificial sequences'
taxonomy_with_ids.loc[13054, 'best_taxonomy_level'] = 'root'

taxonomy_with_ids = (taxonomy_with_ids
                     .assign(best_taxonomy_taxid=lambda dd: dd.best_taxonomy_taxid.astype(int))
)

This taxonomy ID mapping will allow us to keep a single column with the taxonomic IDs, which we will be able to later on use for further analyses without carrying the full taxonomy lineage around.

abundance_tax_df = (abundance_table
 .T
 .reset_index(drop=True)
 .merge(taxonomy_with_ids[['best_taxonomy_taxid']],
         left_index=True, right_index=True)
)

abundance_tax_long = (abundance_tax_df
 .melt(id_vars=['best_taxonomy_taxid'],
        var_name='sample_id', value_name='counts')
 .replace({'NCBI': 'root'})
 .query('counts > 0')
 .assign(sample_id=lambda dd: dd.sample_id.str.split('_').str[0])
 .assign(sample_type=lambda dd: np.where(dd.sample_id.str.contains('B', regex=False), 'Blank', 'Sample'))
)
sample_taxids = abundance_tax_long.query('sample_type != "Blank"').best_taxonomy_taxid.unique()
taxonomy_standardized.iloc[2329].dropna()
cellular root     cellular organisms
domain                      Bacteria
kingdom               Pseudomonadati
phylum                Pseudomonadota
class            Alphaproteobacteria
order               Sphingomonadales
family             Sphingomonadaceae
genus                   Sphingomonas
Name: 2329, dtype: object
taxonomy_names_df.iloc[1886]
level_1                          NCBI
level_2            cellular organisms
level_3                      Bacteria
level_4                Proteobacteria
level_5           Alphaproteobacteria
level_6              Sphingomonadales
level_7             Sphingomonadaceae
level_8                  Sphingomonas
level_9     unclassified Sphingomonas
level_10        Sphingomonas sp. YJ09
level_11                          NaN
level_12                          NaN
level_13                          NaN
level_14                          NaN
level_15                          NaN
level_16                          NaN
level_17                          NaN
level_18                          NaN
level_19                          NaN
level_20                          NaN
level_21                          NaN
level_22                          NaN
level_23                          NaN
level_24                          NaN
level_25                          NaN
level_26                          NaN
level_27                          NaN
level_28                          NaN
level_29                          NaN
level_30                          NaN
level_31                          NaN
level_32                          NaN
level_33                          NaN
level_34                          NaN
level_35                          NaN
level_36                          NaN
level_37                          NaN
Name: 1886, dtype: object
ncbi.get_name_translator(['Sphingomonas sp.'])
{'Sphingomonas sp.': [28214]}
taxonomy_with_ids.loc[lambda dd: dd.duplicated()].query('best_taxonomy_name=="Sphingomonas"')
acellular root cellular root domain realm clade kingdom subkingdom superphylum phylum subphylum ... section serotype biotype forma specialis isolate series best_taxonomy_taxid best_taxonomy_name best_taxonomy_level taxid_status
82 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
138 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
192 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
1886 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
2296 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
2329 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
2994 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
3105 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
3546 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
3659 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
5433 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
5543 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
6253 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
6717 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
7011 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
7189 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
8092 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
8172 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
8823 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
11244 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
11472 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
11528 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success
12159 NaN cellular organisms Bacteria NaN NaN Pseudomonadati NaN NaN Pseudomonadota NaN ... NaN NaN NaN NaN NaN NaN 13687 Sphingomonas genus success

23 rows × 45 columns

(abundance_tax_long
 .groupby(['sample_id', 'sample_type'], as_index=False)
 ['counts']
 .sum()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='reorder(sample_id, counts)', y='counts')
       + p9.geom_col(p9.aes(fill='sample_duration'))
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_y_continuous(expand=(0, 0.1), labels=lambda x: [f"{v/1e6:.0f}M" for v in x])
       + p9.coord_flip()
       + p9.labs(x='', y='Reads', fill='', title='Total assigned reads per sample')
       + p9.theme(
           figure_size=(5, 7),
           axis_text_y=p9.element_text(size=4, family='monospace'),
           axis_text_x=p9.element_text(size=8, family='Georgia'),
           legend_position=(0.8, 0.2)
       )
       )
 )

If we take a look at the best rank of taxonomic assignations:

Show Code
(taxonomy_with_ids
 .query('best_taxonomy_taxid.isin(@sample_taxids)')
 .drop_duplicates()
 .best_taxonomy_level
 .value_counts()
 .head(20)
)
best_taxonomy_level
species             9643
genus               1291
strain               823
family               458
order                287
class                172
phylum               102
subspecies            94
subfamily             52
species group         42
clade                 30
tribe                 27
species subgroup      13
varietas               8
domain                 5
forma                  3
root                   3
realm                  3
cellular root          1
acellular root         1
Name: count, dtype: int64

Community composition

%%R -i big_table -o permanova_blanks_results

sample_type <- ifelse(grepl("^KB", rownames(big_table)), "Blank", "Sample")
dist_matrix <- vegan::vegdist(big_table, method = "bray")
permanova_blanks_results <- vegan::adonis2(dist_matrix ~ sample_type, permutations = 999)

print(permanova_blanks_results)
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = dist_matrix ~ sample_type, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      1    1.485 0.04417 5.5915  0.001 ***
Residual 121   32.146 0.95583                  
Total    122   33.631 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Ordination Analysis

%%R -i big_table  -o nmds_blanks_df

dist_matrix_blanks <- vegan::vegdist(big_table, method = "bray")
nmds_coords_blanks <- vegan::metaMDS(dist_matrix, k = 2, trymax = 100, verbose = FALSE)
nmds_blanks_df <- data.frame(
    NMDS1 = nmds_coords_blanks$points[, 1],
    NMDS2 = nmds_coords_blanks$points[, 2],
    sample_id = rownames(big_table),
    sample_type = ifelse(grepl("^KB", rownames(big_table)), "Blank", "Sample")
)
Run 0 stress 0.1735855 
Run 1 stress 0.1820435 
Run 2 stress 0.1839393 
Run 3 stress 0.1769141 
Run 4 stress 0.1763525 
Run 5 stress 0.1865404 
Run 6 stress 0.2112716 
Run 7 stress 0.1885427 
Run 8 stress 0.1888116 
Run 9 stress 0.1930841 
Run 10 stress 0.1981002 
Run 11 stress 0.1703204 
... New best solution
... Procrustes: rmse 0.05147571  max resid 0.2497888 
Run 12 stress 0.1750515 
Run 13 stress 0.1950641 
Run 14 stress 0.181364 
Run 15 stress 0.188554 
Run 16 stress 0.1781962 
Run 17 stress 0.1788396 
Run 18 stress 0.1866084 
Run 19 stress 0.1831665 
Run 20 stress 0.1757994 
Run 21 stress 0.1773205 
Run 22 stress 0.1755386 
Run 23 stress 0.1822862 
Run 24 stress 0.185336 
Run 25 stress 0.1772682 
Run 26 stress 0.1836586 
Run 27 stress 0.198846 
Run 28 stress 0.2094546 
Run 29 stress 0.1739281 
Run 30 stress 0.1742009 
Run 31 stress 0.1820799 
Run 32 stress 0.1784877 
Run 33 stress 0.1684748 
... New best solution
... Procrustes: rmse 0.01739192  max resid 0.1466545 
Run 34 stress 0.1838412 
Run 35 stress 0.1764085 
Run 36 stress 0.1713718 
Run 37 stress 0.1822316 
Run 38 stress 0.1756282 
Run 39 stress 0.1726824 
Run 40 stress 0.1847255 
Run 41 stress 0.1734816 
Run 42 stress 0.1902603 
Run 43 stress 0.1752344 
Run 44 stress 0.1773065 
Run 45 stress 0.2095513 
Run 46 stress 0.167005 
... New best solution
... Procrustes: rmse 0.01698972  max resid 0.1219746 
Run 47 stress 0.191925 
Run 48 stress 0.1710904 
Run 49 stress 0.1897483 
Run 50 stress 0.1793241 
Run 51 stress 0.1789115 
Run 52 stress 0.1704334 
Run 53 stress 0.1745534 
Run 54 stress 0.1842467 
Run 55 stress 0.1958116 
Run 56 stress 0.1789141 
Run 57 stress 0.1745232 
Run 58 stress 0.176365 
Run 59 stress 0.1677557 
Run 60 stress 0.1875669 
Run 61 stress 0.1739599 
Run 62 stress 0.1811732 
Run 63 stress 0.1791551 
Run 64 stress 0.2114635 
Run 65 stress 0.1949473 
Run 66 stress 0.1712527 
Run 67 stress 0.1954034 
Run 68 stress 0.1785225 
Run 69 stress 0.1762311 
Run 70 stress 0.1803207 
Run 71 stress 0.1699237 
Run 72 stress 0.1977597 
Run 73 stress 0.1774345 
Run 74 stress 0.1808974 
Run 75 stress 0.1788601 
Run 76 stress 0.1844365 
Run 77 stress 0.2011627 
Run 78 stress 0.1761189 
Run 79 stress 0.179456 
Run 80 stress 0.1814798 
Run 81 stress 0.2100013 
Run 82 stress 0.1853572 
Run 83 stress 0.1677553 
Run 84 stress 0.18513 
Run 85 stress 0.1713016 
Run 86 stress 0.1660708 
... New best solution
... Procrustes: rmse 0.01345051  max resid 0.12158 
Run 87 stress 0.1829012 
Run 88 stress 0.165895 
... New best solution
... Procrustes: rmse 0.007187102  max resid 0.07689487 
Run 89 stress 0.2106426 
Run 90 stress 0.1854206 
Run 91 stress 0.1818669 
Run 92 stress 0.190312 
Run 93 stress 0.1728266 
Run 94 stress 0.1658331 
... New best solution
... Procrustes: rmse 0.008056262  max resid 0.07751266 
Run 95 stress 0.1765695 
Run 96 stress 0.1811998 
Run 97 stress 0.1752568 
Run 98 stress 0.1746976 
Run 99 stress 0.1830565 
Run 100 stress 0.1797655 
*** Best solution was not repeated -- monoMDS stopping criteria:
     1: no. of iterations >= maxit
    99: stress ratio > sratmax
Big table wrangling with metadata
big_table.index = big_table.index.str.split('_').str[0]

relevant_kumamoto_metadata = (big_table
 .assign(sample_id=lambda dd: dd.index.str.split('_').str[0])
 [['sample_id']]
 .merge(kumamoto_metadata, on='sample_id', how='left')
)

big_table_no_blanks = big_table.query('index.str.contains("B", regex=False) == False')

kumamoto_metadata_no_blanks = (kumamoto_metadata
                               .query('sample_id.str.contains("B", regex=False) == False')
                               .query('sample_id in @big_table_no_blanks.index.str.split("_").str[0]')
 .assign(date=lambda dd: pd.to_datetime(dd['date_start'] + np.where(dd['sample_duration'] == 'Daily',
                                                   pd.Timedelta(days=0),
                                                   pd.Timedelta(days=3))))
 .assign(month_name=lambda dd: dd['date'].dt.strftime('%B'))
 .assign(month=lambda dd: dd['date'].dt.month)
 .assign(year=lambda dd: dd['date'].dt.year)
 .assign(season=lambda dd: np.select(
        [dd['date'].dt.dayofyear < 80,  # Winter
         dd['date'].dt.dayofyear < 172,  # Spring
         dd['date'].dt.dayofyear < 264,  # Summer
         dd['date'].dt.dayofyear < 355],  # Autumn
        ['Winter', 'Spring', 'Summer', 'Autumn'],
        default='Winter' # Default to Winter for days after December 21
    ))
)
%%R -i big_table_no_blanks -i kumamoto_metadata_no_blanks -o permanova_duration_results -o nmds_no_blanks_df -o pcoa_no_blanks_df -o pcoa_variance_explained

sample_duration <- kumamoto_metadata_no_blanks$sample_duration[match(rownames(big_table_no_blanks), kumamoto_metadata_no_blanks$sample_id)]
dist_matrix_no_blanks <- vegan::vegdist(big_table_no_blanks, method = "bray")
permanova_duration_results <- vegan::adonis2(dist_matrix_no_blanks ~ sample_duration, permutations = 999)
nmds_coords_no_blanks <- vegan::metaMDS(dist_matrix_no_blanks, k = 2, trymax = 100, verbose = FALSE)

pcoa_coords_no_blanks <- ape::pcoa(dist_matrix_no_blanks)
pcoa_variance_explained <- pcoa_coords_no_blanks$values$Relative_eig[1:2] * 100


nmds_no_blanks_df <- data.frame(
    NMDS1 = nmds_coords_no_blanks$points[, 1],
    NMDS2 = nmds_coords_no_blanks$points[, 2],
    sample_id = rownames(big_table_no_blanks),
    duration = sample_duration)

# Create dataframe with PCoA coordinates
pcoa_no_blanks_df <- data.frame(
    PCoA1 = pcoa_coords_no_blanks$vectors[, 1],
    PCoA2 = pcoa_coords_no_blanks$vectors[, 2], 
    sample_id = rownames(big_table_no_blanks),
    duration = sample_duration
)
Run 0 stress 0.1440462 
Run 1 stress 0.163319 
Run 2 stress 0.1766129 
Run 3 stress 0.1449615 
Run 4 stress 0.1594438 
Run 5 stress 0.1440455 
... New best solution
... Procrustes: rmse 0.0001728937  max resid 0.001364318 
... Similar to previous best
Run 6 stress 0.1449019 
Run 7 stress 0.1440189 
... New best solution
... Procrustes: rmse 0.001621551  max resid 0.01392724 
Run 8 stress 0.1440189 
... New best solution
... Procrustes: rmse 5.081942e-06  max resid 2.545541e-05 
... Similar to previous best
Run 9 stress 0.144902 
Run 10 stress 0.1440189 
... Procrustes: rmse 3.392733e-06  max resid 2.902543e-05 
... Similar to previous best
Run 11 stress 0.1440455 
... Procrustes: rmse 0.001630253  max resid 0.01394254 
Run 12 stress 0.1547519 
Run 13 stress 0.1694775 
Run 14 stress 0.1440455 
... Procrustes: rmse 0.001630834  max resid 0.0139466 
Run 15 stress 0.1617539 
Run 16 stress 0.1449609 
Run 17 stress 0.1800464 
Run 18 stress 0.1440462 
... Procrustes: rmse 0.001630917  max resid 0.01393746 
Run 19 stress 0.1449015 
Run 20 stress 0.1449015 
*** Best solution repeated 2 times
Example of working ggplot2 plot with rpy2 and R magic

I am just leaving this as an example of how to generate plots in R directly while controlling the resolution and size of the output image. I can’t seem to be able to do it directly with the %%R cell magic but using the png() function works fine.

%%R -i nmds_no_blanks_df -i kumamoto_metadata_no_blanks


# Plot NMDS
png("nmds_plot.png", res = 300, width=2000, height=1500)
ggplot(nmds_no_blanks_df, aes(x = NMDS1, y = NMDS2, color = duration)) +
  geom_point(size = 3, alpha = 0.8, stroke=0) +
  scale_color_manual(values=c("#30a12e", "#ae213b")) +
  scale_fill_manual(values=c("#30a12e", "#ae213b")) +
  geom_rug() +
  stat_ellipse(aes(fill = sample_duration), alpha = 0.2, level = 0.95, geom='polygon') +
  guides(fill=FALSE) +
  labs(
    title = "NMDS Plot of Bray-Curtis Distances",
    x = "NMDS1", 
    y = "NMDS2", 
    color = "Sample Duration") +
  theme_classic() +
  theme(
    text = element_text(family = "Georgia"),
    panel.grid.major = element_line(linetype=2)
  )
ggsave("nmds_plot.png")
dev.off()
Saving 6.67 x 5 in image
quartz_off_screen 
                2 
from IPython.display import Image
Image('nmds_plot.png')

Combined NMDS and PCoA plot
(pcoa_no_blanks_df
 .merge(nmds_no_blanks_df)
 .melt(id_vars=['sample_id', 'duration'], value_vars=['NMDS1', 'NMDS2', 'PCoA1', 'PCoA2'],
       var_name='axis', value_name='value')
 .assign(decomp=lambda dd: dd['axis'].str[:-1])
 .assign(axis=lambda dd: dd['axis'].str[-1].replace({'1': 'x', '2': 'y'}))
 .pivot(index=['sample_id', 'duration', 'decomp'], columns='axis', values='value')
 .reset_index()
 .pipe(lambda dd: p9.ggplot(dd, p9.aes(x='x', y='y', color='duration'))
       + p9.geom_point(size=2, alpha=0.7, stroke=0)
       + p9.geom_rug()
       + p9.stat_ellipse(p9.aes(fill='duration'), alpha=0.1, level=0.8, geom='polygon')
       + p9.facet_wrap('~decomp', scales='free', nrow=1)
       + p9.scale_color_manual(values=["#30a12e", "#ae213b"])
       + p9.scale_fill_manual(values=["#30a12e", "#ae213b"])
       + p9.labs(x='Coordinate$_1$', y='Coordinate$_2$', color='', fill='',
                 title='')
       + p9.theme(
           figure_size=(5, 3),
           legend_position='top'
       )
)
)

Show Code
permanova_p = permanova_blanks_results.iloc[0, 4]
permanova_r2 = permanova_blanks_results.iloc[0, 2]

a = (p9.ggplot(nmds_blanks_df)
 + p9.aes(x='NMDS1', y='NMDS2', color='sample_type')
 + p9.geom_point(size=3, alpha=0.8, stroke=0)
 + p9.geom_rug()
 + p9.annotate(
     geom='text', 
               label=f'n={nmds_blanks_df.shape[0]}', 
               x=nmds_blanks_df['NMDS1'].min(),
               y=nmds_blanks_df['NMDS2'].min(),
               size=10,
               family='Georgia',
               ha='left')
 + p9.scale_color_manual(values=["#5198ca", "#c16514"])
 + p9.scale_fill_manual(values=["#5198ca", "#c16514"])
 + p9.guides(fill=False)
 + p9.stat_ellipse(p9.aes(fill='sample_type'), alpha=0.1, level=0.95, geom='polygon')
 + p9.labs(title='',
          x='NMDS$_1$', y='NMDS$_2$', color='Sample Type',
          caption=f'PERMANOVA p-value: {permanova_p:.3f} (R$^2$ = {permanova_r2:.3%})',
          tag='(A)'

))

permanova_p = permanova_duration_results.iloc[0, 4]
permanova_r2 = permanova_duration_results.iloc[0, 2]

b = (p9.ggplot(nmds_no_blanks_df)
 + p9.aes(x='NMDS1', y='NMDS2', color='duration')
 + p9.geom_point(size=3, alpha=0.8, stroke=0)
 + p9.geom_rug()
 + p9.scale_color_manual(values=["#30a12e", "#ae213b"]) 
 + p9.scale_fill_manual(values=["#30a12e", "#ae213b"])
 + p9.annotate(
     geom='text',
     label=f'n={nmds_no_blanks_df.shape[0]}', 
     x=nmds_no_blanks_df['NMDS1'].min() * 1.05,
     y=nmds_no_blanks_df['NMDS2'].min() * 0.95,
     size=10, 
     family='Georgia', 
     ha='left',
     va='bottom')
 + p9.guides(fill=False)
 + p9.stat_ellipse(p9.aes(fill='duration'), alpha=0.1, level=0.95, geom='polygon')
 + p9.labs(title='',
          x='NMDS$_1$', y='', color='Sample Duration',
          caption=f'PERMANOVA p-value: {permanova_p:.3f} (R$^2$ = {permanova_r2:.3%})',
          tag='(B)')

)


composition = (a | b) 

composition  & p9.theme(figure_size=(8, 4),
      plot_caption=p9.element_text(size=10, family='Georgia', ha='right'),
      legend_position='top',
      plot_tag=p9.element_text(size=16, family='Georgia', face='bold', ha='left'),
      plot_tag_location='plot',
  ) 

As shown in the results (aside from the satisfaction of finally being able to programmatically generate labeled compositions with plotnine v.0.15.0), the NMDS plots reveal interesting patterns in the data which are confirmed by the PERMANOVA results.

The blanks and actual samples are clearly separated in NMDS space, and significantly different from each other according to the PERMANOVA results (p = 0.001).

As for the sample duration, it also seems have an effect on the composition of the samples, with week-long samples being more similar to each other than to the 24h samples, which are more dispersed in NMDS space. The PERMANOVA results confirm this as well (p = 0.001).

Timeline plot code
month_colors_dict = {
    "June": "#FFD27A",
    "July": "#FFC04D",
    "August": "#FFAD33",
    "September": "#E68A4C",
    "October": "#CC6E4A",
    "November": "#B35A4D",
    "December": "#9A7ACC",
    "January": "#7C93E6",
    "February": "#66A3F2",
    "March": "#7DCB9B",
}


season_colors_dict = {
    "Spring": "#7DCB9B",
    "Summer": "#FFB84D",
    "Autumn": "#CC6E4A",
    "Winter": "#7C93E6",
}

calendar_samples = (pd.DataFrame(dict(
    date=
pd.date_range(
    start=kumamoto_metadata_no_blanks['date_start'].min(),
    end=kumamoto_metadata_no_blanks['date_end'].max(),
    freq='D'
)))
 .assign(month_name=lambda dd: dd['date'].dt.strftime('%B'))
 .assign(year=lambda dd: dd['date'].dt.year)
 .assign(season=lambda dd: np.select(
        [dd['date'].dt.dayofyear < 80,  # Winter
         dd['date'].dt.dayofyear < 172,  # Spring
         dd['date'].dt.dayofyear < 264,  # Summer
         dd['date'].dt.dayofyear < 355],  # Autumn
        ['Winter', 'Spring', 'Summer', 'Autumn'],
        default='Winter' # Default to Winter for days after December 21
    ))
)


a = (kumamoto_metadata_no_blanks
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.geom_rect(p9.aes(
           xmin='date_start', 
           xmax='date_end', 
           ymin=0, 
           ymax=1,
           fill='sample_duration'),
        size=.1,
        color='black',
        alpha=.75
           )
       + p9.scale_y_continuous(expand=(0, 0), breaks=[])
       + p9.scale_x_datetime(date_breaks='1 month',
                             date_labels='%b %d %Y',
                             expand=(0, 0)
                             )
       + p9.scale_fill_manual(values=["#30a12e", "#ae213b"])
       + p9.labs(fill='', title='Timeline of Kumamoto Samples')
       + p9.theme(
           figure_size=(6, 2),
           legend_position=(.55, .65),
           axis_text_x=p9.element_blank(),
           plot_margin_bottom=-.001,
           
           )

       )
)

b = (calendar_samples
     .groupby(['year', 'month_name'], as_index=False)
        .agg(min_date=('date', 'min'),
             max_date=('date', 'max'),
             center_date=('date', 'mean'))
    .pipe(lambda dd: p9.ggplot(dd)
            + p9.geom_rect(p9.aes(
                xmin='min_date',
                xmax='max_date',
                ymin=0,
                ymax=1,
                fill='month_name'),
            )
            + p9.geom_text(p9.aes(x='center_date', y=0.5, label='month_name'),
                           size=8, family='Georgia', ha='center', va='center')
            + p9.scale_x_datetime(date_breaks='1 month',
                                  date_labels='%b %Y',
                                  expand=(0, 0))
            + p9.scale_fill_manual(values=month_colors_dict.values(),
                                   breaks=list(month_colors_dict.keys()))
            + p9.labs(fill='', x='', y='')
            + p9.guides(fill=False)
            + p9.theme(
                figure_size=(6, .5),
                legend_position='top',
                axis_text=p9.element_blank(),
                axis_ticks=p9.element_blank(),
                axis_line=p9.element_blank(),
                panel_grid=p9.element_blank(),
                plot_margin_bottom=-.01,
            )
             )
)

c = (calendar_samples
     .groupby(['season'], as_index=False)
        .agg(min_date=('date', 'min'),
             max_date=('date', 'max'),
             center_date=('date', 'mean'))
    .pipe(lambda dd: p9.ggplot(dd)
            + p9.geom_rect(p9.aes(
                xmin='min_date',
                xmax='max_date',
                ymin=0,
                ymax=1,
                fill='season'),
            )
            + p9.geom_text(p9.aes(x='center_date', y=0.5, label='season'),
                           size=8, family='Georgia', ha='center', va='center')
            + p9.scale_x_datetime(
                date_breaks='1 month',
                date_labels='%b %Y',
                expand=(0, 0))
            + p9.scale_fill_manual(
                values=season_colors_dict.values(),
                breaks=list(season_colors_dict.keys()))
            + p9.labs(fill='', x='', y='')
            + p9.guides(fill=False)
            + p9.theme(
                figure_size=(6, .5),
                legend_position='top',
                axis_text=p9.element_blank(),
                axis_ticks=p9.element_blank(),
                axis_line=p9.element_blank(),
                panel_grid=p9.element_blank(),

            )
             )
)
ph = p9.ggplot() + p9.geom_blank() + p9.theme(axis_line=p9.element_blank())
empty_comp = (b / c /  ph)
final_comp = a / empty_comp
final_comp & p9.theme(figure_size=(6, 2), dpi=300, plot_title=p9.element_text(size=12)) 
 

If we now go ahead and color the NMDS plots by season and month, and run the PERMANOVA tests on the same variables, we can see how actually the season and month of the year have a much stronger effect on the composition of the samples than the sample duration, with p = 0.001 for both cases but explaining roughly a 15% and 27% of the variance in community composition, respectively, compared to a mere 7% for the sample duration.

Temporal Patterns in Community Composition

%%R -i kumamoto_metadata_no_blanks -o permanova_season_results -o permanova_month_results

season <- kumamoto_metadata_no_blanks$season[match(rownames(big_table_no_blanks), kumamoto_metadata_no_blanks$sample_id)]
month <- kumamoto_metadata_no_blanks$month_name[match(rownames(big_table_no_blanks), kumamoto_metadata_no_blanks$sample_id)]
permanova_season_results <- vegan::adonis2(big_table_no_blanks ~ season, permutations = 999)
permanova_month_results <- vegan::adonis2(big_table_no_blanks ~ month, permutations = 999)

print(permanova_season_results)
print(permanova_month_results)
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = big_table_no_blanks ~ season, permutations = 999)
          Df SumOfSqs     R2     F Pr(>F)    
Model      3   4.4672 0.1506 6.501  0.001 ***
Residual 110  25.1957 0.8494                 
Total    113  29.6629 1.0000                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = big_table_no_blanks ~ month, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      8   8.2122 0.27685 5.0248  0.001 ***
Residual 105  21.4507 0.72315                  
Total    113  29.6629 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Show Code
a = (nmds_no_blanks_df
 .merge(kumamoto_metadata_no_blanks, on='sample_id', how='left')
 .pipe(lambda dd: p9.ggplot(dd)
         + p9.aes(x='NMDS1', y='NMDS2', color='season')
         + p9.stat_ellipse(p9.aes(fill='season'), alpha=0.1, level=0.95, geom='polygon')
         + p9.geom_point(size=3, alpha=0.8, stroke=0)
         + p9.geom_rug()
         + p9.scale_color_manual(
                values=season_colors_dict.values(),
                breaks=list(season_colors_dict.keys())
                                 )
         + p9.scale_fill_manual(
                values=season_colors_dict.values(),
                breaks=list(season_colors_dict.keys())
         )
         + p9.labs(title='', fill='',
             x='NMDS$_1$', y='NMDS$_2$', color='',
             caption=f'PERMANOVA p-value: {permanova_season_results.iloc[0, 4]:.3f}' \
               f' (R$^2$ = {permanova_season_results.iloc[0, 2]:.1%})',
               tag='(A)'
         )
         + p9.theme(
              text=p9.element_text(family='Georgia'),
              plot_caption=p9.element_text(size=10, family='Georgia', ha='right'),
              plot_tag=p9.element_text(size=16, family='Georgia', face='bold', ha='left'),
              plot_tag_location='plot',
              legend_position='top',
         )
     )
)

b = (nmds_no_blanks_df
 .merge(kumamoto_metadata_no_blanks, on='sample_id', how='left')
 .pipe(lambda dd: p9.ggplot(dd)
         + p9.aes(x='NMDS1', y='NMDS2', color='month_name', fill='month_name')
         + p9.geom_point(size=3, alpha=0.8, stroke=0)
         + p9.geom_rug()
         + p9.scale_color_manual(
                values=month_colors_dict.values(),
                breaks=list(month_colors_dict.keys()),
                labels=[m[:3] for m in list(month_colors_dict.keys())]
         )
         + p9.scale_fill_manual(
                values=month_colors_dict.values(),
                breaks=list(month_colors_dict.keys()),
                labels=[m[:3] for m in list(month_colors_dict.keys())]
         )
         + p9.labs(title='', fill='',
             x='NMDS$_1$', y='NMDS$_2$', color='',
             caption=f'PERMANOVA p-value: {permanova_month_results.iloc[0, 4]:.3f}' \
              f' (R$^2$ = {permanova_month_results.iloc[0, 2]:.1%})',
              tag='(B)'
         )
         + p9.guides(color=p9.guide_legend(ncol=5))
         + p9.theme(
              figure_size=(6, 4),
              text=p9.element_text(family='Georgia'),
              plot_caption=p9.element_text(size=10, family='Georgia', ha='right'),
              plot_tag=p9.element_text(size=16, family='Georgia', face='bold', ha='left'),
              plot_tag_location='plot',
              legend_position='top',
         )
     )
)

composition = (a | b) & p9.theme(figure_size=(8, 4))
composition

The results are quite clear and consistent, with time being as far as we can tell the main driver of community composition changes in the samples. However, until now we have included every taxonomic level, mixing Bacteria, Fungi, Metazoa and Viruses in the same analysis, which is not ideal since we know that these groups have very different ecological and evolutionary dynamics, and thus an individual analysis of each group might be more informative and reveal different patterns in the data we might not see otherwise.

Big clades

If we take a look at the assigned reads, we can see how they are distributed across different sample durations and big clades:

Show Code
big_clade_per_sample_type = (abundance_tax_long
 .merge(taxonomy_with_ids)
 .assign(domain=lambda dd: np.where(dd['acellular root'].isna() & dd['domain'].isna(),
                                   'Unknown', dd['domain']))
 .query('`acellular root`.notna() or domain.notna()')
 .assign(big_clade=lambda dd: np.where(dd['domain'].notna(), dd['domain'],
                                       dd['acellular root']))
 .groupby(['sample_id', 'sample_type', 'big_clade'], as_index=False)
 .agg(total_counts=('counts', 'sum'))
 .groupby('sample_id', as_index=False)
 .apply(lambda dd: dd.assign(relative_abundance=dd['total_counts'] / dd['total_counts'].sum()))
 .reset_index(drop=True)
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
  .groupby(['sample_duration', 'big_clade'], as_index=False)
  .agg(total_counts=('total_counts', 'sum'),
       relative_abundance=('relative_abundance', 'mean'))
)

(big_clade_per_sample_type
 .query('relative_abundance > 0')
 .query('big_clade.isin(["Bacteria", "Eukaryota", "Viruses", "Unknown"])')
 .assign(big_clade=lambda dd: pd.Categorical(
     dd['big_clade'],
     categories=['Bacteria', 'Eukaryota', 'Viruses', 'Unknown'],
     ordered=True))
 .pipe(lambda dd: p9.ggplot(dd)
      + p9.aes(x='sample_duration',
               y='relative_abundance', 
               fill='big_clade')
      + p9.geom_col( alpha=1)
      + p9.scale_y_continuous(labels=label_percent())
      + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
      + p9.labs(x='', y='Relative Abundance', fill='', 
          caption="*Archaea not shown (<0.1% of all reads)")
      + p9.theme(
          figure_size=(4, 3),
          legend_position='top',
          legend_text=p9.element_text(size=8),
          legend_key_size=10,
          plot_caption= p9.element_text(size=8),
          )
  )

)

Show Code
(big_clade_per_sample_type
 .pivot(
     index='sample_duration',
     columns='big_clade', 
     values='relative_abundance')
[['Bacteria', 'Eukaryota', 'Viruses', 'Unknown', 'Archaea',]]
.style
.format(lambda x: f"{x * 100:.2f}%")
)
big_clade Bacteria Eukaryota Viruses Unknown Archaea
sample_duration          
Blanks 93.37% 2.69% 0.67% 3.27% nan%
Daily 75.39% 23.04% 1.03% 0.51% 0.04%
Weekly 54.41% 44.93% 0.30% 0.35% 0.02%


There seems to be a rather big contrast between the relative abundance of Bacteria and Eukaryota in the samples, with Viruses being present in much lower abundances. While the blank reads mostly belong to bacteria (93%), the daily samples show a still dominant but lower relative abundance of bacteria (75%) with respect to Eukaryota (23%). In the weekly samples, the Bacteria to Eukaryota ratio becomes much closer, however, with 54% of the reads belonging to Bacteria and 45% to Eukaryota.

Viruses are present in all samples but in very low abundances, with relative abundances ranging from 0.30 to 1.03%.

The unclassified sequences (up to this level) range from 0.35% to 0.51% in the weekly and daily samples, respectively, to 3.27% in the blanks.

If we take a look at the percentage of unassigned reads at the big clade level per sample, however, we see how the large % of blanks is explained by a single blank sample, KB105:

Taxonomic Assignment Success Rates

Show Code
(abundance_tax_long
 .merge(taxonomy_with_ids)
 .assign(domain=lambda dd: np.where(dd['acellular root'].isna() & dd['domain'].isna(),
                                   'Unknown', dd['domain']))
 .query('`acellular root`.notna() or domain.notna()')
 .assign(big_clade=lambda dd: np.where(dd['domain'].notna(), dd['domain'],
                                       dd['acellular root']))
 .groupby(['sample_id', 'sample_type', 'big_clade'], as_index=False)
 .agg(total_counts=('counts', 'sum'))
 .groupby('sample_id', as_index=False)
 .apply(lambda dd: dd.assign(relative_abundance=dd['total_counts'] / dd['total_counts'].sum()))
 .reset_index(drop=True)
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .query('big_clade == "Unknown"')
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='relative_abundance', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'),
                        data=dd.query('relative_abundance < 0.1'), 
                        size=2, alpha=0.5, width=0.2, stroke=0)
       + p9.geom_point(p9.aes(color='sample_duration'),
                        data=dd.query('relative_abundance >= 0.1'), 
                        size=2, alpha=0.8, stroke=0)
       + p9.geom_label(p9.aes(label='sample_id'), 
                       data=dd.query('relative_abundance > 0.1'), 
                       nudge_x=.2,
                       size=6,
                       family='monospace')
       + p9.labs(x='', y='% of unassigned reads', fill='')
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_y_continuous(labels=label_percent())
       + p9.guides(fill=False, color=False)
       + p9.theme(
           figure_size=(4, 2),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)         
           )
 )
)

This sample is also the one with the lowest DNA yield (0.13 ng) and the lowest total read counts (roughly 100k, an order of magnitude lower than the next lowest sample that has already over a million reads) and thus it is not surprising that it has a much higher percentage of unassigned reads than the rest of the samples. Given how much of an outlier this sample is, we will remove it from the analysis and focus on the rest of the samples. If we run the comparison again between sample types, we observe that the difference in the percentage of unassigned reads persists and it’s significant between all Blanks and Weekly samples, and between Daily and Weekly samples, but not between Blanks and Daily samples:

Show Code
unassigned_data = (abundance_tax_long
 .merge(taxonomy_with_ids)
 .assign(domain=lambda dd: np.where(dd['acellular root'].isna() & dd['domain'].isna(),
                                   'Unknown', dd['domain']))
 .query('`acellular root`.notna() or domain.notna()')
 .assign(big_clade=lambda dd: np.where(dd['domain'].notna(), dd['domain'],
                                       dd['acellular root']))
 .groupby(['sample_id', 'sample_type', 'big_clade'], as_index=False)
 .agg(total_counts=('counts', 'sum'))
 .groupby('sample_id', as_index=False)
 .apply(lambda dd: dd.assign(relative_abundance=dd['total_counts'] / dd['total_counts'].sum()))
 .reset_index(drop=True)
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .query('big_clade == "Unknown"')
 .query('sample_id != "KB105"')
)

(unassigned_data
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='relative_abundance', fill='sample_duration')
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.geom_jitter(p9.aes(color='sample_duration'),
                        data=dd.query('relative_abundance < 0.1'), 
                        size=2, alpha=0.5, width=0.1, stroke=0)
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.labs(x='', y='% of unassigned reads', fill='')
       + p9.scale_y_continuous(labels=label_percent())
       + p9.guides(fill=False, color=False)
       + p9.annotate(geom='segment', 
                     x=[2.025, 1.025], 
                     xend=[2.975, 2.975],
                     y=[dd['relative_abundance'].max() * 1.1, 
                        dd['relative_abundance'].max() * 1.3],
                     yend=[dd['relative_abundance'].max() * 1.1,
                           dd['relative_abundance'].max() * 1.3],
                     color='black',
                     lineend='butt',
                     arrow=p9.arrow(length=.05, angle=90, ends='both')
                     )
        + p9.annotate(geom='text', x=[2, 2.5], 
                      y=[
                            dd['relative_abundance'].max() * 1.35,
                            dd['relative_abundance'].max() * 1.15],
                      label='*'
                     )
       + p9.theme(
           figure_size=(4, 2),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)         
           )
 )
)

Show Code
%%R -i unassigned_data
library(betareg)

kruskal_test <- kruskal.test(relative_abundance ~ sample_duration, data = unassigned_data)
print(kruskal_test)

# Dunn's test in case it's significant:
if(kruskal_test$p.value < 0.05) {
  print("Kruskal-Wallis test is significant, performing Dunn's test...")
  dunn_test <- FSA::dunnTest(relative_abundance ~ sample_duration, 
                            data = unassigned_data, method = "bh")
  print(dunn_test)
}

# Summary statistics
summary_stats <- unassigned_data %>%
  group_by(sample_duration) %>%
  summarise(
    n = n(),
    median = median(relative_abundance),
    q25 = quantile(relative_abundance, 0.25),
    q75 = quantile(relative_abundance, 0.75),
    mean = mean(relative_abundance),
    sd = sd(relative_abundance)
  )

print(summary_stats)

    Kruskal-Wallis rank sum test

data:  relative_abundance by sample_duration
Kruskal-Wallis chi-squared = 7.959, df = 2, p-value = 0.0187

[1] "Kruskal-Wallis test is significant, performing Dunn's test..."
       Comparison        Z    P.unadj      P.adj
1  Blanks - Daily 1.302445 0.19276440 0.19276440
2 Blanks - Weekly 2.558239 0.01052037 0.03156111
3  Daily - Weekly 2.349434 0.01880199 0.02820298
# A tibble: 3 × 7
  sample_duration     n  median     q25     q75    mean      sd
  <chr>           <int>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Blanks              8 0.00502 0.00451 0.00749 0.00607 0.00235
2 Daily              97 0.00450 0.00369 0.00625 0.00507 0.00220
3 Weekly             17 0.00392 0.00180 0.00483 0.00349 0.00195
Dunn (1964) Kruskal-Wallis multiple comparison
  p-values adjusted with the Benjamini-Hochberg method.

Además: Aviso:
sample_duration was coerced to a factor. 

Bacterial community composition

Let’s now take a look at the bacterial fraction of the taxonomic assignments for our samples. Checking out the levels of classification provided by the pipeline it seems that, roughly, the levels of classification match the following columns (which we can use to map the levels of classification to the actual names):

Since we also have plenty of reads that are assigned to names like “unclassified Bacteria” or “unclassified Micrococcus”, we will try to clean up the data and unify the names of the missing taxa or those with names that imply a lack of classification so that we can have a more consistent and informative analysis of the bacterial community composition.

Bacterial species-level abundances and species name cleaning code
# Patterns which might look like species but are too vague to be useful and shall be considered "Unclassified"

bacterial_taxonomy = (taxonomy_with_ids
            .query('domain == "Bacteria"')
)
non_specific_patterns = [
    r'^\s*uncultured\b.*',
    r'^\s*unclassified\b.*',
    r'^\s*environmental\b.*',
    r'^\s*unidentified\b.*',
    r'^\s*bacterium\s*$',                                   # exactly "bacterium"
    r'^(?:[A-Za-z]+aceae|[A-Za-z]+ales|[A-Za-z]+ia)\s+bacterium$',  # Family/Order/Class + bacterium (no strain)
    r'^(?:marine|soil)\s+bacterium$',                       # marine/soil bacterium (no strain)
    r'^\s*cyanobacterium\s*$',                              # exactly "cyanobacterium"
    r'^[A-Z][a-z]+\s+sp\.\s*$',                            # Genus sp. (no strain) 
    # any line that ends with bacterium/cyanobacterium and has no trailing strain token
    r'^(?:(?!\s+(?:bacterium|cyanobacterium)\s+[A-Za-z0-9_-]+$).)*(?:^|\s)(?:bacterium|cyanobacterium)\s*$',
]

def standardize_unclassified_species(name):
    """
    If the species name matches any of the non-specific patterns, return "Unclassified"
    """
    for pattern in non_specific_patterns:
        if re.match(pattern, name, re.IGNORECASE):
            return "Unclassified"
    return name

bacterial_abundances_long = (abundance_tax_long
 .merge(taxonomy_with_ids)
 .query('domain=="Bacteria"')
 .assign(species=lambda dd: np.where(dd['species'].notna(), dd['species'],
                                     np.where(dd['best_taxonomy_level'] == 'strain', dd['best_taxonomy_name'].str.split(' ').str[:2].str.join(' '),
                                              np.where(dd['best_taxonomy_level'] == 'subspecies', dd['best_taxonomy_name'].str.split(' ').str[:2].str.join(' '), "Unclassified"))))
 # if species matches any of the non-specific patterns, set to "Unclassified"
#  .assign(species=lambda dd: dd['species'].apply(standardize_unclassified_species))
 .groupby(['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'best_taxonomy_level'] + ['sample_id'], as_index=False, dropna=False)
 .agg(counts=('counts', 'sum'))
)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[116], line 6
      1 #| code-fold: true
      2 #| code-summary: Bacterial species-level abundances and species name cleaning code
      3 
      4 # Patterns which might look like species but are too vague to be useful and shall be considered "Unclassified"
----> 6 bacterial_taxonomy = (taxonomy_with_ids
      7             .query('domain == "Bacteria"')
      8 )
      9 non_specific_patterns = [
     10     r'^\s*uncultured\b.*',
     11     r'^\s*unclassified\b.*',
   (...)     20     r'^(?:(?!\s+(?:bacterium|cyanobacterium)\s+[A-Za-z0-9_-]+$).)*(?:^|\s)(?:bacterium|cyanobacterium)\s*$',
     21 ]
     23 def standardize_unclassified_species(name):

NameError: name 'taxonomy_with_ids' is not defined
Show Code
unclassified_reads_df = (bacterial_abundances_long
.query('counts > 0')
 .groupby('sample_id')
 .apply(lambda dd: 
        dd[['phylum', 'class', 'order', 'family', 'genus', 'species']]
        .fillna("Unclassified")
        .map(standardize_unclassified_species)
        .map(lambda x: x == "Unclassified")
        .assign(counts=dd['counts']), 
    include_groups=False)
    .reset_index()
    .eval('unknown_phylum = phylum * counts')
    .eval('unknown_class = `class` * counts')
    .eval('unknown_order = order * counts')
    .eval('unknown_family = family * counts')
    .eval('unknown_genus = genus * counts')
    .eval('unknown_species = species * counts')
    .groupby('sample_id', as_index=False)
    .agg(
        unknown_phylum=('unknown_phylum', 'sum'),
        unknown_class=('unknown_class', 'sum'),
        unknown_order=('unknown_order', 'sum'),
        unknown_family=('unknown_family', 'sum'),
        unknown_genus=('unknown_genus', 'sum'),
        unknown_species=('unknown_species', 'sum'),
        total_counts=('counts', 'sum'),
    )
    .melt(id_vars=['sample_id', 'total_counts'], 
          var_name='taxonomy_level', 
          value_name='unclassified_counts')
    .assign(taxonomy_level=lambda dd: dd['taxonomy_level'].str.split('_').str[1].str.capitalize())
    .eval('unclassified_rate = unclassified_counts / total_counts')
    .assign(taxonomy_level=lambda dd: pd.Categorical(
        dd['taxonomy_level'],
        categories=['Species', 'Genus', 'Family', 'Order', 'Class', 'Phylum'],
        ordered=True))
    .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
           how='left', on='sample_id')
    .replace({'sample_duration': {'Daily': 'Daily Samples', 'Weekly': 'Weekly Samples'}})
)

(unclassified_reads_df
    .pipe(lambda dd: p9.ggplot(dd)
          + p9.aes('reorder(sample_id, unclassified_rate)', 'unclassified_rate')
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Species"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Genus"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Family"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Order"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Class"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Phylum"'))
          + p9.scale_y_continuous(
              labels=["0%\n100%", "25%\n75%", "50%\n50%", "75%\n25%"], 
              expand=(0, 0, 0, .05), 
              breaks=[0, 0.25, 0.5, 0.75,])
          + p9.labs(x='', y='Fraction of unclassified/classified reads', fill='')
          + p9.facet_wrap('sample_duration', scales='free_y')
          + p9.guides(fill=p9.guide_legend(nrow=1))
          + p9.coord_flip()
          + p9.theme(
              axis_text_y=p9.element_blank(),
              axis_ticks_major_y=p9.element_blank(),
              panel_grid_major_y=p9.element_blank(),
              strip_background=p9.element_blank(),
              panel_grid=p9.element_blank(),
              legend_position='top',
              figure_size=(7, 4)
          )

          )
)

Meaning that, for Species, we have an average of around 11% of all reads classified at this level (and thus, 89% unclassified), quite consistently across all sample types. The boxplots below show the comparison for each of these levels, with the table summarizing the average percentage of assigned reads at each level. Strikingly, the weekly samples have the lowest percentage of assigned reads at all taxonomic levels, with a sizable chunk of weekly samples having a pretty low % of assigned reads even at the Phylum level, which is quite concerning and might indicate that degradation of bacterial DNA is happening at a quite high rate in these samples.

Show Code
(unclassified_reads_df
 .assign(sample_duration=lambda dd: dd.sample_duration.str.replace(' ', '\n'))
 .pipe(lambda dd: 
       p9.ggplot(dd)
       + p9.aes('sample_duration', '1 - unclassified_rate', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'), width=.1, size=2, alpha=0.5, stroke=0)
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.facet_wrap('taxonomy_level')
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
         + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
         + p9.scale_y_continuous(labels=label_percent(), expand=(0, 0, 0, .05))
       + p9.guides(fill=False, color=False)
       + p9.labs(x='', y='Fraction of classified reads')
       + p9.theme(
           figure_size=(7, 4),
       )
       )
)

Show Code
(unclassified_reads_df
 .assign(classified_rate=lambda dd: 1 - dd['unclassified_rate'])
 .groupby(['sample_duration', 'taxonomy_level']).agg({
    'classified_rate': 'mean',
}).reset_index()
.rename(columns={'sample_duration': 'Sample Group'})
.pivot(index='taxonomy_level', columns='Sample Group', values='classified_rate')
.style
.format('{:.2%}')
.set_caption("Average fraction of classified reads per taxonomic level")
)
Table 1: Average fraction of classified reads per taxonomic level
Sample Group Blanks Daily Samples Weekly Samples
taxonomy_level      
Species 11.06% 9.59% 9.68%
Genus 53.76% 51.51% 31.19%
Family 63.95% 59.04% 41.31%
Order 64.87% 64.42% 43.74%
Class 66.64% 68.25% 46.66%
Phylum 68.28% 77.55% 51.97%

Now, the total number of reads which end up unassigned at the Species and Genus level is really high, with way over 50% of all reads not even assigned to a Genus, and over 90% not assigned to a Species.

However, in absolute terms, this low fraction still provides us with a rich abundance of information about a rich diversity of Bacteria. Let’s put that into numbers:

toyama_stats.
sample_id dna_yield n_reads total_bases median_read_length median_read_quality header sample_type location sample_n sample_pre sample_i
0 TB01 0.65 64325 295004711 4293 44.3 PM2.5 Blank Toyama 01 TB
1 TB07 0.70 123500 613911971 4680 41.0 PM2.5 Blank Toyama 07 TB
2 TB12 0.18 145486 669664197 4247 41.4 PM2.5 Blank Toyama 12 TB
3 TB17 0.10 35195 149716256 3738 42.8 PM2.5 Blank Toyama 17 TB
4 TB23 0.40 63369 290584275 4247 42.1 PM2.5 Blank Toyama 23 TB
... ... ... ... ... ... ... ... ... ... ... ... ...
110 TS51.2 1.70 117848 653014509 5281 38.6 PM2.5 Sample Toyama 51 TS .2
111 TS52.1 30.95 9415 50084501 4937 41.6 PM10 Sample Toyama 52 TS .1
112 TS52.2 2.25 134607 692493768 4866 38.6 PM2.5 Sample Toyama 52 TS .2
113 TS53.1 50.58 11 43242 4034 41.3 PM10 Sample Toyama 53 TS .1
114 TS53.2 9.80 137342 739884755 5097 38.3 PM2.5 Sample Toyama 53 TS .2

115 rows × 12 columns

Show Code
(bacterial_abundances_long
 .query('counts > 0')
 .query('species != "Unclassified"')
 .groupby(['sample_id'], as_index=False)
 .species.nunique()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='species', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'), size=2, alpha=0.5, width=0.1, stroke=0)
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.guides(fill=False, color=False)
       + p9.labs(x='', y='Number of classified species', fill='')
       + p9.theme(
           figure_size=(4, 2),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)
           )
 ))

(bacterial_abundances_long
 .query('counts > 0')
 .query('species != "Unclassified"')
 .groupby(['sample_id'], as_index=False)
 .agg(species=('species', 'nunique'),
      genera=('genus', 'nunique'),
      families=('family', 'nunique'),
      orders=('order', 'nunique'),
      classes=('class', 'nunique'),
      phyla=('phylum', 'nunique'))
    .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
            how='left', on='sample_id')
 .melt(id_vars=['sample_id', 'sample_duration'],
       var_name='taxonomy_level', value_name='num_taxa')

 .assign(taxonomy_level=lambda dd: dd['taxonomy_level'].str.capitalize(),
         tax_level=lambda dd: pd.Categorical(
             dd['taxonomy_level'],
             categories=['Species', 'Genera', 'Families', 'Orders', 'Classes', 'Phyla'],
             ordered=True))
 .assign(sample_duration=lambda dd: pd.Categorical(dd['sample_duration'],
                                                  categories=['Daily', 'Weekly', 'Blanks'],))
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='num_taxa', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'), size=2, alpha=0.5, width=0.1, stroke=0)
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.scale_color_manual(values=["#c16514", "#17a860", "#5198ca", ])
       + p9.scale_fill_manual(values=["#c16514", "#17a860", "#5198ca", ])
       + p9.facet_wrap('tax_level', scales='free_y')
       + p9.guides(fill=False, color=False)
       + p9.labs(x='', y='Number of unique taxa', fill='')
       + p9.theme(
           figure_size=(7, 3),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)
           )
)
)

Table of mean number of unique bacterial species per sample group
(bacterial_abundances_long
 .query('counts > 0')
 .query('species != "Unclassified"')
 .groupby(['sample_id'], as_index=False)
 .species.nunique()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .groupby('sample_duration', as_index=False)
 .agg({'species': 'mean'})
 .rename(columns={'species': 'Mean n of unique species',
                  'sample_duration': 'Sample Group'})
 .style
 .hide(axis='index')
 .format({
     'Mean n of unique species': '{:.1f}'
 })
)
Sample Group Mean n of unique species
Blanks 101.2
Daily 329.1
Weekly 122.1
Table of mean number of unique bacterial genus per sample group
(genus_abundances
 .groupby('sample_id')
 .agg({'genus': 'nunique'})
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
 .groupby('sample_duration', as_index=False)
 .agg({'genus': 'mean'})
 .rename(columns={'genus': 'Mean n of unique genera',
                  'sample_duration': 'Sample Group'})
 .style
 .hide(axis='index')
 .format({
     'Mean n of unique genera': '{:.1f}'})
)
Sample Group Mean n of unique genera
Blanks 77.8
Daily 264.5
Weekly 115.9
Show Code
# We will create a sample x genus table for later use
bacterial_genus_table = (bacterial_abundances_long
    .query('counts > 0')
    .query('genus.notna() & genus != "Unclassified"')
    .groupby(['sample_id', 'genus'])['counts']
    .sum()
    .pivot(index='sample_id', columns='genus', values='counts')
)
%%R -i bacterial_genus_table -i kumamoto_metadata -o bacterial_pcoa_df -o bacterial_pcoa_df_no_blanks -o bacterial_permanova_results -o bacterial_permanova_results_no_blanks

bacterial_dist <- vegan::vegdist(bacterial_genus_table, method = "bray")
genus_no_blanks <- bacterial_genus_table %>%
    dplyr::filter(!grepl("^KB", rownames(bacterial_genus_table)))
bacterial_dist_no_blanks <- vegan::vegdist(genus_no_blanks, method = "bray")

# bacterial_nmds <- vegan::metaMDS(bacterial_dist, k = 2)
bacterial_pcoa <- ape::pcoa(bacterial_dist)
bacterial_pcoa_no_blanks <- ape::pcoa(bacterial_dist_no_blanks)
sample_duration <- kumamoto_metadata$sample_duration[match(rownames(bacterial_genus_table), kumamoto_metadata$sample_id)]
sample_duration_no_blanks <- kumamoto_metadata$sample_duration[match(rownames(genus_no_blanks), kumamoto_metadata$sample_id)]
bacterial_permanova_results <- vegan::adonis2(bacterial_genus_table ~ sample_duration, permutations = 999)
bacterial_permanova_results_no_blanks <- vegan::adonis2(genus_no_blanks ~ sample_duration_no_blanks, permutations = 999)

print(bacterial_permanova_results)
print(bacterial_permanova_results_no_blanks)

bacterial_pcoa_df <- as.data.frame(bacterial_pcoa$vectors[, 1:2])
colnames(bacterial_pcoa_df) <- c('PCoA1', 'PCoA2')
bacterial_pcoa_df$sample_id <- rownames(bacterial_pcoa_df)
bacterial_pcoa_df_no_blanks <- as.data.frame(bacterial_pcoa_no_blanks$vectors[, 1:2])
colnames(bacterial_pcoa_df_no_blanks) <- c('PCoA1', 'PCoA2')
bacterial_pcoa_df_no_blanks$sample_id <- rownames(bacterial_pcoa_df_no_blanks)
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = bacterial_genus_table ~ sample_duration, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      2    3.549 0.11142 7.5234  0.001 ***
Residual 120   28.305 0.88858                  
Total    122   31.854 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = genus_no_blanks ~ sample_duration_no_blanks, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      1   2.4666 0.08797 10.803  0.001 ***
Residual 112  25.5721 0.91203                  
Total    113  28.0387 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(bacterial_pcoa_df
 .merge(bacterial_pcoa_df_no_blanks, 
        on='sample_id', how='left', suffixes=('__With Blanks', '__Without Blanks'))
 .melt(id_vars='sample_id')
 .assign(sample_group=lambda dd: dd['variable'].str.split('__').str[1]) 
 .assign(variable=lambda dd: dd['variable'].str.split('__').str[0])
 .pivot(index=['sample_id', 'sample_group'], columns='variable', values='value')
 .reset_index()
 .dropna()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
  .pipe(lambda dd: p9.ggplot(dd)
  + p9.aes('PCoA1', 'PCoA2', color='sample_duration')
  + p9.stat_ellipse(p9.aes(fill='sample_duration'), alpha=0.1, level=0.8, geom='polygon')
  + p9.geom_point(size=3, alpha=0.8, stroke=0)
  + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.facet_wrap('sample_group')
  + p9.labs(title='',
           x='PCoA$_1$', y='PCoA$_2$', color='', fill='',
           caption=
       f'PERMANOVA p-value with blanks: {bacterial_permanova_results.iloc[0, 4]:.3f}' +
       f'(R$^2$ = {bacterial_permanova_results.iloc[0, 2]:.1%}) |' +
       f'without blanks: {bacterial_permanova_results_no_blanks.iloc[0, 4]:.3f} ' +
       f'(R$^2$ = {bacterial_permanova_results_no_blanks.iloc[0, 2]:.1%})',)
  + p9.ggtitle('PCoA of Bacterial Genus-Level Abundances')
  + p9.theme(
      legend_position='top',
       figure_size=(7, 4)
  )
 )
)

(bacterial_pcoa_df
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], on='sample_id', how='left')
 .pipe(lambda dd: p9.ggplot(dd)
  + p9.aes('PCoA1', 'PCoA2', color='sample_duration')
  + p9.stat_ellipse(p9.aes(fill='sample_duration'), alpha=0.1, level=0.8, geom='polygon')
  + p9.geom_point(size=3, alpha=0.8, stroke=0)
  + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.labs(title='',
           x='PCoA$_1$', y='PCoA$_2$', color='', fill='')
  + p9.theme(
      legend_position='top',
  )
 )
)

Appendix (Birds)

Show Code
aves_notes = {
    "Neognathae": "Clade containing most modern birds except ratites and tinamous.",
    "Palaeognathae": "Clade including ratites (ostrich, emu) and tinamous.",
    "Limosa lapponica baueri": "Bar-tailed godwit subspecies, a long-distance migratory shorebird.",
    "Aves": "Class of all birds.",
    "Phasianidae": "Family of ground-dwelling birds including pheasants, chickens, and turkeys.",
    "Gallus gallus": "Red junglefowl, wild ancestor of domestic chickens.",
    "Calidris pugnax": "Ruff, a migratory wader species in the sandpiper family.",
    "Scolopacidae": "Family of wading birds like sandpipers and curlews.",
    "Galloanserae": "Clade grouping landfowl (Galliformes) and waterfowl (Anseriformes).",
    "Charadrius vociferus": "Killdeer, a plover common in open habitats in North America.",
    "Haliaeetus leucocephalus": "Bald eagle, large bird of prey from North America.",
    "Hirundo rustica rustica": "Nominate subspecies of barn swallow, widespread in Eurasia.",
    "Charadriiformes": "Order of shorebirds including plovers, gulls, and auks.",
    "Colinus virginianus": "Northern bobwhite, a quail native to North America.",
    "Pipra filicauda": "Wire-tailed manakin, a small South American bird.",
    "Galliformes": "Order of heavy-bodied ground-feeding birds (chickens, turkeys, quails).",
    "Columba livia": "Rock pigeon, common in urban areas worldwide.",
    "Coturnix japonica": "Japanese quail, a domesticated quail species.",
    "Dromaius novaehollandiae": "Emu, large flightless bird native to Australia.",
    "Phasianinae": "Subfamily of pheasants and related birds.",
    "Balearica regulorum gibbericeps": "Grey crowned crane subspecies from Africa.",
    "Chloebia gouldiae": "Gouldian finch, a colorful Australian passerine.",
    "Bambusicola thoracicus": "Chinese bamboo partridge, a ground bird from East Asia.",
    "Gruiformes": "Order including cranes, rails, and coots.",
    "Buceros rhinoceros silvestris": "Subspecies of rhinoceros hornbill from Southeast Asia.",
    "Oxyura jamaicensis": "Ruddy duck, a stiff-tailed duck from the Americas.",
    "Chaetura pelagica": "Chimney swift, a migratory bird from the Americas.",
    "Falco naumanni": "Lesser kestrel, a small migratory falcon.",
    "Callipepla squamata": "Scaled quail, found in southwestern US and Mexico.",
    "Zosterops borbonicus": "Réunion grey white-eye, endemic passerine to Réunion Island.",
    "Gavia stellata": "Red-throated loon, a migratory aquatic bird.",
    "Passeroidea": "Superfamily of passerine birds including sparrows and finches.",
    "Corvus": "Genus of crows, ravens, rooks, and jackdaws.",
    "Lagopus leucura": "White-tailed ptarmigan, a grouse living in alpine habitats.",
    "Passeriformes": "Order of perching birds, the largest bird order."
}

aves_unlikely_presence = [
    "Dromaius novaehollandiae", 
    "Chloebia gouldiae", 
    "Buceros rhinoceros silvestris", 
    "Haliaeetus leucocephalus",
    "Balearica regulorum gibbericeps",
    "Pipra filicauda",
]



(abundance_tax_long
 .query('level_27 == " Aves"')
 .assign(full_name=lambda dd: dd.apply(lambda row: '; '.join(row[tax_levels].dropna()), axis=1))
 .assign(last_name=lambda dd: dd.full_name.str.split(';').str[-1].str.strip())
 .groupby('last_name', as_index=False)
 .agg({'counts': 'sum'})
 .sort_values('counts', ascending=False)
 .assign(Frequency=lambda dd: dd.counts / dd.counts.sum())
 .assign(note=lambda dd: dd.last_name.map(aves_notes).fillna(''))
 .rename(columns={'last_name': 'Name (deepest tax. level)',
                  'counts': 'Reads'})
 
 .style
 .format({'Frequency': '{:.2%}',
          'Reads': '{:.0f}'})
 .hide(axis='index')
 # text align left
 .set_properties(**{'text-align': 'left', 'font-family': 'monospace',
                    'font-size': '9pt',
                    'width' : '1200 px'})
 .bar(subset=['Frequency'], color='#5198ca', align='mid')
 # color red for unlikely presence
 .map(lambda x: 'color: red' if x in aves_unlikely_presence else 'color: none',
       subset=['Name (deepest tax. level)']
 ))
Name (deepest tax. level) Reads Frequency note
Neognathae 253914 40.23% Clade containing most modern birds except ratites and tinamous.
Limosa lapponica baueri 113472 17.98% Bar-tailed godwit subspecies, a long-distance migratory shorebird.
Aves 59533 9.43% Class of all birds.
Phasianidae 39584 6.27% Family of ground-dwelling birds including pheasants, chickens, and turkeys.
Gallus gallus 33016 5.23% Red junglefowl, wild ancestor of domestic chickens.
Calidris pugnax 16638 2.64% Ruff, a migratory wader species in the sandpiper family.
Scolopacidae 9934 1.57% Family of wading birds like sandpipers and curlews.
Galloanserae 9279 1.47% Clade grouping landfowl (Galliformes) and waterfowl (Anseriformes).
Charadrius vociferus 9086 1.44% Killdeer, a plover common in open habitats in North America.
Haliaeetus leucocephalus 8342 1.32% Bald eagle, large bird of prey from North America.
Hirundo rustica rustica 7571 1.20% Nominate subspecies of barn swallow, widespread in Eurasia.
Charadriiformes 7351 1.16% Order of shorebirds including plovers, gulls, and auks.
Colinus virginianus 6431 1.02% Northern bobwhite, a quail native to North America.
Pipra filicauda 4862 0.77% Wire-tailed manakin, a small South American bird.
Galliformes 4575 0.72% Order of heavy-bodied ground-feeding birds (chickens, turkeys, quails).
Columba livia 4470 0.71% Rock pigeon, common in urban areas worldwide.
Coturnix japonica 3955 0.63% Japanese quail, a domesticated quail species.
Dromaius novaehollandiae 3812 0.60% Emu, large flightless bird native to Australia.
Phasianinae 3453 0.55% Subfamily of pheasants and related birds.
Balearica regulorum gibbericeps 3222 0.51% Grey crowned crane subspecies from Africa.
Chloebia gouldiae 3059 0.48% Gouldian finch, a colorful Australian passerine.
Bambusicola thoracicus 2705 0.43% Chinese bamboo partridge, a ground bird from East Asia.
Gruiformes 2646 0.42% Order including cranes, rails, and coots.
Buceros rhinoceros silvestris 2409 0.38% Subspecies of rhinoceros hornbill from Southeast Asia.
Oxyura jamaicensis 2274 0.36% Ruddy duck, a stiff-tailed duck from the Americas.
Chaetura pelagica 2236 0.35% Chimney swift, a migratory bird from the Americas.
Falco naumanni 2210 0.35% Lesser kestrel, a small migratory falcon.
Callipepla squamata 2168 0.34% Scaled quail, found in southwestern US and Mexico.
Zosterops borbonicus 1963 0.31% Réunion grey white-eye, endemic passerine to Réunion Island.
Gavia stellata 1959 0.31% Red-throated loon, a migratory aquatic bird.
Passeroidea 1683 0.27% Superfamily of passerine birds including sparrows and finches.
Corvus 1469 0.23% Genus of crows, ravens, rooks, and jackdaws.
Lagopus leucura 942 0.15% White-tailed ptarmigan, a grouse living in alpine habitats.
Palaeognathae 663 0.11% Clade including ratites (ostrich, emu) and tinamous.
Passeriformes 297 0.05% Order of perching birds, the largest bird order.
migratory_species = (pd.DataFrame(aves_notes.items(), columns=['Name', 'Note'])
                     .query('Note.str.contains("migr")')
                     ['Name']
)
abundance_tax_long.query('sample_type=="Sample"').sample_id.nunique()
114
(abundance_tax_long
 .query('level_27 == " Aves"')
 .groupby(['sample_id', 'sample_type'])
 .agg({'counts': 'sum'})
 .reset_index()
 .groupby('sample_type')
 .agg(total_reads=('counts', 'sum'),
      avg_reads=('counts', 'mean'),
      num_samples=('counts', lambda x: (x > 0).sum()),
)
)
total_reads avg_reads num_samples
sample_type
Blank 0.0 0.000000 0
Sample 631183.0 5536.692982 55
Show Code
most_common_aves = (abundance_tax_long
 .query('level_27 == " Aves"')
 .assign(full_name=lambda dd: dd.apply(lambda row: '; '.join(row[tax_levels].dropna()), axis=1))
 .assign(last_name=lambda dd: dd.full_name.str.split(';').str[-1].str.strip())
 .query('counts > 0')
 .groupby('last_name', as_index=False)
 .size()
 .sort_values('size', ascending=False)
 ['last_name']
)

(abundance_tax_long
 .query('level_27 == " Aves"')
 .assign(full_name=lambda dd: dd.apply(lambda row: '; '.join(row[tax_levels].dropna()), axis=1))
 .assign(last_name=lambda dd: dd.full_name.str.split(';').str[-1].str.strip())
 .merge(kumamoto_metadata_no_blanks[['sample_id', 'date_start', 'date_end', 'sample_duration']],)
 .assign(date=lambda x: np.where(x.sample_duration == "Weekly", x.date_start + pd.Timedelta(days=3), x.date_start))
 .groupby('last_name', as_index=False).apply(lambda dd: dd.assign(counts=dd.counts / dd.counts.max()))
 .assign(last_name=lambda dd: pd.Categorical(dd.last_name, categories=most_common_aves, ordered=True))
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.facet_wrap('~last_name', ncol=5)
       + p9.scale_x_datetime(date_breaks='3 months', date_labels='%b %Y', expand=(0, 0))
       + p9.labs(x='', y='OTU-normalized read counts')
       + p9.scale_y_continuous(expand=(0, 0), labels=lambda x: [f"{v:.0%}" for v in x])
       + p9.geom_rect(p9.aes(xmin='date_start', xmax='date_end', ymin=0, ymax='counts'))
       + p9.theme_bw()
       + p9.theme(
           figure_size=(16 * 0.75, 9 * 0.75),
           dpi=300,
           panel_grid_major_y=p9.element_blank(),
           panel_grid_major_x=p9.element_line(linetype='dashed', color='lightgrey'),
           strip_text=p9.element_text(size=9, family='Georgia', margin={'b': 2, 't': 2})
       ))
       )