Pacbio metagenomics of air samples from Kumamoto

Testing out the new Pacbio metagenomics workflow for air samples

Authors

Affiliations

Alejandro Fontal

Climate & Health Program, ISGlobal

Daniel Ekhlas

AIRLAB, Climate & Health Program, ISGlobal

Xavier Rodó

Climate & Health Program, ISGlobal

Catalonia Institute for Advanced Studies

Published

December 15, 2025

Introduction

This notebook presents a comprehensive analysis of PacBio metagenomic sequencing data from air samples collected in Kumamoto, Japan. The study evaluates the performance of the PacBio META pipeline developed by AIRLAB for environmental DNA (eDNA) analysis in atmospheric samples.

Study Design and Objectives

We analyzed air samples collected using different sampling strategies:

Blank samples: Control samples to assess contamination
Daily samples: 24-hour collection periods
Weekly samples: 7-day collection periods

The primary objectives are to:

Evaluate sequencing quality and taxonomic assignment success across sample types
Compare biodiversity patterns between different sampling durations
Assess the pipeline’s ability to capture environmental DNA from major taxonomic groups (Bacteria, Fungi, Metazoa, Viruses)
Investigate temporal patterns in microbial community composition

Table of contents (click to go to each section)

The analysis is structured into several key sections:

Sequencing Statistics: Evaluation of DNA yield, read quality, and sequencing metrics across sample types
Taxonomy Analysis: Comprehensive taxonomic profiling and assignment statistics
- Alpha Diversity Metrics: Species richness and diversity comparisons
- Taxonomic Assignment Performance: Success rates and quality of taxonomic classifications
- Community Composition: Beta diversity and ordination analyses
- Major Taxonomic Groups: Distribution of reads across Bacteria, Eukaryota, Viruses, and other groups
- Bacterial Community Analysis: Detailed analysis of bacterial diversity and classification rates
Supplementary Analyses: Additional findings including avian eDNA detection

The analysis employs both Python and R environments, utilizing established ecological analysis packages (vegan, phyloseq) alongside modern data science tools (pandas, plotnine) to ensure reproducible and statistically robust results.

Key Findings Preview

Initial results suggest significant differences in community composition between sampling strategies, with temporal factors (season, month) showing stronger effects than sampling duration. The pipeline demonstrates high taxonomic assignment success rates, with distinct patterns observed across major taxonomic groups.

Preamble

In this section we load the necessary libraries and set the global parameters for the notebook. It’s irrelevant for the narrative of the analysis but essential for the reproducibility of the results. Since in this case we are using both Python and R, we have a bit of a more complex setup than usual. This includes using the rpy2 package to run R code within the Python environment, as well as ensuring that data is properly passed between the two languages.

Show code imports and presets

Code imports

import os
import warnings
warnings.filterwarnings('ignore', category=UserWarning, module='rpy2')

# Clean environment variables that cause issues with R
# Remove bash functions that R can't parse properly
env_vars_to_remove = [
    'BASH_FUNC_ml%%', 
    'BASH_FUNC_module%%', 
    'BASH_FUNC_which%%'
]

for var in env_vars_to_remove:
    if var in os.environ:
        del os.environ[var]

# only if the OS is macOS and R_HOME is not set (otherwise fails...)
if os.name == 'posix' and os.uname().sysname == 'Darwin':
    os.environ["R_HOME"] = "/Library/Frameworks/R.framework/Resources"
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

%%R

# Load the packages
library(dplyr)
library(vegan)
library(ggplot2)
library(phyloseq)
library(magrittr)

import re
import numpy as np
import pandas as pd
import plotnine as p9
import geopandas as gpd

from glob import glob
from Bio import Entrez
from ete3 import NCBITaxa
from collections import defaultdict
from itables import init_notebook_mode, show
from plotnine.composition import plot_spacer
from mizani.labels import label_percent, label_scientific

Presets

# Matplotlib settings
import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
plt.rcParams['font.family'] = 'Georgia'
plt.rcParams['svg.fonttype'] = 'none'
set_matplotlib_formats('retina')
plt.rcParams['figure.dpi'] = 300

# Plotnine settings (for figures)
p9.options.set_option('base_family', 'Georgia')

p9.theme_set(
     p9.theme_classic()

    + p9.theme(panel_grid=p9.element_blank(),
               legend_background=p9.element_blank(),
               panel_grid_major=p9.element_line(size=.5, linetype='dashed',
                                                alpha=.15, color='black'),
               plot_title=p9.element_text(ha='center'),
               dpi=300
    )
)

Entrez.email = "alejandro.fontal@isglobal.org"

Introduction

We have been working on the taxonomic analysis of air samples collected in Kumamoto, Japan, which were subsequently sequenced using PacBio, and which we have processed using the PacBio META pipeline developed in the AIRLAB. The goal of this notebook is to evaluate the results of the pipeline, compare the biodiversity of the samples from the point of view of Bacteria, Fungi, other Metazoa and to assess the ability of the pipeline to properly capture the eDNA present in the environment.

Analysis

Sequencing stats

We’ll load the metadata from our filter database and the sequencing metrics from the PacBio pipeline, then merge them to create a single dataframe with all the information needed to examine and compare the results from the blanks, the daily and the weekly samples.

pacbio_metrics = (pd.read_csv('../data/pacbio_samples_metrics.csv', sep='\t')
     .rename(columns={
         'Names': 'sample_id',
         'DNA (ng)': 'dna_yield',
         'Number_of_reads': 'n_reads',
         'Total_bases': 'total_bases',
         'Median_read_length': 'median_read_length',
     'Median_read_quality': 'median_read_quality'
 })
 [['sample_id', 'dna_yield', 'n_reads', 'total_bases', 'median_read_length', 'median_read_quality']]
  .assign(header=lambda dd: np.where(dd.sample_id.str.endswith('.1'), 'PM10', 'PM2.5'))
  .assign(sample_id=lambda dd: 
          np.where(dd.sample_id.str.startswith('T'), dd['sample_id'],
                   dd['sample_id'].str[:2] + dd['sample_id'].str[2:].str.zfill(3)))
  .assign(sample_type=lambda dd: 
          np.where(dd.sample_id.str.contains('B', regex=False), 'Blank', 'Sample'))
  .assign(n_reads=lambda dd: dd.n_reads.str.replace(',', '').astype(int))
  .assign(total_bases=lambda dd: dd.total_bases.str.replace(',', '').astype(int))
  .assign(median_read_length=lambda dd: dd.median_read_length.str.replace(',', '').astype(int))
  .assign(location=lambda dd: np.where(dd.sample_id.str.startswith('K'), 'Kumamoto', 'Toyama'))
)

kumamoto_stats = pacbio_metrics.query('location == "Kumamoto"')
toyama_stats = (pacbio_metrics
    .query('location == "Toyama"')               
    .assign(sample_n=lambda dd: dd.sample_id.str.split('.').str[0].str[2:].str.zfill(2))
    .query('sample_n.str.len() == 2')
    .assign(sample_pre=lambda dd: dd.sample_id.str[:2].astype(str))
    .assign(sample_i=lambda dd: dd.sample_id.str.split('.').str[1].fillna(''))
    .assign(sample_i=lambda dd: np.where(dd['sample_i'].str.len() == 0, '', '.' + dd['sample_i']))
    .eval('sample_id = sample_pre + sample_n + sample_i')
    .sort_values('sample_id')
    .reset_index(drop=True)
)

toyama_metadata = pd.read_csv('../data/toyama_filters_clean.csv', 
                              parse_dates=['date_start', 'date_end'])

toyama_comps_df = (toyama_stats
 [['sample_id', 'dna_yield', 'total_bases', 'median_read_length', 'median_read_quality']]
 .assign(total_bases=lambda dd: dd.total_bases / 1e6)  # Convert to millions of bases
 .merge(toyama_metadata, how='left', on='sample_id')
)

kumamoto_metadata = (pd.read_csv('../data/kumamoto_filters.csv', 
                                 usecols=[1, 3, 4, 5, 6, 7, 8, 9])
    .rename(columns={
        'Sample ID': 'sample_id',
        'Initial ': 'date_start',
        'Final ': 'date_end',
        'Filters': 'sample_type',
        'Duration (h)': 'duration',
        'Sampled air volume (m3)': 'air_volume',
        'Weather': 'weather',
        'Remarks': 'remarks'
    })
    .assign(sample_id=lambda dd: dd['sample_id'].str[:2] + dd['sample_id'].str[2:].str.zfill(3))
    .query('sample_id.notna()')
    .assign(date_start=lambda dd: dd.date_start.ffill())
    .assign(date_end=lambda dd: dd.date_end.ffill())
    .assign(date_start=lambda dd: pd.to_datetime(dd['date_start'], format='%d/%m/%Y'))
    .assign(date_end=lambda dd: pd.to_datetime(dd['date_end'], format='%d/%m/%Y'))
    .assign(air_volume=lambda dd: 
            dd['air_volume'].astype(str).str.replace(',', '').astype(float).fillna(0))
    .assign(sample_duration=lambda dd: np.where(dd['sample_type'] == 'BLANK', 'Blanks',
                                                np.where(dd['duration'] < 24, 'Daily', 'Weekly')))
    .drop(columns=['duration', 'weather', 'remarks'])

)

kumamoto_comps_df = (kumamoto_stats
 [['sample_id', 'dna_yield', 'total_bases', 'median_read_length', 'median_read_quality']]
 .assign(total_bases=lambda dd: dd.total_bases / 1e6)  # Convert to millions of bases
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
)

If we compare the results from the blanks, the daily and the weekly samples, we observe the following:

The blanks have lower DNA yield (as expected) which also leads to a lower total bases count. The median read length and quality, however, are similar to the values for the samples, with the highest median read length. This can be explained by the fact that the blanks are not as degraded as the samples, which have been exposed to the environment for longer periods of time.
The daily samples have lower DNA yield than the weekly samples, which is expected since they collected approximately 7 times less air volume. However, this doesn’t appear to affect the total base count, as most likely the sequencing was performed to saturation for all samples.

Show Code

a = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='dna_yield')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.scale_y_log10()
  + p9.ggtitle('DNA yield (ng)')
  + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())
))

b = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='total_bases')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Total bases (Mbp)')
    + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())

))

c = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_quality')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read quality (Phred score)')
))

d = (kumamoto_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_length')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration', ), 
                   alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read length (bp)')
))

((a | b) / 
 (c | d) 
 & p9.guides(fill=False, color=False)
 & p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.labs(x='', y='')
 & p9.theme(
     figure_size=(6, 4),
     plot_title=p9.element_text(size=10, ha='center'),
     plot_margin=.015
 )
)

The observed differences are statistically significant, with a Kruskal-Wallis test showing differences across all four metrics (p-values < 0.001). The post-hoc Dunn’s test shows that the differences are significant between all groups, with the exception of the median read length and quality between the blanks and the daily samples and blanks and weekly samples, respectively:

Statistical Analysis of Sequencing Metrics

%%R -i kumamoto_comps_df -o pairwise_df

kumamoto_comps_df$sample_duration <- as.factor(kumamoto_comps_df$sample_duration)
numeric_vars <- c('dna_yield', 'total_bases', 'median_read_length', 'median_read_quality')

kruskal_results <- lapply(numeric_vars, function(var) {
      test <- kruskal.test(kumamoto_comps_df[[var]] ~ kumamoto_comps_df$sample_duration)
      data.frame(
        variable = var,
        statistic = test$statistic,
        p_value = test$p.value)
})


significant_vars <- numeric_vars[sapply(kruskal_results, function(x) x$p_value < 0.05)]

pairwise_df <- lapply(significant_vars, function(var) {
  # Run Dunn's test
  dunn_res <- FSA::dunnTest(kumamoto_comps_df[[var]] ~ kumamoto_comps_df$sample_duration,
                            method = "bh")
  
  # Extract the "res" table and add the variable name
  df <- dunn_res$res %>% dplyr::mutate(variable = var)
  return(df)
}) %>% dplyr::bind_rows()

Show Code

(pairwise_df
 .replace({'variable': {
     'dna_yield': 'DNA Yield',
     'total_bases': 'Total Bases',
     'median_read_length': 'Median Read Length',
     'median_read_quality': 'Median Read Quality'}})
 [['variable', 'Comparison', 'Z', 'P.unadj', 'P.adj']]
 .style
 .hide(axis='index')
 .format(precision=4)
 # color P.adj < .05 rows in red
 .apply(lambda x: ['color: red' if v < 0.05 else '' for v in x], 
       subset=['P.adj'])
)

variable	Comparison	Z	P.unadj	P.adj
DNA Yield	Blanks - Daily	-2.5939	0.0095	0.0095
DNA Yield	Blanks - Weekly	-6.6136	0.0000	0.0000
DNA Yield	Daily - Weekly	-7.1209	0.0000	0.0000
Total Bases	Blanks - Daily	-4.6865	0.0000	0.0000
Total Bases	Blanks - Weekly	-2.2314	0.0257	0.0257
Total Bases	Daily - Weekly	3.0662	0.0022	0.0033
Median Read Length	Blanks - Daily	1.8021	0.0715	0.0715
Median Read Length	Blanks - Weekly	3.4533	0.0006	0.0017
Median Read Length	Daily - Weekly	3.0759	0.0021	0.0031
Median Read Quality	Blanks - Daily	-2.6883	0.0072	0.0215
Median Read Quality	Blanks - Weekly	-0.7613	0.4465	0.4465
Median Read Quality	Daily - Weekly	2.6092	0.0091	0.0136

If we run the same comparisons, but now for the Toyama samples, however…

%%R -i toyama_comps_df -o pairwise_toyama_df -o kruskal_results_header

toyama_comps_df$sample_duration <- as.factor(toyama_comps_df$sample_duration)
toyama_comps_df$sample_header <- as.factor(toyama_comps_df$sample_header)
numeric_vars <- c('dna_yield', 'total_bases', 'median_read_length', 'median_read_quality')

kruskal_results <- lapply(numeric_vars, function(var) {
      test <- kruskal.test(toyama_comps_df[[var]] ~ toyama_comps_df$sample_duration)
      data.frame(
        variable = var,
        statistic = test$statistic,
        p_value = test$p.value)
})


significant_vars <- numeric_vars[sapply(kruskal_results, function(x) x$p_value < 0.05)]

toyama_comps_samples <- toyama_comps_df %>%
  dplyr::filter(sample_type != 'Blank')

kruskal_results_header <- lapply(numeric_vars, function(var) {
      test <- kruskal.test(toyama_comps_samples[[var]] ~ toyama_comps_samples$sample_header)
      data.frame(
        variable = var,
        statistic = test$statistic,
        p_value = test$p.value)
})

pairwise_toyama_df <- lapply(significant_vars, function(var) {
  # Run Dunn's test
  dunn_res <- FSA::dunnTest(toyama_comps_df[[var]] ~ toyama_comps_df$sample_duration,
                            method = "bh")
  
  # Extract the "res" table and add the variable name
  df <- dunn_res$res %>% dplyr::mutate(variable = var)
  return(df)
}) %>% dplyr::bind_rows()

Show Code

(pairwise_toyama_df
 .replace({'variable': {
     'dna_yield': 'DNA Yield',
     'total_bases': 'Total Bases',
     'median_read_length': 'Median Read Length',
     'median_read_quality': 'Median Read Quality'}})
 [['variable', 'Comparison', 'Z', 'P.unadj', 'P.adj']]
 .style
 .hide(axis='index')
 .format(precision=4)
 # color P.adj < .05 rows in red
 .apply(lambda x: ['color: red' if v < 0.05 else '' for v in x], 
       subset=['P.adj'])
)

variable	Comparison	Z	P.unadj	P.adj
DNA Yield	Blank - Daily	-3.5145	0.0004	0.0013
DNA Yield	Blank - Weekly	-3.4864	0.0005	0.0007
DNA Yield	Daily - Weekly	0.6171	0.5372	0.5372
Total Bases	Blank - Daily	-3.3732	0.0007	0.0022
Total Bases	Blank - Weekly	-2.9686	0.0030	0.0045
Total Bases	Daily - Weekly	1.1722	0.2411	0.2411
Median Read Length	Blank - Daily	-2.7546	0.0059	0.0176
Median Read Length	Blank - Weekly	-1.9343	0.0531	0.0796
Median Read Length	Daily - Weekly	1.7097	0.0873	0.0873
Median Read Quality	Blank - Daily	-0.2594	0.7953	0.7953
Median Read Quality	Blank - Weekly	1.4831	0.1380	0.2071
Median Read Quality	Daily - Weekly	2.7184	0.0066	0.0197

Show Code

a = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='dna_yield')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.scale_y_log10()
  + p9.ggtitle('DNA yield (ng)')
  + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())
))

b = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='total_bases')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Total bases (Mbp)')
    + p9.theme(axis_text_x=p9.element_blank(), axis_ticks_major_x=p9.element_blank())

))

c = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_quality')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration'), alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read quality (Phred score)')
))

d = (toyama_comps_df
.pipe(lambda dd: p9.ggplot(dd)
  + p9.aes(x='sample_duration', y='median_read_length')
  + p9.geom_boxplot(p9.aes(fill='sample_duration'), outlier_alpha=0, alpha=0.5)
  + p9.geom_jitter(p9.aes(color='sample_duration', ), 
                   alpha=0.5, width=0.2, height=0, stroke=0, size=2)
  + p9.ggtitle('Median read length (bp)')
))

((a | b) / 
 (c | d) 
 & p9.guides(fill=False, color=False)
 & p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
 & p9.labs(x='', y='')
 & p9.theme(
     figure_size=(6, 4),
     plot_title=p9.element_text(size=10, ha='center'),
     plot_margin=.015
 )
)

Taxonomy

We’ll begin by loading the taxonomic assignments generated by Diamond and the results from the PacBio META pipeline, then merge them into a single dataframe containing all required information. We will then compare the results from the blanks, the daily and the weekly samples to test for differences in biodiversity between the groups.

The analysis output provides a complex .txt file that requires parsing and cleaning to extract the needed information. For now, we will load all the different taxonomic assignments (including Bacteria, Fungi, Metazoa and Viruses) and then merge them into a single dataframe.

Function to process MEGAN .txt files

def process_megan_txt(file_path, sample_id=None):
    df = pd.read_csv(
        file_path,
        sep='\t',
        header=None,
        names=['taxonomy', 'count'],
        engine='python'
    )
    df['taxonomy'] = df['taxonomy'].str.strip().str.rstrip(';')
    df['count'] = df['count'].astype(float)
    tax_split = df['taxonomy'].str.split(';', expand=True).apply(lambda col: col.str.strip())
    tax_split.columns = [f"level_{i+1}" for i in range(tax_split.shape[1])]
    df_expanded = pd.concat([tax_split, df['count']], axis=1)
    cols_to_keep = [f"level_{i+1}" for i in range(11)] + ['count']
    df_trimmed = df_expanded[cols_to_keep]
    df_grouped = (
        df_trimmed
        .groupby([f"level_{i+1}" for i in range(11)], dropna=False, as_index=False)
        .agg({'count': 'sum'})
    )
    # Create a unique taxon string for each row
    df_grouped['taxon'] = df_grouped[[f'level_{i+1}' for i in range(11)]].fillna('').agg(';'.join, axis=1)
    df_grouped = df_grouped[['taxon', 'count']]
    if sample_id is not None:
        df_grouped = df_grouped.set_index('taxon').rename(columns={'count': sample_id})
    else:
        df_grouped = df_grouped.set_index('taxon')
    return df_grouped

blank_files = glob("../data/kumamoto/blanks/*.txt")
sample_files = glob("../data/kumamoto/samples/*.txt")
all_files = blank_files + sample_files

feature_tables = []
sample_ids = []

for file_path in all_files:
    sample_id = os.path.splitext(os.path.basename(file_path))[0]
    df_grouped = process_megan_txt(file_path, sample_id=sample_id)
    feature_tables.append(df_grouped)
    sample_ids.append(sample_id)

# we combine into a single DataFrame (taxa as rows, samples as columns)
big_table_KM = pd.concat(feature_tables, axis=1).fillna(0).astype(float)
big_table_KM = big_table_KM.T  # transposed so samples are rows, taxa are columns

big_table_KM.index.name = 'sample_id'
sample_ids = [s.split('_')[0] for s in big_table_KM.index]
big_table_KM.index = sample_ids

blank_files_TY = glob("../data/toyama/TB*.txt")
sample_files_TY = glob("../data/toyama/TS*.txt")
all_files_TY = blank_files_TY + sample_files_TY

feature_tables_TY = []
sample_ids_TY = []

for file_path in all_files_TY:
    sample_id = os.path.splitext(os.path.basename(file_path))[0].split('_')[0]
    df_grouped = process_megan_txt(file_path, sample_id=sample_id)
    feature_tables_TY.append(df_grouped)

# let's now combine the Toyama data
big_table_TY = pd.concat(feature_tables_TY, axis=1).fillna(0).astype(float)
big_table_TY = big_table_TY.T
big_table_TY.index.name = 'sample_id'

Aggregate alpha-diversity metrics

Since the standard in metagenomics is the usage of the vegan package for ecological data analysis, we will use it to perform some basic analyses on our feature table, including the calculation of alpha and beta diversity metrics.

%%R -i big_table_KM -i big_table_TY -o diversity_df


richness_KM <- vegan::specnumber(big_table_KM)
shannon_KM <- vegan::diversity(big_table_KM, index = "shannon")
simpson_KM <- vegan::diversity(big_table_KM, index = "simpson")

richness_TY <- vegan::specnumber(big_table_TY)
shannon_TY <- vegan::diversity(big_table_TY, index = "shannon")
simpson_TY <- vegan::diversity(big_table_TY, index = "simpson")

diversity_KM <- data.frame(
    sample_id = rownames(big_table_KM),
    sample_type = ifelse(grepl('B', rownames(big_table_KM)), 'Blanks', 'Samples'),
    richness = richness_KM,
    shannon = shannon_KM,
    simpson = simpson_KM
)

diversity_TY <- data.frame(
    sample_id = rownames(big_table_TY),
    sample_type = ifelse(grepl('B', rownames(big_table_TY)), 'Blanks', 'Samples'),
    richness = richness_TY,
    shannon = shannon_TY,
    simpson = simpson_TY
)

diversity_df <- rbind(diversity_KM, diversity_TY)
diversity_df$sample_type = factor(diversity_df$sample_type, levels=c('Samples', 'Blanks'))

To start, we will compare the alpha diversity metrics between blanks and samples, considering each of the assigned taxa at every taxonomic level as a unique feature. If we do so:

%%R -o diversity_tests_df 

# We run a wilcoxon test for each metric as read counts are not normally distributed
richness_test_KM <- wilcox.test(richness ~ sample_type, data = diversity_KM)
shannon_test_KM <- wilcox.test(shannon ~ sample_type, data = diversity_KM)
simpson_test_KM <- wilcox.test(simpson ~ sample_type, data = diversity_KM)
richness_test_TY <- wilcox.test(richness ~ sample_type, data = diversity_TY)
shannon_test_TY <- wilcox.test(shannon ~ sample_type, data = diversity_TY)
simpson_test_TY <- wilcox.test(simpson ~ sample_type, data = diversity_TY)

diversity_tests_df <- data.frame(
  metric    = c("Richness", "Shannon", "Simpson", "Richness", "Shannon", "Simpson"),
  statistic = c(richness_test_KM$statistic, shannon_test_KM$statistic, simpson_test_KM$statistic,
                richness_test_TY$statistic, shannon_test_TY$statistic, simpson_test_TY$statistic),
  p_value   = c(richness_test_KM$p.value, shannon_test_KM$p.value, simpson_test_KM$p.value,
                richness_test_TY$p.value, shannon_test_TY$p.value, simpson_test_TY$p.value),
  location  = c(rep("Kumamoto", 3), rep("Toyama", 3))
)

The results change by location: In Kumamoto, the samples are, as expected, more diverse than the blanks, with a significant difference in all three metrics (richness, Shannon, and Simpson). However, in Toyama, none of the metrics show significant differences between blanks and samples:

Show Code

diversity_melt = (diversity_df
.melt(
    id_vars=['sample_id', 'sample_type'],
    value_vars=['richness', 'shannon', 'simpson'],
    var_name='metric',
    value_name='value'
)
.assign(metric=lambda dd: dd['metric'].str.capitalize())
.assign(location=lambda dd: np.where(dd['sample_id'].str.startswith('K'), 'Kumamoto', 'Toyama'))
.merge(diversity_tests_df, how='left', on=['metric', 'location'])
)

(p9.ggplot(diversity_melt, p9.aes(x='sample_type', y='value', fill='sample_type'))
    + p9.geom_boxplot(outlier_alpha=0, alpha=0.5, width=.4)
    + p9.geom_jitter(width=0.2, alpha=0.7, size=2, stroke=0)
    + p9.facet_grid('metric.str.capitalize()', 'location', scales='free_y')
    + p9.geom_text(
        p9.aes(x=1.5, y=0,
               label=f'"$p=$" + p_value.round(4).astype(str) + sig'),
        data=lambda dd: dd
            .groupby(['metric', 'location'], as_index=False)
            .agg({'value': 'max', 'p_value': 'first'})
            .assign(sig=lambda dd: np.where(dd['p_value'] < 0.05, '*', '')), 
        ha='center',
        va='bottom',
        size=8,
        inherit_aes=False

    )
    + p9.scale_fill_manual(values=["#c16514",  "#5198ca"])
    + p9.labs(x='', y='', title='',
              caption='*$p$-values from Wilcoxon rank sum tests')
    + p9.theme(
        figure_size=(6.5, 5),
        # strip_background=p9.element_blank(),
        panel_spacing=.02,
        legend_position='none'
    )
)

Taxonomy pipeline stats

Long abundance table code

def process_megan_txt(file_path, sample_id=None):
    df = pd.read_csv(
        file_path,
        sep='\t',
        header=None,
        names=['taxonomy', 'count'],
        engine='python'
    )
    df['taxonomy'] = df['taxonomy'].str.strip().str.rstrip(';')
    df['count'] = df['count'].astype(float)
    if sample_id is not None:
        df = df.set_index('taxonomy').rename(columns={'count': sample_id})
    else:
        df = df.set_index('taxonomy')
    return df

# Gather all sample files
blank_files = glob("../data/kumamoto/blanks/*.txt")
sample_files = glob("../data/kumamoto/samples/*.txt")
all_files = blank_files + sample_files

# Build the abundance table
feature_tables = []
sample_ids = []

for file_path in all_files:
    sample_id = os.path.splitext(os.path.basename(file_path))[0]
    df_grouped = process_megan_txt(file_path, sample_id=sample_id)
    feature_tables.append(df_grouped)
    sample_ids.append(sample_id)

# Merge all samples (outer join, fill missing with 0)
abundance_table = pd.concat(feature_tables, axis=1).fillna(0).T
abundance_table.index.name = "sample_id"
abundance_table.index = [s.split('_')[0] for s in abundance_table.index]

tax_split = abundance_table.columns.str.split(';', expand=True)
tax_levels = [f'level_{i+1}' for i in range(tax_split.to_frame().shape[1])]

taxonomy_names_df = (tax_split
 .to_frame()
 .reset_index(drop=True)
 .rename(columns=dict((i, f'level_{i+1}') for i in range(tax_split.to_frame().shape[1])))
 .map(lambda x: x.lstrip() if isinstance(x, str) else x)
)


def get_complete_taxonomy_lineage(row, ncbi):
    """
    Extract the most specific taxon from a row and get the COMPLETE lineage
    preserving all NCBI taxonomic ranks, not just the standard 7.
    """
    
    # Get all non-null values from the row (excluding non-taxonomic columns)
    taxa = [val for val in row if pd.notna(val) and val != '' and val != 'NCBI']
    
    if not taxa:
        return {}
    
    # Start from most specific (rightmost) taxon
    for taxon in reversed(taxa):
        try:
            # Get taxon ID
            name_dict = ncbi.get_name_translator([taxon])
            if taxon in name_dict:
                taxid = name_dict[taxon][0]
                
                # Get COMPLETE lineage (all taxonomic IDs)
                lineage = ncbi.get_lineage(taxid)
                
                # Get ranks and names for all lineage members
                ranks = ncbi.get_rank(lineage)
                names = ncbi.get_taxid_translator(lineage)
                
                # Create complete taxonomy dictionary with ALL ranks
                result = {}
                
                # Add all ranks found in the lineage
                for tid in lineage:
                    rank = ranks[tid]
                    name = names[tid]
                    
                    # Skip 'no rank' entries but keep everything else
                    if rank != 'no rank' and rank != '':
                        result[rank] = name
                
                # Also add any intermediate ranks that might be missing
                # by checking for common NCBI ranks
                all_possible_ranks = [
                    'superkingdom', 'kingdom', 'subkingdom',
                    'superphylum', 'phylum', 'subphylum', 'infraphylum',
                    'superclass', 'class', 'subclass', 'infraclass',
                    'cohort', 'subcohort', 'superorder', 'order', 'suborder', 'infraorder',
                    'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
                    'genus', 'subgenus', 'species group', 'species subgroup', 
                    'species', 'subspecies', 'strain', 'varietas', 'forma'
                ]
                
                return result
                
        except Exception as e:
            print(f"Error processing {taxon}: {e}")
            continue
    
    return {}

def standardize_taxonomy_dataframe(df, ncbi=None):
    """
    Process a taxonomy dataframe and return a standardized version with
    all available taxonomic ranks preserved.
    """
    
    if ncbi is None:
        print("Initializing NCBI taxonomy database...")
        ncbi = NCBITaxa()
    
    
    # Apply the taxonomy extraction to each row
    taxonomy_results = []
    
    for _, row in df.iterrows():        
        result = get_complete_taxonomy_lineage(row, ncbi)
        taxonomy_results.append(result)
    
    # Convert to DataFrame
    df_standardized = pd.DataFrame(taxonomy_results)
    
    # Reorder columns by typical taxonomic hierarchy if they exist
    preferred_order = [
        'acellular root', 'cellular root', 'domain', 'realm', 'clade',
        'superkingdom', 'kingdom', 'subkingdom',
        'superphylum', 'phylum', 'subphylum', 'infraphylum',
        'superclass', 'class', 'subclass', 'infraclass',
        'cohort', 'subcohort', 'superorder', 'order', 'suborder', 'infraorder',
        'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
        'genus', 'subgenus', 'species group', 'species subgroup', 
        'species', 'subspecies', 'strain', 'varietas', 'forma'
    ]
    
    # Reorder columns based on preferred hierarchy
    existing_cols = [col for col in preferred_order if col in df_standardized.columns]
    other_cols = [col for col in df_standardized.columns if col not in preferred_order]
    new_column_order = existing_cols + other_cols
    
    df_standardized = df_standardized[new_column_order]
    
    
    return df_standardized

def get_taxon_ids_for_best_taxonomy(df_standardized, ncbi=None):
    """
    Get NCBI taxon IDs for the most specific taxonomic rank available in each row.
    
    Parameters:
    df_standardized: DataFrame with standardized taxonomy and 'best_available_taxonomy' column
    ncbi: NCBITaxa object (optional, will create if not provided)
    
    Returns:
    DataFrame with added taxon ID columns
    """
    
    if ncbi is None:
        print("Initializing NCBI taxonomy database...")
        ncbi = NCBITaxa()
    
    df_result = df_standardized.copy()
    
    # Add columns for taxon IDs
    df_result['best_taxonomy_taxid'] = np.nan
    df_result['best_taxonomy_name'] = ''
    df_result['best_taxonomy_level'] = ''
    df_result['taxid_status'] = ''  # Track success/failure
    
    print("Getting taxon IDs for most specific taxonomic ranks...")
    
    # Define hierarchy from most specific to least specific
    hierarchy = [
        'subspecies', 'strain', 'varietas', 'forma',  # Below species
        'species',                           # Species level
        'species subgroup', 'species group', # Between genus and species
        'genus', 'subgenus',                # Genus level
        'tribe', 'subtribe',                # Above genus
        'subfamily', 'family',              # Family level
        'order', 'class', 'phylum',       # Higher levels
        'realm', 'clade', 'domain', 'subkingdom', 'superkingdom', 'kingdom',
        'cellular root', 'acellular root'  # NCBI ranks

    ]
    
    # Statistics tracking
    total_rows = len(df_result)
    success_count = 0
    failed_names = []
    
    for idx, row in df_result.iterrows():
    
        # Find the most specific taxonomic level available
        best_taxon_name = None
        best_level = None
        
        for level in hierarchy:
            if level in df_result.columns and pd.notna(row[level]) and row[level] != '':
                best_taxon_name = row[level]
                best_level = level
                break
        
        if best_taxon_name is None:
            df_result.loc[idx, 'taxid_status'] = 'no_taxonomy_data'
            continue
        
        # Try to get taxon ID from NCBI
        try:
            # Clean the name (remove extra whitespace)
            clean_name = str(best_taxon_name).strip()
            
            # Get taxon ID using ete3
            name_dict = ncbi.get_name_translator([clean_name])
            
            if clean_name in name_dict and len(name_dict[clean_name]) > 0:
                taxid = name_dict[clean_name][0]  # Take first match
                
                # Store results
                df_result.loc[idx, 'best_taxonomy_taxid'] = taxid
                df_result.loc[idx, 'best_taxonomy_name'] = clean_name
                df_result.loc[idx, 'best_taxonomy_level'] = best_level
                df_result.loc[idx, 'taxid_status'] = 'success'
                success_count += 1
                
            else:
                df_result.loc[idx, 'taxid_status'] = 'name_not_found'
                failed_names.append(clean_name)
                
        except Exception as e:
            df_result.loc[idx, 'taxid_status'] = f'error: {str(e)}'
            failed_names.append(best_taxon_name)
    
    # Print summary statistics
    print(f"\n=== TAXON ID RETRIEVAL SUMMARY ===")
    print(f"Total entries processed: {total_rows}")
    print(f"Successfully retrieved taxon IDs: {success_count}")
    print(f"Success rate: {(success_count/total_rows)*100:.1f}%")
    
    # Show status breakdown
    status_counts = df_result['taxid_status'].value_counts()
    print(f"\nStatus breakdown:")
    for status, count in status_counts.items():
        print(f"  {status}: {count}")
    
    # Show most common failed names
    if failed_names:
        print(f"\nTop 10 most common failed lookups:")
        failed_series = pd.Series(failed_names)
        for name, count in failed_series.value_counts().head(10).items():
            print(f"  {name}: {count}")
    
    return df_result

ncbi = NCBITaxa()
taxonomy_standardized = standardize_taxonomy_dataframe(taxonomy_names_df, ncbi)
taxonomy_with_ids = get_taxon_ids_for_best_taxonomy(taxonomy_standardized, ncbi)

Getting taxon IDs for most specific taxonomic ranks...

=== TAXON ID RETRIEVAL SUMMARY ===
Total entries processed: 14391
Successfully retrieved taxon IDs: 14388
Success rate: 100.0%

Status breakdown:
  success: 14388
  no_taxonomy_data: 3

Define and validate species name standardization functions

def standardize_species_names(df):
    """
    Standardize species names across the entire taxonomy dataframe
    """
    import re
    
    def fix_species(row):
        # If species already exists and looks good, keep it
        if pd.notna(row['species']) and row['species'] != '' and not 'unclassified' in str(row['species']).lower():
            return row['species']
        
        # Only try to fix if best_taxonomy_level is at genus level or below
        genus_and_below_levels = ['genus', 'subgenus', 'species group', 'species subgroup', 
                                 'species', 'subspecies', 'strain', 'varietas', 'forma']
        
        if row['best_taxonomy_level'] not in genus_and_below_levels:
            return "Unclassified"
        
        # Try to extract from best_taxonomy_name
        if pd.notna(row['best_taxonomy_name']):
            name = str(row['best_taxonomy_name']).strip()
            
            # Pattern for proper binomial nomenclature
            binomial_patterns = [
                r'^([A-Z][a-z]+)\s+([a-z]+)$',  # Simple binomial
                r'^([A-Z][a-z]+)\s+([a-z]+)\s+([a-z]+)$',  # Trinomial (subspecies)
                r'^([A-Z][a-z]+)\s+(sp\.)\s+([A-Za-z0-9_-]+)$',  # Genus sp. strain
            ]
            
            for pattern in binomial_patterns:
                match = re.match(pattern, name)
                if match:
                    genus_from_name = match.group(1)
                    species_epithet = match.group(2)
                    
                    # Skip non-specific epithets that aren't real species names
                    non_specific_epithets = ['group', 'clade', 'complex', 'cluster', 
                                           'assemblage', 'section', 'series']
                    if species_epithet.lower() in non_specific_epithets:
                        return "Unclassified"
                    
                    # Verify genus consistency
                    if pd.notna(row['genus']) and row['genus'] != '':
                        if genus_from_name == row['genus']:
                            if species_epithet == 'sp.':
                                return f"{genus_from_name} {species_epithet} {match.group(3)}"
                            else:
                                return f"{genus_from_name} {species_epithet}"
                    else:
                        # Use what we found
                        if species_epithet == 'sp.':
                            return f"{genus_from_name} {species_epithet} {match.group(3)}"
                        else:
                            return f"{genus_from_name} {species_epithet}"
        
        return "Unclassified"
    
    return df.assign(species=df.apply(fix_species, axis=1))

def validate_species_extraction(df):
    """
    Check how many species were recovered vs missing
    """
    before_fix = df['species'].notna().sum()
    df_fixed = standardize_species_names(df)
    after_fix = (df_fixed['species'] != 'Unclassified').sum()
    
    print(f"Species names before fix: {before_fix}")
    print(f"Species names after fix: {after_fix}")
    print(f"Improvement: {after_fix - before_fix} additional species recovered")
    
    # Show examples of what was fixed
    fixed_examples = (df_fixed[(df['species'].isna() | (df['species'] == '')) & 
           (df_fixed['species'] != 'Unclassified')])

    for _, row in fixed_examples.iterrows():
        print(f"{row['best_taxonomy_level']}: {row['best_taxonomy_name']} -> Fixed Species: {row['species']}")

validate_species_extraction(taxonomy_with_ids)

Species names before fix: 10840
Species names after fix: 10868
Improvement: 28 additional species recovered
species subgroup: Stutzerimonas stutzeri subgroup -> Fixed Species: Stutzerimonas stutzeri
species group: Burkholderia cepacia complex -> Fixed Species: Burkholderia cepacia
species group: Alternaria alternata complex -> Fixed Species: Alternaria alternata
species group: Pseudomonas aeruginosa group -> Fixed Species: Pseudomonas aeruginosa
species group: Pseudomonas fluorescens group -> Fixed Species: Pseudomonas fluorescens
species group: Pseudomonas syringae group -> Fixed Species: Pseudomonas syringae
species subgroup: Stutzerimonas stutzeri subgroup -> Fixed Species: Stutzerimonas stutzeri
species group: Agrobacterium tumefaciens complex -> Fixed Species: Agrobacterium tumefaciens
species group: Stutzerimonas stutzeri group -> Fixed Species: Stutzerimonas stutzeri
species group: Enterobacter cloacae complex -> Fixed Species: Enterobacter cloacae
species group: Streptococcus anginosus group -> Fixed Species: Streptococcus anginosus
species group: Citrobacter freundii complex -> Fixed Species: Citrobacter freundii
species group: Pseudomonas putida group -> Fixed Species: Pseudomonas putida
species group: Bacillus subtilis group -> Fixed Species: Bacillus subtilis
species subgroup: Bacillus amyloliquefaciens group -> Fixed Species: Bacillus amyloliquefaciens
species group: Bacillus cereus group -> Fixed Species: Bacillus cereus
species group: Rhodococcus erythropolis group -> Fixed Species: Rhodococcus erythropolis
species group: Enterobacter cloacae complex -> Fixed Species: Enterobacter cloacae
species group: Mycobacterium simiae complex -> Fixed Species: Mycobacterium simiae
species group: Amycolatopsis methanolica group -> Fixed Species: Amycolatopsis methanolica
species group: Pseudomonas aeruginosa group -> Fixed Species: Pseudomonas aeruginosa
species group: Vibrio harveyi group -> Fixed Species: Vibrio harveyi
species group: Mycobacterium tuberculosis complex -> Fixed Species: Mycobacterium tuberculosis
species group: Pseudomonas chlororaphis group -> Fixed Species: Pseudomonas chlororaphis
species group: Prauserella coralliicola group -> Fixed Species: Prauserella coralliicola
species group: Stenotrophomonas maltophilia group -> Fixed Species: Stenotrophomonas maltophilia
species group: Streptomyces violaceusniger group -> Fixed Species: Streptomyces violaceusniger
species group: Streptococcus dysgalactiae group -> Fixed Species: Streptococcus dysgalactiae

taxonomy_with_ids.query('taxid_status == "no_taxonomy_data"').index

Index([0, 7258, 13054], dtype='int64')

taxonomy_names_df.iloc[0].dropna()

level_1    NCBI
Name: 0, dtype: object

First entry has no lineage at all (NCBI root), so we set its taxid to 0

taxonomy_with_ids.loc[0, 'best_taxonomy_taxid'] = 0
taxonomy_with_ids.loc[0, 'best_taxonomy_name'] = 'NCBI root'
taxonomy_with_ids.loc[0, 'best_taxonomy_level'] = 'root'

Second entry belongs to unclassified_entries -> unclassified_sequences -> environmental_samples so with a search in NCBI’s taxonomy browser we can find that its taxid is 151659:

taxonomy_names_df.iloc[7258].dropna()

level_1                                                 NCBI
level_2                                 unclassified entries
level_3                               unclassified sequences
level_4    environmental samples <unclassified sequences,...
Name: 7258, dtype: object

taxonomy_with_ids.loc[7258, 'best_taxonomy_taxid'] = 151659
taxonomy_with_ids.loc[7258, 'best_taxonomy_name'] = 'Environmental samples'
taxonomy_with_ids.loc[7258, 'best_taxonomy_level'] = 'root'

taxonomy_names_df.iloc[13054].dropna()

level_1                    NCBI
level_2           other entries
level_3         other sequences
level_4    artificial sequences
Name: 13054, dtype: object

Third and last entry belongs to other entries -> other sequences -> artificial_sequences

taxonomy_with_ids.loc[13054, 'best_taxonomy_taxid'] = 81077
taxonomy_with_ids.loc[13054, 'best_taxonomy_name'] = 'Artificial sequences'
taxonomy_with_ids.loc[13054, 'best_taxonomy_level'] = 'root'

taxonomy_with_ids = (taxonomy_with_ids
                     .assign(best_taxonomy_taxid=lambda dd: dd.best_taxonomy_taxid.astype(int))
)

This taxonomy ID mapping will allow us to keep a single column with the taxonomic IDs, which we will be able to later on use for further analyses without carrying the full taxonomy lineage around.

abundance_tax_df = (abundance_table
 .T
 .reset_index(drop=True)
 .merge(taxonomy_with_ids[['best_taxonomy_taxid']],
         left_index=True, right_index=True)
)

abundance_tax_long = (abundance_tax_df
 .melt(id_vars=['best_taxonomy_taxid'],
        var_name='sample_id', value_name='counts')
 .replace({'NCBI': 'root'})
 .query('counts > 0')
 .assign(sample_id=lambda dd: dd.sample_id.str.split('_').str[0])
 .assign(sample_type=lambda dd: np.where(dd.sample_id.str.contains('B', regex=False), 'Blank', 'Sample'))
)

sample_taxids = abundance_tax_long.query('sample_type != "Blank"').best_taxonomy_taxid.unique()

taxonomy_standardized.iloc[2329].dropna()

cellular root     cellular organisms
domain                      Bacteria
kingdom               Pseudomonadati
phylum                Pseudomonadota
class            Alphaproteobacteria
order               Sphingomonadales
family             Sphingomonadaceae
genus                   Sphingomonas
Name: 2329, dtype: object

taxonomy_names_df.iloc[1886]

level_1                          NCBI
level_2            cellular organisms
level_3                      Bacteria
level_4                Proteobacteria
level_5           Alphaproteobacteria
level_6              Sphingomonadales
level_7             Sphingomonadaceae
level_8                  Sphingomonas
level_9     unclassified Sphingomonas
level_10        Sphingomonas sp. YJ09
level_11                          NaN
level_12                          NaN
level_13                          NaN
level_14                          NaN
level_15                          NaN
level_16                          NaN
level_17                          NaN
level_18                          NaN
level_19                          NaN
level_20                          NaN
level_21                          NaN
level_22                          NaN
level_23                          NaN
level_24                          NaN
level_25                          NaN
level_26                          NaN
level_27                          NaN
level_28                          NaN
level_29                          NaN
level_30                          NaN
level_31                          NaN
level_32                          NaN
level_33                          NaN
level_34                          NaN
level_35                          NaN
level_36                          NaN
level_37                          NaN
Name: 1886, dtype: object

ncbi.get_name_translator(['Sphingomonas sp.'])

{'Sphingomonas sp.': [28214]}

taxonomy_with_ids.loc[lambda dd: dd.duplicated()].query('best_taxonomy_name=="Sphingomonas"')

	acellular root	cellular root	domain	realm	clade	kingdom	subkingdom	superphylum	phylum	subphylum	...	section	serotype	biotype	forma specialis	isolate	series	best_taxonomy_taxid	best_taxonomy_name	best_taxonomy_level	taxid_status
82	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
138	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
192	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
1886	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
2296	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
2329	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
2994	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
3105	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
3546	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
3659	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
5433	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
5543	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
6253	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
6717	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
7011	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
7189	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
8092	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
8172	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
8823	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
11244	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
11472	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
11528	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success
12159	NaN	cellular organisms	Bacteria	NaN	NaN	Pseudomonadati	NaN	NaN	Pseudomonadota	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	13687	Sphingomonas	genus	success

23 rows × 45 columns

(abundance_tax_long
 .groupby(['sample_id', 'sample_type'], as_index=False)
 ['counts']
 .sum()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='reorder(sample_id, counts)', y='counts')
       + p9.geom_col(p9.aes(fill='sample_duration'))
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_y_continuous(expand=(0, 0.1), labels=lambda x: [f"{v/1e6:.0f}M" for v in x])
       + p9.coord_flip()
       + p9.labs(x='', y='Reads', fill='', title='Total assigned reads per sample')
       + p9.theme(
           figure_size=(5, 7),
           axis_text_y=p9.element_text(size=4, family='monospace'),
           axis_text_x=p9.element_text(size=8, family='Georgia'),
           legend_position=(0.8, 0.2)
       )
       )
 )

If we take a look at the best rank of taxonomic assignations:

Show Code

(taxonomy_with_ids
 .query('best_taxonomy_taxid.isin(@sample_taxids)')
 .drop_duplicates()
 .best_taxonomy_level
 .value_counts()
 .head(20)
)

best_taxonomy_level
species             9643
genus               1291
strain               823
family               458
order                287
class                172
phylum               102
subspecies            94
subfamily             52
species group         42
clade                 30
tribe                 27
species subgroup      13
varietas               8
domain                 5
forma                  3
root                   3
realm                  3
cellular root          1
acellular root         1
Name: count, dtype: int64

Community composition

%%R -i big_table -o permanova_blanks_results

sample_type <- ifelse(grepl("^KB", rownames(big_table)), "Blank", "Sample")
dist_matrix <- vegan::vegdist(big_table, method = "bray")
permanova_blanks_results <- vegan::adonis2(dist_matrix ~ sample_type, permutations = 999)

print(permanova_blanks_results)

Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = dist_matrix ~ sample_type, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      1    1.485 0.04417 5.5915  0.001 ***
Residual 121   32.146 0.95583                  
Total    122   33.631 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Ordination Analysis

%%R -i big_table  -o nmds_blanks_df

dist_matrix_blanks <- vegan::vegdist(big_table, method = "bray")
nmds_coords_blanks <- vegan::metaMDS(dist_matrix, k = 2, trymax = 100, verbose = FALSE)
nmds_blanks_df <- data.frame(
    NMDS1 = nmds_coords_blanks$points[, 1],
    NMDS2 = nmds_coords_blanks$points[, 2],
    sample_id = rownames(big_table),
    sample_type = ifelse(grepl("^KB", rownames(big_table)), "Blank", "Sample")
)

Run 0 stress 0.1735855 
Run 1 stress 0.1820435 
Run 2 stress 0.1839393 
Run 3 stress 0.1769141 
Run 4 stress 0.1763525 
Run 5 stress 0.1865404 
Run 6 stress 0.2112716 
Run 7 stress 0.1885427 
Run 8 stress 0.1888116 
Run 9 stress 0.1930841 
Run 10 stress 0.1981002 
Run 11 stress 0.1703204 
... New best solution
... Procrustes: rmse 0.05147571  max resid 0.2497888 
Run 12 stress 0.1750515 
Run 13 stress 0.1950641 
Run 14 stress 0.181364 
Run 15 stress 0.188554 
Run 16 stress 0.1781962 
Run 17 stress 0.1788396 
Run 18 stress 0.1866084 
Run 19 stress 0.1831665 
Run 20 stress 0.1757994 
Run 21 stress 0.1773205 
Run 22 stress 0.1755386 
Run 23 stress 0.1822862 
Run 24 stress 0.185336 
Run 25 stress 0.1772682 
Run 26 stress 0.1836586 
Run 27 stress 0.198846 
Run 28 stress 0.2094546 
Run 29 stress 0.1739281 
Run 30 stress 0.1742009 
Run 31 stress 0.1820799 
Run 32 stress 0.1784877 
Run 33 stress 0.1684748 
... New best solution
... Procrustes: rmse 0.01739192  max resid 0.1466545 
Run 34 stress 0.1838412 
Run 35 stress 0.1764085 
Run 36 stress 0.1713718 
Run 37 stress 0.1822316 
Run 38 stress 0.1756282 
Run 39 stress 0.1726824 
Run 40 stress 0.1847255 
Run 41 stress 0.1734816 
Run 42 stress 0.1902603 
Run 43 stress 0.1752344 
Run 44 stress 0.1773065 
Run 45 stress 0.2095513 
Run 46 stress 0.167005 
... New best solution
... Procrustes: rmse 0.01698972  max resid 0.1219746 
Run 47 stress 0.191925 
Run 48 stress 0.1710904 
Run 49 stress 0.1897483 
Run 50 stress 0.1793241 
Run 51 stress 0.1789115 
Run 52 stress 0.1704334 
Run 53 stress 0.1745534 
Run 54 stress 0.1842467 
Run 55 stress 0.1958116 
Run 56 stress 0.1789141 
Run 57 stress 0.1745232 
Run 58 stress 0.176365 
Run 59 stress 0.1677557 
Run 60 stress 0.1875669 
Run 61 stress 0.1739599 
Run 62 stress 0.1811732 
Run 63 stress 0.1791551 
Run 64 stress 0.2114635 
Run 65 stress 0.1949473 
Run 66 stress 0.1712527 
Run 67 stress 0.1954034 
Run 68 stress 0.1785225 
Run 69 stress 0.1762311 
Run 70 stress 0.1803207 
Run 71 stress 0.1699237 
Run 72 stress 0.1977597 
Run 73 stress 0.1774345 
Run 74 stress 0.1808974 
Run 75 stress 0.1788601 
Run 76 stress 0.1844365 
Run 77 stress 0.2011627 
Run 78 stress 0.1761189 
Run 79 stress 0.179456 
Run 80 stress 0.1814798 
Run 81 stress 0.2100013 
Run 82 stress 0.1853572 
Run 83 stress 0.1677553 
Run 84 stress 0.18513 
Run 85 stress 0.1713016 
Run 86 stress 0.1660708 
... New best solution
... Procrustes: rmse 0.01345051  max resid 0.12158 
Run 87 stress 0.1829012 
Run 88 stress 0.165895 
... New best solution
... Procrustes: rmse 0.007187102  max resid 0.07689487 
Run 89 stress 0.2106426 
Run 90 stress 0.1854206 
Run 91 stress 0.1818669 
Run 92 stress 0.190312 
Run 93 stress 0.1728266 
Run 94 stress 0.1658331 
... New best solution
... Procrustes: rmse 0.008056262  max resid 0.07751266 
Run 95 stress 0.1765695 
Run 96 stress 0.1811998 
Run 97 stress 0.1752568 
Run 98 stress 0.1746976 
Run 99 stress 0.1830565 
Run 100 stress 0.1797655 
*** Best solution was not repeated -- monoMDS stopping criteria:
     1: no. of iterations >= maxit
    99: stress ratio > sratmax

Big table wrangling with metadata

big_table.index = big_table.index.str.split('_').str[0]

relevant_kumamoto_metadata = (big_table
 .assign(sample_id=lambda dd: dd.index.str.split('_').str[0])
 [['sample_id']]
 .merge(kumamoto_metadata, on='sample_id', how='left')
)

big_table_no_blanks = big_table.query('index.str.contains("B", regex=False) == False')

kumamoto_metadata_no_blanks = (kumamoto_metadata
                               .query('sample_id.str.contains("B", regex=False) == False')
                               .query('sample_id in @big_table_no_blanks.index.str.split("_").str[0]')
 .assign(date=lambda dd: pd.to_datetime(dd['date_start'] + np.where(dd['sample_duration'] == 'Daily',
                                                   pd.Timedelta(days=0),
                                                   pd.Timedelta(days=3))))
 .assign(month_name=lambda dd: dd['date'].dt.strftime('%B'))
 .assign(month=lambda dd: dd['date'].dt.month)
 .assign(year=lambda dd: dd['date'].dt.year)
 .assign(season=lambda dd: np.select(
        [dd['date'].dt.dayofyear < 80,  # Winter
         dd['date'].dt.dayofyear < 172,  # Spring
         dd['date'].dt.dayofyear < 264,  # Summer
         dd['date'].dt.dayofyear < 355],  # Autumn
        ['Winter', 'Spring', 'Summer', 'Autumn'],
        default='Winter' # Default to Winter for days after December 21
    ))
)

%%R -i big_table_no_blanks -i kumamoto_metadata_no_blanks -o permanova_duration_results -o nmds_no_blanks_df -o pcoa_no_blanks_df -o pcoa_variance_explained

sample_duration <- kumamoto_metadata_no_blanks$sample_duration[match(rownames(big_table_no_blanks), kumamoto_metadata_no_blanks$sample_id)]
dist_matrix_no_blanks <- vegan::vegdist(big_table_no_blanks, method = "bray")
permanova_duration_results <- vegan::adonis2(dist_matrix_no_blanks ~ sample_duration, permutations = 999)
nmds_coords_no_blanks <- vegan::metaMDS(dist_matrix_no_blanks, k = 2, trymax = 100, verbose = FALSE)

pcoa_coords_no_blanks <- ape::pcoa(dist_matrix_no_blanks)
pcoa_variance_explained <- pcoa_coords_no_blanks$values$Relative_eig[1:2] * 100


nmds_no_blanks_df <- data.frame(
    NMDS1 = nmds_coords_no_blanks$points[, 1],
    NMDS2 = nmds_coords_no_blanks$points[, 2],
    sample_id = rownames(big_table_no_blanks),
    duration = sample_duration)

# Create dataframe with PCoA coordinates
pcoa_no_blanks_df <- data.frame(
    PCoA1 = pcoa_coords_no_blanks$vectors[, 1],
    PCoA2 = pcoa_coords_no_blanks$vectors[, 2], 
    sample_id = rownames(big_table_no_blanks),
    duration = sample_duration
)

Run 0 stress 0.1440462 
Run 1 stress 0.163319 
Run 2 stress 0.1766129 
Run 3 stress 0.1449615 
Run 4 stress 0.1594438 
Run 5 stress 0.1440455 
... New best solution
... Procrustes: rmse 0.0001728937  max resid 0.001364318 
... Similar to previous best
Run 6 stress 0.1449019 
Run 7 stress 0.1440189 
... New best solution
... Procrustes: rmse 0.001621551  max resid 0.01392724 
Run 8 stress 0.1440189 
... New best solution
... Procrustes: rmse 5.081942e-06  max resid 2.545541e-05 
... Similar to previous best
Run 9 stress 0.144902 
Run 10 stress 0.1440189 
... Procrustes: rmse 3.392733e-06  max resid 2.902543e-05 
... Similar to previous best
Run 11 stress 0.1440455 
... Procrustes: rmse 0.001630253  max resid 0.01394254 
Run 12 stress 0.1547519 
Run 13 stress 0.1694775 
Run 14 stress 0.1440455 
... Procrustes: rmse 0.001630834  max resid 0.0139466 
Run 15 stress 0.1617539 
Run 16 stress 0.1449609 
Run 17 stress 0.1800464 
Run 18 stress 0.1440462 
... Procrustes: rmse 0.001630917  max resid 0.01393746 
Run 19 stress 0.1449015 
Run 20 stress 0.1449015 
*** Best solution repeated 2 times

Example of working ggplot2 plot with rpy2 and R magic

I am just leaving this as an example of how to generate plots in R directly while controlling the resolution and size of the output image. I can’t seem to be able to do it directly with the %%R cell magic but using the png() function works fine.

%%R -i nmds_no_blanks_df -i kumamoto_metadata_no_blanks


# Plot NMDS
png("nmds_plot.png", res = 300, width=2000, height=1500)
ggplot(nmds_no_blanks_df, aes(x = NMDS1, y = NMDS2, color = duration)) +
  geom_point(size = 3, alpha = 0.8, stroke=0) +
  scale_color_manual(values=c("#30a12e", "#ae213b")) +
  scale_fill_manual(values=c("#30a12e", "#ae213b")) +
  geom_rug() +
  stat_ellipse(aes(fill = sample_duration), alpha = 0.2, level = 0.95, geom='polygon') +
  guides(fill=FALSE) +
  labs(
    title = "NMDS Plot of Bray-Curtis Distances",
    x = "NMDS1", 
    y = "NMDS2", 
    color = "Sample Duration") +
  theme_classic() +
  theme(
    text = element_text(family = "Georgia"),
    panel.grid.major = element_line(linetype=2)
  )
ggsave("nmds_plot.png")
dev.off()

Saving 6.67 x 5 in image
quartz_off_screen 
                2

from IPython.display import Image
Image('nmds_plot.png')

Combined NMDS and PCoA plot

(pcoa_no_blanks_df
 .merge(nmds_no_blanks_df)
 .melt(id_vars=['sample_id', 'duration'], value_vars=['NMDS1', 'NMDS2', 'PCoA1', 'PCoA2'],
       var_name='axis', value_name='value')
 .assign(decomp=lambda dd: dd['axis'].str[:-1])
 .assign(axis=lambda dd: dd['axis'].str[-1].replace({'1': 'x', '2': 'y'}))
 .pivot(index=['sample_id', 'duration', 'decomp'], columns='axis', values='value')
 .reset_index()
 .pipe(lambda dd: p9.ggplot(dd, p9.aes(x='x', y='y', color='duration'))
       + p9.geom_point(size=2, alpha=0.7, stroke=0)
       + p9.geom_rug()
       + p9.stat_ellipse(p9.aes(fill='duration'), alpha=0.1, level=0.8, geom='polygon')
       + p9.facet_wrap('~decomp', scales='free', nrow=1)
       + p9.scale_color_manual(values=["#30a12e", "#ae213b"])
       + p9.scale_fill_manual(values=["#30a12e", "#ae213b"])
       + p9.labs(x='Coordinate$_1$', y='Coordinate$_2$', color='', fill='',
                 title='')
       + p9.theme(
           figure_size=(5, 3),
           legend_position='top'
       )
)
)

Show Code

permanova_p = permanova_blanks_results.iloc[0, 4]
permanova_r2 = permanova_blanks_results.iloc[0, 2]

a = (p9.ggplot(nmds_blanks_df)
 + p9.aes(x='NMDS1', y='NMDS2', color='sample_type')
 + p9.geom_point(size=3, alpha=0.8, stroke=0)
 + p9.geom_rug()
 + p9.annotate(
     geom='text', 
               label=f'n={nmds_blanks_df.shape[0]}', 
               x=nmds_blanks_df['NMDS1'].min(),
               y=nmds_blanks_df['NMDS2'].min(),
               size=10,
               family='Georgia',
               ha='left')
 + p9.scale_color_manual(values=["#5198ca", "#c16514"])
 + p9.scale_fill_manual(values=["#5198ca", "#c16514"])
 + p9.guides(fill=False)
 + p9.stat_ellipse(p9.aes(fill='sample_type'), alpha=0.1, level=0.95, geom='polygon')
 + p9.labs(title='',
          x='NMDS$_1$', y='NMDS$_2$', color='Sample Type',
          caption=f'PERMANOVA p-value: {permanova_p:.3f} (R$^2$ = {permanova_r2:.3%})',
          tag='(A)'

))

permanova_p = permanova_duration_results.iloc[0, 4]
permanova_r2 = permanova_duration_results.iloc[0, 2]

b = (p9.ggplot(nmds_no_blanks_df)
 + p9.aes(x='NMDS1', y='NMDS2', color='duration')
 + p9.geom_point(size=3, alpha=0.8, stroke=0)
 + p9.geom_rug()
 + p9.scale_color_manual(values=["#30a12e", "#ae213b"]) 
 + p9.scale_fill_manual(values=["#30a12e", "#ae213b"])
 + p9.annotate(
     geom='text',
     label=f'n={nmds_no_blanks_df.shape[0]}', 
     x=nmds_no_blanks_df['NMDS1'].min() * 1.05,
     y=nmds_no_blanks_df['NMDS2'].min() * 0.95,
     size=10, 
     family='Georgia', 
     ha='left',
     va='bottom')
 + p9.guides(fill=False)
 + p9.stat_ellipse(p9.aes(fill='duration'), alpha=0.1, level=0.95, geom='polygon')
 + p9.labs(title='',
          x='NMDS$_1$', y='', color='Sample Duration',
          caption=f'PERMANOVA p-value: {permanova_p:.3f} (R$^2$ = {permanova_r2:.3%})',
          tag='(B)')

)


composition = (a | b) 

composition  & p9.theme(figure_size=(8, 4),
      plot_caption=p9.element_text(size=10, family='Georgia', ha='right'),
      legend_position='top',
      plot_tag=p9.element_text(size=16, family='Georgia', face='bold', ha='left'),
      plot_tag_location='plot',
  )

As shown in the results (aside from the satisfaction of finally being able to programmatically generate labeled compositions with plotnine v.0.15.0), the NMDS plots reveal interesting patterns in the data which are confirmed by the PERMANOVA results.

The blanks and actual samples are clearly separated in NMDS space, and significantly different from each other according to the PERMANOVA results (p = 0.001).

As for the sample duration, it also seems have an effect on the composition of the samples, with week-long samples being more similar to each other than to the 24h samples, which are more dispersed in NMDS space. The PERMANOVA results confirm this as well (p = 0.001).

Timeline plot code

month_colors_dict = {
    "June": "#FFD27A",
    "July": "#FFC04D",
    "August": "#FFAD33",
    "September": "#E68A4C",
    "October": "#CC6E4A",
    "November": "#B35A4D",
    "December": "#9A7ACC",
    "January": "#7C93E6",
    "February": "#66A3F2",
    "March": "#7DCB9B",
}


season_colors_dict = {
    "Spring": "#7DCB9B",
    "Summer": "#FFB84D",
    "Autumn": "#CC6E4A",
    "Winter": "#7C93E6",
}

calendar_samples = (pd.DataFrame(dict(
    date=
pd.date_range(
    start=kumamoto_metadata_no_blanks['date_start'].min(),
    end=kumamoto_metadata_no_blanks['date_end'].max(),
    freq='D'
)))
 .assign(month_name=lambda dd: dd['date'].dt.strftime('%B'))
 .assign(year=lambda dd: dd['date'].dt.year)
 .assign(season=lambda dd: np.select(
        [dd['date'].dt.dayofyear < 80,  # Winter
         dd['date'].dt.dayofyear < 172,  # Spring
         dd['date'].dt.dayofyear < 264,  # Summer
         dd['date'].dt.dayofyear < 355],  # Autumn
        ['Winter', 'Spring', 'Summer', 'Autumn'],
        default='Winter' # Default to Winter for days after December 21
    ))
)


a = (kumamoto_metadata_no_blanks
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.geom_rect(p9.aes(
           xmin='date_start', 
           xmax='date_end', 
           ymin=0, 
           ymax=1,
           fill='sample_duration'),
        size=.1,
        color='black',
        alpha=.75
           )
       + p9.scale_y_continuous(expand=(0, 0), breaks=[])
       + p9.scale_x_datetime(date_breaks='1 month',
                             date_labels='%b %d %Y',
                             expand=(0, 0)
                             )
       + p9.scale_fill_manual(values=["#30a12e", "#ae213b"])
       + p9.labs(fill='', title='Timeline of Kumamoto Samples')
       + p9.theme(
           figure_size=(6, 2),
           legend_position=(.55, .65),
           axis_text_x=p9.element_blank(),
           plot_margin_bottom=-.001,
           
           )

       )
)

b = (calendar_samples
     .groupby(['year', 'month_name'], as_index=False)
        .agg(min_date=('date', 'min'),
             max_date=('date', 'max'),
             center_date=('date', 'mean'))
    .pipe(lambda dd: p9.ggplot(dd)
            + p9.geom_rect(p9.aes(
                xmin='min_date',
                xmax='max_date',
                ymin=0,
                ymax=1,
                fill='month_name'),
            )
            + p9.geom_text(p9.aes(x='center_date', y=0.5, label='month_name'),
                           size=8, family='Georgia', ha='center', va='center')
            + p9.scale_x_datetime(date_breaks='1 month',
                                  date_labels='%b %Y',
                                  expand=(0, 0))
            + p9.scale_fill_manual(values=month_colors_dict.values(),
                                   breaks=list(month_colors_dict.keys()))
            + p9.labs(fill='', x='', y='')
            + p9.guides(fill=False)
            + p9.theme(
                figure_size=(6, .5),
                legend_position='top',
                axis_text=p9.element_blank(),
                axis_ticks=p9.element_blank(),
                axis_line=p9.element_blank(),
                panel_grid=p9.element_blank(),
                plot_margin_bottom=-.01,
            )
             )
)

c = (calendar_samples
     .groupby(['season'], as_index=False)
        .agg(min_date=('date', 'min'),
             max_date=('date', 'max'),
             center_date=('date', 'mean'))
    .pipe(lambda dd: p9.ggplot(dd)
            + p9.geom_rect(p9.aes(
                xmin='min_date',
                xmax='max_date',
                ymin=0,
                ymax=1,
                fill='season'),
            )
            + p9.geom_text(p9.aes(x='center_date', y=0.5, label='season'),
                           size=8, family='Georgia', ha='center', va='center')
            + p9.scale_x_datetime(
                date_breaks='1 month',
                date_labels='%b %Y',
                expand=(0, 0))
            + p9.scale_fill_manual(
                values=season_colors_dict.values(),
                breaks=list(season_colors_dict.keys()))
            + p9.labs(fill='', x='', y='')
            + p9.guides(fill=False)
            + p9.theme(
                figure_size=(6, .5),
                legend_position='top',
                axis_text=p9.element_blank(),
                axis_ticks=p9.element_blank(),
                axis_line=p9.element_blank(),
                panel_grid=p9.element_blank(),

            )
             )
)
ph = p9.ggplot() + p9.geom_blank() + p9.theme(axis_line=p9.element_blank())
empty_comp = (b / c /  ph)
final_comp = a / empty_comp
final_comp & p9.theme(figure_size=(6, 2), dpi=300, plot_title=p9.element_text(size=12))

If we now go ahead and color the NMDS plots by season and month, and run the PERMANOVA tests on the same variables, we can see how actually the season and month of the year have a much stronger effect on the composition of the samples than the sample duration, with p = 0.001 for both cases but explaining roughly a 15% and 27% of the variance in community composition, respectively, compared to a mere 7% for the sample duration.

Temporal Patterns in Community Composition

%%R -i kumamoto_metadata_no_blanks -o permanova_season_results -o permanova_month_results

season <- kumamoto_metadata_no_blanks$season[match(rownames(big_table_no_blanks), kumamoto_metadata_no_blanks$sample_id)]
month <- kumamoto_metadata_no_blanks$month_name[match(rownames(big_table_no_blanks), kumamoto_metadata_no_blanks$sample_id)]
permanova_season_results <- vegan::adonis2(big_table_no_blanks ~ season, permutations = 999)
permanova_month_results <- vegan::adonis2(big_table_no_blanks ~ month, permutations = 999)

print(permanova_season_results)
print(permanova_month_results)

Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = big_table_no_blanks ~ season, permutations = 999)
          Df SumOfSqs     R2     F Pr(>F)    
Model      3   4.4672 0.1506 6.501  0.001 ***
Residual 110  25.1957 0.8494                 
Total    113  29.6629 1.0000                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = big_table_no_blanks ~ month, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      8   8.2122 0.27685 5.0248  0.001 ***
Residual 105  21.4507 0.72315                  
Total    113  29.6629 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Show Code

a = (nmds_no_blanks_df
 .merge(kumamoto_metadata_no_blanks, on='sample_id', how='left')
 .pipe(lambda dd: p9.ggplot(dd)
         + p9.aes(x='NMDS1', y='NMDS2', color='season')
         + p9.stat_ellipse(p9.aes(fill='season'), alpha=0.1, level=0.95, geom='polygon')
         + p9.geom_point(size=3, alpha=0.8, stroke=0)
         + p9.geom_rug()
         + p9.scale_color_manual(
                values=season_colors_dict.values(),
                breaks=list(season_colors_dict.keys())
                                 )
         + p9.scale_fill_manual(
                values=season_colors_dict.values(),
                breaks=list(season_colors_dict.keys())
         )
         + p9.labs(title='', fill='',
             x='NMDS$_1$', y='NMDS$_2$', color='',
             caption=f'PERMANOVA p-value: {permanova_season_results.iloc[0, 4]:.3f}' \
               f' (R$^2$ = {permanova_season_results.iloc[0, 2]:.1%})',
               tag='(A)'
         )
         + p9.theme(
              text=p9.element_text(family='Georgia'),
              plot_caption=p9.element_text(size=10, family='Georgia', ha='right'),
              plot_tag=p9.element_text(size=16, family='Georgia', face='bold', ha='left'),
              plot_tag_location='plot',
              legend_position='top',
         )
     )
)

b = (nmds_no_blanks_df
 .merge(kumamoto_metadata_no_blanks, on='sample_id', how='left')
 .pipe(lambda dd: p9.ggplot(dd)
         + p9.aes(x='NMDS1', y='NMDS2', color='month_name', fill='month_name')
         + p9.geom_point(size=3, alpha=0.8, stroke=0)
         + p9.geom_rug()
         + p9.scale_color_manual(
                values=month_colors_dict.values(),
                breaks=list(month_colors_dict.keys()),
                labels=[m[:3] for m in list(month_colors_dict.keys())]
         )
         + p9.scale_fill_manual(
                values=month_colors_dict.values(),
                breaks=list(month_colors_dict.keys()),
                labels=[m[:3] for m in list(month_colors_dict.keys())]
         )
         + p9.labs(title='', fill='',
             x='NMDS$_1$', y='NMDS$_2$', color='',
             caption=f'PERMANOVA p-value: {permanova_month_results.iloc[0, 4]:.3f}' \
              f' (R$^2$ = {permanova_month_results.iloc[0, 2]:.1%})',
              tag='(B)'
         )
         + p9.guides(color=p9.guide_legend(ncol=5))
         + p9.theme(
              figure_size=(6, 4),
              text=p9.element_text(family='Georgia'),
              plot_caption=p9.element_text(size=10, family='Georgia', ha='right'),
              plot_tag=p9.element_text(size=16, family='Georgia', face='bold', ha='left'),
              plot_tag_location='plot',
              legend_position='top',
         )
     )
)

composition = (a | b) & p9.theme(figure_size=(8, 4))
composition

The results are quite clear and consistent, with time being as far as we can tell the main driver of community composition changes in the samples. However, until now we have included every taxonomic level, mixing Bacteria, Fungi, Metazoa and Viruses in the same analysis, which is not ideal since we know that these groups have very different ecological and evolutionary dynamics, and thus an individual analysis of each group might be more informative and reveal different patterns in the data we might not see otherwise.

Big clades

If we take a look at the assigned reads, we can see how they are distributed across different sample durations and big clades:

Show Code

big_clade_per_sample_type = (abundance_tax_long
 .merge(taxonomy_with_ids)
 .assign(domain=lambda dd: np.where(dd['acellular root'].isna() & dd['domain'].isna(),
                                   'Unknown', dd['domain']))
 .query('`acellular root`.notna() or domain.notna()')
 .assign(big_clade=lambda dd: np.where(dd['domain'].notna(), dd['domain'],
                                       dd['acellular root']))
 .groupby(['sample_id', 'sample_type', 'big_clade'], as_index=False)
 .agg(total_counts=('counts', 'sum'))
 .groupby('sample_id', as_index=False)
 .apply(lambda dd: dd.assign(relative_abundance=dd['total_counts'] / dd['total_counts'].sum()))
 .reset_index(drop=True)
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
  .groupby(['sample_duration', 'big_clade'], as_index=False)
  .agg(total_counts=('total_counts', 'sum'),
       relative_abundance=('relative_abundance', 'mean'))
)

(big_clade_per_sample_type
 .query('relative_abundance > 0')
 .query('big_clade.isin(["Bacteria", "Eukaryota", "Viruses", "Unknown"])')
 .assign(big_clade=lambda dd: pd.Categorical(
     dd['big_clade'],
     categories=['Bacteria', 'Eukaryota', 'Viruses', 'Unknown'],
     ordered=True))
 .pipe(lambda dd: p9.ggplot(dd)
      + p9.aes(x='sample_duration',
               y='relative_abundance', 
               fill='big_clade')
      + p9.geom_col( alpha=1)
      + p9.scale_y_continuous(labels=label_percent())
      + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
      + p9.labs(x='', y='Relative Abundance', fill='', 
          caption="*Archaea not shown (<0.1% of all reads)")
      + p9.theme(
          figure_size=(4, 3),
          legend_position='top',
          legend_text=p9.element_text(size=8),
          legend_key_size=10,
          plot_caption= p9.element_text(size=8),
          )
  )

)

Show Code

(big_clade_per_sample_type
 .pivot(
     index='sample_duration',
     columns='big_clade', 
     values='relative_abundance')
[['Bacteria', 'Eukaryota', 'Viruses', 'Unknown', 'Archaea',]]
.style
.format(lambda x: f"{x * 100:.2f}%")
)

big_clade	Bacteria	Eukaryota	Viruses	Unknown	Archaea
sample_duration
Blanks	93.37%	2.69%	0.67%	3.27%	nan%
Daily	75.39%	23.04%	1.03%	0.51%	0.04%
Weekly	54.41%	44.93%	0.30%	0.35%	0.02%

There seems to be a rather big contrast between the relative abundance of Bacteria and Eukaryota in the samples, with Viruses being present in much lower abundances. While the blank reads mostly belong to bacteria (93%), the daily samples show a still dominant but lower relative abundance of bacteria (75%) with respect to Eukaryota (23%). In the weekly samples, the Bacteria to Eukaryota ratio becomes much closer, however, with 54% of the reads belonging to Bacteria and 45% to Eukaryota.

Viruses are present in all samples but in very low abundances, with relative abundances ranging from 0.30 to 1.03%.

The unclassified sequences (up to this level) range from 0.35% to 0.51% in the weekly and daily samples, respectively, to 3.27% in the blanks.

If we take a look at the percentage of unassigned reads at the big clade level per sample, however, we see how the large % of blanks is explained by a single blank sample, KB105:

Taxonomic Assignment Success Rates

Show Code

(abundance_tax_long
 .merge(taxonomy_with_ids)
 .assign(domain=lambda dd: np.where(dd['acellular root'].isna() & dd['domain'].isna(),
                                   'Unknown', dd['domain']))
 .query('`acellular root`.notna() or domain.notna()')
 .assign(big_clade=lambda dd: np.where(dd['domain'].notna(), dd['domain'],
                                       dd['acellular root']))
 .groupby(['sample_id', 'sample_type', 'big_clade'], as_index=False)
 .agg(total_counts=('counts', 'sum'))
 .groupby('sample_id', as_index=False)
 .apply(lambda dd: dd.assign(relative_abundance=dd['total_counts'] / dd['total_counts'].sum()))
 .reset_index(drop=True)
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .query('big_clade == "Unknown"')
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='relative_abundance', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'),
                        data=dd.query('relative_abundance < 0.1'), 
                        size=2, alpha=0.5, width=0.2, stroke=0)
       + p9.geom_point(p9.aes(color='sample_duration'),
                        data=dd.query('relative_abundance >= 0.1'), 
                        size=2, alpha=0.8, stroke=0)
       + p9.geom_label(p9.aes(label='sample_id'), 
                       data=dd.query('relative_abundance > 0.1'), 
                       nudge_x=.2,
                       size=6,
                       family='monospace')
       + p9.labs(x='', y='% of unassigned reads', fill='')
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_y_continuous(labels=label_percent())
       + p9.guides(fill=False, color=False)
       + p9.theme(
           figure_size=(4, 2),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)         
           )
 )
)

This sample is also the one with the lowest DNA yield (0.13 ng) and the lowest total read counts (roughly 100k, an order of magnitude lower than the next lowest sample that has already over a million reads) and thus it is not surprising that it has a much higher percentage of unassigned reads than the rest of the samples. Given how much of an outlier this sample is, we will remove it from the analysis and focus on the rest of the samples. If we run the comparison again between sample types, we observe that the difference in the percentage of unassigned reads persists and it’s significant between all Blanks and Weekly samples, and between Daily and Weekly samples, but not between Blanks and Daily samples:

Show Code

unassigned_data = (abundance_tax_long
 .merge(taxonomy_with_ids)
 .assign(domain=lambda dd: np.where(dd['acellular root'].isna() & dd['domain'].isna(),
                                   'Unknown', dd['domain']))
 .query('`acellular root`.notna() or domain.notna()')
 .assign(big_clade=lambda dd: np.where(dd['domain'].notna(), dd['domain'],
                                       dd['acellular root']))
 .groupby(['sample_id', 'sample_type', 'big_clade'], as_index=False)
 .agg(total_counts=('counts', 'sum'))
 .groupby('sample_id', as_index=False)
 .apply(lambda dd: dd.assign(relative_abundance=dd['total_counts'] / dd['total_counts'].sum()))
 .reset_index(drop=True)
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .query('big_clade == "Unknown"')
 .query('sample_id != "KB105"')
)

(unassigned_data
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='relative_abundance', fill='sample_duration')
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.geom_jitter(p9.aes(color='sample_duration'),
                        data=dd.query('relative_abundance < 0.1'), 
                        size=2, alpha=0.5, width=0.1, stroke=0)
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.labs(x='', y='% of unassigned reads', fill='')
       + p9.scale_y_continuous(labels=label_percent())
       + p9.guides(fill=False, color=False)
       + p9.annotate(geom='segment', 
                     x=[2.025, 1.025], 
                     xend=[2.975, 2.975],
                     y=[dd['relative_abundance'].max() * 1.1, 
                        dd['relative_abundance'].max() * 1.3],
                     yend=[dd['relative_abundance'].max() * 1.1,
                           dd['relative_abundance'].max() * 1.3],
                     color='black',
                     lineend='butt',
                     arrow=p9.arrow(length=.05, angle=90, ends='both')
                     )
        + p9.annotate(geom='text', x=[2, 2.5], 
                      y=[
                            dd['relative_abundance'].max() * 1.35,
                            dd['relative_abundance'].max() * 1.15],
                      label='*'
                     )
       + p9.theme(
           figure_size=(4, 2),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)         
           )
 )
)

Show Code

%%R -i unassigned_data
library(betareg)

kruskal_test <- kruskal.test(relative_abundance ~ sample_duration, data = unassigned_data)
print(kruskal_test)

# Dunn's test in case it's significant:
if(kruskal_test$p.value < 0.05) {
  print("Kruskal-Wallis test is significant, performing Dunn's test...")
  dunn_test <- FSA::dunnTest(relative_abundance ~ sample_duration, 
                            data = unassigned_data, method = "bh")
  print(dunn_test)
}

# Summary statistics
summary_stats <- unassigned_data %>%
  group_by(sample_duration) %>%
  summarise(
    n = n(),
    median = median(relative_abundance),
    q25 = quantile(relative_abundance, 0.25),
    q75 = quantile(relative_abundance, 0.75),
    mean = mean(relative_abundance),
    sd = sd(relative_abundance)
  )

print(summary_stats)


    Kruskal-Wallis rank sum test

data:  relative_abundance by sample_duration
Kruskal-Wallis chi-squared = 7.959, df = 2, p-value = 0.0187

[1] "Kruskal-Wallis test is significant, performing Dunn's test..."
       Comparison        Z    P.unadj      P.adj
1  Blanks - Daily 1.302445 0.19276440 0.19276440
2 Blanks - Weekly 2.558239 0.01052037 0.03156111
3  Daily - Weekly 2.349434 0.01880199 0.02820298
# A tibble: 3 × 7
  sample_duration     n  median     q25     q75    mean      sd
  <chr>           <int>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 Blanks              8 0.00502 0.00451 0.00749 0.00607 0.00235
2 Daily              97 0.00450 0.00369 0.00625 0.00507 0.00220
3 Weekly             17 0.00392 0.00180 0.00483 0.00349 0.00195

Dunn (1964) Kruskal-Wallis multiple comparison
  p-values adjusted with the Benjamini-Hochberg method.

Además: Aviso:
sample_duration was coerced to a factor.

Bacterial community composition

Let’s now take a look at the bacterial fraction of the taxonomic assignments for our samples. Checking out the levels of classification provided by the pipeline it seems that, roughly, the levels of classification match the following columns (which we can use to map the levels of classification to the actual names):

Since we also have plenty of reads that are assigned to names like “unclassified Bacteria” or “unclassified Micrococcus”, we will try to clean up the data and unify the names of the missing taxa or those with names that imply a lack of classification so that we can have a more consistent and informative analysis of the bacterial community composition.

Bacterial species-level abundances and species name cleaning code

# Patterns which might look like species but are too vague to be useful and shall be considered "Unclassified"

bacterial_taxonomy = (taxonomy_with_ids
            .query('domain == "Bacteria"')
)
non_specific_patterns = [
    r'^\s*uncultured\b.*',
    r'^\s*unclassified\b.*',
    r'^\s*environmental\b.*',
    r'^\s*unidentified\b.*',
    r'^\s*bacterium\s*$',                                   # exactly "bacterium"
    r'^(?:[A-Za-z]+aceae|[A-Za-z]+ales|[A-Za-z]+ia)\s+bacterium$',  # Family/Order/Class + bacterium (no strain)
    r'^(?:marine|soil)\s+bacterium$',                       # marine/soil bacterium (no strain)
    r'^\s*cyanobacterium\s*$',                              # exactly "cyanobacterium"
    r'^[A-Z][a-z]+\s+sp\.\s*$',                            # Genus sp. (no strain) 
    # any line that ends with bacterium/cyanobacterium and has no trailing strain token
    r'^(?:(?!\s+(?:bacterium|cyanobacterium)\s+[A-Za-z0-9_-]+$).)*(?:^|\s)(?:bacterium|cyanobacterium)\s*$',
]

def standardize_unclassified_species(name):
    """
    If the species name matches any of the non-specific patterns, return "Unclassified"
    """
    for pattern in non_specific_patterns:
        if re.match(pattern, name, re.IGNORECASE):
            return "Unclassified"
    return name

bacterial_abundances_long = (abundance_tax_long
 .merge(taxonomy_with_ids)
 .query('domain=="Bacteria"')
 .assign(species=lambda dd: np.where(dd['species'].notna(), dd['species'],
                                     np.where(dd['best_taxonomy_level'] == 'strain', dd['best_taxonomy_name'].str.split(' ').str[:2].str.join(' '),
                                              np.where(dd['best_taxonomy_level'] == 'subspecies', dd['best_taxonomy_name'].str.split(' ').str[:2].str.join(' '), "Unclassified"))))
 # if species matches any of the non-specific patterns, set to "Unclassified"
#  .assign(species=lambda dd: dd['species'].apply(standardize_unclassified_species))
 .groupby(['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'best_taxonomy_level'] + ['sample_id'], as_index=False, dropna=False)
 .agg(counts=('counts', 'sum'))
)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[116], line 6
      1 #| code-fold: true
      2 #| code-summary: Bacterial species-level abundances and species name cleaning code
      3 
      4 # Patterns which might look like species but are too vague to be useful and shall be considered "Unclassified"
----> 6 bacterial_taxonomy = (taxonomy_with_ids
      7             .query('domain == "Bacteria"')
      8 )
      9 non_specific_patterns = [
     10     r'^\s*uncultured\b.*',
     11     r'^\s*unclassified\b.*',
   (...)     20     r'^(?:(?!\s+(?:bacterium|cyanobacterium)\s+[A-Za-z0-9_-]+$).)*(?:^|\s)(?:bacterium|cyanobacterium)\s*$',
     21 ]
     23 def standardize_unclassified_species(name):

NameError: name 'taxonomy_with_ids' is not defined

Show Code

unclassified_reads_df = (bacterial_abundances_long
.query('counts > 0')
 .groupby('sample_id')
 .apply(lambda dd: 
        dd[['phylum', 'class', 'order', 'family', 'genus', 'species']]
        .fillna("Unclassified")
        .map(standardize_unclassified_species)
        .map(lambda x: x == "Unclassified")
        .assign(counts=dd['counts']), 
    include_groups=False)
    .reset_index()
    .eval('unknown_phylum = phylum * counts')
    .eval('unknown_class = `class` * counts')
    .eval('unknown_order = order * counts')
    .eval('unknown_family = family * counts')
    .eval('unknown_genus = genus * counts')
    .eval('unknown_species = species * counts')
    .groupby('sample_id', as_index=False)
    .agg(
        unknown_phylum=('unknown_phylum', 'sum'),
        unknown_class=('unknown_class', 'sum'),
        unknown_order=('unknown_order', 'sum'),
        unknown_family=('unknown_family', 'sum'),
        unknown_genus=('unknown_genus', 'sum'),
        unknown_species=('unknown_species', 'sum'),
        total_counts=('counts', 'sum'),
    )
    .melt(id_vars=['sample_id', 'total_counts'], 
          var_name='taxonomy_level', 
          value_name='unclassified_counts')
    .assign(taxonomy_level=lambda dd: dd['taxonomy_level'].str.split('_').str[1].str.capitalize())
    .eval('unclassified_rate = unclassified_counts / total_counts')
    .assign(taxonomy_level=lambda dd: pd.Categorical(
        dd['taxonomy_level'],
        categories=['Species', 'Genus', 'Family', 'Order', 'Class', 'Phylum'],
        ordered=True))
    .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
           how='left', on='sample_id')
    .replace({'sample_duration': {'Daily': 'Daily Samples', 'Weekly': 'Weekly Samples'}})
)

(unclassified_reads_df
    .pipe(lambda dd: p9.ggplot(dd)
          + p9.aes('reorder(sample_id, unclassified_rate)', 'unclassified_rate')
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Species"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Genus"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Family"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Order"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Class"'))
          + p9.geom_bar(p9.aes(fill='taxonomy_level'), stat='identity', position='identity', data=dd.query('taxonomy_level=="Phylum"'))
          + p9.scale_y_continuous(
              labels=["0%\n100%", "25%\n75%", "50%\n50%", "75%\n25%"], 
              expand=(0, 0, 0, .05), 
              breaks=[0, 0.25, 0.5, 0.75,])
          + p9.labs(x='', y='Fraction of unclassified/classified reads', fill='')
          + p9.facet_wrap('sample_duration', scales='free_y')
          + p9.guides(fill=p9.guide_legend(nrow=1))
          + p9.coord_flip()
          + p9.theme(
              axis_text_y=p9.element_blank(),
              axis_ticks_major_y=p9.element_blank(),
              panel_grid_major_y=p9.element_blank(),
              strip_background=p9.element_blank(),
              panel_grid=p9.element_blank(),
              legend_position='top',
              figure_size=(7, 4)
          )

          )
)

Meaning that, for Species, we have an average of around 11% of all reads classified at this level (and thus, 89% unclassified), quite consistently across all sample types. The boxplots below show the comparison for each of these levels, with the table summarizing the average percentage of assigned reads at each level. Strikingly, the weekly samples have the lowest percentage of assigned reads at all taxonomic levels, with a sizable chunk of weekly samples having a pretty low % of assigned reads even at the Phylum level, which is quite concerning and might indicate that degradation of bacterial DNA is happening at a quite high rate in these samples.

Show Code

(unclassified_reads_df
 .assign(sample_duration=lambda dd: dd.sample_duration.str.replace(' ', '\n'))
 .pipe(lambda dd: 
       p9.ggplot(dd)
       + p9.aes('sample_duration', '1 - unclassified_rate', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'), width=.1, size=2, alpha=0.5, stroke=0)
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.facet_wrap('taxonomy_level')
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
         + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
         + p9.scale_y_continuous(labels=label_percent(), expand=(0, 0, 0, .05))
       + p9.guides(fill=False, color=False)
       + p9.labs(x='', y='Fraction of classified reads')
       + p9.theme(
           figure_size=(7, 4),
       )
       )
)

Show Code

(unclassified_reads_df
 .assign(classified_rate=lambda dd: 1 - dd['unclassified_rate'])
 .groupby(['sample_duration', 'taxonomy_level']).agg({
    'classified_rate': 'mean',
}).reset_index()
.rename(columns={'sample_duration': 'Sample Group'})
.pivot(index='taxonomy_level', columns='Sample Group', values='classified_rate')
.style
.format('{:.2%}')
.set_caption("Average fraction of classified reads per taxonomic level")
)

Table 1: Average fraction of classified reads per taxonomic level

Sample Group	Blanks	Daily Samples	Weekly Samples
taxonomy_level
Species	11.06%	9.59%	9.68%
Genus	53.76%	51.51%	31.19%
Family	63.95%	59.04%	41.31%
Order	64.87%	64.42%	43.74%
Class	66.64%	68.25%	46.66%
Phylum	68.28%	77.55%	51.97%

Now, the total number of reads which end up unassigned at the Species and Genus level is really high, with way over 50% of all reads not even assigned to a Genus, and over 90% not assigned to a Species.

However, in absolute terms, this low fraction still provides us with a rich abundance of information about a rich diversity of Bacteria. Let’s put that into numbers:

toyama_stats.

	sample_id	dna_yield	n_reads	total_bases	median_read_length	median_read_quality	header	sample_type	location	sample_n	sample_pre	sample_i
0	TB01	0.65	64325	295004711	4293	44.3	PM2.5	Blank	Toyama	01	TB
1	TB07	0.70	123500	613911971	4680	41.0	PM2.5	Blank	Toyama	07	TB
2	TB12	0.18	145486	669664197	4247	41.4	PM2.5	Blank	Toyama	12	TB
3	TB17	0.10	35195	149716256	3738	42.8	PM2.5	Blank	Toyama	17	TB
4	TB23	0.40	63369	290584275	4247	42.1	PM2.5	Blank	Toyama	23	TB
...	...	...	...	...	...	...	...	...	...	...	...	...
110	TS51.2	1.70	117848	653014509	5281	38.6	PM2.5	Sample	Toyama	51	TS	.2
111	TS52.1	30.95	9415	50084501	4937	41.6	PM10	Sample	Toyama	52	TS	.1
112	TS52.2	2.25	134607	692493768	4866	38.6	PM2.5	Sample	Toyama	52	TS	.2
113	TS53.1	50.58	11	43242	4034	41.3	PM10	Sample	Toyama	53	TS	.1
114	TS53.2	9.80	137342	739884755	5097	38.3	PM2.5	Sample	Toyama	53	TS	.2

115 rows × 12 columns

Show Code

(bacterial_abundances_long
 .query('counts > 0')
 .query('species != "Unclassified"')
 .groupby(['sample_id'], as_index=False)
 .species.nunique()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='species', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'), size=2, alpha=0.5, width=0.1, stroke=0)
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
       + p9.guides(fill=False, color=False)
       + p9.labs(x='', y='Number of classified species', fill='')
       + p9.theme(
           figure_size=(4, 2),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)
           )
 ))

(bacterial_abundances_long
 .query('counts > 0')
 .query('species != "Unclassified"')
 .groupby(['sample_id'], as_index=False)
 .agg(species=('species', 'nunique'),
      genera=('genus', 'nunique'),
      families=('family', 'nunique'),
      orders=('order', 'nunique'),
      classes=('class', 'nunique'),
      phyla=('phylum', 'nunique'))
    .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
            how='left', on='sample_id')
 .melt(id_vars=['sample_id', 'sample_duration'],
       var_name='taxonomy_level', value_name='num_taxa')

 .assign(taxonomy_level=lambda dd: dd['taxonomy_level'].str.capitalize(),
         tax_level=lambda dd: pd.Categorical(
             dd['taxonomy_level'],
             categories=['Species', 'Genera', 'Families', 'Orders', 'Classes', 'Phyla'],
             ordered=True))
 .assign(sample_duration=lambda dd: pd.Categorical(dd['sample_duration'],
                                                  categories=['Daily', 'Weekly', 'Blanks'],))
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.aes(x='sample_duration', y='num_taxa', fill='sample_duration')
       + p9.geom_jitter(p9.aes(color='sample_duration'), size=2, alpha=0.5, width=0.1, stroke=0)
       + p9.geom_boxplot(outlier_alpha=0, alpha=.3, width=.35)
       + p9.scale_color_manual(values=["#c16514", "#17a860", "#5198ca", ])
       + p9.scale_fill_manual(values=["#c16514", "#17a860", "#5198ca", ])
       + p9.facet_wrap('tax_level', scales='free_y')
       + p9.guides(fill=False, color=False)
       + p9.labs(x='', y='Number of unique taxa', fill='')
       + p9.theme(
           figure_size=(7, 3),
           axis_title_y=p9.element_text(size=9),
           axis_text=p9.element_text(size=9)
           )
)
)

Table of mean number of unique bacterial species per sample group

(bacterial_abundances_long
 .query('counts > 0')
 .query('species != "Unclassified"')
 .groupby(['sample_id'], as_index=False)
 .species.nunique()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']],
        how='left', on='sample_id')
 .groupby('sample_duration', as_index=False)
 .agg({'species': 'mean'})
 .rename(columns={'species': 'Mean n of unique species',
                  'sample_duration': 'Sample Group'})
 .style
 .hide(axis='index')
 .format({
     'Mean n of unique species': '{:.1f}'
 })
)

Sample Group	Mean n of unique species
Blanks	101.2
Daily	329.1
Weekly	122.1

Table of mean number of unique bacterial genus per sample group

(genus_abundances
 .groupby('sample_id')
 .agg({'genus': 'nunique'})
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
 .groupby('sample_duration', as_index=False)
 .agg({'genus': 'mean'})
 .rename(columns={'genus': 'Mean n of unique genera',
                  'sample_duration': 'Sample Group'})
 .style
 .hide(axis='index')
 .format({
     'Mean n of unique genera': '{:.1f}'})
)

Sample Group	Mean n of unique genera
Blanks	77.8
Daily	264.5
Weekly	115.9

Show Code

# We will create a sample x genus table for later use
bacterial_genus_table = (bacterial_abundances_long
    .query('counts > 0')
    .query('genus.notna() & genus != "Unclassified"')
    .groupby(['sample_id', 'genus'])['counts']
    .sum()
    .pivot(index='sample_id', columns='genus', values='counts')
)

%%R -i bacterial_genus_table -i kumamoto_metadata -o bacterial_pcoa_df -o bacterial_pcoa_df_no_blanks -o bacterial_permanova_results -o bacterial_permanova_results_no_blanks

bacterial_dist <- vegan::vegdist(bacterial_genus_table, method = "bray")
genus_no_blanks <- bacterial_genus_table %>%
    dplyr::filter(!grepl("^KB", rownames(bacterial_genus_table)))
bacterial_dist_no_blanks <- vegan::vegdist(genus_no_blanks, method = "bray")

# bacterial_nmds <- vegan::metaMDS(bacterial_dist, k = 2)
bacterial_pcoa <- ape::pcoa(bacterial_dist)
bacterial_pcoa_no_blanks <- ape::pcoa(bacterial_dist_no_blanks)
sample_duration <- kumamoto_metadata$sample_duration[match(rownames(bacterial_genus_table), kumamoto_metadata$sample_id)]
sample_duration_no_blanks <- kumamoto_metadata$sample_duration[match(rownames(genus_no_blanks), kumamoto_metadata$sample_id)]
bacterial_permanova_results <- vegan::adonis2(bacterial_genus_table ~ sample_duration, permutations = 999)
bacterial_permanova_results_no_blanks <- vegan::adonis2(genus_no_blanks ~ sample_duration_no_blanks, permutations = 999)

print(bacterial_permanova_results)
print(bacterial_permanova_results_no_blanks)

bacterial_pcoa_df <- as.data.frame(bacterial_pcoa$vectors[, 1:2])
colnames(bacterial_pcoa_df) <- c('PCoA1', 'PCoA2')
bacterial_pcoa_df$sample_id <- rownames(bacterial_pcoa_df)
bacterial_pcoa_df_no_blanks <- as.data.frame(bacterial_pcoa_no_blanks$vectors[, 1:2])
colnames(bacterial_pcoa_df_no_blanks) <- c('PCoA1', 'PCoA2')
bacterial_pcoa_df_no_blanks$sample_id <- rownames(bacterial_pcoa_df_no_blanks)

Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = bacterial_genus_table ~ sample_duration, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      2    3.549 0.11142 7.5234  0.001 ***
Residual 120   28.305 0.88858                  
Total    122   31.854 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Permutation test for adonis under reduced model
Permutation: free
Number of permutations: 999

vegan::adonis2(formula = genus_no_blanks ~ sample_duration_no_blanks, permutations = 999)
          Df SumOfSqs      R2      F Pr(>F)    
Model      1   2.4666 0.08797 10.803  0.001 ***
Residual 112  25.5721 0.91203                  
Total    113  28.0387 1.00000                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(bacterial_pcoa_df
 .merge(bacterial_pcoa_df_no_blanks, 
        on='sample_id', how='left', suffixes=('__With Blanks', '__Without Blanks'))
 .melt(id_vars='sample_id')
 .assign(sample_group=lambda dd: dd['variable'].str.split('__').str[1]) 
 .assign(variable=lambda dd: dd['variable'].str.split('__').str[0])
 .pivot(index=['sample_id', 'sample_group'], columns='variable', values='value')
 .reset_index()
 .dropna()
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], how='left', on='sample_id')
  .pipe(lambda dd: p9.ggplot(dd)
  + p9.aes('PCoA1', 'PCoA2', color='sample_duration')
  + p9.stat_ellipse(p9.aes(fill='sample_duration'), alpha=0.1, level=0.8, geom='polygon')
  + p9.geom_point(size=3, alpha=0.8, stroke=0)
  + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.facet_wrap('sample_group')
  + p9.labs(title='',
           x='PCoA$_1$', y='PCoA$_2$', color='', fill='',
           caption=
       f'PERMANOVA p-value with blanks: {bacterial_permanova_results.iloc[0, 4]:.3f}' +
       f'(R$^2$ = {bacterial_permanova_results.iloc[0, 2]:.1%}) |' +
       f'without blanks: {bacterial_permanova_results_no_blanks.iloc[0, 4]:.3f} ' +
       f'(R$^2$ = {bacterial_permanova_results_no_blanks.iloc[0, 2]:.1%})',)
  + p9.ggtitle('PCoA of Bacterial Genus-Level Abundances')
  + p9.theme(
      legend_position='top',
       figure_size=(7, 4)
  )
 )
)

(bacterial_pcoa_df
 .merge(kumamoto_metadata[['sample_id', 'sample_duration']], on='sample_id', how='left')
 .pipe(lambda dd: p9.ggplot(dd)
  + p9.aes('PCoA1', 'PCoA2', color='sample_duration')
  + p9.stat_ellipse(p9.aes(fill='sample_duration'), alpha=0.1, level=0.8, geom='polygon')
  + p9.geom_point(size=3, alpha=0.8, stroke=0)
  + p9.scale_color_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.scale_fill_manual(values=["#5198ca", "#c16514", "#17a860"])
  + p9.labs(title='',
           x='PCoA$_1$', y='PCoA$_2$', color='', fill='')
  + p9.theme(
      legend_position='top',
  )
 )
)

Appendix (Birds)

Show Code

aves_notes = {
    "Neognathae": "Clade containing most modern birds except ratites and tinamous.",
    "Palaeognathae": "Clade including ratites (ostrich, emu) and tinamous.",
    "Limosa lapponica baueri": "Bar-tailed godwit subspecies, a long-distance migratory shorebird.",
    "Aves": "Class of all birds.",
    "Phasianidae": "Family of ground-dwelling birds including pheasants, chickens, and turkeys.",
    "Gallus gallus": "Red junglefowl, wild ancestor of domestic chickens.",
    "Calidris pugnax": "Ruff, a migratory wader species in the sandpiper family.",
    "Scolopacidae": "Family of wading birds like sandpipers and curlews.",
    "Galloanserae": "Clade grouping landfowl (Galliformes) and waterfowl (Anseriformes).",
    "Charadrius vociferus": "Killdeer, a plover common in open habitats in North America.",
    "Haliaeetus leucocephalus": "Bald eagle, large bird of prey from North America.",
    "Hirundo rustica rustica": "Nominate subspecies of barn swallow, widespread in Eurasia.",
    "Charadriiformes": "Order of shorebirds including plovers, gulls, and auks.",
    "Colinus virginianus": "Northern bobwhite, a quail native to North America.",
    "Pipra filicauda": "Wire-tailed manakin, a small South American bird.",
    "Galliformes": "Order of heavy-bodied ground-feeding birds (chickens, turkeys, quails).",
    "Columba livia": "Rock pigeon, common in urban areas worldwide.",
    "Coturnix japonica": "Japanese quail, a domesticated quail species.",
    "Dromaius novaehollandiae": "Emu, large flightless bird native to Australia.",
    "Phasianinae": "Subfamily of pheasants and related birds.",
    "Balearica regulorum gibbericeps": "Grey crowned crane subspecies from Africa.",
    "Chloebia gouldiae": "Gouldian finch, a colorful Australian passerine.",
    "Bambusicola thoracicus": "Chinese bamboo partridge, a ground bird from East Asia.",
    "Gruiformes": "Order including cranes, rails, and coots.",
    "Buceros rhinoceros silvestris": "Subspecies of rhinoceros hornbill from Southeast Asia.",
    "Oxyura jamaicensis": "Ruddy duck, a stiff-tailed duck from the Americas.",
    "Chaetura pelagica": "Chimney swift, a migratory bird from the Americas.",
    "Falco naumanni": "Lesser kestrel, a small migratory falcon.",
    "Callipepla squamata": "Scaled quail, found in southwestern US and Mexico.",
    "Zosterops borbonicus": "Réunion grey white-eye, endemic passerine to Réunion Island.",
    "Gavia stellata": "Red-throated loon, a migratory aquatic bird.",
    "Passeroidea": "Superfamily of passerine birds including sparrows and finches.",
    "Corvus": "Genus of crows, ravens, rooks, and jackdaws.",
    "Lagopus leucura": "White-tailed ptarmigan, a grouse living in alpine habitats.",
    "Passeriformes": "Order of perching birds, the largest bird order."
}

aves_unlikely_presence = [
    "Dromaius novaehollandiae", 
    "Chloebia gouldiae", 
    "Buceros rhinoceros silvestris", 
    "Haliaeetus leucocephalus",
    "Balearica regulorum gibbericeps",
    "Pipra filicauda",
]



(abundance_tax_long
 .query('level_27 == " Aves"')
 .assign(full_name=lambda dd: dd.apply(lambda row: '; '.join(row[tax_levels].dropna()), axis=1))
 .assign(last_name=lambda dd: dd.full_name.str.split(';').str[-1].str.strip())
 .groupby('last_name', as_index=False)
 .agg({'counts': 'sum'})
 .sort_values('counts', ascending=False)
 .assign(Frequency=lambda dd: dd.counts / dd.counts.sum())
 .assign(note=lambda dd: dd.last_name.map(aves_notes).fillna(''))
 .rename(columns={'last_name': 'Name (deepest tax. level)',
                  'counts': 'Reads'})
 
 .style
 .format({'Frequency': '{:.2%}',
          'Reads': '{:.0f}'})
 .hide(axis='index')
 # text align left
 .set_properties(**{'text-align': 'left', 'font-family': 'monospace',
                    'font-size': '9pt',
                    'width' : '1200 px'})
 .bar(subset=['Frequency'], color='#5198ca', align='mid')
 # color red for unlikely presence
 .map(lambda x: 'color: red' if x in aves_unlikely_presence else 'color: none',
       subset=['Name (deepest tax. level)']
 ))

Name (deepest tax. level)	Reads	Frequency	note
Neognathae	253914	40.23%	Clade containing most modern birds except ratites and tinamous.
Limosa lapponica baueri	113472	17.98%	Bar-tailed godwit subspecies, a long-distance migratory shorebird.
Aves	59533	9.43%	Class of all birds.
Phasianidae	39584	6.27%	Family of ground-dwelling birds including pheasants, chickens, and turkeys.
Gallus gallus	33016	5.23%	Red junglefowl, wild ancestor of domestic chickens.
Calidris pugnax	16638	2.64%	Ruff, a migratory wader species in the sandpiper family.
Scolopacidae	9934	1.57%	Family of wading birds like sandpipers and curlews.
Galloanserae	9279	1.47%	Clade grouping landfowl (Galliformes) and waterfowl (Anseriformes).
Charadrius vociferus	9086	1.44%	Killdeer, a plover common in open habitats in North America.
Haliaeetus leucocephalus	8342	1.32%	Bald eagle, large bird of prey from North America.
Hirundo rustica rustica	7571	1.20%	Nominate subspecies of barn swallow, widespread in Eurasia.
Charadriiformes	7351	1.16%	Order of shorebirds including plovers, gulls, and auks.
Colinus virginianus	6431	1.02%	Northern bobwhite, a quail native to North America.
Pipra filicauda	4862	0.77%	Wire-tailed manakin, a small South American bird.
Galliformes	4575	0.72%	Order of heavy-bodied ground-feeding birds (chickens, turkeys, quails).
Columba livia	4470	0.71%	Rock pigeon, common in urban areas worldwide.
Coturnix japonica	3955	0.63%	Japanese quail, a domesticated quail species.
Dromaius novaehollandiae	3812	0.60%	Emu, large flightless bird native to Australia.
Phasianinae	3453	0.55%	Subfamily of pheasants and related birds.
Balearica regulorum gibbericeps	3222	0.51%	Grey crowned crane subspecies from Africa.
Chloebia gouldiae	3059	0.48%	Gouldian finch, a colorful Australian passerine.
Bambusicola thoracicus	2705	0.43%	Chinese bamboo partridge, a ground bird from East Asia.
Gruiformes	2646	0.42%	Order including cranes, rails, and coots.
Buceros rhinoceros silvestris	2409	0.38%	Subspecies of rhinoceros hornbill from Southeast Asia.
Oxyura jamaicensis	2274	0.36%	Ruddy duck, a stiff-tailed duck from the Americas.
Chaetura pelagica	2236	0.35%	Chimney swift, a migratory bird from the Americas.
Falco naumanni	2210	0.35%	Lesser kestrel, a small migratory falcon.
Callipepla squamata	2168	0.34%	Scaled quail, found in southwestern US and Mexico.
Zosterops borbonicus	1963	0.31%	Réunion grey white-eye, endemic passerine to Réunion Island.
Gavia stellata	1959	0.31%	Red-throated loon, a migratory aquatic bird.
Passeroidea	1683	0.27%	Superfamily of passerine birds including sparrows and finches.
Corvus	1469	0.23%	Genus of crows, ravens, rooks, and jackdaws.
Lagopus leucura	942	0.15%	White-tailed ptarmigan, a grouse living in alpine habitats.
Palaeognathae	663	0.11%	Clade including ratites (ostrich, emu) and tinamous.
Passeriformes	297	0.05%	Order of perching birds, the largest bird order.

migratory_species = (pd.DataFrame(aves_notes.items(), columns=['Name', 'Note'])
                     .query('Note.str.contains("migr")')
                     ['Name']
)

abundance_tax_long.query('sample_type=="Sample"').sample_id.nunique()

(abundance_tax_long
 .query('level_27 == " Aves"')
 .groupby(['sample_id', 'sample_type'])
 .agg({'counts': 'sum'})
 .reset_index()
 .groupby('sample_type')
 .agg(total_reads=('counts', 'sum'),
      avg_reads=('counts', 'mean'),
      num_samples=('counts', lambda x: (x > 0).sum()),
)
)

	total_reads	avg_reads	num_samples
sample_type
Blank	0.0	0.000000	0
Sample	631183.0	5536.692982	55

Show Code

most_common_aves = (abundance_tax_long
 .query('level_27 == " Aves"')
 .assign(full_name=lambda dd: dd.apply(lambda row: '; '.join(row[tax_levels].dropna()), axis=1))
 .assign(last_name=lambda dd: dd.full_name.str.split(';').str[-1].str.strip())
 .query('counts > 0')
 .groupby('last_name', as_index=False)
 .size()
 .sort_values('size', ascending=False)
 ['last_name']
)

(abundance_tax_long
 .query('level_27 == " Aves"')
 .assign(full_name=lambda dd: dd.apply(lambda row: '; '.join(row[tax_levels].dropna()), axis=1))
 .assign(last_name=lambda dd: dd.full_name.str.split(';').str[-1].str.strip())
 .merge(kumamoto_metadata_no_blanks[['sample_id', 'date_start', 'date_end', 'sample_duration']],)
 .assign(date=lambda x: np.where(x.sample_duration == "Weekly", x.date_start + pd.Timedelta(days=3), x.date_start))
 .groupby('last_name', as_index=False).apply(lambda dd: dd.assign(counts=dd.counts / dd.counts.max()))
 .assign(last_name=lambda dd: pd.Categorical(dd.last_name, categories=most_common_aves, ordered=True))
 .pipe(lambda dd: p9.ggplot(dd)
       + p9.facet_wrap('~last_name', ncol=5)
       + p9.scale_x_datetime(date_breaks='3 months', date_labels='%b %Y', expand=(0, 0))
       + p9.labs(x='', y='OTU-normalized read counts')
       + p9.scale_y_continuous(expand=(0, 0), labels=lambda x: [f"{v:.0%}" for v in x])
       + p9.geom_rect(p9.aes(xmin='date_start', xmax='date_end', ymin=0, ymax='counts'))
       + p9.theme_bw()
       + p9.theme(
           figure_size=(16 * 0.75, 9 * 0.75),
           dpi=300,
           panel_grid_major_y=p9.element_blank(),
           panel_grid_major_x=p9.element_line(linetype='dashed', color='lightgrey'),
           strip_text=p9.element_text(size=9, family='Georgia', margin={'b': 2, 't': 2})
       ))
       )