Synthetic Data

Use a statistical model of single nucleotide polymorphisms (SNPs) to synthesize genomic data.
Use another model to combine that data with geocoded samples to generate snail sizes.
Build a program to analyze and visualize the synthetic data to test the analysis pipeline.

Terms defined: single nucleotide polymorphism, synthetic data generator

How do you test a data analysis pipeline?
- Unit tests of the kind used for invasion percolation aren’t necessarily appropriate
- Problem isn’t the control flow of the code but how it interacts with data
[Alexander2023]: build a synthetic data generator
- Check the behavior of the pipeline by pushing data with known properties through it
- If it finds something statistically significant in white noise, rethink the pipeline
Faker has tools for generating a lot of useful data (we will use it in future chapters)
But it can’t generate genomes, so we’ll build our own

Synthesizing Genomes

Create some short snail genomes with single-nucleotide polymorphisms
Output will have:
- Length: for now, all sequences the same length)
- Reference sequence: the unmutated original
- Locations: where mutations can occur
- Susceptible location and base: which of those locations is the key to variance, and what mutated base causes the visible mutation
- Individuals: list of generated genomes
Since this has multiple fields, store as JSON
Main driver

def main():
    """Main driver."""
    args = parse_args()
    random.seed(args.seed)
    genomes = random_genomes(args.length, args.num_genomes, args.num_snp, args.prob_other)
    add_susceptibility(genomes)
    save(args.outfile, genomes)

Random genomes
- Create a random base sequence as the reference genome
- Create duplicates for individuals
- Determine where SNPs can occur
- Introduce significant mutations
- Introduce other random mutations
- Sort for reproducibility [Taschuk2017]

def random_genomes(length, num_genomes, num_snp, prob_other):
    """Generate a set of genomes with specified number of point mutations."""
    assert 0 <= num_snp <= length
    reference = random_bases(length)
    individuals = [reference] * num_genomes
    locations = random.sample(list(range(length)), num_snp)
    for loc in locations:
        candidates = _other_bases(reference, loc)
        bases = [reference[loc]] + random.sample(candidates, k=len(candidates))
        individuals = [_mutate_snps(reference, ind, loc, bases) for ind in individuals]
    other_locations = list(set(range(length)) - set(locations))
    individuals = [_mutate_other(ind, prob_other, other_locations) for ind in individuals]
    individuals.sort()
    locations.sort()
    return GenePool(length=length, reference=reference, individuals=individuals, locations=locations)

Random bases and mutations

def random_bases(length):
    """Generate a random sequence of bases of the specified length."""
    assert 0 < length
    return ''.join(random.choices(DNA, k=length))

def _mutate_snps(reference, genome, loc, bases):
    """Introduce single nucleotide polymorphisms at the specified location."""
    choice = _choose_one(bases, SNP_PROBS)
    return genome[:loc] + choice + genome[loc + 1:]

def _mutate_other(genome, prob, locations):
    """Introduce other mutations at specified locations."""
    if random.random() > prob:
        return genome
    loc = random.sample(locations, k=1)[0]
    base = random.choice(_other_bases(genome, loc))
    genome = genome[:loc] + base + genome[loc + 1:]
    return genome

Synthesizing Samples

Snail size depends on:
- Presence of significant mutation
- Distance from epicenter of pollution
Building this forces us to be explicit about our model
- Variance depends on a single nucleotide
- And on linear distance from a center point
So:
- Load the individuals
- Put each individual somewhere near the spill site
- Generate a random reading for its size that depends on two factors
Main driver should be starting to look familiar

def main():
    """Main driver."""
    args = parse_args()
    random.seed(args.seed)
    genomes = json.loads(Path(args.genomes).read_text())
    geo_params = get_geo_params(args)
    samples = generate_samples(args, genomes, geo_params)
    save(args, samples)

Get geographic parameters from CSV files
- Need to join tables to get longitude, latitude, and nominal pollution radius

site,lon,lat
COW,-124.04519,48.82172
YOU,-124.197,48.87251
HMB,-124.17555,48.81673
GBY,-124.4593,48.9209

label,site,date,num,peak,relative_sd,radius
1748,COW,2023-04-27,23,100.0,0.1,0.1
1749,COW,2023-04-28,11,100.0,0.1,0.1
1755,COW,2023-05-13,15,101.0,0.11,0.1
1781,YOU,2023-05-01,12,90.0,0.15,0.15
1790,HMB,2023-05-02,19,107.0,0.22,0.11
1803,GBY,2023-05-08,8,95.0,0.1,0.14

def get_geo_params(args):
    """Get geographic parameters."""
    sites = pd.read_csv(Path(args.paramsdir, 'sites.csv'))
    surveys = pd.read_csv(Path(args.paramsdir, 'surveys.csv'))
    combined = sites.merge(surveys, how='inner', on='site')
    filtered = combined[combined['site'] == args.site].iloc[0]
    return {'lon': filtered['lon'], 'lat': filtered['lat'], 'radius': filtered['radius']}

Once we have those, generate location and snail size based on genetics and distance

def generate_samples(args, genomes, geo_params):
    """Generate snail samples."""
    samples = []
    for sequence in genomes['individuals']:
        point, distance = random_geo_point(**geo_params)
        if sequence[genomes['susceptible_loc']] == genomes['susceptible_base']:
            limit = args.mutant
        else:
            limit = args.normal
        scale = limit * distance / geo_params['radius']
        reading = random.uniform(MIN_SNAIL_SIZE, MIN_SNAIL_SIZE + MAX_SNAIL_SIZE * scale)
        samples.append((point.longitude, point.latitude, sequence, reading))
    df = pd.DataFrame(samples, columns=('lon', 'lat', 'sequence', 'reading'))
    df['lon'] = df['lon'].round(LON_LAT_PRECISION)
    df['lat'] = df['lat'].round(LON_LAT_PRECISION)
    df['reading'] = df['reading'].round(SNAIL_PRECISION)
    return df

Keep the magic numbers at the top of the file
- If we wanted to vary these, would store them as JSON or YAML and load
- In particular, would do this if we were sweeping parameters as in Chapter 6

CIRCLE = 360.0
LON_LAT_PRECISION = 5
READING_PRECISION = 1
MIN_SNAIL_SIZE = 0.5
MAX_SNAIL_SIZE = 5.0
SNAIL_PRECISION = 1

Analysis

Finally ready to write our analysis
Read the CSV data with locations, genomes, and readings
Find all candidate locations where sequences don’t agree
Create a scatter plot by location and base
- One showing all data
- One showing only locations where there are variations

scatterplot of all readings at all locations — Figure 8.1: Reading as a function of location and base (all)

Plot reading versus mutation location in ranked order for interesting locations

rank plot of readings at mutation locations — Figure 8.3: Reading as a function of location (rank order)

Clearly a winner in this plot…
…but not nearly as clear in scatter plot…
…and the winner has only twice the reading of the next-highest (random) value
Statistics could tell us if this is what we expect, but this isn’t a statistics lesson