forked from popsim-consortium/adding-species-manuscript
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathadding_species_manuscript.tex
executable file
·1024 lines (920 loc) · 65.6 KB
/
adding_species_manuscript.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[hidelinks]{article}
\usepackage[letterpaper,margin=1.0in]{geometry}
\usepackage[utf8]{inputenc}
\pagenumbering{arabic}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[singlelinecheck=false]{caption} % singlelinecheck makes single line caption left aligned instead of centered
\usepackage{subcaption}
\usepackage{amsmath}
\usepackage[round]{natbib}
\usepackage{fancyhdr}
\usepackage{longtable}
\usepackage{booktabs}
% hyperlinks
\usepackage{hyperref}
\usepackage{xspace}
\usepackage{mathrsfs}
\usepackage{graphicx}
\pagestyle{fancy}
\fancyhead[R]{\textbf{Expanding \stdpopsim}}
% for highlighting text
\usepackage{xcolor}
\usepackage{soul}
% bibliography
\usepackage[round]{natbib} % omit 'round' option if you prefer square brackets
\bibliographystyle{plainnat}
\newcommand{\Stdpopsim}{\texttt{Stdpopsim}\xspace}
\newcommand{\stdpopsim}{\texttt{stdpopsim}\xspace}
%commands to format figure and table references in the supplement
\newcommand{\beginsupplement}{%
\fancyhead[L]{Supplemental Material}
\setcounter{table}{0}
\renewcommand{\thetable}{S\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{S\arabic{figure}}%
}
\newcommand{\stopsupplement}{%
\setcounter{table}{0}
\renewcommand{\thetable}{\arabic{table}}%
\setcounter{figure}{0}
\renewcommand{\thefigure}{\arabic{figure}}%
}
\makeatletter
\newcommand{\labelname}[1]{\def\@currentlabelname{#1}}
\makeatother
% Avoid pandoc bug when there are lists in the body.
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\title{Expanding the \stdpopsim species catalog, and lessons learned for realistic genome simulations}
\author[1,+]{M. Elise Lauterbur}
\author[2,*]{Maria Izabel A. Cavassim}
\author[3,*]{Ariella L. Gladstein}
\author[4,*]{Graham Gower}
\author[5*]{Nathaniel S. Pope}
\author[6,*]{Georgia Tsambos}
\author[5,7]{Jeff Adrion}
\author[5]{Saurabh Belsare}
\author[8]{Arjun Biddanda}
\author[5]{Victoria Caudill}
\author[9]{Jean Cury}
\author[10]{Ignacio Echevarria}
\author[11]{Benjamin C. Haller}
\author[12,13]{Ahmed R. Hasan}
\author[14,15]{Xin Huang}
\author[16]{Leonardo Nicola Martin Iasi}
\author[17]{Ekaterina Noskova}
\author[18]{Jana Obšteter}
\author[19]{Vitor Antonio Corrêa Pavinato}
\author[20,21]{Alice Pearson}
\author[22,23]{David Peede}
\author[24]{Manolo F. Perez}
\author[5]{Murillo F. Rodrigues}
\author[5]{Chris C. R. Smith}
\author[25]{Jeffrey P. Spence}
\author[5]{Anastasia Teterina}
\author[5]{Silas Tittes}
\author[26]{Per Unneberg}
\author[27]{Juan Manuel Vazquez}
\author[28]{Ryan K. Waples}
\author[29]{Anthony Wilder Wohns}
\author[30]{Yan Wong}
\author[31]{Franz Baumdicker}
\author[32]{Reed A. Cartwright}
\author[33]{Gregor Gorjanc}
\author[34]{Ryan N. Gutenkunst}
\author[30]{Jerome Kelleher}
\author[5]{Andrew D. Kern}
\author[35]{Aaron P. Ragsdale}
\author[5,36]{Peter L. Ralph}
\author[37]{Daniel R. Schrider}
\author[38,+]{Ilan Gronau}
\affil[*]{\small{These authors contributed equally to the paper.}}
\affil[+]{\small{Corresponding authors: [email protected] ; [email protected].}}
\affil[1]{\small{Department of Ecology and Evolutionary Biology, University of Arizona, Tucson AZ 85719, USA}}
\affil[2]{\small{Department of Ecology and Evolutionary Biology, University of California, Los Angeles, Los Angeles CA, USA}}
\affil[3]{\small{Embark Veterinary, Inc., Boston MA 02111, USA}}
\affil[4]{\small{Section for Molecular Ecology and Evolution, Globe Institute, University of Copenhagen, Denmark}}
\affil[5]{\small{Institute of Ecology and Evolution, University of Oregon, Eugene OR 97402, USA}}
\affil[6]{\small{School of Mathematics and Statistics, University of Melbourne, Australia}}
\affil[7]{\small{AncestryDNA, San Francisco CA 94107, USA}}
\affil[8]{\small{54Gene, Inc., Washington DC 20005, USA}}
\affil[9]{\small{Université Paris-Saclay, CNRS, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, UMR 9015 Orsay, France}}
\affil[10]{\small{School of Life Sciences, University of Glasgow, Glasgow, UK}}
\affil[11]{\small{Department of Computational Biology, Cornell University, Ithaca NY, USA}}
\affil[12]{\small{Department of Cell and Systems Biology, University of Toronto, Toronto ON, Canada}}
\affil[13]{\small{Department of Biology, University of Toronto Mississauga, Mississauga ON, Canada}}
\affil[14]{\small{Department of Evolutionary Anthropology, University of Vienna, Vienna, Austria}}
\affil[15]{\small{Human Evolution and Archaeological Sciences (HEAS), University of Vienna, Vienna, Austria}}
\affil[16]{\small{Department of Evolutionary Genetics, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany}}
\affil[17]{\small{Computer Technologies Laboratory, ITMO University, St Petersburg, Russia}}
\affil[18]{\small{Agricultural Institute of Slovenia, Department of Animal Science, Ljubljana, Slovenia}}
\affil[19]{\small{Entomology Department, The Ohio State University, Wooster OH, USA}}
\affil[20]{\small{Department of Genetics, University of Cambridge, Cambridge, UK}}
\affil[21]{\small{Department of Zoology, University of Cambridge, Cambridge, UK}}
\affil[22]{\small{Department of Ecology, Evolution, and Organismal Biology, Brown University, Providence RI, USA}}
\affil[23]{\small{Center for Computational Molecular Biology, Brown University, Providence RI, USA}}
\affil[24]{\small{Department of Genetics and Evolution, Federal University of Sao Carlos, Sao Carlos 13565905, Brazil}}
\affil[25]{\small{Department of Genetics, Stanford University School of Medicine, Stanford CA 94305, USA}}
\affil[26]{\small{Department of Cell and Molecular Biology, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Uppsala University, Husargatan 3, SE-752 37 Uppsala, Sweden}}
\affil[27]{\small{Department of Integrative Biology, University of California, Berkeley, Berkeley CA, USA}}
\affil[28]{\small{Department of Biostatistics, University of Washington, Seattle WA, USA}}
\affil[29]{\small{Broad Institute of MIT and Harvard, Cambridge MA 02142, USA}}
\affil[30]{\small{Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK}}
\affil[31]{\small{Cluster of Excellence - Controlling Microbes to Fight Infections, Eberhard Karls Universität Tübingen, Tübingen, Baden-Württemberg, Germany}}
\affil[32]{\small{School of Life Sciences and The Biodesign Institute, Arizona State University, Tempe AZ, USA}}
\affil[33]{\small{The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Edinburgh EH25 9RG, UK}}
\affil[34]{\small{Department of Molecular and Cellular Biology, University of Arizona, Tucson AZ 85721, USA}}
\affil[35]{\small{Department of Integrative Biology, University of Wisconsin-Madison, Madison WI, USA}}
\affil[36]{\small{Department of Mathematics, University of Oregon, Eugene OR 97402, USA}}
\affil[37]{\small{Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill NC 27599, USA}}
\affil[38]{\small{Efi Arazi School of Computer Science, Reichman University, Herzliya, Israel}}
\date{\small{\today{}}}
\begin{document}
\maketitle
\section*{Abstract}
Simulation is a key tool in population genetics for both methods development and empirical research,
but producing simulations that recapitulate the main features of genomic data sets remains a major obstacle.
Today, more realistic simulations are possible thanks to large increases in
the quantity and quality of available genetic data,
and to the sophistication of inference and simulation software.
However, implementing these simulations still requires substantial time and specialized knowledge.
These challenges are especially pronounced for simulating genomes for species that are not well-studied,
since it is not always clear what information is required
to produce simulations with a level of realism
sufficient to confidently answer a given question.
The community-developed framework \stdpopsim seeks to lower this barrier
by facilitating the simulation of complex population genetic models using
up-to-date information.
The initial version of \stdpopsim focused on establishing this framework using
six well-characterized model species \citep{Adrion2020}.
Here, we report on major improvements made in the new release of \stdpopsim (version 0.2),
which includes a significant expansion of the species catalog and substantial additions to simulation capabilities.
Features added to improve the realism of the simulated genomes include non-crossover recombination and provision of species-specific genomic annotations.
Through community-driven efforts, we expanded the number of species in the catalog more than three-fold and broadened coverage across the tree of life.
During the process of expanding the catalog,
we have identified common sticking points and developed best practices for setting up genome-scale simulations.
We describe the input data required for generating a realistic simulation,
suggest good practices for obtaining the relevant information from the literature,
and discuss common pitfalls and major considerations.
These improvements to \stdpopsim aim to further promote the use of realistic whole-genome population genetic simulations,
especially in non-model organisms,
making them available, transparent, and accessible to everyone.
\section*{Introduction}
\label{introduction}
%REVISION: modified beginning of paragraph (rev 1 comment)
Population genetics allows us to answer questions across scales from deep evolutionary time to ongoing ecological dynamics,
and dramatic reductions in sequencing costs enable the generation of
unprecedented amounts of genomic data that can be used to address these questions \citep{Ellegren2014}.
Ongoing efforts to systematically sequence life on
Earth by initiatives such as the Earth Biogenome \citep{Lewin2022} and its
affiliated project networks, such as Vertebrate Genomes
\citep{Rhie2021}, 10,000 Plants \citep{Cheng2018} and others \citep{darwin2022sequence},
are providing the backbone for enormous increases in the amount of population-level genomic data available for model and non-model species.
These data are being used, among other things,
in inference of population history and demographic parameters \citep{Beichman2018},
studying adaptive introgression \citep{Gower2021},
providing null expectations for selection scans \citep[e.g.][]{Hsieh2021},
and understanding the implications of deleterious variation in populations of conservation concern \citep[e.g.][]{Robinson2023}.
%Methods that use these data for purposes such as inference of demographic history
%and natural selection are also flourishing \citep{}.
While many of the methods that address these questions were initially developed for a few key model systems such as humans and \emph{Drosophila},
more recent efforts are generalizing these methods to include
important factors not initially accounted for,
such as inbreeding or selfing \citep{Blischak2020},
skewed offspring distributions \citep{Montano2016},
and intense artificial selection
even for non-model organisms \citep{MacLeod2013, MacLeod2014}.
Simulations can be useful at all stages of this work---for
planning studies, analyzing data, testing inference methods,
and validating findings from empirical and theoretical research.
For instance, simulations provide training data
for inference methods based on machine learning \citep{Schrider2018} and
Approximate Bayesian Computation \citep{Csillery2010}. They can also serve as
baselines for further analyses: for example, simulations incorporating
demographic history serve as null models when detecting selection \citep{Hsieh2016a}
or seed downstream breeding program simulations \citep{Gaynor2020}.
More recently, population genomic simulations have been used to help guide conservation decisions for threatened species
\citep{Teixeira2021,kyriazis2022using}.
Increasing amounts of data and sophistication of inference methods
have enabled researchers to ask ever more
specific and precise questions. Consequently, simulations must incorporate
more and more elements of biological realism.
Important elements include genomic features such as mutation and recombination
rates that strongly affect genetic variation and haplotype structure
\citep{Nachman2002}. The inclusion of these genomic features is particularly important
when linked selection is acting upon the patterns of genomic diversity being studied \citep{Cutter2013}.
Furthermore, the demographic history of a species---encompassing population sizes and distributions, divergences, and gene flow---can
dramatically affect patterns of genomic variation \citep{Teshima2006}. Thus
species-specific estimates of these and other ecological and evolutionary parameters
(such as those governing the process of natural selection)
are important when generating realistic simulations.
This presents challenges, especially to new researchers,
as it takes a great deal of specialized knowledge not only to code the simulations themselves
but also to find and choose appropriate estimates of the parameters underlying the simulation model.
The recently developed community resource \stdpopsim provides easy
access to detailed population genomic simulations \citep{Adrion2020}. It
lowers the technical barriers to performing these simulations
and reduces the possibility of erroneous implementation of simulations
for species with published demographic models.
The initial release of \stdpopsim was
restricted to only six well-characterized model species, such as
\emph{Drosophila melanogaster} and \emph{Homo sapiens},
but feedback we received from the community identified a widespread desire
to simulate a broader range of non-model species,
and ideally to incorporate these into the \stdpopsim catalog for future use.
This feedback, and subsequent efforts to expand the catalog,
also uncovered a vital need to better understand when it is practical to create a realistic
simulation of a species of interest, and indeed what ``realistic'' means in this context.
This paper reports on the updates made in the current release of \stdpopsim (version 0.2),
and is also intended as a resource for any researcher
who wishes to develop chromosome-scale simulations for their own species of interest.
We start by describing the central idea behind the standardized simulation framework
of \stdpopsim,
and then outline the main updates made to the \stdpopsim catalog and simulation framework
in the past two years.
We then provide guidelines for
generating population genomic simulations, either for the purpose of using them in one specific study,
or with the intent of making the simulations available for future work by adding the appropriate models to \stdpopsim.
Among other considerations, we discuss when a chromosome-scale simulation is more useful than
simulations based on either individual loci or generic %(non-species specific)
loci.
% commented out 'non-species specific' bove, since Dan pointed out that a generic locus could be species-specific (e.g., using the demographic model of that species). The point here is not about being species-specific, but about simulating long stretches of DNA.
We specify the required input data,
mention common pitfalls in choosing appropriate parameters,
and suggest courses of action for species that are missing estimates of some necessary inputs.
We conclude with examples from two species recently added to \stdpopsim,
which demonstrate some of the main considerations involved in the process of designing realistic chromosome-scale simulations.
While the guidelines provided in this paper are intended for any researcher interested in implementing a population genomic simulation using any software,
we highlight the ways in which the \stdpopsim framework eases the burden involved in this process and facilitates reproducible research.
\section*{The utility of \stdpopsim for chromosome-scale simulations}
\label{sec:std-sim}
We begin by providing a brief overview of the importance of chromosome-scale simulations and the main rationale behind \stdpopsim;
see \citet{Adrion2020} for more on the topic.
The main objective of population genomic simulations is to recreate
patterns of sequence variation along the genome under
the inferred evolutionary history of a given species.
To achieve this, \stdpopsim is built on top of the
\texttt{msprime} \citep{Kelleher2016,Nelson2020,Baumdicker2022}
and \texttt{SLiM} \citep{Haller2019} simulation engines,
which are capable of producing fairly realistic patterns of sequence variation
if provided with accurate descriptions of the genome architecture
and evolutionary history of the simulated species.
The required parameters include the number of chromosomes and their lengths,
mutation and recombination rates, the demographic history of the simulated population,
and, potentially, the landscape of natural selection along the genome.
A key challenge when setting up a population genomic simulation is to
obtain estimates of all of these quantities from the literature
and then correctly implement them in an appropriate simulation engine.
Detailed estimates of all of these quantities are increasingly available
due to the growing availability of population genomic data
coupled with methodological advances. Incorporating this data
into a population genomic simulation often involves
integrating this data between different literature sources, which can
require specialized knowledge of population genetics theory.
Thus, the process of coding a realistic simulation can be quite time-consuming and often error-prone.
The main objective of \stdpopsim is to streamline this process,
and to make it more robust and more reproducible.
Contributors collect parameter values for their species of interest from the literature,
and then specify these parameters in a template file for the new model.
This model then undergoes a peer-review process,
which involves another researcher independently recreating the model based on the provided documentation.
Automated scripts then execute to compare the two models;
if discrepancies are found in this process, they are resolved by discussion between the contributor and reviewer,
and if necessary with input of additional members of the community.
This quality-control process quite often finds subtle bugs \citep[e.g., as in][]{Ragsdale2020}
or highlights parts of the model that are ambiguously defined by the literature sources.
This increases the reliability and reproducibility of the resulting simulations in any downstream analysis.
Another important goal of \stdpopsim is to promote and facilitate chromosome-scale simulations,
as opposed to the common practice of simulating many short segments \citep[see, e.g.,][]{harris2016genetic}.
Simulation of long sequences, on the order of $10^7$ bases,
has until recently been computationally prohibitive,
but this has changed with the development of modern simulation engines
such as \texttt{msprime} and \texttt{SLiM}.
Generating chromosome-scale simulations has several key benefits.
First, the organization of genes on chromosomes is a key feature of a species' genome that is ignored in many traditional population genomic simulations
(see \cite{schrider2020background} for one exception).
Second, modeling physical linkage allows simulations to capture
important correlations between genetic variants on a chromosome.
These correlations reduce variance relative to separate and independent simulations of equivalent genetic material.
This has a particularly striking effect in long stretches with a low recombination rate,
as observed for instance on the long arm of human chromosome 22 \citep{Dawson2002}.
In bacteria, a similar effect occurs due to genome-wide linkage that is broken only
by horizontal transfer of short segments \citep{Didelot2010}.
When conducting simulations with natural selection, linkage has
an even stronger effect. Selection acting on a small number of sites can
indirectly influence levels and patterns of genetic variation at linked neutral sites,
which has been shown to have a widespread
effect on patterns of genomic variation in a myriad of species
\citep[e.g.,][]{McVicker2009,Charlesworth2012}.
In addition, the lengths of chromosome-scale shared haplotypes within and
between populations provides valuable information on their demographic history.
Demographic inference methods that use such information,
such as MSMC \citep{Schiffels2020} and IBDNe \citep{browning2015accurate},
perform best on long genomic segments with realistic recombination rates.
Chromosome-scale simulations are clearly required to test (or train) such methods,
or to conduct power analyses when designing empirical studies that use them.
With \stdpopsim, such simulations are available with just a single call to a command-line script or with execution of a handful of lines of Python code.
%with citations to all primary sources accompanying each simulation.
\section*{Additions to \stdpopsim}
\label{sec:expanded-catalog}
When first published, the \stdpopsim catalog included six species:
\emph{Homo sapiens}, \emph{Pongo abelii}, \emph{Canis familiaris}, \emph{Drosophila melanogaster},
\emph{Arabidopsis thaliana}, and \emph{Escherichia coli} (Figure \ref{fig:tree}).
One way the catalog has expanded is through the introduction of additional demographic models
for \emph{Homo sapiens}, \emph{Pongo abelii}, \emph{Drosophila melanogaster},
and \emph{Arabidopsis thaliana}, enabling a wider variety of simulations for these
well-studied species.
However, the initial collection of six species represents only a small slice of the tree of life.
This is a concern
not only because there is a large community of researchers studying other organisms,
but also because methods developed for application to model species (such as humans)
may not perform well when applied to other species with very different biology.
Adding species to the \stdpopsim catalog will allow developers to easily test their methods across a wider variety of organisms.
We thus made a concerted effort
to recruit members of the population and evolutionary genetics community
to add their species of interest to the \stdpopsim catalog.
This effort involved a series of workshops to introduce potential contributors to \stdpopsim, followed by a ``Growing the Zoo'' hackathon organized alongside the 2021 ProbGen conference.
The seven initial workshops allowed us to reach a broad community of more than 150 researchers,
many of whom expressed interest in adding non-model species to \stdpopsim.
The hackathon was then structured based on feedback from these participants.
One month before the hackathon, we organized a final workshop to prepare interested
participants, by introducing them to the process of developing
a new species model and adding it to the \stdpopsim code base.
Roughly 20 scientists participated in the hackathon (most of whom are included as authors on this paper),
which resulted in the addition of 15 species to the \stdpopsim catalog
(Figure \ref{fig:tree}).
The catalog now includes
a teleost fish (\textit{Gasterosteus aculeatus}),
a bird (\textit{Anas platyrhynchos}),
a reptile (\textit{Anolis carolinensis}),
a livestock species (\textit{Bos taurus}),
six insects including two vectors of human disease (\textit{Aedes aegypti} and \textit{Anopheles gambiae}),
a nematode (\textit{Caenorhabditis elegans}),
two flowering plants including a crop (\textit{Helianthus annuus}),
an algae (\textit{Chlamydomonas reinhardtii}),
two bacteria,
four primates, and a common mammalian associate of humans (\textit{Canis familiaris}).
%REVISION: genetic map --> recombination map (to avoid confusion)
Not all of these have recombination maps or demographic models (see Figure \ref{fig:tree}),
but this lays a framework for future contributions.
\begin{figure}
\includegraphics[width=\linewidth]{figs/species_fig}
\caption{Phylogenetic tree of species available in the \stdpopsim catalog,
including the six species we published in the original release \citep[in blue]{Adrion2020}, and 15 species that have since been added (in orange).
%REVISION: genetic map --> recombination map (to avoid confusion)
Solid circles indicate species that have one (light grey) or more
(dark grey) demographic models and recombination maps.
Branch lengths were derived from the divergence times provided by TimeTree5 \citep{Kumar2022}.
The horizontal bar below the tree indicates 500 million years (my).
\label{fig:tree} }
\end{figure}
Expanding the species catalog required adding several capabilities to the simulation framework of \stdpopsim.
Some features were added by upgrading the neutral simulation engine, \texttt{msprime}, from version 0.7.4 to version 1.0 \citep{Baumdicker2022}.
Among other features, this upgrade includes a discrete-site model of mutation,
which enables simulating sites with multiple mutations and possibly more than two alleles.
Another key feature added to \stdpopsim's simulation framework was the ability to model non-crossover recombination.
%REVISION: clarified HGT, due to rev 3 comment
In bacteria and archaea, genetic material can be exchanged through horizontal gene transfer,
which can add new genetic material (e.g., via the transfer of plasmids)
or replace homologous sequences through homologous recombination \citep{Thomas2005,Didelot2010,Gophna2022}.
However, the initial version of \stdpopsim used crossover recombination to stand in for these processes.
Although we cannot currently simulate varying gene content
(as would be required to simulate the addition of new genetic material by horizontal gene transfer),
the \texttt{msprime} and \texttt{SLiM} simulation engines now allow gene conversion,
which has the same effect as non-crossover homologous recombination.
Following \citet{Cury2022}, we use this to include non-crossover homologous recombination in bacterial and archaeal species.
This is done in \stdpopsim by setting a flag in the species model to indicate that recombination should be modeled without crossovers,
and specifying an average tract length of exchanged genetic material.
For example, the model for \textit{Escherichia coli} has been updated in the \stdpopsim catalog to use non-crossover recombination at an average rate of $8.9\times 10^{-11}$ recombination events per base per generation,
with an average tract length of 542 bases \citep{Wielgoss2011,Didelot2012}.
Note that this rate ($8.9\times 10^{-11}$) corresponds to the rate of initiation of a recombined tract.
Recombination without crossover is also prevalent in sexually reproducing species,
where it is termed \emph{gene conversion}.
Gene conversion affects shorter segments than crossover recombination and creates distinct patterns of genetic diversity along the genome \citep{Korunes2017}.
Indeed, gene conversion rates in some species are estimated to occur at similar or even higher rates than crossover recombination \citep{Gay2007,Comeron2012,Wijnker2013}.
To accommodate this in \stdpopsim simulations,
one needs to specify the fraction of recombinations that occur due to gene conversion (i.e., without crossover), and the average tract length.
For example, the model for \emph{Drosophila melanogaster} has been updated in the \stdpopsim catalog to have a fraction of gene conversions of 0.83 (in all chromosomes that undergo recombination) and an average tract length of 518 bases \citep{Comeron2012}.
This update does not affect the rate of crossover recombination, but it adds gene conversion events at a ratio of 83:17 relative to crossover recombination events.
We note that since non-crossover recombination incurs a high computation load in simulation,
it is turned off by default in \stdpopsim,
and must be explicitly invoked by the simulation model.
%
%REVISION: added comment and ref to table 1
Note that ignoring gene conversion may result in a slightly skewed distribution of shared haplotypes between individuals (see Table \ref{tab:param-mod}).
Another important extension of \stdpopsim allows augmenting a genome assembly with genome annotations, such as coding regions, promoters, and conserved elements.
These annotations can be used to simulate selection at a subset of sites (such as the annotated coding regions)
using parametric distributions of fitness effects.
Standardized, easily accessible simulations
that include the reality of pervasive linked selection in a species-specific
manner has long been identified as a goal for evolutionary genetics
\cite[e.g.,][]{McVicker2009,comeron2014background}.
Thus, we expect this extension of \stdpopsim to be transformative in the way simulations are carried out in population genetics.
This significant new capability of the \stdpopsim library will be detailed in a forthcoming publication,
and is not the focus of this paper.
\section*{Guidelines for implementing a population genomic simulation}
\labelname{Guidelines}
\label{sec:sim-guidelines}
The concentrated effort to add species to the \stdpopsim catalog
has led to a series of important insights about this process,
which we summarize here as a set of guidelines
for implementing realistic simulations for any species.
Our intention is to provide general guidance that applies to any population genomic simulation software,
but we also mention specific requirements that apply to simulations done in \stdpopsim.
\subsection*{Basic setup for chromosome-level simulations}
Implementing a realistic population genomic simulation for a species of interest
requires a detailed description of the organism's demography and mechanisms of genetic inheritance.
While simulation software requires unforgivingly precise values,
in practice we may only have rough guesses for most of the parameters describing these processes.
In this section, we list the relevant parameters and
provide guidelines for how to set them based on current knowledge.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\item
\textbf{A chromosome-level genome assembly}, which consists of a list of chromosomes or scaffolds and their lengths.
Having a good quality assembly with complete chromosomes, or at least very long scaffolds,
is necessary if chromosome-level population genomic simulations are to reflect the genomic architecture of the species.
When expanding the \stdpopsim catalog during the ``Growing the Zoo" hackathon,
we considered the possibility of adding species whose genome assemblies are composed of many relatively small contigs,
unanchored to chromosome-level scaffolds.
Although we had not previously put restrictions on which species might be added,
we decided that we would only add species with chromosome-level assemblies.
The main justification for this restriction is that
species with less complete genome builds typically do not have good
%REVISION: removed genetic map here because it is implied by "recombination rates" above
% and confused reviewer #3
%estimates of recombination rate, genetic maps,
recombination maps
and demographic models,
making chromosome-level simulation much less useful in such species.
Another issue is the storage burden and long load times involved in dealing with
hundreds of contigs.
Finally, each species requires validation of its code before it is added to the \stdpopsim catalog,
as well as long-term maintenance to keep it up-to-date with changes made to the \stdpopsim framework.
So, the benefit of including species with very partial genome builds in \stdpopsim
would be outweighed by the substantial extra burden on \stdpopsim maintainers
as well as downstream users of these models.
Another reason to focus on species with chromosome-level assemblies is that we
expect their numbers to dramatically increase in the near future due to numerous genome initiatives \citep{Lewin2022,Rhie2021,Cheng2018}
and the development of new long-read sequencing technologies and assembly pipelines
\citep{Chakraborty2016,Amarasinghe2020,Amarasinghe2021}.
\item
\textbf{An average mutation rate} for each chromosome (per generation per bp).
This rate estimate can be based on sequence data from pedigrees, mutation accumulation studies,
or comparative genomic analysis calibrated by fossil data (i.e., phylogenetic estimates).
At present, \stdpopsim simulates mutations at a constant rate under the Jukes--Cantor model of nucleotide mutations \citep{Jukes1969}.
However, we anticipate future development will provide support for more complex, heterogeneous mutational processes,
as these are easily specified in both the \texttt{SLiM} and \texttt{msprime} simulation engines.
Such progress will further improve the realism of simulated genomes,
since mutation processes, including rates, are known to vary along the genome and through time \citep{Benzer1961,Ellegren2003,Supek2019}.
\item
\textbf{Recombination rates} (per generation per bp).
Ideally, a population genomic simulation should make use of a chromosome-level
%REVISION: removed boldface
{recombination map},
since the recombination rate is known to vary widely across chromosomes \citep{Nachman2002},
and this can strongly affect the patterns of linkage disequilibrium and shared haplotype lengths.
When this information is not available, we suggest specifying an average recombination rate for each chromosome.
At minimum, an average genome-wide recombination rate needs to be specified, which is typically available for well-assembled genomes.
For bacteria and archaea, which primarily experience non-crossover recombination,
the average tract length should also be specified
(see details in previous section).
\textbf{Gene conversion (optional):} If one wishes to model gene conversion in eukaryotes, either together with crossover recombination or as a stand-alone process,
then one should specify the fraction of recombinations done by gene conversion
as well as the per chromosome average tract length.
\item
\textbf{A demographic model} describing
ancestral population sizes, split times, and migration rates.
Selection of a reasonable demographic model is often crucial,
since misspecification of the model can generate unrealistic patterns of genetic variation that will affect downstream analyses \citep[e.g.,][]{Navascues2009}.
% PLR: an odd citation for that?
% ILAN: I agree, but couldn't find a better one. Let's keep
A given species might have more than one demographic model, fit from different data or by different methods.
Thus, when selecting a demographic model, one should examine the data sources and methods used to obtain it to ensure that they are relevant to the study at hand
%REVISION added ref to new section
(see also {\bf Limiations of simulated genomes} below).
%REVISION removed boldface font
At a minimum, simulation requires a single estimate of {effective population size}. This estimate, which may correspond to some sort of historical average effective population size,
should produce simulated data that matches the average observed genetic diversity in that species. Note, however, that this average effective population size cannot capture features of genetic variation that are caused by recent changes in population size and the presence of population structure \citep{MacLeod2013,Eldon2015}.
For example, a recent population expansion will produce
an excess of low-frequency alleles that no simulation of a constant-sized
population will reproduce \citep{Tennessen2012}.
\item
\textbf{An average generation time} for the species.
This parameter is an important part of the species' natural history.
This value does not directly affect the simulation, since
\stdpopsim uses either the Wright--Fisher model (in \texttt{SLiM}) or the Moran model (in \texttt{msprime}),
both of which operate in time units of generations.
Thus, the average generation time is only currently used to convert time units to years,
which is useful when comparing among different demographic models.
\end{enumerate}
These five categories of parameters are sufficient for generating simulations
under neutral evolution. Such simulations are useful for a number of purposes,
but they cannot be used to model the influence of natural selection on patterns of genetic variation.
To achieve this, the simulator needs to know which regions along the genome are subject to selection,
and the nature and strength of this selection.
As mentioned above, the ability to simulate chromosomes with realistic models of
selection is still under development, and will be finalized in the next release of \stdpopsim.
The development version of \stdpopsim enables simulation with selection
(using the \texttt{SLiM} engine)
by specifying genome annotations and distributions of fitness effects,
as specified below.
\begin{enumerate}
\def\labelenumi{\arabic{enumi}.}
\setcounter{enumi}{5}
\item
\textbf{Genome annotations}, specifying regions subject to selection (as, for example, a GFF3/GTF file).
For instance, annotations can contain information on the location of coding regions,
the position of specific genes, or conserved non-coding regions.
Regions not covered by the annotation file are assumed to be evolving free from the effects of direct natural selection.
\item
\textbf{Distributions of fitness effects} (DFEs) for each annotation.
Each annotation is associated with a DFE describing
the probability distribution of selection coefficients (deleterious, neutral, and beneficial)
for mutations occurring in the region covered by the annotation.
DFEs can be inferred from population genomic data \citep[reviewed in][]{Eyre-Walker2007},
and are available for several species \citep[e.g.,][]{Ma2013, Huber2018}.
\end{enumerate}
%REVISION: mentiond currently implemented DFEs and mentioned selection ms
The current release of \stdpopsim contains annotations and implemented DFE models for the
three model species: {\em A. thaliana}, {\em D. melanogaster}, and {\em H. sapiens}.
A forthcoming publication will provide details about how this is implemented in \stdpopsim
and examples of possible uses of this feature.
\subsection*{Extracting parameters from the literature}
Simulations cannot of course precisely match reality, but in setting up simulations
it is desirable to choose parameters that best reflect our current understanding of the evolutionary history of the species of interest.
In practice a researcher may choose each parameter to match a fairly precise estimate or a wild guess,
which may be obtained from a peer-reviewed publication or by word of mouth.
However, values in \stdpopsim are always chosen to match published estimates,
so that the underlying data and methods are documented and can be validated.
Because the process of converting information reported in the literature to parameters used by a simulation engine is quite error-prone,
independent validation of the simulation code is crucial.
We highly recommend following a quality-control procedure similar to the one used in \stdpopsim,
in which each species or model added to the catalog is independently recreated or thoroughly reviewed by a separate researcher.
Obtaining reliable and citable estimates for all model parameters is not a trivial task.
Oftentimes, values for different parameters must be gleaned from multiple publications and combined.
For example, it is not uncommon to find an estimate of a mutation rate in one paper,
a recombination map in a separate paper, and a suitable demographic model in a third paper.
Integrating information from different publications requires caution,
since some of these parameter estimates are entangled in non-trivial ways.
For instance, consider simulating a demographic model estimated in a specific paper that assumes
a certain mutation rate.
Naively using the demographic model, as published, with a new estimate of the mutation rate
will lead to levels of genetic diversity that do not fit the genomic data.
This is addressed in \stdpopsim by allowing a demographic model to be simulated using a mutation rate that differs from the default rate specified for the species.
See, for example, the model implemented for \emph{Bos taurus},
which is described in the next section.
This important feature does not necessarily fix all potential inconsistencies
caused by assumptions made by the demographic inference method
(such as assumptions about recombination rates).
It is therefore recommended, when possible, to take the demographic model,
mutation rates, and recombination rates from the same study,
and to proceed carefully when mixing sources.
An additional tricky source of inconsistency is coordinate drift between
subsequent versions of genome assemblies.
%REVISION: genetic map --> recombination map (to avoid confusion)
In \stdpopsim, we follow the UCSC Genome Browser
and use liftOver to convert the coordinates of recombination maps and genome annotations
to the coordinates of the current genome assembly \citep{Hinrichs2006ucsc}.
%REVISION: added section
\subsection*{Limitations of simulated genomes}
Despite their great utility, simulated genomes cannot fully
capture all aspects of genetic variation as observed in real data,
with some aspects modeled better than others.
As mentioned above, this will strongly depend on the demographic model used in simulation.
Thus, it is important to consider potential limitations of different demographic models
in reflecting observed genetic variation.
First, a demographic model inferred from analysis of genomic data will likely depend on
the samples that contributed the analyzed genomes.
The inferred demographic model can only reflect the genealogical ancestry
of these sampled individuals, and this will typically make up a small portion
of the complete genealogical ancestry of the species.
% Ilan: I commented out the following statement because I think it's addressed
% by the final statement in this paragraph. We can uncomment, if needed.
%(albeit the genealogy of any set of sampled individuals includes many ancestors).
Thus, demographic models inferred from larger sets of samples from diverse ancestry backgrounds
may potentially provide a more comprehensive depiction of genetic variation within a species.
This is true if sufficiently realistic demographic models can be fit---models that account for the structure of populations within a species.
That said, the choice of samples used for inference will mostly influence
recent changes in genetic variation.
This is because the genealogy of even a single individual consists of numerous ancestors
in each generation in the deep past,
which is the premise of methods that infer ancestral population sizes from a single input genome
\citep{LiDurbin2011}.
The computational method used for inference also affects the way genetic variation
is reflected by the demographic model,
because different methods derive their inference from different features of genomic variation.
Some methods make use of the site frequency spectrum at unlinked single sites
\citep[e.g.,][]{Gutenkunst2009,Excoffier2013,Liu2015}, while other methods use haplotype structure \citep[e.g.,][]{LiDurbin2011,Schiffels2020,browning2015accurate}.
This, in turn, may influence the accuracy of different features in the inferred demography.
For example, very recent demographic changes, such as recent admixture or bottlenecks,
are difficult to infer from the site frequency spectrum,
but are more easily inferred by examining shared long haplotypes
(as demonstrated by the demographic model inferred for \emph{Bos taurus} by \cite{MacLeod2013}; see below).
Several studies have compared different approaches to demographic inference \citep[e.g.,][]{Harris2013,Beichman2017}, but unfortunately, there is currently no succinct handbook that describes the relative strengths and weaknesses of different methods.
Thus, assessing the potential limitations of a given demographic model currently requires
some familiarity with the method used for its inference.
%Ilan: added this to address rev 3's comment about BGS.
% I didn't want to get to much into details
In addition, all methods assume that the input sequences are neutrally evolving.
This implies that technical choices,
such as the specific genomic segments analyzed and various filters,
may also influence the inferred model and its ability to model observed genetic variation.
Thus, it is strongly advised to read the study that inferred the demographic model
and understand potential limitations that stem from the selection of
samples, methods, and filters.
We note that inclusion of a demographic model in the \stdpopsim catalog does not involve any judgment as to which aspects of genetic variation it captures.
Any model that is a faithful implementation of a published model inferred from genomic data can be added to the \stdpopsim catalog.
Thus, potential users of \stdpopsim should use the implemented models with the appropriate caution, keeping in mind the limitations discussed above.
We maintain a fairly detailed documentation page for the catalog (see {\bf Data availability}),
which contains a brief summary for each demographic model.
This summary includes a graphical description of the model
(such as the one shown for {\em Anopheles gambiae} in Fig. \ref{fig:anogam}B),
as well as a description of the data and method used for inference.
Therefore, the documentation can provide guidance to potential users of \stdpopsim
in the process of selecting an appropriate demographic model for simulation.
%
Finally, we hope that the standardized simulations implemented in \stdpopsim
will facilitate additional studies that examine the relative strengths and limitations
of different approaches to demographic inference (and modeling genetic variation in general),
and this will allow us to generate even more realistic simulations in the future.
\subsection*{Filling in the missing pieces}
For many species it is difficult to obtain estimates of all necessary model parameters.
Table \ref{tab:param-mod} provides suggestions for ways to deal with missing values of various model parameters.
The table also mentions possible consequences of misspecification of each parameter.
%REVISION: moved table to fit more nicely in text
\begin{table}[h!]
\captionof{table}{\textbf{Guidelines for dealing with missing parameters.} %
For each parameter, we provide a suggested course of action,
and mention the main discrepancies
between simulated data and real genomic data that could be caused by misspecification of that parameter.
} \label{tab:param-mod}
\begin{tabular}{p{1.5in}p{2.2in}p{2.2in}}
\hline
\textbf{Missing parameter} &
\textbf{Suggested action} &
\textbf{Possible discrepancies} \\
\hline
Mutation rate &
Borrow from closest relative with a citable mutation rate &
Number of polymorphic sites \\
\hline
Recombination rate &
Borrow from closest relative with a citable recombination rate &
Patterns of linkage disequilibrium
\\
\hline
Gene conversion rate and tract length &
Set rate to 0 or borrow from closest relative with a citable rate &
Lengths of shared haplotypes across individuals
\\
\hline
Demographic model &
Set the effective population size ($N_e$) to a value
that reflects the
%
%REVISION: tweaked here due to comment of reviewer #3
observed genetic diversity %in the simulated population
&
Features of genetic diversity that are captured by the site frequency spectrum,
such as the prevalence of low-frequency alleles\\
\hline
\end{tabular}
\end{table}
In some cases, one may wish to generate simulations for a species with a partial genome build.
Despite the focus of \stdpopsim on species with chromosome-level assemblies (see discussion above), simulation is still potentially useful for species with less complete assemblies, with some important considerations to keep in mind.
Longer contigs or scaffolds in these builds can be simulated separately and independently.
This approach allows us to model genetic linkage within each contig,
but linkage between different contigs that map to the same chromosome will not be captured by the simulation.
This provides a reasonable approximation for many purposes, at least for genomic regions far from the contig edges.
For shorter contigs, separate independent simulations will not be able to capture patterns of long-range linkage in a reasonably realistic way.
Thus, a potentially viable option for shorter contigs is to combine them into longer pseudo-chromosomes, trying to mimic the species' expected chromosome lengths.
Despite their somewhat artificial construction,
these pseudo-chromosomes have the important benefit of
capturing patterns of linkage similar to those observed in real genomic chromosomes.
If, for example, the main purpose of the simulation is to examine the distribution of lengths of shared haplotypes between individuals,
or study patterns of background selection,
then it makes sense to simulate such pseudo-chromosomes.
However, genetic correlations between different specific contigs lumped together in this way are obviously not accurate.
So, if the main purpose of the simulation is to examine local patterns of genetic variation in loci of interest, then it may be more appropriate to simulate the relevant contigs separately (even if they are short), or to randomly sample several mappings of contigs to pseudo-chromosomes.
For some purposes it makes sense to simulate a large number of unlinked sites \citep{Gutenkunst2009,Excoffier2013},
which can be generated without any sort of genome assembly.
However, this approach would not have the benefits of chromosome-scale simulations.
While some of the same considerations hold when simulating unlinked short sequences, a detailed discussion about such simulations goes beyond the scope of this paper.
Ultimately, the recommended mode of simulation for a species with a partial genome assembly depends on the intended use of the simulated genomes.
\section*{Examples of added species}
\labelname{Examples}
\label{sec:examples}
In this section, we provide examples of two species recently added to the \stdpopsim catalog,
\textit{Anopheles gambiae} and \textit{Bos taurus},
to demonstrate some of the key considerations of the process.
%REVISION: explain boldface
In each example, we highlight in bold the model parameters set for each species.
\subsection*{\texorpdfstring{\emph{Anopheles gambiae} (mosquito)}{Anopheles gambiae (mosquito)}}
\label{AnoGam}
\begin{figure}[b!]
\includegraphics[width=\linewidth]{figs/anogam_demog_table}
\caption{The species parameters and demographic model used for \emph{Anopheles gambiae} in the \stdpopsim catalog.
(A) The parameters associated with the genome build and species, including
chromosome lengths, average recombination rates (per base per generation),
and average mutation rates (per base per generation).
(B) A graphical depiction of the demographic model,
which consists of a single population whose size changes throughout the past 11,260 generations in 67 time intervals (note the log scale). The width at each point depicts the effective population size ($N_e$), with the horizontal bar at the bottom indicating the scale for $N_e=10^6$.
%REVISION: mentioned the docs
This figure is adapted from the data on the \stdpopsim catalog documentation page (see {\bf Data availability}) and plotted with POPdemog \citep{Zhou2018}.
\label{fig:anogam} }
\end{figure}
\emph{Anopheles gambiae}, the African malaria mosquito, is
a non-model organism whose population history has direct implications for human health.
Several large-scale studies in recent years have provided information about the
population history of this species on which population genomic simulations can be based \citep[e.g.,][]{Miles2017, clarkson2020genome}.
The genome assembly structure used in the species model is from
the AgamP4 \textbf{genome assembly} \citep{Sharakhova2007},
downloaded directly from Ensembl \citep{ensembl2021}
using the special utilities provided by \stdpopsim.
Estimates of average \textbf{recombination rates} for each of the chromosomes (excluding the mitochondrial genome)
were taken from a recombination map inferred by \citet{Pombi2006} which itself included information from
\citet{zheng1996integrated} (Figure \ref{fig:anogam}A).
As direct estimates of \textbf{mutation rate} (e.g., via mutation accumulation) do not currently exist for \emph{Anopheles gambiae},
we used the genome-wide average mutation rate of $\mu=3.5 \times 10^{-9}$ mutations per generation per site
estimated by \cite{Keightley2009} for the fellow Dipteran \textit{Drosophila melanogaster},
a rate that was used for analysis of \textit{A.~gambiae} data in \citet{Miles2017}.
To obtain an estimate for the default \textbf{effective population size} ($N_e$),
we used the formula $\theta=4\mu N_e$,
with the above mutation rate ($\mu=3.5 \times 10^{-9}$
%REVISION: added units to mut rate
mutations per base per generation)
and a mean nucleotide diversity of $\theta\approx 0.015$,
as reported by \citet{Miles2017} for the Gabon population.
This resulted in an estimate of $N_e=1.07\times 10^{6}$,
which we rounded down to one million.
These steps were documented in the code for the \stdpopsim species model,
to facilitate validation and future updates.
We acknowledge that some of these steps involve somewhat arbitrary choices,
such as the choice of the Gabon population and rounding down of the final value.
However, this should not be seen as a considerable source of misspecification,
since this value of $N_e$ is meant to provide only a rough approximation to
historical population sizes and would be overwritten by a more detailed demographic model.
%REVISION: moved boldface "demographic model" implemented in stdpopsim
\citet{Miles2017} inferred {demographic models} from \textit{Anopheles} samples from nine different populations (locations) using the stairway plot method \citep{Liu2015}.
We chose to include in \stdpopsim the \textbf{demographic model} inferred from the Gabon sample,
which consists of a single population whose size fluctuated from below 80,000
(an ancient bottleneck roughly 10,000 generations ago) to the present-day estimate of over 4 million individuals (Figure \ref{fig:anogam}B).
To convert the timescale from generations to years,
%REVISION: added boldface
we used an \textbf{average generation time} of $1/11$ years,
as in \cite{Miles2017}.
All of these parameters were set in the species entry in the \stdpopsim catalog,
accompanied by the relevant citation information,
and the model underwent the standard quality-control process.
The species entry may be refined in the future by adding more demographic models,
updating or refining the recombination map,
or updating the mutation rate estimates based on ones directly estimated for this species.
Note that even if the mutation rate is updated sometime in the future,
the demographic model mentioned above should still be associated with the current
mutation rate ($\mu=3.5 \times 10^{-9}$
%REVISION: added units to mut rate
mutations per base per generation),
since this was the rate used in its inference.
\subsection*{\texorpdfstring{\emph{Bos taurus} (cattle)}{Bos taurus (cattle)}}
\label{bos-taurus}
\emph{Bos taurus} (cattle) was added to the \stdpopsim catalog during the 2020 hackathon because of its agricultural importance. Agricultural species experience
strong selection due to domestication and selective breeding, leading
to a reduction in effective population size. These processes,
as well as admixture and introgression, produce patterns
of genetic variation that can be very different from typical model
species \citep{Larson2013}. These processes have occurred over a
relatively short period of time, since the advent of agriculture roughly 10,000 years ago, and they have intensified over the years to improve food production \citep{Gaut2018,MacLeod2013}. High-quality genome assemblies are now
available for several breeds of cattle \citep[e.g.,][]{Rosen2020, Heaton2021,
Talenti2022} and the use of genomic data has become ubiquitous
in selective breeding \citep{Meuwissen2001,MacLeod2014, Obsteter2021, Cesarani2022}.
Modern cattle have extremely low and declining genetic diversity,
with estimates of effective population size around 90 in the early 1980s \citep{MacLeod2013, VanRaden2020, Makanjouloa2020}.
On the other hand, the ancestral effective population size is estimated to be roughly $N_e$=62,000 \citep{MacLeod2013}.
This change in effective population size presents a challenge for demographic inference,
selection scans, genome-wide association, and genomic prediction
\citep{MacLeod2013,MacLeod2014,Hartfield2022}.
For these reasons, it was useful to develop a detailed simulation model for cattle to be added to the \stdpopsim catalog.
We used the most recent \textbf{genome assembly}, ARS-UCD1.2
\citep{Rosen2020}, a constant \textbf{mutation rate} of
%REVISION: added units to mut rate of
$\mu=1.2\times 10^{-8}$ mutations per base per generation
for all chromosomes \citep{Harland2017},
and a constant \textbf{recombination rate} of
%REVISION: added units to rec rate
$r=9.26 \times 10^{-9}$ recombinations per base per generation for all chromosomes other than the mitochondrial genome \citep{Ma2015}.
With respect to the \textbf{effective population size}, it is clear that simulating with either
the ancestral or current effective population size would not generate realistic genome structure and diversity \citep{MacLeod2013,Rosen2020}.
Since \stdpopsim does not allow for a missing value of $N_e$,
we chose to set the species default $N_e$ to the ancestral estimate of $6.2\times 10^4$.
However, we strongly caution that
simulating the cattle genome with any fixed value for $N_e$ will generate unrealistic patterns of genetic variation,
and recommend using a reasonably detailed demographic model.
%REVISION: added this to clarify that the default Ne is not used when a dem model is specified
Note that the default $N_e$ is only used in simulation if a demographic model is not specified.
To this end,
we implemented the \textbf{demographic model} of the Holstein breed, which was
inferred by \cite{MacLeod2013} from runs of homozygosity in the whole-genome sequence of two iconic bulls.
This demographic model specifies changes in the ancestral effective population size from $N_e$=62,000 at around 33,000 generations ago to $N_e$=90 in the 1980s
in a series of 13 instantaneous population size changes
\citep[taken from Supplementary Table S1 in][]{MacLeod2013}.
%REVISION: tweaked boldface
To convert the timescale from generations to years, we used an \textbf{average generation time} of $5$ years \citep{MacLeod2013}.
Note that this demographic model does not capture the intense selective breeding since the 1980s that has even further reduced the effective population size of cattle \citep{MacLeod2013, VanRaden2020, Makanjouloa2020}. These effects can be modeled with
downstream breeding simulations \citep[e.g.,][]{Gaynor2020}.
When setting up the parameters of the demographic model, we noticed that the inference by \cite{MacLeod2013} assumed a genome-wide fixed recombination rate of
%REVISION: added units to rec rate
$r=10^{-8}$ recombinations per base per generation,
and a fixed mutation rate of
%REVISION: added units to mut rate
$\mu=9.4 \times 10^{-9}$ mutations per base per generation (considering also sequence errors).
The more recently updated mutation rate assumed in the species model
%REVISION: added units to mut rate
\citep[$1.2\times 10^{-8}$ mutations per base per generation, from][]{Harland2017}
is thus \(28\%\) higher than the rate used for inference.
As a result, if genomes were simulated under this demographic model with the species' default mutation rate they would have considerably higher sequence diversity than actually observed in real genomic data.
To address this, we specified a mutation rate of \(\mu=9.4 \times 10^{-9}\) in the demographic model,
which then overrides the species' mutation rate when this demographic model is applied in simulation.
The issue of fitting the rates used in simulation with those assumed during inference was discussed during the independent review of this demographic model, and it raised an important question about recombination rates. Since \cite{MacLeod2013} use runs of homozygosity to infer the demographic model, their results depends on the assumed recombination rate. The recombination rate assumed in inference
%REVISION: added units to rec rate
(\(r=10^{-8}\) recombinations per base per generation) is \(8\%\) higher than the one used in the species model (\(r=9.26\times 10^{-9}\)). In its current version, \stdpopsim does not allow specification of a separate recombination rate for each demographic model, so we had no simple way to adjust for this. Future versions of \stdpopsim will enable such flexibility. Thus, we note that genomes simulated under this demographic model as currently implemented in \stdpopsim
might have slightly higher linkage disequilibrium than observed in real cattle genomes.
However, we anticipate that this would affect patterns less
than selection due to domestication and selective breeding,
which are not yet modeled at all in \stdpopsim simulations.
\section*{Conclusion}
\label{conclusion}
As our ability to sequence genomes continues to advance, the need for
population genomic simulations of new model and non-model organisms is becoming acute.
So, too, is the concomitant need for an expandable framework for implementing such simulations and guidance for how to do so.
Generating realistic whole-genome simulations presents significant challenges
both in coding and in choosing parameter values on which to base the simulation.
With \stdpopsim, we provide a resource that is uniquely poised to address these
challenges as it provides easy access to state-of-the-art simulation engines and practices, and an easy procedure for including new species.
Moreover, we aim for the choices regarding inclusion of new species to be driven
by the needs of the population genomics community.
In this manuscript we
describe the expansion of \stdpopsim in two ways:
the addition of new features to the simulation framework that incorporate new evolutionary processes,
such as non-crossover recombination, broadening the diversity of species that can
be realistically modeled; and the considerable expansion of the catalog itself
to include more species and demographic models.
We also formulated a series of guidelines for implementing
population genomic simulations, based on
insights from the community-driven process of expanding the \stdpopsim catalog.
These guidelines specify the basic requirements for generating a useful chromosome-level simulation for a given species, as well as the rationale behind these requirements.
We also discuss special considerations for collecting relevant information from the literature,
and what to do if some of that information is not available.
Because this process is quite error-prone,
we encourage wider adoption of ``code review'':
researchers implementing simulations should have their
parameter choices and implementation reviewed by at least one other researcher.
The guidelines in this paper can be followed when implementing a simulation independently for a single study,
or (as we encourage others to do) when adding code to \stdpopsim,
which helps to ensure its robustness and to make it available for future research.
Currently, large-scale efforts such as the Earth Biogenome
and its affiliated project networks are generating tens of thousands of genome
assemblies. Each of these assemblies
would become a candidate for inclusion into the
\stdpopsim catalog, %following the steps we have outlined above,
although substantial changes to the structure of \stdpopsim would be required
to include so many distinct species.
As annotations of those genome assemblies improve over time, this information, too, can easily be added to the \stdpopsim catalog.
One of the important objectives of the PopSim consortium is to leverage \stdpopsim as a means to promote education and inclusion of new communities into
computational biology and software development.
We are keen to use outreach, such as the workshops and hackathons described here,
as a way to grow the \stdpopsim catalog and library while also democratizing the development of population genomic simulations in general.
We predict that the increased use of chromosome-scale simulations in non-model species will lead to an improvement in inference methods,
which traditionally have been quite narrowly focused on well-studied model organisms.
Thus, we hope that further expansion of \stdpopsim will improve the ease and reproducibility of research across a larger number of systems,
while simultaneously expanding the community of software developers among population and evolutionary geneticists.
\section*{Data availability}\label{acknowledgements}
%REVISION: added link to docs
The code for \stdpopsim and the species catalog are available from: \href{https://github.com/popsim-consortium/stdpopsim}
{https://github.com/popsim-consortium/stdpopsim}.
%
The documentation page for the \stdpopsim catalog is available from:
\href{https://popsim-consortium.github.io/stdpopsim-docs/stable/catalog.html}
{https://popsim-consortium.github.io/stdpopsim-docs/stable/catalog.html}
\section*{Acknowledgments}\label{acknowledgements}