-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdataman.shtml
2222 lines (1922 loc) · 75.7 KB
/
dataman.shtml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<body>
<head>
<link rel="stylesheet" href="plink.css" type="text/css">
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<title>PLINK: Whole genome data analysis toolset</title>
</head>
<!--<html>-->
<!--<title>PLINK</title>-->
<!--<body>-->
<font size="6" color="darkgreen"><b>plink...</b></font>
<div style="position:absolute;right:10px;top:10px;font-size:
75%"><em>Last original <tt>PLINK</tt> release is <b>v1.07</b>
(10-Oct-2009); <b>PLINK 1.9</b> is now <a href="plink2.shtml"> available</a> for beta-testing</em></div>
<h1>Whole genome association analysis toolset</h1>
<font size="1" color="darkgreen">
<em>
<a href="index.shtml">Introduction</a> |
<a href="contact.shtml">Basics</a> |
<a href="download.shtml">Download</a> |
<a href="reference.shtml">Reference</a> |
<a href="data.shtml">Formats</a> |
<a href="dataman.shtml">Data management</a> |
<a href="summary.shtml">Summary stats</a> |
<a href="thresh.shtml">Filters</a> |
<a href="strat.shtml">Stratification</a> |
<a href="ibdibs.shtml">IBS/IBD</a> |
<a href="anal.shtml">Association</a> |
<a href="fanal.shtml">Family-based</a> |
<a href="perm.shtml">Permutation</a> |
<a href="ld.shtml">LD calcualtions</a> |
<a href="haplo.shtml">Haplotypes</a> |
<a href="whap.shtml">Conditional tests</a> |
<a href="proxy.shtml">Proxy association</a> |
<a href="pimputation.shtml">Imputation</a> |
<a href="dosage.shtml">Dosage data</a> |
<a href="metaanal.shtml">Meta-analysis</a> |
<a href="annot.shtml">Result annotation</a> |
<a href="clump.shtml">Clumping</a> |
<a href="grep.shtml">Gene Report</a> |
<a href="epi.shtml">Epistasis</a> |
<a href="cnv.shtml">Rare CNVs</a> |
<a href="gvar.shtml">Common CNPs</a> |
<a href="rfunc.shtml">R-plugins</a> |
<a href="psnp.shtml">SNP annotation</a> |
<a href="simulate.shtml">Simulation</a> |
<a href="profile.shtml">Profiles</a> |
<a href="ids.shtml">ID helper</a> |
<a href="res.shtml">Resources</a> |
<a href="flow.shtml">Flow chart</a> |
<a href="misc.shtml">Misc.</a> |
<a href="faq.shtml">FAQ</a> |
<a href="gplink.shtml">gPLINK</a>
</em></font>
</p>
<table border=0>
<tr>
<td bgcolor="lightblue" valign="top" width=20%>
<font size="1">
<a href="index.shtml">1. Introduction</a> </p>
<a href="contact.shtml">2. Basic information</a> </p>
<ul>
<li> <a href="contact.shtml#cite">Citing PLINK</a>
<li> <a href="contact.shtml#probs">Reporting problems</a>
<li> <a href="news.shtml">What's new?</a>
<li> <a href="pdf.shtml">PDF documentation</a>
</ul>
<a href="download.shtml">3. Download and general notes</a> </p>
<ul>
<li> <a href="download.shtml#download">Stable download</a>
<li> <a href="download.shtml#latest">Development code</a>
<li> <a href="download.shtml#general">General notes</a>
<li> <a href="download.shtml#msdos">MS-DOS notes</a>
<li> <a href="download.shtml#nix">Unix/Linux notes</a>
<li> <a href="download.shtml#compilation">Compilation</a>
<li> <a href="download.shtml#input">Using the command line</a>
<li> <a href="download.shtml#output">Viewing output files</a>
<li> <a href="changelog.shtml">Version history</a>
</ul>
<a href="reference.shtml">4. Command reference table</a> </p>
<ul>
<li> <a href="reference.shtml#options">List of options</a>
<li> <a href="reference.shtml#output">List of output files</a>
<li> <a href="newfeat.shtml">Under development</a>
</ul>
<a href="data.shtml">5. Basic usage/data formats</a>
<ul>
<li> <a href="data.shtml#plink">Running PLINK</a>
<li> <a href="data.shtml#ped">PED files</a>
<li> <a href="data.shtml#map">MAP files</a>
<li> <a href="data.shtml#tr">Transposed filesets</a>
<li> <a href="data.shtml#long">Long-format filesets</a>
<li> <a href="data.shtml#bed">Binary PED files</a>
<li> <a href="data.shtml#pheno">Alternate phenotypes</a>
<li> <a href="data.shtml#covar">Covariate files</a>
<li> <a href="data.shtml#clst">Cluster files</a>
<li> <a href="data.shtml#sets">Set files</a>
</ul>
<a href="dataman.shtml">6. Data management</a> </p>
<ul>
<li> <a href="dataman.shtml#recode">Recode</a>
<li> <a href="dataman.shtml#recode">Reorder</a>
<li> <a href="dataman.shtml#snplist">Write SNP list</a>
<li> <a href="dataman.shtml#updatemap">Update SNP map</a>
<li> <a href="dataman.shtml#updateallele">Update allele information</a>
<li> <a href="dataman.shtml#refallele">Force reference allele</a>
<li> <a href="dataman.shtml#updatefam">Update individuals</a>
<li> <a href="dataman.shtml#wrtcov">Write covariate files</a>
<li> <a href="dataman.shtml#wrtclst">Write cluster files</a>
<li> <a href="dataman.shtml#flip">Flip strand</a>
<li> <a href="dataman.shtml#flipscan">Scan for strand problem</a>
<li> <a href="dataman.shtml#merge">Merge two files</a>
<li> <a href="dataman.shtml#mergelist">Merge multiple files</a>
<li> <a href="dataman.shtml#extract">Extract SNPs</a>
<li> <a href="dataman.shtml#exclude">Remove SNPs</a>
<li> <a href="dataman.shtml#zero">Zero out sets of genotypes</a>
<li> <a href="dataman.shtml#keep">Extract Individuals</a>
<li> <a href="dataman.shtml#remove">Remove Individuals</a>
<li> <a href="dataman.shtml#filter">Filter Individuals</a>
<li> <a href="dataman.shtml#attrib">Attribute filters</a>
<li> <a href="dataman.shtml#makeset">Create a set file</a>
<li> <a href="dataman.shtml#tabset">Tabulate SNPs by sets</a>
<li> <a href="dataman.shtml#snp-qual">SNP quality scores</a>
<li> <a href="dataman.shtml#geno-qual">Genotypic quality scores</a>
</ul>
<a href="summary.shtml">7. Summary stats</a>
<ul>
<li> <a href="summary.shtml#missing">Missingness</a>
<li> <a href="summary.shtml#oblig_missing">Obligatory missingness</a>
<li> <a href="summary.shtml#clustermissing">IBM clustering</a>
<li> <a href="summary.shtml#testmiss">Missingness by phenotype</a>
<li> <a href="summary.shtml#mishap">Missingness by genotype</a>
<li> <a href="summary.shtml#hardy">Hardy-Weinberg</a>
<li> <a href="summary.shtml#freq">Allele frequencies</a>
<li> <a href="summary.shtml#prune">LD-based SNP pruning</a>
<li> <a href="summary.shtml#mendel">Mendel errors</a>
<li> <a href="summary.shtml#sexcheck">Sex check</a>
<li> <a href="summary.shtml#pederr">Pedigree errors</a>
</ul>
<a href="thresh.shtml">8. Inclusion thresholds</a>
<ul>
<li> <a href="thresh.shtml#miss2">Missing/person</a>
<li> <a href="thresh.shtml#maf">Allele frequency</a>
<li> <a href="thresh.shtml#miss1">Missing/SNP</a>
<li> <a href="thresh.shtml#hwd">Hardy-Weinberg</a>
<li> <a href="thresh.shtml#mendel">Mendel errors</a>
</ul>
<a href="strat.shtml">9. Population stratification</a>
<ul>
<li> <a href="strat.shtml#cluster">IBS clustering</a>
<li> <a href="strat.shtml#permtest">Permutation test</a>
<li> <a href="strat.shtml#options">Clustering options</a>
<li> <a href="strat.shtml#matrix">IBS matrix</a>
<li> <a href="strat.shtml#mds">Multidimensional scaling</a>
<li> <a href="strat.shtml#outlier">Outlier detection</a>
</ul>
<a href="ibdibs.shtml">10. IBS/IBD estimation</a>
<ul>
<li> <a href="ibdibs.shtml#genome">Pairwise IBD</a>
<li> <a href="ibdibs.shtml#inbreeding">Inbreeding</a>
<li> <a href="ibdibs.shtml#homo">Runs of homozygosity</a>
<li> <a href="ibdibs.shtml#segments">Shared segments</a>
</ul>
<a href="anal.shtml">11. Association</a>
<ul>
<li> <a href="anal.shtml#cc">Case/control</a>
<li> <a href="anal.shtml#fisher">Fisher's exact</a>
<li> <a href="anal.shtml#model">Full model</a>
<li> <a href="anal.shtml#strat">Stratified analysis</a>
<li> <a href="anal.shtml#homog">Tests of heterogeneity</a>
<li> <a href="anal.shtml#hotel">Hotelling's T(2) test</a>
<li> <a href="anal.shtml#qt">Quantitative trait</a>
<li> <a href="anal.shtml#qtmeans">Quantitative trait means</a>
<li> <a href="anal.shtml#qtgxe">Quantitative trait GxE</a>
<li> <a href="anal.shtml#glm">Linear and logistic models</a>
<li> <a href="anal.shtml#set">Set-based tests</a>
<li> <a href="anal.shtml#adjust">Multiple-test correction</a>
</ul>
<a href="fanal.shtml">12. Family-based association</a>
<ul>
<li> <a href="fanal.shtml#tdt">TDT</a>
<li> <a href="fanal.shtml#ptdt">ParenTDT</a>
<li> <a href="fanal.shtml#poo">Parent-of-origin</a>
<li> <a href="fanal.shtml#dfam">DFAM test</a>
<li> <a href="fanal.shtml#qfam">QFAM test</a>
</ul>
<a href="perm.shtml">13. Permutation procedures</a>
<ul>
<li> <a href="perm.shtml#perm">Basic permutation</a>
<li> <a href="perm.shtml#aperm">Adaptive permutation</a>
<li> <a href="perm.shtml#mperm">max(T) permutation</a>
<li> <a href="perm.shtml#rank">Ranked permutation</a>
<li> <a href="perm.shtml#genedropmodel">Gene-dropping</a>
<li> <a href="perm.shtml#cluster">Within-cluster</a>
<li> <a href="perm.shtml#mkphe">Permuted phenotypes files</a>
</ul>
<a href="ld.shtml">14. LD calculations</a>
<ul>
<li> <a href="ld.shtml#ld1">2 SNP pairwise LD</a>
<li> <a href="ld.shtml#ld2">N SNP pairwise LD</a>
<li> <a href="ld.shtml#tags">Tagging options</a>
<li> <a href="ld.shtml#blox">Haplotype blocks</a>
</ul>
<a href="haplo.shtml">15. Multimarker tests</a>
<ul>
<li> <a href="haplo.shtml#hap1">Imputing haplotypes</a>
<li> <a href="haplo.shtml#precomputed">Precomputed lists</a>
<li> <a href="haplo.shtml#hap2">Haplotype frequencies</a>
<li> <a href="haplo.shtml#hap3">Haplotype-based association</a>
<li> <a href="haplo.shtml#hap3c">Haplotype-based GLM tests</a>
<li> <a href="haplo.shtml#hap3b">Haplotype-based TDT</a>
<li> <a href="haplo.shtml#hap4">Haplotype imputation</a>
<li> <a href="haplo.shtml#hap5">Individual phases</a>
</ul>
<a href="whap.shtml">16. Conditional haplotype tests</a>
<ul>
<li> <a href="whap.shtml#whap1">Basic usage</a>
<li> <a href="whap.shtml#whap2">Specifying type of test</a>
<li> <a href="whap.shtml#whap3">General haplogrouping</a>
<li> <a href="whap.shtml#whap4">Covariates and other SNPs</a>
</ul>
<a href="proxy.shtml">17. Proxy association</a>
<ul>
<li> <a href="proxy.shtml#proxy1">Basic usage</a>
<li> <a href="proxy.shtml#proxy2">Refining a signal</a>
<li> <a href="proxy.shtml#proxy2b">Multiple reference SNPs</a>
<li> <a href="proxy.shtml#proxy3">Haplotype-based SNP tests</a>
</ul>
<a href="pimputation.shtml">18. Imputation (beta)</a>
<ul>
<li> <a href="pimputation.shtml#impute1">Making reference set</a>
<li> <a href="pimputation.shtml#impute2">Basic association test</a>
<li> <a href="pimputation.shtml#impute3">Modifying parameters</a>
<li> <a href="pimputation.shtml#impute4">Imputing discrete calls</a>
<li> <a href="pimputation.shtml#impute5">Verbose output options</a>
</ul>
<a href="dosage.shtml">19. Dosage data</a>
<ul>
<li> <a href="dosage.shtml#format">Input file formats</a>
<li> <a href="dosage.shtml#assoc">Association analysis</a>
<li> <a href="dosage.shtml#output">Outputting dosage data</a>
</ul>
<a href="metaanal.shtml">20. Meta-analysis</a>
<ul>
<li> <a href="metaanal.shtml#basic">Basic usage</a>
<li> <a href="metaanal.shtml#opt">Misc. options</a>
</ul>
<a href="annot.shtml">21. Annotation</a>
<ul>
<li> <a href="annot.shtml#basic">Basic usage</a>
<li> <a href="annot.shtml#opt">Misc. options</a>
</ul>
<a href="clump.shtml">22. LD-based results clumping</a>
<ul>
<li> <a href="clump.shtml#clump1">Basic usage</a>
<li> <a href="clump.shtml#clump2">Verbose reporting</a>
<li> <a href="clump.shtml#clump3">Combining multiple studies</a>
<li> <a href="clump.shtml#clump4">Best single proxy</a>
</ul>
<a href="grep.shtml">23. Gene-based report</a>
<ul>
<li> <a href="grep.shtml#grep1">Basic usage</a>
<li> <a href="grep.shtml#grep2">Other options</a>
</ul>
<a href="epi.shtml">24. Epistasis</a>
<ul>
<li> <a href="epi.shtml#snp">SNP x SNP</a>
<li> <a href="epi.shtml#case">Case-only</a>
<li> <a href="epi.shtml#gene">Gene-based</a>
</ul>
<a href="cnv.shtml">25. Rare CNVs</a>
<ul>
<li> <a href="cnv.shtml#format">File format</a>
<li> <a href="cnv.shtml#maps">MAP file construction</a>
<li> <a href="cnv.shtml#loading">Loading CNVs</a>
<li> <a href="cnv.shtml#olap_check">Check for overlap</a>
<li> <a href="cnv.shtml#type_filter">Filter on type </a>
<li> <a href="cnv.shtml#gene_filter">Filter on genes </a>
<li> <a href="cnv.shtml#freq_filter">Filter on frequency </a>
<li> <a href="cnv.shtml#burden">Burden analysis</a>
<li> <a href="cnv.shtml#burden2">Geneset enrichment</a>
<li> <a href="cnv.shtml#assoc">Mapping loci</a>
<li> <a href="cnv.shtml#reg-assoc">Regional tests</a>
<li> <a href="cnv.shtml#qt-assoc">Quantitative traits</a>
<li> <a href="cnv.shtml#write_cnvlist">Write CNV lists</a>
<li> <a href="cnv.shtml#report">Write gene lists</a>
<li> <a href="cnv.shtml#groups">Grouping CNVs </a>
</ul>
<a href="gvar.shtml">26. Common CNPs</a>
<ul>
<li> <a href="gvar.shtml#cnv2"> CNPs/generic variants</a>
<li> <a href="gvar.shtml#cnv2b"> CNP/SNP association</a>
</ul>
<a href="rfunc.shtml">27. R-plugins</a>
<ul>
<li> <a href="rfunc.shtml#rfunc1">Basic usage</a>
<li> <a href="rfunc.shtml#rfunc2">Defining the R function</a>
<li> <a href="rfunc.shtml#rfunc2b">Example of debugging</a>
<li> <a href="rfunc.shtml#rfunc3">Installing Rserve</a>
</ul>
<a href="psnp.shtml">28. Annotation web-lookup</a>
<ul>
<li> <a href="psnp.shtml#psnp1">Basic SNP annotation</a>
<li> <a href="psnp.shtml#psnp2">Gene-based SNP lookup</a>
<li> <a href="psnp.shtml#psnp3">Annotation sources</a>
</ul>
<a href="simulate.shtml">29. Simulation tools</a>
<ul>
<li> <a href="simulate.shtml#sim1">Basic usage</a>
<li> <a href="simulate.shtml#sim2">Resampling a population</a>
<li> <a href="simulate.shtml#sim3">Quantitative traits</a>
</ul>
<a href="profile.shtml">30. Profile scoring</a>
<ul>
<li> <a href="profile.shtml#prof1">Basic usage</a>
<li> <a href="profile.shtml#prof2">SNP subsets</a>
<li> <a href="profile.shtml#dose">Dosage data</a>
<li> <a href="profile.shtml#prof3">Misc options</a>
</ul>
<a href="ids.shtml">31. ID helper</a>
<ul>
<li> <a href="ids.shtml#ex">Overview/example</a>
<li> <a href="ids.shtml#intro">Basic usage</a>
<li> <a href="ids.shtml#check">Consistency checks</a>
<li> <a href="ids.shtml#alias">Aliases</a>
<li> <a href="ids.shtml#joint">Joint IDs</a>
<li> <a href="ids.shtml#lookup">Lookups</a>
<li> <a href="ids.shtml#replace">Replace values</a>
<li> <a href="ids.shtml#match">Match files</a>
<li> <a href="ids.shtml#qmatch">Quick match files</a>
<li> <a href="ids.shtml#misc">Misc.</a>
</ul>
<a href="res.shtml">32. Resources</a>
<ul>
<li> <a href="res.shtml#hapmap">HapMap (PLINK format)</a>
<li> <a href="res.shtml#teach">Teaching materials</a>
<li> <a href="res.shtml#mmtests">Multimarker tests</a>
<li> <a href="res.shtml#sets">Gene-set lists</a>
<li> <a href="res.shtml#glist">Gene range lists</a>
<li> <a href="res.shtml#attrib">SNP attributes</a>
</ul>
<a href="flow.shtml">33. Flow-chart</a>
<ul>
<li> <a href="flow.shtml">Order of commands</a>
</ul>
<a href="misc.shtml">34. Miscellaneous</a>
<ul>
<li> <a href="misc.shtml#opt">Command options/modifiers</a>
<li> <a href="misc.shtml#output">Association output modifiers</a>
<li> <a href="misc.shtml#species">Different species</a>
<li> <a href="misc.shtml#bugs">Known issues</a>
</ul>
<a href="faq.shtml">35. FAQ & Hints</a>
</p>
<a href="gplink.shtml">36. gPLINK</a>
<ul>
<li> <a href="gplink.shtml">gPLINK mainpage</a>
<li> <a href="gplink_tutorial/index.html">Tour of gPLINK</a>
<li> <a href="gplink.shtml#overview">Overview: using gPLINK</a>
<li> <a href="gplink.shtml#locrem">Local versus remote modes</a>
<li> <a href="gplink.shtml#start">Starting a new project</a>
<li> <a href="gplink.shtml#config">Configuring gPLINK</a>
<li> <a href="gplink.shtml#plink">Initiating PLINK jobs</a>
<li> <a href="gplink.shtml#view">Viewing PLINK output</a>
<li> <a href="gplink.shtml#hv">Integration with Haploview</a>
<li> <a href="gplink.shtml#down">Downloading gPLINK</a></p>
</ul>
</font>
</td><td width=5%>
<td valign="top">
</p>
<h1>Data management tools</h1>
PLINK provides a simple interface for recoding, reordering, merging,
flipping DNA-strand and extracting subsets of data. </p>
<a name="recode">
<h2>Recode and reorder a sample</h2>
</a></p>
A basic, but often useful feature, is to output a dataset:
<ol>
<li> with the PED file markers reordered for physical position,
<li> with excluded SNPs (negative values in the MAP file) excluded from the new PED file
<li> possibly excluding other SNPs based on filters such as genotyping rate
<li> possibly recoding the SNPs to a 1/2 coding
<li> possibly recoding the SNPs between letters and numbers (A,C,G,T / 1,2,3,4)
<li> possibly transposing the genotype file (SNPs as rows)
<li> possibly recoding the SNP to an additive and dominant pair of components
<li> possibly listing the data with each specific genotype as a distinct row
<li> possibly listing the data one genotype per row
<li> possibly listing only minor alleles
</ol>
The basic option to generate a new dataset is the <tt>--recode</tt> option:
<h5>
plink --file data --recode
</h5></p>
which will output the allele labels as they appear in the original;
also, the missing genotype code is preserved if this is different
from <tt>0</tt>. Also, if <tt>--output-missing-genotype</tt> is specified (which can be as well as <tt>--missing-genotype</tt>)
then this value will be used instead (i.e. so that input and output files can have different missing codes; this also applies to
the phenotype with <tt>--output-missing-phenotype</tt> and <tt>--missing-phenotype</tt>).
</p>
The <a href="data.shtml#bed"><tt>--make-bed</tt></a> option does the
same as <tt>--recode</tt> but creates binary files; these can also be
filtered, etc, as described below.
</p>
<p>In contrast,
<h5>
plink --file data --recode12
</h5></p>
will recode the alleles as <tt>1</tt> and <tt>2</tt> (and the missing genotype will always be
<tt>0</tt>). </p>
Both these commands will create two new files
<pre>
plink.ped
plink.map
</pre>
(where, as usual, "plink" would be replaced by any specified --out
{filename} ).
</p>
</p>
Unless manually specified, for all these options, the usual filters
for missingness and allele frequency will be set so as not to exclude
any SNPs or individuals. By explicitly including an option,
e.g. <tt>--maf 0.05</tt> on the command line, this behaviour is
overriden (see <a href="thresh.shtml">this page</a>).
<p>
By default, any <tt>--recode</tt> option, and also <tt>--make-bed</tt>
will preserve all genotypes exactly as they are. To set to missing
Mendel errors or heterozygous haploid calls, use the
options <tt>--set-me-missing</tt> and <tt>--set-hh-missing</tt>
respectively. For the former, you will also need to specify <tt>--me 1
1</tt> (i.e. to invole an evalation of Mendel errors, which does not
occur by default, by not excluding any individuals or SNPs based on
the results, i.e. if you only want to zero-out certain genotypes).
</p>
To recode SNP alleles from A,C,G,T to 1,2,3,4 or vice versa,
use <tt>--allele1234</tt> (to go from letters to numbers)
and <tt>--alleleACGT</tt> (to go from numbers to letters). These flags
should be used in conjunction with a data generation command
(e.g. <tt>--make-bed</tt>), or any other analysis or summary statistic
option. Alleles other than A,C,G,T or 1,2,3,4 will be left unchanged.
<p>
It is sometimes useful to have a PED file that is tab-delimited,
except that between alleles of the same genotype a space instead of a
tab is used. A file formatted in this way can load into Excel, for
example, as a tab-delimited file, but with one genotype per column
instead of one allele per column. Use the option <tt>--tab</tt> as
well as <tt>--recode</tt> or <tt>--recode12</tt> to achieve this
effect. </p>
</p>
To make a new file in which non-founders without both parents also in
the same fileset are recoded as founders (i.e. pat and mat codes set
both to 0), add the <tt>--make-founders</tt> flag.
<h6>Transposed genotype files</h6>
When using either <tt>--recode</tt> or <tt>--recode12</tt>, you can obtain a transposed text genotype
file by adding the <tt>--transpose</tt> option. This generates two files:
<pre>
plink.tped
plink.fam
</pre>
The first contains the genotype data, with SNPs as rows and individuals as columns, for example: if
the original file was
<pre>
1 1 0 0 1 1 1 1 G G
1 2 0 0 2 1 0 0 A G
1 3 0 0 1 1 1 1 A G
1 4 0 0 2 1 2 1 A A
</pre>
then this would generate
<pre>
1 snp1 0 10001 1 1 0 0 1 1 2 1
1 snp2 0 20001 G G G A G A A A
</pre>
The first four columns are from the MAP file (chromosome, SNP ID,
genetic position, physical position), followed by the genotype
data. The <tt>plink.fam</tt> gives the ID, sex and phenotype
information for each individual. The order of individuals in this
file is the same as the order across the columns of the TPED file. The
FAM file is just the first six columns of the PED file (or literally
the same FAM file if the input where a binary fileset).
<h6>Additive and dominance components</h6>
The following format is often useful if one wants to use a standard, non-genetic statistical package
to analyse the data, as here genotypes are coded as a single allele dosage number.
To create a file with SNP genotypes recoded in terms of additive and dominant components, use the
option:
<h5>
plink --file data --recodeAD
</h5></p>
which, assuming <tt>C</tt> is the minor allele, will recode genotypes as
follows:
<pre>
SNP SNP_A , SNP_HET
--- ----- -----
A A -> 0 , 0
A C -> 1 , 1
C C -> 2 , 0
0 0 -> NA , NA
</pre>
In otherwords, the default for the additive recoding is to count the
number of minor alleles per person. The <tt>--recodeAD</tt> option
produces both an additive and dominance coding: use <tt>--recodeA</tt>
instead to skip the <tt>SNP_HET</tt> coding.
</p>
The <tt>--recodeAD</tt> option saves the data to a single file
<pre>
plink.raw
</pre>
which has a header row indicating the SNP names (with <tt>_A</tt>
and <tt>_HET</tt> appended to the SNP names to represent additive and
dominant components, respectively).
</p>
For example, consider the following PED file, which has two SNPs:
<pre>
1 1 0 0 1 1 1 1 G G
1 2 0 0 2 1 0 0 A G
1 3 0 0 1 1 1 1 A G
1 4 0 0 2 1 2 1 A A
</pre>
Using the <tt>--recodeAD</tt> option generates the file
<tt>plink-recode.raw</tt>:
<pre>
FID IID PAT MAT SEX PHENOTYPE snp1_2 snp1_HET snp2_G snp2_HET
1 1 0 0 1 1 0 0 2 0
1 2 0 0 2 1 NA NA 1 1
1 3 0 0 1 1 0 0 1 1
1 4 0 0 2 1 1 1 0 0
</pre>
The column labels reflect the snp name (e.g. <tt>snp1</tt>) with the
name of the minor allele appended (i.e. <tt>snp1_2</tt> in the first instance, as
<tt>2</tt> is the minor allele) for the additive component. The
dominant component ( a dummy variable reflecting heterozygote state)
is coded with the <tt>_HET</tt> suffix.
</p>
This file can be easily loaded into <tt>R</tt>: for example:
<pre>
d <- read.table("plink.raw",header=T)
</pre>
For example, for the first SNP, the individuals are coded
<tt>1/1</tt>, <tt>0/0</tt>, <tt>1/1</tt> and <tt>2/1</tt>.
The additive count of the number of common (<tt>1</tt>) alleles is
therefore: <tt>2</tt>, <tt>NA</tt>, <tt>2</tt> and <tt>1</tt>, which
is reflected in the field <tt>snp1_2</tt>. The field <tt>snp1_HET</tt>
is coded <tt>1</tt> for the fourth individual who is heterozygous --
this field can be used to model dominance effect of the allele.
</p>
The behavior of the <tt>--recodeA</tt> and <tt>--recodeAD</tt>
commands can be changed with the <tt>--recode-allele</tt>
command. This allows for the 0, 1, 2 count to reflect the number of a
pre-specified allele type per SNP, rather than the number of the minor
allele. This command takes as a single argument the name of a file
that lists SNP name and allele to report, e.g. if the
file <tt>recode.txt</tt> contained
<pre>
snp1 1
snp2 A
</pre>
then
<h5>
plink --file data --recodeAD --recode-allele recode.txt
</h5></p>
would now report in the LOG file
<pre>
Reading allele coding list from [ recode.txt ]
Read allele codes for 2 SNPs
</pre>
and the <tt>plink.raw</tt> file would read
<pre>
FID IID PAT MAT SEX PHENOTYPE snp1_1 snp1_HET snp2_A snp2_HET
1 1 0 0 1 1 2 0 0 0
1 2 0 0 2 1 NA NA 1 1
1 3 0 0 1 1 2 0 1 1
1 4 0 0 2 1 1 1 2 0
</pre>
If the SNP is monomorphic, by default the allele code out will
be <tt>0</tt> and all individuals will have a count of 0
(or <tt>NA</tt>). If an allele is specified
in <tt>--recode-allele</tt> that is not seen in the data, similarly
all individuals will receive a 0 count (i.e. rather than an error
being given).
</p><strong>NOTE</strong> For alleles that have exactly 0.50 minor
allele frequency, as for the second SNP in the example above, then
which allele is labelled as minor will depend on which was first
encountered in the PED file.
</p>
<h6>Listing by minor allele count</h6>
The command
<pre>
--recode-rlist
</pre>
will generate a files
<pre>
plink.rlist
plink.fam
plink.map
</pre>
where the <tt>plink.rlist</tt> file format is
<pre>
SNP
GENOTYPE (BOTH ALLELES)
FID/IID PAIRS ...
</pre>
For example, consider a particular SNP, <tt>rs2379981</tt> has a minor
allele (<tt>G</tt>) seen twice (in two heterozygotes) and two individuals with a
missing genotpe; all other individuals are homozygous for the major allele. In
this case, we would see two rows in the <tt>pink.rlist</tt> file:
<pre>
rs2379981 HET G A CH18612 NA18612 JA18998 NA18998
rs2379981 NIL 0 0 JA18999 NA18999 JA19003 NA19003
</pre>
indicating, for example, that individual FID/IID CH18612/NA18612 has a
rare heterozygote.
</p>
This command could be used in conjunction with the
<tt>--reference</tt> command and <tt>--freq</tt> to list all instances
of rare non-reference alleles, e.g. from resequencing study data.
<h6>Listing by long-format (LGEN)</h6>
To output a file in the LGEN format, use the command
<pre>
--recode-lgen
</pre>
which generates files
<pre>
plink.lgen
plink.fam
plink.map
</pre>
that can be read with the <tt>--lfile</tt> command. The
<pre>
--with-reference
</pre>
with generate a fourth file
<pre>
plink.ref
</pre>
that can be read back in with the <tt>--reference</tt> command when using <tt>--lfile</tt>.
<h6>Listing by genotype</h6>
Another format that might sometimes be useful is the <tt>--list</tt> option which genetes a file
<pre>
plink.list
</pre>
that is ordered one genotype per row, listing all family and individual IDs of people with that genotype. For
example, if we have a file with two SNPs <tt>rs1001</tt> and <tt>rs2002</tt> (both on chromosome 1):
<pre>
A 1 0 0 1 2 A A 1 1
B 2 0 0 1 2 A C 0 0
C 3 0 0 1 1 A C 1 2
D 4 0 0 1 1 C C 1 2
</pre>
then then option
<h5>
plink --file mydata --list
</h5></p>
will generate the file <tt>plink.list</tt>
<pre>
1 rs1001 AA A 1
1 rs1001 AC B 2 C 3
1 rs1001 CC D 4
1 rs1001 00
1 rs2002 22
1 rs2002 21 C 3 D 4
1 rs2002 11 A 1
1 rs2002 00 B 2
</pre>
which has columns
<pre>
Chromosome
SNP identifier
Genotype
Family ID, Individual ID for 1st person
Family ID, Individual ID for 2nd person
...
Family ID, Individual ID for final person
</pre>
Obviously, different rows will have a different number of columns.
Here, we see that individual <tt>A 1</tt> has the <tt>A/A</tt> genotype for <tt>rs1001</tt>, etc.
This option is often useful in conjunction with <tt>--snp</tt>, if you want an easy breakdown of which individuals
have which genotypes.
<a name="snplist">
<h2>Write SNP list files</h2>
</a></p>
To output just the list of SNPs that remain after all filtering, etc, use the
<tt>--write-snplist</tt> command, e.g. to get a list of all high frequency,
high genotyping-rate SNPs:
<h5>
plink --bfile mydata --maf 0.05 --geno 0.05 --write-snplist
</h5></p>
which generates a file
<pre>
plink.snplist
</pre>
This file is simply a list of included SNP names, i.e. the same SNPs that a <tt>--recode</tt> or <tt>--make-bed</tt> statement
would have produced in the corresponding MAP or BIM files.
<a name="updatemap">
<h2>Update SNP information</h2>
</a></p>
To automatically update either the genetic or physical positions for some or all SNPs in a dataset, use the
<tt>--update-map</tt> command, which takes a single parameter of a filename, e.g.
<h5>
plink --bfile mydata --update-map build36.txt --make-bed --out mydata2
</h5></p>
where, for example, the file <tt>build36.txt</tt> contains
new physical positions for SNPs, based on dbSNP126/build 36, in the simple format of SNP/position per line, e.g.
<pre>
rs100001 1000202
rs100002 6252678
rs100003 7635353
...
</pre>
To change genetic position (3rd column in map file) add the
flag <tt>--update-cm</tt> <em>as well
as</em> <tt>--update-map</tt>. There is no way to change chromosome
codes using this command.
Normally, one would want to save the new file with the changed
positions, as in the example above, although one could combine other
commands instead (e.g. association testing, etc) although the updated
positions would then be lost (i.e. the changes are not automatically
saved).
</p>
The file with new SNP information does not need to feature all of the SNPs
in the current dataset: SNPs not in this file will be left unchanged. If a SNP
is listed more than once in the file, an error will be reported.
</p>
<strong>NOTE</strong> When updating the map positions, it is possible that the
implied ordering of SNPs in the dataset might change. If this is the case, a
message will be written to the LOG file. Although the positions are updated,
the order is not changed internally: as SNPs might be out of order, it is
important to correct this by saving and reloading the file. For example, the if the original
contains
<pre>
...
rs10001 500000
rs10002 520000
rs10003 540000
rs10004 560000
...
</pre>
but we update <tt>rs10002</tt> to position 580000, the data will be
<pre>
...
rs10001 500000
rs10002 580000
rs10003 540000
rs10004 560000
...
</pre>
Only after saving and reloading (e.g. <tt>--make-bed</tt> / <tt>--bfile</tt> ) will the file be
in the correct order
<pre>
...
rs10001 500000
rs10003 540000
rs10004 560000
rs10002 580000
...
</pre>
This will only be an issue for commands which rely on relative SNP
positions (e.g. --hap-window, --homozyg, etc). If the LOG file does
not show a message that the order of SNPs has changed after using <tt>--update-map</tt>,
one need not worry.
</p>
The name and chromosome code of a SNP can also be changed, by adding the modifiers
<tt>--update-name</tt> or <tt>--update-chr</tt>, e.g.
<h5>
./plink --bfile mydata --update-map rsID.lst --update-name --make-bed --out mydata2
</h5></p>
or
<h5>
./plink --bfile mydata --update-map chr-codes.txt --update-chr --make-bed --out mydata2
</h5></p>
In both case, the format of the input file should be two columns per line, e.g.
<pre>
SNP_A-1919191 rs123456
SNP_A-64646464 rs222222
...
</pre>
or, for chromosome codes (use numeric values and codes X, Y, etc)
<pre>
rs123456 1
rs987654 18
rs678678 X
..
</pre>
You cannot update more than one attribute at a time for SNPs.
<a name="updateallele">
<h2>Update allele information</h2>
</a></p>
To recode alleles, for example from A,B allele coding to A,C,G,T
coding, use the command <tt>--update-alleles</tt>, for example
<h5>
./plink --bfile mydata --update-alleles mylist.txt --make-bed --out newfile
</h5></p>
where the file <tt>mylist.txt</tt> contains five columns per row listing,
<pre>
SNP identifier
Old allele code for one allele
Old allele code for other allele
New allele code for first allele
New allele code for other allele
</pre>
For example,
<pre>
rs10001 A B G T
rs10002 A B A C
...
</pre>
will change allele A to G and allele B to T for rs10001, etc.
<a name="refallele">
<h2>Force a specific reference allele</h2>
</a></p>
It is possible to manually specify which allele is the <tt>A1</tt>
allele and which is <tt>A2</tt>. By default, the minor allele is
assigned to be <tt>A1</tt>. All odds ratios, etc, are calculated
with respect to the <tt>A1</tt> allele (i.e. an odds ratio greater
than 1 implies that the <tt>A1</tt> allele increases risk).
</p>
To set a particular allele as <tt>A1</tt>, which might not be the minor allele,
use the command <tt>--reference-allele</tt>, which can be used with
any other analysis or data generation command, e.g.
<h5>
./plink --bfile mydata --reference-allele mylist.txt --assoc
</h5></p>
where the file <tt>mylist.txt</tt> contains a list of SNP IDs and
the allele to be set as <tt>A1</tt>, e.g.
<pre>
rs10001 A
rs10002 T
rs10003 T
...
</pre>
This command can make comparing results across studies easier, so that odds ratios
reported can be made to be in the same direction as the other study, for example.
<a name="updatefam">
<h2>Update individual information</h2>
</a></p>
Rather than try to manually edit PED or FAM files (which is not advised), use these functions
to change ID codes, sex and parental information for individuals in a fileset. The command
<h5>
plink --bfile mydata --update-ids recoded.txt --make-bed --out mydata2
</h5></p>
changes ID codes for individuals specified in <tt>recoded.txt</tt>, which should be
in the format of four columnds per row: old FID, old IID, new FID, new IID, e.g.
<pre>
FA 1001 F0001 I0001
FA 1002.dup F0002 I0002
...
</pre>
will, for example find the person <tt>FA/1001</tt> and change their FID/IID
values to <tt>F0001/I0001</tt>. Not all people need be listed in the file (they
will not be changed; the order of the file need not match the original dataset.
</p>
Two simular commands (but that cannot be run at the same time as <tt>--update-ids</tt>) are
<h5>
--update-sex myfile1.txt
</h5></p>
that expects 3 columns per row:
<pre>
FID
IID
SEX Coded 1/2/0 for M/F/missing
</pre>
and
<h5>
--update-parents myfile2.txt
</h5></p>
that expects 4 columns per row: