Index of /goldenPath/hg19/bigZips

Overview
^^^^^^^^

This directory and the subdirectory "initial/" contain the genome sequence files
from the original release in 2009 by NCBI (GCA_000001405.1) and some related
files. This original version and all subsequent changes are called "hg19" at
UCSC. All coordinates from the initial assembly will always be valid on the
"hg19" UCSC Genome Browser, as no changes were made to existing sequences.

In 2020 we added a few additional sequences, new sequences from GRC patch
release GRCh37.p13 (GCA_000001405.14) plus the revised Cambridge Reference
Sequence (rCRS) mitochondrial sequence. These can be found in the subdirectory
"p13.plusMT/" or its alias "latest/". See the section "Patches" below. Most
users looking at this text are looking for the file "latest/hg19.fa.gz".

There is one exception: if you need a file for a genome aligner, like BWA,
bowtie2, hisat2 or similar, please read the section "Analysis Set" below and
look at the directory "analysisSet/".

The subdirectory "genes/" contains select gene transcript sets in GFF format.

GRCh37 was produced and is updated by the Genome Reference Consortium:
https://www.ncbi.nlm.nih.gov/grc

Differences from the NCBI files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are two main differences compared to the NCBI files:

- the mitochondrial genome: since the release of the UCSC hg19
assembly, the Homo sapiens mitochondrion sequence (represented as "chrM" in the
Genome Browser) has been replaced in GenBank with the record NC_012920, the
revised Cambridge Reference Sequence (rCRS). We have not replaced the original
sequence, NC_001807, as chrM in the hg19 Genome Browser. However, files in the
subdirectory p13.plusMT include NC_012920 as "chrMT", in addition to the original
"chrM".

- also, the FASTA files of NCBI's GCA_000001405.1 distributed at
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/
have different sequence identifiers ("NC_000001.10" for NCBI instead of "chr1"
for UCSC) and the repeatmasking, expressed by lowercasing letters, was done
with different RepeatMasker settings.

Please also read the notes on our hg19 overview page at:
http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
The page explains the naming scheme of unplaced contigs and haplotypes,
e.g. HSCHR6_MHC_APD_CTG1 = GL000250.1 => "chr6_apd_hap1"
and the placement of the pseudo-autosomal (PAR) regions on chrX and chrY.

Analysis set
^^^^^^^^^^^^

The GRCh37/hg19 patch13 assembly contains more than just the chromosome
sequences, but also a mitochondrial genome, unplaced sequences, alternate
haplotypes and fixes, some of these sequences can confuse modern aligners.

The subdirectory analysisSet/ contains files with optimized versions of the
genome for these aligners or similar high-throuput analysis programs. The
README.txt file in that directory provides more details.

Patches to hg19
^^^^^^^^^^^^^^^

The Genome Reference Consortium has been adding additional (short) sequences
since the initial release. We have added these patches in 2020 but keep the
updated releases in separate directories:

- The initial/ subdirectory contains files for the initial release of GRCh37,
without any patch release sequences.

- The p13.plusMT/ subdirectory contains files for GRCh37.p13 (patch release 13)
plus the rCRS mitochondrion sequence (NC_012920) as "chrMT".
GRC patch releases do not change any previously existing sequences; they
simply add new sequences for fix patches or alternate haplotypes that
correspond to specific regions of the main chromosome sequences.
The Genome Browser displays this expanded set of assembly sequences.

- The latest/ subdirectory contains files that do not include version indicators
in their names, but are symbolic links to files in the most recent version
subdirectory, i.e. p13.plusMT.

- Data files in the current directory are the same as files in the initial/
subdirectory, i.e. they are from the initial GRCh37 release and do not
include the patch sequences that are included in the Genome Browser.

Sequence names
^^^^^^^^^^^^^^

For historical reasons, what UCSC calls "chr1", Ensembl calls "1" and NCBI
calls "NC_000067.6". The sequences are identical though. To map between UCSC,
Ensembl and NCBI names, use our table "chromAlias", available via our Table
Browser or as file:
https://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz We
also provide a Python command line tool to convert sequence names in the most
common genomics file formats:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc

During genome assembly, reads are assembled into "contigs" (a few kbp long),
which are then joined into longer "scaffolds" of a few hundred kbp. These are
finally placed, often manually e.g. with FISH assays, onto chromosomes.
The .agp file below describes how these were placed onto chromosomes.

The alternate haplotype (_hap) sequences were released with the initial assembly,
subsequent patches introduced fix sequences (_fix) and novel sequences (_alt).
For more information on patches see: http://genome.ucsc.edu/blog/patches/
The following list represents all the types of sequences found in the hg19 genome:

Chromosomes:
- made from scaffolds placed onto chromosome locations, 95% of the genome file
- format: chr{chromosome number or name}
- e.g. chr1 or chrX, chrM for the (non-rCRS) mitochondrial genome.

Unlocalized scaffolds:
- a sequence found in an assembly that is associated with a specific
chromosome but cannot be ordered or oriented on that chromosome.
- format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_random
- e.g. chr17_gl000205_random

Unplaced scaffolds:
- a sequence found in an assembly that is not associated with any chromosome.
- format: chrUn_{sequence_accession}v{sequence_version}
- e.g. chrUn_gl000223

Alternative haplotypes in initial GRCh37 release:
- a sequence that provides an alternate representation of a locus found
in the primary assembly. These sequences were present in the initial hg19
assembly release. They do not represent complete chromosome sequences.
There are 9 present in the initial hg19 assembly.
For more information on the 7 chr6 alternate haplotypes see the MHC Haplotype
Project website: http://www.ucl.ac.uk/cancer/medical-genomics/mhc
- format: chr{chromosome number or name}_{haplotype_name}_hap{haplotype_number_in_chromosome}
- e.g. chr6_cox_hap2

Alternate loci scaffolds from patch releases:
- a scaffold that provides an alternate representation of a locus found
in the primary assembly. These sequences do not represent a complete
chromosome sequence although there is no hard limit on the size of the
alternate locus; currently most are less than 1 Mb. In the context of
hg19, all these sequences have been added through patch releases.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_alt
- e.g. chr12_gl877876_alt

Fix loci scaffolds:
- a patch that corrects sequence or reduces an assembly gap in a given
major release. FIX patch sequences are meant to be incorporated into
the primary or existing alt-loci assembly units at the next major
release.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_fix
- e.g. chrX_kb021648_fix

Files
^^^^^

Files included in this directory are from the initial 2009 release of the genome,
files for the most current patch version of the genome are in the "latest/" subdirectory:

hg19.fa.gz - "Soft-masked" assembly sequence in one file.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case. Again, the most current version of this
file is latest/hg19.fa.gz
For many types of analysis that include sequence comparisons,
the files in the directory analysisSet are recommended, as these
include fewer duplicates.

hg19.fa.masked.gz - based on hg19.fa.gz, "hard-masked" assembly sequence in
one file. Repeats are masked by capital Ns; non-repeating sequence is shown in
upper case.

hg19.fa.out.gz - RepeatMasker .out file. RepeatMasker was run with the
-s (sensitive) setting.
Jan 29 2009 (open-3-2-7) version of RepeatMasker
RepBase library: RELEASE 20090120

hg19.fa.align.gz - RepeatMasker .align file. RepeatMasker was run with the
-s (sensitive) setting.
Jan 29 2009 (open-3-2-7) version of RepeatMasker
RepBase library: RELEASE 20090120

hg19.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats
with period less than or equal to 12, and translated into UCSC's BED
format.

hg19.2bit - contains the complete human/hg19/GRCh37 genome sequence
in the 2bit file format. Repeats from RepeatMasker and Tandem Repeats
Finder (with period of 12 or less) are shown in lower case; non-repeating
sequence is shown in upper case. The utility program, twoBitToFa (available
from the kent src tree), can be used to extract .fa file(s) from
this file. A pre-compiled version of the command line tool can be
found at:
http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
See also:
http://genome.ucsc.edu/admin/git.html
https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/userApps/README

hg19.agp.gz - Description of how the assembly was generated from
fragments.

chromAgp.tar.gz - Description of how the assembly was generated from
fragments, unpacking to one file per chromosome.

chromFa.tar.gz - The assembly sequence in one file per chromosome.
Repeats from RepeatMasker and Tandem Repeats Finder (with period
of 12 or less) are shown in lower case; non-repeating sequence is
shown in upper case.

chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
Repeats are masked by capital Ns; non-repeating sequence is shown in
upper case.

chromOut.tar.gz - RepeatMasker .out files (one file per chromosome).
RepeatMasker was run with the -s (sensitive) setting.
Using: Jan 29 2009 (open-3-2-7) version of RepeatMasker and
RELEASE 20090120 of library RepeatMaskerLib.embl

chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats
with period less than or equal to 12, and translated into UCSC's BED 5+
format (one file per chromosome).

est.fa.gz - Human ESTs in GenBank. This sequence data is updated
regularly via automatic GenBank updates.

md5sum.txt - checksums of files in this directory

mrna.fa.gz - Human mRNA from GenBank. This sequence data is updated
regularly via automatic GenBank updates.

refMrna.fa.gz - RefSeq mRNA from the same species as the genome.
This sequence data is updated regularly via automatic GenBank
updates.

upstream1000.fa.gz - Sequences 1000 bases upstream of annotated
transcription starts of RefSeq genes with annotated 5' UTRs.
This file is updated weekly so it might be slightly out of sync with
the RefSeq data which is updated daily for most assemblies.

upstream2000.fa.gz - Same as upstream1000, but 2000 bases.

upstream5000.fa.gz - Same as upstream1000, but 5000 bases.

xenoMrna.fa.gz - GenBank mRNAs from species other than that of
the genome.

hg19.chrom.sizes - Two-column tab-separated text file containing assembly
sequence names and sizes.

hg19.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used
- to construct the GC Percent track
hg19.gc5Base.wig.gz - wiggle database table for the GC Percent track
- this is an older standard alternative to the current
- bigWig format of the track, sometimes usefull for analysis
hg19.gc5Base.wib - binary data to correspond with the gc5Base.wig file
see also: http://genome.ucsc.edu/goldenPath/help/wiggle.html
and http://genomewiki.ucsc.edu/index.php/Using_hgWiggle_without_a_database
for a discussion of how to use the wig.gz and .wib files for
interaction with the GC percent data values

hg19.chromAlias.txt - sequence name alias file, one line
for each sequence name. First column is sequence name followed by
tab separated alias names.

------------------------------------------------------------------
How to download
^^^^^^^^^^^^^^^

If you plan to download a large file or multiple files from this
directory, we recommend that you use ftp rather than downloading the
files via our website. To do so, ftp to hgdownload.soe.ucsc.edu
[username: anonymous, password: your email address], then cd to the
directory goldenPath/hg19/bigZips. To download multiple files, use
the "mget" command:

mget <filename1> <filename2> ...
- or -
mget -a (to download all the files in the directory)

Alternate methods to ftp access.

Using an rsync command to download the entire directory:
rsync -avzP rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ .
For a single file, e.g. chromFa.tar.gz
rsync -avzP
rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz .

Or with wget, all files:
wget --timestamping
'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/*'
With wget, a single file:
wget --timestamping
'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz'
-O chromFa.tar.gz

To unpack the *.tar.gz files:
tar xvzf <file>.tar.gz
To uncompress the fa.gz files:
gunzip <file>.fa.gz

All the files in this directory are freely available for public use.

      Name                       Last modified      Size  Description
      Parent Directory                                -   
      analysisSet/               2020-03-13 17:39    -   
      chromAgp.tar.gz            2009-03-20 09:02  538K  
      chromFa.tar.gz             2009-03-20 09:21  905M  
      chromFaMasked.tar.gz       2009-03-20 09:30  477M  
      chromOut.tar.gz            2009-03-20 09:03  163M  
      chromTrf.tar.gz            2009-03-20 09:30  7.6M  
      est.fa.gz                  2019-10-14 15:08  1.5G  
      est.fa.gz.md5              2019-10-14 15:08   44   
      genes/                     2024-07-31 12:33    -   
      hg19.2bit                  2009-03-08 15:29  778M  
      hg19.agp.gz                2009-05-06 15:22  532K  
      hg19.chrom.sizes           2009-03-08 14:56  1.9K  
      hg19.chromAlias.bb         2023-02-23 12:59   38K  
      hg19.chromAlias.txt        2023-02-22 11:47  4.8K  
      hg19.fa.align.gz           2009-03-08 22:08  2.2G  
      hg19.fa.gz                 2018-08-21 12:56  905M  
      hg19.fa.masked.gz          2018-09-12 10:33  477M  
      hg19.fa.out.gz             2009-03-08 21:55  163M  
      hg19.gc5Base.wib           2019-01-17 14:49  571M  
      hg19.gc5Base.wig.gz        2019-01-17 14:49   11M  
      hg19.gc5Base.wigVarStep.gz 2018-09-28 15:21  1.5G  
      hg19.trf.bed.gz            2009-03-08 15:00  7.6M  
      initial/                   2024-02-29 14:20    -   
      latest/                    2020-03-25 13:33    -   
      md5sum.txt                 2019-01-17 15:55  967   
      mrna.fa.gz                 2019-10-14 14:50  370M  
      mrna.fa.gz.md5             2019-10-14 14:50   45   
      p13.plusMT/                2024-07-23 16:43    -   
      refMrna.fa.gz              2019-10-14 15:08   80M  
      refMrna.fa.gz.md5          2019-10-14 15:08   48   
      upstream1000.fa.gz         2019-10-14 15:09  9.7M  
      upstream1000.fa.gz.md5     2019-10-14 15:09   53   
      upstream2000.fa.gz         2019-10-14 15:10   18M  
      upstream2000.fa.gz.md5     2019-10-14 15:10   53   
      upstream5000.fa.gz         2019-10-14 15:10   47M  
      upstream5000.fa.gz.md5     2019-10-14 15:10   53   
      xenoMrna.fa.gz             2019-10-14 15:00  6.4G  
      xenoMrna.fa.gz.md5         2019-10-14 15:00   49   
      xenoRefMrna.fa.gz          2019-10-14 15:08  250M  
      xenoRefMrna.fa.gz.md5      2019-10-14 15:08   52