Overview
^^^^^^^^

This directory and the subdirectory "initial/" contain the genome sequence files
from the original release in 2009 by NCBI (GCA_000001405.1) and some related
files. This original version and all subsequent changes are called "hg19" at
UCSC. All coordinates from the initial assembly will always be valid on the
"hg19" UCSC Genome Browser, as no changes were made to existing sequences.

In 2020 we added a few additional sequences, new sequences from GRC patch
release GRCh37.p13 (GCA_000001405.14) plus the revised Cambridge Reference
Sequence (rCRS) mitochondrial sequence. These can be found in the subdirectory
"p13.plusMT/" or its alias "latest/".  See the section "Patches" below.  Most
users looking at this text are looking for the file "latest/hg19.fa.gz". 

There is one exception: if you need a file for a genome aligner, like BWA,
bowtie2, hisat2 or similar, please read the section "Analysis Set" below and
look at the directory "analysisSet/".

The subdirectory "genes/" contains select gene transcript sets in GFF format.  

GRCh37 was produced and is updated by the Genome Reference Consortium:
	https://www.ncbi.nlm.nih.gov/grc

Differences from the NCBI files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

There are two main differences compared to the NCBI files:

- the mitochondrial genome: since the release of the UCSC hg19
assembly, the Homo sapiens mitochondrion sequence (represented as "chrM" in the
Genome Browser) has been replaced in GenBank with the record NC_012920, the
revised Cambridge Reference Sequence (rCRS).  We have not replaced the original
sequence, NC_001807, as chrM in the hg19 Genome Browser.  However, files in the
subdirectory p13.plusMT include NC_012920 as "chrMT", in addition to the original
"chrM".

- also, the FASTA files of NCBI's GCA_000001405.1 distributed at
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/
have different sequence identifiers ("NC_000001.10" for NCBI instead of "chr1"
for UCSC) and the repeatmasking, expressed by lowercasing letters, was done
with different RepeatMasker settings.

Please also read the notes on our hg19 overview page at:
   http://genome.ucsc.edu/cgi-bin/hgGateway?db=hg19
The page explains the naming scheme of unplaced contigs and haplotypes, 
e.g. HSCHR6_MHC_APD_CTG1 = GL000250.1 => "chr6_apd_hap1"
and the placement of the pseudo-autosomal (PAR) regions on chrX and chrY.

Analysis set
^^^^^^^^^^^^

The GRCh37/hg19 patch13 assembly contains more than just the chromosome
sequences, but also a mitochondrial genome, unplaced sequences, alternate
haplotypes and fixes, some of these sequences can confuse modern aligners.

The subdirectory analysisSet/ contains files with optimized versions of the
genome for these aligners or similar high-throuput analysis programs. The 
README.txt file in that directory provides more details.

Patches to hg19
^^^^^^^^^^^^^^^

The Genome Reference Consortium has been adding additional (short) sequences
since the initial release.  We have added these patches in 2020 but keep the
updated releases in separate directories:

- The initial/ subdirectory contains files for the initial release of GRCh37,
without any patch release sequences.

- The p13.plusMT/ subdirectory contains files for GRCh37.p13 (patch release 13)
plus the rCRS mitochondrion sequence (NC_012920) as "chrMT".
GRC patch releases do not change any previously existing sequences; they
simply add new sequences for fix patches or alternate haplotypes that
correspond to specific regions of the main chromosome sequences.
The Genome Browser displays this expanded set of assembly sequences.

- The latest/ subdirectory contains files that do not include version indicators
in their names, but are symbolic links to files in the most recent version
subdirectory, i.e. p13.plusMT.

- Data files in the current directory are the same as files in the initial/
subdirectory, i.e. they are from the initial GRCh37 release and do not
include the patch sequences that are included in the Genome Browser.

Sequence names
^^^^^^^^^^^^^^

For historical reasons, what UCSC calls "chr1", Ensembl calls "1" and NCBI
calls "NC_000067.6". The sequences are identical though. To map between UCSC,
Ensembl and NCBI names, use our table "chromAlias", available via our Table
Browser or as file:
https://hgdownload.cse.ucsc.edu/goldenPath/mm10/database/chromAlias.txt.gz We
also provide a Python command line tool to convert sequence names in the most
common genomics file formats:
http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/chromToUcsc

During genome assembly, reads are assembled into "contigs" (a few kbp long),
which are then joined into longer "scaffolds" of a few hundred kbp. These are
finally placed, often manually e.g. with FISH assays, onto chromosomes.
The .agp file below describes how these were placed onto chromosomes.

The alternate haplotype (_hap) sequences were released with the initial assembly, 
subsequent patches introduced fix sequences (_fix) and novel sequences (_alt).
For more information on patches see: http://genome.ucsc.edu/blog/patches/
The following list represents all the types of sequences found in the hg19 genome:

Chromosomes:
- made from scaffolds placed onto chromosome locations, 95% of the genome file
- format: chr{chromosome number or name}
- e.g. chr1 or chrX, chrM for the (non-rCRS) mitochondrial genome.

Unlocalized scaffolds:
- a sequence found in an assembly that is associated with a specific
chromosome but cannot be ordered or oriented on that chromosome.
- format: chr{chromosome number or name}_{sequence_accession}v{sequence_version}_random
- e.g. chr17_gl000205_random

Unplaced scaffolds:
- a sequence found in an assembly that is not associated with any chromosome.
- format: chrUn_{sequence_accession}v{sequence_version}
- e.g. chrUn_gl000223

Alternative haplotypes in initial GRCh37 release:
- a sequence that provides an alternate representation of a locus found
  in the primary assembly. These sequences were present in the initial hg19
  assembly release. They do not represent complete chromosome sequences. 
  There are 9 present in the initial hg19 assembly.
  For more information on the 7 chr6 alternate haplotypes see the MHC Haplotype
  Project website: http://www.ucl.ac.uk/cancer/medical-genomics/mhc
- format: chr{chromosome number or name}_{haplotype_name}_hap{haplotype_number_in_chromosome}
- e.g. chr6_cox_hap2

Alternate loci scaffolds from patch releases:
- a scaffold that provides an alternate representation of a locus found
  in the primary assembly. These sequences do not represent a complete
  chromosome sequence although there is no hard limit on the size of the
  alternate locus; currently most are less than 1 Mb. In the context of 
  hg19, all these sequences have been added through patch releases.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_alt
- e.g. chr12_gl877876_alt

Fix loci scaffolds:
- a patch that corrects sequence or reduces an assembly gap in a given
  major release. FIX patch sequences are meant to be incorporated into
  the primary or existing alt-loci assembly units at the next major
  release.
- these sequences are not part of the files in the initial/ directory
- format: chr{chromosome number or name}_{sequence_accession}_fix
- e.g. chrX_kb021648_fix

Files
^^^^^

Files included in this directory are from the initial 2009 release of the genome, 
files for the most current patch version of the genome are in the "latest/" subdirectory:

hg19.fa.gz - "Soft-masked" assembly sequence in one file.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case. Again, the most current version of this
    file is latest/hg19.fa.gz 
    For many types of analysis that include sequence comparisons,
    the files in the directory analysisSet are recommended, as these
    include fewer duplicates.

hg19.fa.masked.gz - based on hg19.fa.gz, "hard-masked" assembly sequence in 
    one file. Repeats are masked by capital Ns; non-repeating sequence is shown in
    upper case.

hg19.fa.out.gz - RepeatMasker .out file.  RepeatMasker was run with the
    -s (sensitive) setting.
    Jan 29 2009 (open-3-2-7) version of RepeatMasker
    RepBase library: RELEASE 20090120

hg19.fa.align.gz - RepeatMasker .align file.  RepeatMasker was run with the
    -s (sensitive) setting.
    Jan 29 2009 (open-3-2-7) version of RepeatMasker
    RepBase library: RELEASE 20090120

hg19.trf.bed.gz - Tandem Repeats Finder locations, filtered to keep repeats
    with period less than or equal to 12, and translated into UCSC's BED
    format.

hg19.2bit - contains the complete human/hg19/GRCh37 genome sequence
    in the 2bit file format.  Repeats from RepeatMasker and Tandem Repeats
    Finder (with period of 12 or less) are shown in lower case; non-repeating
    sequence is shown in upper case.  The utility program, twoBitToFa (available
    from the kent src tree), can be used to extract .fa file(s) from
    this file.  A pre-compiled version of the command line tool can be
    found at:
        http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/
    See also:
        http://genome.ucsc.edu/admin/git.html
        https://genome-source.gi.ucsc.edu/gitlist/kent.git/blob/master/src/userApps/README

hg19.agp.gz - Description of how the assembly was generated from
    fragments.

chromAgp.tar.gz - Description of how the assembly was generated from
    fragments, unpacking to one file per chromosome.

chromFa.tar.gz - The assembly sequence in one file per chromosome.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

chromFaMasked.tar.gz - The assembly sequence in one file per chromosome.
    Repeats are masked by capital Ns; non-repeating sequence is shown in
    upper case.

chromOut.tar.gz - RepeatMasker .out files (one file per chromosome).
    RepeatMasker was run with the -s (sensitive) setting.
    Using: Jan 29 2009 (open-3-2-7) version of RepeatMasker and
    RELEASE 20090120 of library RepeatMaskerLib.embl

chromTrf.tar.gz - Tandem Repeats Finder locations, filtered to keep repeats
    with period less than or equal to 12, and translated into UCSC's BED 5+
    format (one file per chromosome).

est.fa.gz - Human ESTs in GenBank. This sequence data is updated 
    regularly via automatic GenBank updates.

md5sum.txt - checksums of files in this directory

mrna.fa.gz - Human mRNA from GenBank. This sequence data is updated
    regularly via automatic GenBank updates.

refMrna.fa.gz - RefSeq mRNA from the same species as the genome.
    This sequence data is updated regularly via automatic GenBank
    updates.

upstream1000.fa.gz - Sequences 1000 bases upstream of annotated
    transcription starts of RefSeq genes with annotated 5' UTRs.
    This file is updated weekly so it might be slightly out of sync with
    the RefSeq data which is updated daily for most assemblies.

upstream2000.fa.gz - Same as upstream1000, but 2000 bases.

upstream5000.fa.gz - Same as upstream1000, but 5000 bases.

xenoMrna.fa.gz - GenBank mRNAs from species other than that of 
    the genome. 

hg19.chrom.sizes - Two-column tab-separated text file containing assembly
    sequence names and sizes.

hg19.gc5Base.wigVarStep.gz - ascii data wiggle variable step values used
                           - to construct the GC Percent track
hg19.gc5Base.wig.gz - wiggle database table for the GC Percent track
                    - this is an older standard alternative to the current
                    - bigWig format of the track, sometimes usefull for analysis
hg19.gc5Base.wib    - binary data to correspond with the gc5Base.wig file
    see also:  http://genome.ucsc.edu/goldenPath/help/wiggle.html
    and  http://genomewiki.ucsc.edu/index.php/Using_hgWiggle_without_a_database
         for a discussion of how to use the wig.gz and .wib files for
         interaction with the GC percent data values

hg19.chromAlias.txt - sequence name alias file, one line
    for each sequence name.  First column is sequence name followed by
    tab separated alias names.

------------------------------------------------------------------
How to download
^^^^^^^^^^^^^^^

If you plan to download a large file or multiple files from this
directory, we recommend that you use ftp rather than downloading the
files via our website. To do so, ftp to hgdownload.soe.ucsc.edu
[username: anonymous, password: your email address], then cd to the
directory goldenPath/hg19/bigZips. To download multiple files, use
the "mget" command:

    mget <filename1> <filename2> ...
    - or -
    mget -a (to download all the files in the directory)

Alternate methods to ftp access.

Using an rsync command to download the entire directory:
    rsync -avzP rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ .
For a single file, e.g. chromFa.tar.gz
    rsync -avzP 
        rsync://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz .

Or with wget, all files:
    wget --timestamping 
        'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/*'
With wget, a single file:
    wget --timestamping 
        'ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz' 
        -O chromFa.tar.gz

To unpack the *.tar.gz files:
    tar xvzf <file>.tar.gz
To uncompress the fa.gz files:
    gunzip <file>.fa.gz

All the files in this directory are freely available for public use.
      Name                       Last modified      Size  Description
Parent Directory - analysisSet/ 2020-03-13 17:39 - chromAgp.tar.gz 2009-03-20 09:02 538K chromFa.tar.gz 2009-03-20 09:21 905M chromFaMasked.tar.gz 2009-03-20 09:30 477M chromOut.tar.gz 2009-03-20 09:03 163M chromTrf.tar.gz 2009-03-20 09:30 7.6M est.fa.gz 2019-10-14 15:08 1.5G est.fa.gz.md5 2019-10-14 15:08 44 genes/ 2024-07-31 12:33 - hg19.2bit 2009-03-08 15:29 778M hg19.agp.gz 2009-05-06 15:22 532K hg19.chrom.sizes 2009-03-08 14:56 1.9K hg19.chromAlias.bb 2023-02-23 12:59 38K hg19.chromAlias.txt 2023-02-22 11:47 4.8K hg19.fa.align.gz 2009-03-08 22:08 2.2G hg19.fa.gz 2018-08-21 12:56 905M hg19.fa.masked.gz 2018-09-12 10:33 477M hg19.fa.out.gz 2009-03-08 21:55 163M hg19.gc5Base.wib 2019-01-17 14:49 571M hg19.gc5Base.wig.gz 2019-01-17 14:49 11M hg19.gc5Base.wigVarStep.gz 2018-09-28 15:21 1.5G hg19.trf.bed.gz 2009-03-08 15:00 7.6M initial/ 2024-02-29 14:20 - latest/ 2020-03-25 13:33 - md5sum.txt 2019-01-17 15:55 967 mrna.fa.gz 2019-10-14 14:50 370M mrna.fa.gz.md5 2019-10-14 14:50 45 p13.plusMT/ 2024-07-23 16:43 - refMrna.fa.gz 2019-10-14 15:08 80M refMrna.fa.gz.md5 2019-10-14 15:08 48 upstream1000.fa.gz 2019-10-14 15:09 9.7M upstream1000.fa.gz.md5 2019-10-14 15:09 53 upstream2000.fa.gz 2019-10-14 15:10 18M upstream2000.fa.gz.md5 2019-10-14 15:10 53 upstream5000.fa.gz 2019-10-14 15:10 47M upstream5000.fa.gz.md5 2019-10-14 15:10 53 xenoMrna.fa.gz 2019-10-14 15:00 6.4G xenoMrna.fa.gz.md5 2019-10-14 15:00 49 xenoRefMrna.fa.gz 2019-10-14 15:08 250M xenoRefMrna.fa.gz.md5 2019-10-14 15:08 52