Introduction - The Human Protein Atlas

INTRODUCTION

The Human Protein Atlas portal is a publicly available database with millions of high-resolution images showing the spatial distribution of proteins in 44 different normal human tissues and 20 different cancer types, as well as 46 different human cell lines. The data is released together with application-specific validation performed for each antibody, including immunohistochemisty, Western blot analysis and, for a large fraction, a protein array assay and immunofluorescent based confocal microscopy. The database has been developed in a gene-centric manner with the inclusion of all human genes predicted from genome efforts. Search functionalities allow for complex queries regarding protein expression profiles, protein classes and chromosome location.

Uhlen et al (2015). Tissue-based map of the human proteome. Science.
DOI: 10.1126/science.1260419

Uhlen et al (2010). Towards a knowledge-based Human Protein Atlas. Nat Biotechnol. 28(12):1248-50.
PubMed: 21139605 DOI: 10.1038/nbt1210-1248

Berglund et al (2008). A gene-centric Human Protein Atlas for expression profiles based on antibodies. Mol Cell Proteomics. 7(10):2019-27.
PubMed: 18669619 DOI: 10.1074/mcp.R800013-MCP200

Uhlen et al (2005). A human protein atlas for normal and cancer tissues based on antibody proteomics. Mol Cell Proteomics. 4(12):1920-1932.
PubMed: 16127175 DOI: 10.1074/mcp.M500279-MCP200

Ponten et al (2008). The Human Protein Atlas--a tool for pathology. J Pathology. 216(4):387-93.
PubMed: 18853439 DOI: 10.1002/path.2440

The Human Protein Atlas

The Human Protein Atlas contains information for a large majority of all human protein-coding genes regarding the expression and localization of the corresponding proteins based on both RNA and protein data. The atlas consists of four subparts; normal tissue, cancer, subcellular and cell lines with each subpart containing images and data based on antibody-based proteomics and transcriptomics. The tissue atlas contains information of 44 different human tissues and organs with annotation data for altogether 83 different cell types. The transcriptomics data provide quantitative data on gene expression levels across the tissues and organs, while the antibody-based protein profiles show the spatial distribution on a single cell level for the corresponding protein in the various substructures and cell types of the tissues. Version 13 of the Human Protein Atlas contains RNA data for 99.9% and protein data for 83% of the predictive human genes and includes more than 11 million images with primary data from immunohistochemistry and immunofluorescence.

The normal tissue atlas

The normal tissue atlas contains information and images regarding the expression profiles of human genes both on the mRNA and protein level. The protein expression data is derived from annotation of immunohistochemical staining of cell populations in all major human tissues and organs, including the brain, liver, kidney, lymphoid tissues, heart, lung, skin, gastrointestinal tract, pancreas, endocrine tissues and the reproductive organs. In total, 44 different human tissues are included and contain annotation data for altogether 83 different cell types. The antibody-based protein profiles are qualitative and describe the spatial distribution, cell type specificity and the rough relative abundance of proteins in these tissues, whereas the mRNA data provide quantitative data on gene expression levels. If data from two or more antibodies directed towards the same target protein exist, these expression profiles are manually curated to yield an "annotated protein expression". This procedure takes into account the expression profiles of both antibodies and the mRNA data to derive a best estimate of "true" protein expression.

Example:

MYL7
Myosin, light chain 7, regulatory.

Selective cytoplasmic expression in heart myocytes at the protein level, highly tissue enriched in heart muscle at the mRNA level.

The cancer atlas

The cancer tissue atlas contains a multitude of human cancer specimens representing the 20 most common forms of cancer, including breast-, colon-, prostate-, lung-, urothelial-, skin-, endometrial- and cervical cancer. Altogether 216 different cancer samples are used to generate protein expression profiles for all proteins using immunohistochemistry. The data is presented as pathology-based annotation of protein expression levels in tumor cells, along with the images underlying the annotation. This enables the identification of a potential protein signature for each given type of cancer and provides a starting point for further analyses of cancer type-specific proteins. Because the cancer atlas contains a large number of cancer samples the available protein profiles provide an excellent starting point for identifying new potential cancer biomarkers.

Example:

KLK3
Kallikrein-related peptidase 3.

Selective cytoplasmic expression in prostate cancers. All other malignant tissues were negative.

The cell line atlas

The cell atlas contains expression profiles from a diverse panel of human-derived cell lines on both the mRNA (n=44) and protein level (n=46). In addition, protein data is also displayed for patient blood samples representing normal peripheral blood mononuclear cells (PBMC) and different types of leukemia and lymphoma; each antibody is tested on two samples of PBMC and ten samples representing AML, ALL, CML, and CLL. Protein expression has been assessed using immunohistochemistry (IHC), and profiles are currently available for 22% of all protein-coding genes based on 5436 well-validated antibodies. IHC staining positivity has been quantified using the automated image analysis software TMAx (Beecher instruments, Sun Praire, WI) (Strömberg et al. 2007). All underlying images of IHC stained cells and cell lines are displayed, along with transcript levels and relative IHC scores.

Example:

Emerin
EMD, LEMD5, STA.

Transcript and protein detected at same medium/high levels in almost all cell lines.

The subcellular atlas

Alongside the immunohistochemical pipeline generating the three sub-atlases above, antibodies are also used for confocal immunofluorescence analyses to generate a subcellular protein atlas. This sub-atlas contains high resolution, multicolor images of immunofluorescently stained cells that reveal spatial expression patterns on the subcellular level. For each antibody, two suitable human cell lines are selected for the immunofluorescence analysis on the basis of RNA expression. The third human cell line is always U-2 OS. The cells are stained in a standardized way where the antibody of interest is labeled green, the cytoskeleton is labeled red, endoplasmic reticulum is labeled yellow, and nuclei are stained blue by DAPI. The images are manually annotated in terms of subcellular localization, staining intensity and staining characteristics.

Example:

Ezrin
EZR, VIL2.

Protein localized to the plasma membrane in both human and mouse cells.

Background and history

The Human Protein Atlas project was initiated in 2003 by funding from the The Knut and Alice Wallenberg foundation. Primarily based in Sweden, the HPA-project involves the joint efforts of the Royal Institute of Technology in Stockholm, Uppsala University, Uppsala Akademiska University Hospital and more recently also Science for Life Laboratory based in both Uppsala and Stockholm. International nodes of the project are based in Mumbai, India and other formal collaborations are with groups in South Korea, Japan, China, Germany, France, Switzerland, USA, Canada, Denmark, Finland, The Netherlands, Spain and Italy.

The first version of the HPA-website was launched in 2005 and contained protein expression data based on approximately 700 antibodies. Since then, each new release has added more and more data and also added new functionality and new features to the website. Some important changes were the inclusion of cell-line data in version 2.0, and the inclusion of confocal images showing subcellular localizations in version 3.0. Version 3.0 also included a new search function that allowed for building queries. In version 4.0, the overall database structure was shifted from a previously antibody-centric structure, to a gene-centric structure in order to include information on all genes predicted by Ensembl. The next major restructuring came in 2010 with the version 7.0 when the concept of annotated protein expression for paired antibodies (two independent antibodies directed against different, non-overlapping epitopes on the same protein) was introduced. In 2013, the version 12 of the protein atlas database was complemented with transcriptomics profiles from 27 normal tissues, and the format with four subatlases was introduced.

Strategy for high-throughput proteomics

The high-throughput approach to human proteomics rests on two main pillars, the streamlined production of antibodies and the use of tissue microarray (TMA) technology for immunohistochemistry. The antibody production process begins with a bioinformatics analysis of the protein-coding part of the genome. For every protein, the amino acid sequence is compared to all other putative protein-coding genes to identify a stretch of 50-150 amino acids that has as low homology as possible with respect to all other proteins. Transmembrane regions including hydrophobic and less immunogenic regions are avoided. These sequences are then cloned from cDNA libraries using specifically designed primers and transformed into E. coli bacteria that produce the corresponding peptide chain, here called a PrEST (Protein Epitope Signature Tag). The PrEST is used for various applications including immunization to produce antibodies, and for affinity purification of the polyclonal antisera. Numerous quality assurance and validation steps are included throughout this production chain and all generated antibodies undergo a validation regime and basic characterization before being approved for profiling on tissue microarrays.

The tissue microarray technology enables high-throughput immunohistochemistry on multiple samples within a single experiment. The tissue microarrays used in the Human Protein Atlas project typically consists of 72 different tissue samples, each one punched out as 1 mm diameter cores from formalin-fixed paraffin-embedded tissue blocks. These sampled cores are then arranged in a matrix on a single receiver paraffin block. The resulting receiver block (or TMA) is subsequently sectioned into 200-250 sections that are used for separate immunohistochemical staining experiments. Using this approach, many data points are generated under similar conditions, reducing intra-experimental variation, and saving both time and cost as compared to staining all tissues separately. The Human Protein Atlas project routinely generates protein expression profiles for each antibody by staining 9 different standardized TMA-sets containing samples from 44 different normal human tissues, 20 different cancer types, 46 different human cell lines and 6 hematopoietic cell types from patients.

The stained TMA-slides are then scanned using a digital slide scanner and all tissue cores are separated into individual image files that are uploaded to an internal annotation software and annotated by pathologists. The stainings are scored with respect to the intensity of immunoreactivity, the fraction of immunostained cells and cellular localization of immunoreactivity. The annotation output is then reviewed and compared with available data from RNA-Seq, literature, sibling-antibodies (that are directed towards the same protein), and other sources of protein information before it is finally approved for publication on the Human Protein Atlas website.

The Human Protein Atlas project is funded by the Knut & Alice Wallenberg foundation.