Memorandum by Graham Cameron and Michael
Ashburner, writing as individuals
1. The activities of the European Bioinformatics
Institute (http://www.ebi.ac.uk) are marginal to the specific
questions asked by this Inquiry. However, as the European source
of all nucleic acid sequence data, both human and otherwise, the
Inquiry needs to understand the role of the EBI, and its sister
institutes in Japan and the USA, in distributing human genetic
sequence data.
2. GENETIC SEQUENCE
DATATHE
ROLE OF
THE DATA
LIBRARIES
2.1 As the Committee will be aware techniques
for sequencing DNA became available in the late 1970s and have
become increasingly facile since then. In the very early 1980s
it became apparent to researchers at the EMBL, Heidelberg that
DNA sequence data required archiving in a computer readable form.
These data are, on the one hand, not particularly suited to conventional
scientific publication and, on the other hand, can best be analysed
by computer programs. It seemed, therefore, very sensible to build
a database in which all available sequence data could be made
freely available to all, deposition of sequences in this database
obviating the need for them to be published in scientific papers.
This was the genesis of the EMBL Nucleotide Sequence Data Library,
first released in June 1982 with 585,433 base pairs of sequence
(568 database entries); in the last week of September 2000 the
Data Library (now EMBLBank) passed 10 billion base pairs (>8,766,800
entries).
2.2 Soon after the foundation of the EMBL
Nucleotide Sequence Data Library a similar initiative was launched
by the US National Institutes of Health (leading to the foundation
of Genbank) and, in 1984, a third was launched in Japan, leading
to the foundation of the DNA Data Bank of Japan. Since the mid-1980s
these three data libraries have been very closely integrated.
Scientists may submit new data to any one of the three and these
data are then exchanged with the other two partners every day.
At present the International Nucleic Acid Sequence Data Library
includes 10,061,977,000 base-pairs of sequence, from >50,000
different organisms (from small viruses to human) submitted by
over 100,000 different scientists. For publication of new scientific
information in the literature which references new DNA sequence
data it is now essentially mandatory to have submitted the sequences
themselves to the Data Library.
2.3 All of the sequence data in the Data
Library are open and freely available to all without let or hindrance.
The data are neither secret nor are they copyrighted; some data
may well be covered by patent, since data appearing in the patent
literature area are included in the Data Library.
2.4 All of the data from the "Human
Genome Project" are included in the Data Library. In the
UK, for example, there is, in effect, a direct pipe from the major
sequencing centre, the Sanger Centre, into the Data Library at
the EBI next door. At the time of writing of the 10 billion base
pairs some 62 per cent is sequence from human.
2.5 When submitting new sequence data to
the Data Library the scientists will "annotate" the
sequence in some way. At the very least this annotation must indicate
the source of the new sequence, for example the species. In the
majority of cases the annotation puts the sequence in its scientific
context. Sequences may well be derived from human patients (for
example there are over 40,000 different patient derived HIV sequences
in the Data Library). As providers of the Data Library neither
the EBI, nor its partners, have established standards for what
patient data may or may not be included in such submissions. Our
role is weakly analogous to that of the University Librarianthose
who submit the sequences (that is, the writers of the books) are
legally and ethically responsible for their content. Of course,
should it be brought to our attention that a submission includes
information that is unethical or libellous, then our duty would
be to inform the submitter and withdraw the sequence from the
Data Library. (In fact, this has never happened, though sequences
have been withdrawn on scientific grounds, eg that they were simply
wrong.)
2.6 Without the primary nucleic acid sequence
data library modern biological research would be impossible. The
data in the data library belongs to the scientific community and
it is the determination of the EBI, and of its sister institutes,
that these data should continue to be available to all, without
constraint or restraint. We see no reason whatsoever for these
data to be subject to regulation, and every reason for the current
policies to continue. We point out that this policy of complete
openness can, and perhaps should, be exploited in the context
of public attitudes to human genetic data.
3. FUNDING THE
DATA LIBRARIES
3.1 The three international institutes that,
collaboratively, collect and distribute the primary nucleic acid
sequence data are all publicly funded: the US Genbank project
by the US National Institutes of Health, the Japanese DDBJ by
the Ministry of Education, Science, Sports and Culture and the
European Bioinformatics Institute by the 16 member governments
of the European Molecular Biology Laboratory (in the case of the
UK, this is through the Medical Research Council budget).
4. ANNOTATED
HUMAN GENETIC
SEQUENCE DATA
4.1 The primary sequence data from the Human
Genome Project are barely annotated. This is typical of the information
in the primary "working draft" record of a 175,000 base
pair sequence from the Sanger Centre:
Feature /organism="Homo sapiens"
Feature /clone="RP11-575B7"
Feature /clone_lib="RPCI-11.2"
4.2 "Annotation" is the task of
putting this sequence into its biological context. In its best
form annotation will be the product of both computational and
expert human analysis. An "annotated" sequence will
have its context with its neighbouring sequences established and
will have been analysed with respect to regions which may (at
least) code for proteins and other features.
4.3 There are a few efforts world-wide to
automatically annotate the emerging human genome sequence. The
first and best established of these is called Ensembl, a joint
project between the EBI and the Sanger Centre at Hinxton, largely
funded by the Wellcome Trust.
4.4 Ensembl (http://www.ensembl.org) tracks
all of the primary sequence data from the Human Genome Project.
Viewing these data as a large jigsaw puzzle Ensembl is developing
computer programs to assemble the individual sequences into a
larger whole and to analyse these for features of biological interest.
4.5 Ensembl is a part of a larger international
effort (see http://www.ensembl.org/genome/central/) to bring annotations
on the Human Genome Sequence to the public. Ensembl is, like the
nucleic acid sequence data library, open and public to all. Neither
the data nor the computer software are subject to any restriction.
4.6 Those developing and supporting Ensembl,
and similar projects in the USA, are confident that they can provide
information concerning the human genome that is at least as good,
and probably better, than that being offered by commercial companies,
such as Celera, Incyte and DoubleTwist in the USA. We are also
convinced that such information must never become the property
of any single institution, be that institution public or private.
5. HUMAN MUTATION
DATABASES
5.1 The EBI is heavily involved in an international
collaboration to make public data concerning the genetic basis
of human disease and variation. There are, internationally, nearly
100 different databases that include information concerning the
specific genetic basis of human disease. Typically, each database
is specific for one disease eg the Haemophilia B Mutation Database
at Guy's Hospital. Each of these databases includes patient data,
since they represent the actual nucleic acid sequence of an individual
patient. For example, in the Haemophilia B Mutation Database it
is recorded that patient "UK 232" has a particular nucleotide
base pair change in his (or her) Haemophilia B gene. The way in
which patient anonymity is respected is, in these databases, a
matter for their curators; all, however, fully understand the
need for such protection; indeed we see no problems in this area
over and above the well understood needs to protect the privacy
of patients and their relatives.
5.2 In the end it may well be possible for
a very determined person to break anonymity if contextual data
for patients is public. The dilemma facing scientists, of course,
is that completely stripped of any context such data may lose
much of their value. For example it is clearly important that
scientists can analyse different mutations in the same gene in
the context of the particular phenotypes of those patients carrying
the mutations. Perhaps some classes of mutation have a much more
severe phenotype (or worse prognosis) than others. In the case
of genetic diseases that are very rare then inclusion of data
describing a clinical condition could well allow the particular
patient to be identified.
5.3 The EBI is both a member of the world-wide
HUGO administered "Mutation Database Initiative" and
the producer of an integrated resource, the Sequence Variation
Database (http://www.ebi.ac.uk/mutations). This project is funded
from EMBL sources.
5.4 There is a real danger that such information,
though usually obtained by scientists funded by public monies,
may end up in the private domain and then subject to licence.
For example the Human Gene Mutation Database in Cardiff (http://www.uwcm.ac.uk/uwcm/mg/hgmd0.html)
has recently signed an agreement with Celera that limits access
to the data they include (see Bioinform 4(19), 18 September 2000).
This trend is very regrettable as it means access to data, the
great majority of which are discovered by publicly funded scientists,
is preferentially available to commercial interests, some of which
may (in effect) be monopolistic.
5.5 Human mutation data typically include
data associated with particular disease. A new class of human
genetic data is now being collectedthe variation in nucleotide
sequence that affects every individual in the world (other than
identical twins); two individuals chosen at random will differ
in the base pair at about three million different positions in
their genomes (about 0.1 per cent). These data (so-called SNPs,
Single Nucleotide Polymorphisms) are being collected both in the
public and private domains. In the public domain the "SNP
Consortium" (http://snp.cshl.org) and many others are providing
data on a large number of human polymorphisms. In the private
domain Celera have recently announced the release of data on 2.8
million SNPs (see http://www.pecorporation.com/press/prccorp091300.html)
(of which 400,000 are from public data).
5.6 Scientists have handled data on genetic
polymorphisms in humans for many decades. There is, for example,
an enormous amount of information available concerning the frequency
and distribution of different alleles that code for the blood
group substances, as well as for many other human polymorphisms
(see http://human.stanford.edu/). These data have been of extraordinary
interest and importance to human biologists. There is the hope
that these classical studies will be followed by studies of SNPs
(see http://satori.stanford.edu/institute.html), although proposals
for surveys have yet to gain universal approval. Typically such
data, if to be used for the study of human genetic and cultural
diversity, need not be attributable to an individual; but they
do need to be attributable to a community or population. The danger
that these data will be used to discriminate between populations
is real, but no different in principle from the danger posed by
protein polymorphism data. The danger that the sampling of populations
to obtain such data will be used to exploit genetic diversity
for commercial or even public benefit is also real, but should
be obviated by well written ethical agreements.
5.7 The main justification for determining
a very large number of human SNPs is their potential for the analysis
of complex human diseases, diseases that might have a multifactorial
basis (see Nature 407:516, 28 September 2000). It is here
that there is a direct conflict between commercial and public
interests. We see neither reason nor method to prevent commercial
interests from obtaining and exploiting such data; what is, however,
vital is that the public domain is funded to compete with these
interests at a realistic level. It is only then that the public
good will be best served by the exploitation of human genetic
data, since they promise enormous benefits to human health.
Graham Cameron
Joint Head, European Molecular Biology
LaboratoryEuropean Bioinformatics Institute
Michael Ashburner FRS
Joint Head, European Molecular Biology LaboratoryEuropean
Bioinformatics Institute and Professor of Biology, Department
of Genetics, University of Cambridge
2 October 2000
|