UCSC Genomics Text Indexing

Dear Sir or Madam,

We are an academic group headquartered at the University of California, Santa Cruz that wish to offer our help in indexing the biomedical results in your journals using the reference sequence of the human genome. Hundreds of thousands of biomedical researchers and clinicians use an on-line genome browser like the UCSC Genome Browser (http://genome.ucsc.edu) in their daily work to inspect gene structures, genetic mutations and functional annotations that are mapped to the coordinates of the reference genome. The UCSC Genome Browser receives 7 million web requests per week, for example. The system is something like Google Maps, where the position along the chromosome is used as a universal coordinate system to navigate various kinds of information relevant to genetics and genomics. Thus, when the user is in a specific place in the genome, he or she sees the kinds of local information of interest to them, e.g. genes, human disease variants, etc., the same way one would elect to see restaurants or roads on Google Maps.

The most important functional annotation that remains unlinked to the human genome at the moment are the publications that study the region of the genome one is viewing. We have developed a system that allows rapid and automated mapping of scientific articles to the human genome that addresses this missing gap in genomic knowledge. This project depends on the availability of full-text articles for high-throughput content mining. If you agree, we will index your content and provide links from the UCSC genome browser to your articles. We will do this by matching genomic identifiers or snippets of DNA or protein that occur in your article or its supplementary material to the reference sequence of the human genome. You can find examples and more details about the indexing approach on the webpage http://text.soe.ucsc.edu. An article by Richard van Noorden in the journal Nature recently covered our project, see the editorial at http://goo.gl/bxIm0 and his article at http://goo.gl/uW6h3.

To obtain the information we need to build this index, we are requesting permission to crawl all articles on your site published after 1980, after the advent of routine reporting of DNA sequences. Like all search engines, we will not make the full-text content of your documents available; our users will see only a snippet of ~200 characters around the match of a sequence or identifier. Users will be provided with a link that will direct them to your website. We don't need any IT support from you other than permission to access your site in a programmatic manner from specified IP addresses. We will not overload your webserver, since our crawlers are designed only to request 3 documents per minute, just like a human user. If you would like to make other arrangements, such as shipping us disks, this can also be arranged. If you already allow crawling of articles on your website and this specific request for permission is not necessary, please still send us an email stating your journal's policy so that we are aware of it.

Some publishers, like Elsevier, Wiley and the Nature Publishing Group are already collaborating with us. You can follow the progress of our project on http://text.soe.ucsc.edu/progress.html. Our search engine can be found at http://genome.ucsc.edu (go to human genome and set the "Publications" track to "pack"

A recent study by the Publishing Research Commission, an industry association of academic publishers, found that most publishers grant indexing requests from academic researchers, and thus you are likely to have received similar requests. If you are not one of these, we hope that you will consider our request for this project as the first of others that are likely to come in the future, and thus may set an important precedent for your company. We strongly believe that requests like ours to develop tools that help promote access to the biomedical information will provide enormous benefits to the biomedical research community and to the publishing industry alike.

Thank you so much for your consideration.

Sincerely yours,

Prof. David Haussler, PhD, University of California, Santa Cruz
Maximilian Haeussler, PhD, University of California, Santa Cruz
Casey Bergman, PhD, University of Manchester, UK
Jeff Murray, MD, University of Iowa

Email to publishers