I presented a talk on Big data biology for pythonistas: getting in on the genomics revolution at PyCon Au 2016.

In the abstract, I promised the following:

In 2001 Bill Clinton unveiled “the most important, most wondrous map ever produced by humankind” - the human genome. This monumental endeavour cost $3 billion, and took hundreds of scientists from all over the world 13 years. Today, a single person can generate such a map in ~2 days for $1000. This dramatic drop in cost means that we now have data for hundreds of thousands of people - and other species - from all corners of the globe, and cohorts are available for every major disease under the sun. Petabytes of new data are also being generated every week.

Most of this data is publicly available, so anyone with an internet connection can try in silico biology from the comfort of their own home. In my talk, I’ll walk through what this data looks like, and how it’s analysed - with a special focus on where python fits into the workflow (;tldr the most interesting parts!). I will also highlight some common pitfalls software engineers and developers face when getting into this space.

Finally, I’ll showcase several other facets of bioinformatics that sorely need contributions from good coders. Genomics is rapidly entering the world of health care in both the public and private hospital sectors, and in direct-to-consumer genetic testing. Understanding this data, the challenges and limitations of its analytics will help us all make better-informed health and medical decisions, affecting our quality of life and those we love.

The slides for this:

See the full video on youtube here.

If you are trying to follow up on my talk and carry out an analysis of some data (or are looking for some data, or would like to know where to find the data from paper X), please leave a comment below, and I’ll do my best to answer your question and help! I don’t monitor the comments on youtube or slideshare, but I do get a ping if you leave a comment here.