Training the trainer presentation

Presentation on Carpentry Instructor Training - “Training the trainers” given at ANDS Data Seminar.

Amazon AWS workshop at the University of Sydney

On October 24th I attented a talk by Adrian White about using Amazon AWS for research. Below are my notes, not extensively annotated, in the hope that they’ll be useful for someone. If something is not clear, please ask in the comments, and I’ll try to answer to the best of my knowledge

161024_AWS workshop

  • SciTeam ~10 ppl

Overview of AWS services

  • If you need to make your data available you have availability zones. For an ultra-stable website we can make our app accessible in multiple AZs.

  • Networking - pair with AARNet in Australia

  • spark to run GATK

  • ML module - developed for ecommerce based on how people visit websites
  • supervised learning only, basic linear regression

Essential services

Elastic compute (EC2)

  • elastic load balance capability interesting for when you’re firing up a shiny analytics environment

Networking Services

Amazon VPC (virtual private cloud)

  • own network isolated, within a region, you control addressing, DNS servers, whether it connects to the internet

AWS DirectConnect

  • if you need to transfer data faster

Amazon Route 53

  • Domain Name System (DNS) web service.


    Amazon S3

  • object storage service (put/get/delete) not a file system (so can’t do byte range retrievals and other subsets of the files)
  • durability with a 10^6 objects you have 1:10k chance of losing it
  • File size - up to 5 Tb.

Amazon EBS

  • up to 16 Tb
  • disk that fires up and is mounted to a server on EC2
  • can still use magnetic disk to cut cost (default is SSD)


  • file size up to 50 Tb
  • cold storage
  • can take 3-5 hours to retrieve object
  • also 11 9s durability

    AWS Storage Gateway

  • storage on premise


    Amazon RDS

  • Standard relational database. Maintained/run by amazon so you can just use the data

    Amazon DynamoDB

  • Managed NoSQL database service
  • today need 100 reads/sec, tomorrow 1000k

Amazon ElastiCache

  • reddis
  • some there open source libraries

Big Data Services

  • Amazon EMR (Elastic Map Reduce)
  • AWS Data Pipeline Hosted Hadoop framework
    • Move data among AWS services and on premises data sources
  • Amazon Redshift
    • Petabyte-scale data warehouse service
    • OLAP style DB environment (massively parallel processing database)

Monitoring services

  • Amazon CloudWatch (free)
    • Monitor resources
    • you can make your own monitors such as how many times a function is run and send you notifications or make graphs
  • AWS IAM (Identity & Access Mgmt)
    • Manage users, groups & permissions
  • AWS OpsWorks
    • Dev-Ops framework for application lifecycle management
  • AWS CloudFormation
    • Templates to deploy & manage
    • This is useful for us. You build a cluster based on templates = users ,roles, storage, databases, etc
    • infrastructure as a configuration file
  • AWS Elastic Beanstalk
    • Automate resource management
    • can host apps with python or java or docker containers (find out more)

Accessing your resources

  • Everything you can do through the Console, you can do through the CLI or an SDK
  • SDK’s for most programming languages
    • Andriod, IOS, Java, .Net, Node.js, PHP, Python, Ruby, Go

Commercial models

  • on-demand
    • On-Demand Pay for compute capacity by the hour with no long-term commitments For spiky workloads, or to define needs
  • spot
    • Bid for unused capacity, charged at a Spot Price which fluctuates based on supply and demand For time-insensitive or transient workloads
    • 60-70% savings on on demand price when it’s on Spot
    • if price goes above that cost, it gets shut down in 2 minutes
    • RNA-seq on spot to have disappearing servers on 250 k samples
  • Reserved
    • Make a low, one-time payment and receive a significant discount on the hourly charge. For committed utilization
  • Dedicated - you are the only user on a particular physical piece of hardware - health care data compliancy (in the US)

Amazon won’t kill jobs for you itself if your bill goes higher than X. You can configure this yourself, specifying what should happen to storage.


  • Genome processing
  • Modeling and Simulation
  • Government and Educational Research
  • Monte Carlo Simulations
  • Transcoding and Encoding (video)
  • Computational Chemistry


  • self-paced-labs

Can also be used in class:

  • AWS blog - best place to get updates on the main stuff that changes
  • Deeper into a topic - product page + documentation

Moving data to AWS

Storing data in S3 and EC2

Amazon S3 is object based storage

  • Data is an object (treat each file as a single object)
  • Consists of data (globally unique identifier) and metadata
  • Very simple operations: (not POSIX!)
  • Cannot do an lseek, partial read or write, over-write existing data
  • Versioning is allowed (only differences are stored)
  • You control where data is located (Amazon never moves or copies it)
  • Flat hierarchy (no concept of directories)

Amazon S3 is

  • Extremely Durable (99.999999999%)
  • Extremely Available (99.99%)
  • Virtually infinite (scalable)
  • Pay only for what you use
  • You control who has access to your data (and the location of the data)
  • Data Encryption
    • not encrypted by default
    • can tick a box to get amazon to manage encryption
    • can be completely client side encryption only
  • Event notifications
  • Lifecycle Management
  • Amazon S3 is more read oriented than write oriented
  • However, write performance is still very good

QIMR: Moving data to Europe via snowball - a data store device

  • Optimising data transfer to S3
  • Examples


Before worrying about upload performance:

  1. Know file size distribution and the amount of data
    • Average file size? Largest? Smallest? Standard Deviation?
  2. Know how you are connected to Internet and AWS
    • Perform preliminary tests to S3 to understand network performance
  3. Know how you will use data in an Amazon S3 bucket
    • Will it be shared?
    • How is it organized? How do applications access the data?
    • Read vs. Write?
    • Being able to answer these questions will make things much, much easier (with faster uploads)

Non-Network Keys to the fastest upload

  • The following are general principles for the quickest data upload from outside of AWS
  • Multi-part upload
    • aws s3 create-multipart-upload
    • aws s3 sync # Better command that will try to optimise the performance for you!!!
    • All high-level commands that involve uploading objects into S3 (aws s3 cp, aws s3 mv, and aws s3 sync) automatically perform a multipart upload when the object is large
  • Parallel upload
    • each process uses a cpu
    • How you do this is very specific to the language (SDK)
    • In Python use multi-processing module
    • Can use threads (C++, Java)
      • Some tools already available:
    • s3-parallel-put ( )
  • Random prefix to key name
    • first two characters of filename are part of the key, and should be randomised (evenly) by the alphabet so that they get put onto different partitions and hence the performance is much better
  • tar
    • can put everything into a tar archive (such as images), then look at the TOC of the tar and pull out only the byte range of the image you need from AWS
  • Data compression


SKA (square kilometer) array

  • s3-parallel-upload script (python/boto2)
  • performance dropped off after ~30 parallel processes


Cluster HPC

  • can do placement groups - which means the compute nodes are close to each other physically at the server farm
  • storage is ephemeral, EBS, NFSv4 Amazon EFS or Lustre Intel cloud edition (PAYG), BeeGFS
  • is still redundant within your region (so not as durable as S3)
  • minimum use time on Amazon HPC is 1 hour (so cost is that of 1 hour)

Instance types

  • Family - optimised for a particular use
    • ex R3 - memory
    • ex G2/P2 - GPU
    • ec C3/ CPU use type
    • P2
      • machine learning genomics (tensor flow)

Data lakes

  • store datasets of different types

Research on AWS

Lab: Alces Flight & CfnCluster

Apart from Alces flight, what other launch a cluster configurations are there?

In order to relaunch a snapshot you have you can use CfnCluster.

spark R machine learning

jupiter gnu (ask him via email)

If you need to use a key on Windows


PyConAU 2016 Presentation

I presented a talk on Big data biology for pythonistas: getting in on the genomics revolution at PyCon Au 2016.

In the abstract, I promised the following:

In 2001 Bill Clinton unveiled “the most important, most wondrous map ever produced by humankind” - the human genome. This monumental endeavour cost $3 billion, and took hundreds of scientists from all over the world 13 years. Today, a single person can generate such a map in ~2 days for $1000. This dramatic drop in cost means that we now have data for hundreds of thousands of people - and other species - from all corners of the globe, and cohorts are available for every major disease under the sun. Petabytes of new data are also being generated every week.

Most of this data is publicly available, so anyone with an internet connection can try in silico biology from the comfort of their own home. In my talk, I’ll walk through what this data looks like, and how it’s analysed - with a special focus on where python fits into the workflow (;tldr the most interesting parts!). I will also highlight some common pitfalls software engineers and developers face when getting into this space.

Finally, I’ll showcase several other facets of bioinformatics that sorely need contributions from good coders. Genomics is rapidly entering the world of health care in both the public and private hospital sectors, and in direct-to-consumer genetic testing. Understanding this data, the challenges and limitations of its analytics will help us all make better-informed health and medical decisions, affecting our quality of life and those we love.

The slides for this:

See the full video on youtube here.

If you are trying to follow up on my talk and carry out an analysis of some data (or are looking for some data, or would like to know where to find the data from paper X), please leave a comment below, and I’ll do my best to answer your question and help! I don’t monitor the comments on youtube or slideshare, but I do get a ping if you leave a comment here.