Search This Blog

Wednesday, August 29, 2012

vsespb/mt-aws-glacier · GitHub

vsespb/mt-aws-glacier · GitHub:


mt-aws-glacier

Perl Multithreaded multipart sync to Amazon AWS Glacier service.

Intro

Amazon AWS Glacier is an archive/backup service with very low storage price. However with some caveats in usage and archive retrieval prices. Read more about Amazon AWS Glacier
mt-aws-glacier is a client application for Glacier.

Version

  • Version 0.7 Beta

Features

  • Does not use any existing AWS library, so can be flexible in implementing advanced features
  • Glacier Multipart upload
  • Multithreaded upload
  • Multipart+Multithreaded upload
  • Multithreaded retrieval, deletion and download
  • Tracking of all uploaded files with a local journal file (opened for write in append mode only)
  • Checking integrity of local files using journal
  • Ability to limit number of archives to retrieve

Coming-soon features

  • Multipart download (using HTTP Range header)
  • Ability to limit amount of archives to retrieve, by size, or by traffic/hour
  • Use journal file as flock() mutex
  • Checking integrity of remote files
  • Upload from STDIN
  • Some integration with external world, ability to read SNS topics
  • Simplified distribution for Debian/RedHat
  • Split code to re-usable modules, publish on CPAN (Currently there are great existing Glacier modules on CPAN - see Net::Amazon::Glacier by Tim Nordenfur https://metacpan.org/module/Net::Amazon::Glacier )
  • Create/Delete vault function

Planed next version features

  • Amazon S3 support

Important bugs/missed features

  • Zero length files are ignored
  • chunk size hardcoded as 2MB
  • Only multipart upload implemented, no plain upload
  • Retrieval works as proof-of-concept, so you can't initiate retrieve job twice (until previous job is completed)
  • No way to specify SNS topic
  • HTTP only, no way to configure HTTPS yet (however it works fine in HTTPS mode)
  • Internal refactoring needed, no comments in source yet, unit tests not published
  • Journal file required to restore backup. To be fixed. Will store file metainformation in archive description.

Production ready

  • Not recomended to use in production until first "Release" version. Currently Beta.

Installation

  • Install the following CPAN modules:
            LWP::UserAgent JSON::XS
    
that's all
  • in case you use HTTPS, also install
            LWP::Protocol::https
    
  • Some CPAN modules better install as OS packages (example for Ubuntu/Debian)
            libjson-xs-perl liblwp-protocol-https-perl liburi-perl
    

Warning

  • When playing with Glacier make sure you will be able to delete all your archives, it's impossible to delete archive or non-empty vault in amazon console now. Also make sure you have read all AWS Glacier pricing/faq.
  • Read their pricing FAQ again, really. Beware of retrieval fee.
  • Backup your local journal file. Currently it's impossible to correctly restore backup without journal file

Usage

  1. Create a directory containing files to backup. Example /data/backup
  2. Create config file, say, glacier.cfg
            key=YOURKEY                                                                                                                                                                                                                                                      
            secret=YOURSECRET                                                                                                                                                                                                                               
            region=us-east-1 #eu-west-1, us-east-1 etc
    
  3. Create a vault in specified region, using Amazon Console (myvault)
  4. Choose a filename for the Journal, for example, journal.log
  5. Sync your files
            ./mtglacier.pl sync --config=glacier.cfg --from-dir /data/backup --to-vault=myvault --journal=journal.log --concurrency=3
    
  6. Add more files and sync again
  7. Check that your local files not modified since last sync
            ./mtglacier.pl check-local-hash --config=glacier.cfg --from-dir /data/backup --to-vault=myvault -journal=journal.log
    
  8. Delete some files from your backup location
  9. Initiate archive restore job on Amazon side
            ./mtglacier.pl restore --config=glacier.cfg --from-dir /data/backup --to-vault=myvault -journal=journal.log --max-number-of-files=10
    
  10. Wait 4+ hours
  11. Download restored files back to backup location
            ./mtglacier.pl restore-completed --config=glacier.cfg --from-dir /data/backup --to-vault=myvault -journal=journal.log
    
  12. Delete all your files from vault
            ./mtglacier.pl purge-vault --config=glacier.cfg --from-dir /data/backup --to-vault=myvault -journal=journal.log
    

Test/Play with it

  1. create empty dir MYDIR
  2. Set vault name inside cycletest.sh
  3. Run
    ./cycletest.sh init MYDIR
    ./cycletest.sh retrieve MYDIR
    ./cycletest.sh restore MYDIR
    
OR
    ./cycletest.sh init MYDIR
    ./cycletest.sh purge MYDIR

Minimum AWS permissions

something like that
            {
            "Statement": [
                {
                "Effect": "Allow",
                "Resource":["arn:aws:glacier:eu-west-1:XXXXXXXXXXXX:vaults/test1",
                    "arn:aws:glacier:us-east-1:XXXXXXXXXXXX:vaults/test1",
                    "arn:aws:glacier:eu-west-1:XXXXXXXXXXXX:vaults/test2",
                    "arn:aws:glacier:eu-west-1:XXXXXXXXXXXX:vaults/test3"],
                "Action":["glacier:UploadArchive",
                            "glacier:InitiateMultipartUpload",
                            "glacier:UploadMultipartPart",
                            "glacier:UploadPart",
                            "glacier:DeleteArchive",
                            "glacier:ListParts",
                            "glacier:InitiateJob",
                            "glacier:ListJobs",
                            "glacier:GetJobOutput",
                            "glacier:ListMultipartUploads",
                            "glacier:CompleteMultipartUpload"] 
                }
            ]
            }

S3fs - LinodeWiki

S3fs - LinodeWiki:


Mounting Amazon S3 as a local filesystem via FUSE

[edit]Get your AWS account information

If you haven't already, sign up with Amazon Web Services and enable S3 for your account.
On the AWS page, hover over the "Your Account" tab, and select "Security Credentials". Find the section labeled "Your access keys" - make a note of your Access Key ID, then click on the link labeled "Show" under "Secret Access Keys" and note that information, too. You will have to provide both pieces of information to the s3 commands you'll be using later.

[edit]Set up s3cmd and create a bucket

S3 lets you create buckets to store your information; you have to use the AWS API to create buckets, and you won't be able to create buckets with s3fs. If you haven't already created a bucket in your S3 account, you can use the s3cmd program to set one up:
 $ s3cmd --configure
 
 Enter new values or accept defaults in brackets with Enter.
 Refer to user manual for detailed description of all options.
 
 Access key and Secret key are your identifiers for Amazon S3
 Access Key: YourID
 Secret Key: YourSecret
 
 Encryption password is used to protect your files from reading
 by unauthorized persons while in transfer to S3
 Encryption password: gpgpass
 Path to GPG program [/usr/bin/gpg]: 
 
 When using secure HTTPS protocol all communication with Amazon S3
 servers is protected from 3rd party eavesdropping. This method is
 slower than plain HTTP and can't be used if you're behind a proxy
 Use HTTPS protocol [No]: Yes
 
 New settings:
 Access Key: YourID
 Secret Key: YourSecret
 Encryption password: gpgpass
 Path to GPG program: /usr/bin/gpg
 Use HTTPS protocol: True
 HTTP Proxy server name: 
 HTTP Proxy server port: 0
 Test access with supplied credentials? [Y/n] n
 
 Save settings? [y/N] y
 Configuration saved to '/home/user/.s3cfg'
Note: your AWS Access ID and Secret will be stored in cleartext in .s3cfg. Make sure to set permissions on the file to be as restrictive as possible, and keep the file safe!

[edit]Compile S3FS

S3FS isn't packaged as a binary with any distribution I'm aware of, but it's relatively easy to compile. On a Debian Lenny system, you'll need a few packages to compile s3fs:
 sudo apt-get install make g++ libcurl4-openssl-dev libssl-dev libxml2-dev libfuse-dev
Grab the source off Google Code:
 wget http://s3fs.googlecode.com/files/s3fs-r177-source.tar.gz
Unpack the source and build the binary
 tar xzvf s3fs-r177-source.tar.gz
 cd s3fs
 make
Running make may return a warning or two, but should end with "Ok!". If not, you probably missed one of the dependency libraries above.
Copy the resulting binary to somewhere in your path, I used /usr/local/bin
 sudo cp s3fs /usr/local/bin
If you built the binary on one system and want to run it on another system, you'll still need libcurl and fuse installed. You shouldn't need to do this if you built the binary on the same machine:
 sudo apt-get install fuse-utils libcurl3
Test the command by running it.
 s3fs
If you get warnings about missing libcurl or libfuse, review your steps to make sure all the dependent shared objects are installed.

[edit]Using S3FS

If you want a regular user to be able to mount S3 shares, they will need to be added to the fuse group so they can read and write /dev/fuse
 usermod -aG fuse username
Now, as a user with fuse access, test a simple mount:
 s3fs mybucket -o accessKeyID=youraccesskey -o secretAccessKey=yoursecret -o url=https://s3.amazonaws.com /mnt/s3
You should be able to read and write files to and from /mnt/s3. If you write a file with S3FS, try confirming it with s3cmd:
 s3cmd ls s3://mybucket

[edit]Setting up automatic mounts

You can have s3fs mount your S3 shares automatically. To do this, create a file called /etc/passwd-s3fs. Make the permissions on this file as restrictive as possible: only users who will be mounting S3 filesystems should be able to read the file. I have my /etc/passwd-s3fs file owned by root, group root, with 400 permission because I only use root to mount the shares.
The format of the file is your Access Key ID and your Secret Key separated by a colon with no spaces between:
 AccessKeyID:SecretKey
To have the share mount when your Linode boots, add it to /etc/fstab:
 s3fs#mybucket /mnt/s3 fuse url=https://s3.amazonaws.com 0 0
Now you should be able to mount the filesystem with a regular mount command:
 sudo mount /mnt/s3

[edit]More mount options

By default, Fuse will lock the access to a file down to whoever ran the Fuse command. So, if you mount a filesystem as user foo, only foo will be able to access the filesystem- even root can't get to it! If you want to put an S3 filesystem in /etc/fstab and have root mount it at boot but have a regular user or group own the filesystem, you can set uid and/or gid in /etc/fstab:
 s3fs#mybucket /mnt/s3 fuse uid=500,gid=500,url=https://s3.amazonaws.com 0 0
If you want everyone on your Linode to have access to the filesystem and use Unix permissions for security instead of Fuse, you can pass a special option in /etc/fstab:
 s3fs#mybucket /mnt/s3 fuse allow_other,url=https://s3.amazonaws.com 0 0

[edit]HTTPS

By default, s3cmd and s3fs will use HTTP to access Amazon Web Services, and they will pass your Access ID and Secret Key in plain text. You need to protect your login information against snooping. You'll notice that when I configured s3cmd, I said to use HTTPS, and in all the s3fs commands, I included an option to use https. This will ensure that your transmissions to and from S3, including your login credentials, are encrypted in transit.

[edit]Great, now what?

What good is having S3 locally mounted? I'm using it to store my MP3s and photos, currently. I stream the music back to myself with MPDand display photos with WordPress's NextGen Gallery plugin. I pay about $8/mo to store 30GB on S3 and shuffle lots of bits around.
I have tested S3 as a backing store for BoxBackup, and that REALLY doesn't work. BoxBackup expects storage to be locally attached and dislikes latency in its datastore.
I have also tested S3 as a backing store for Bacula, which works very well. Look for a new Wiki page later detailing how to best configure Bacula storage on S3.
Please note; if using CentOS; the only functional solution I have found was to compile fuse and the S3fs script. Directions are on this page, in comment 8.

InstallationNotes - s3fs - Installation Notes - FUSE-based file system backed by Amazon S3 - Google Project Hosting

InstallationNotes - s3fs - Installation Notes - FUSE-based file system backed by Amazon S3 - Google Project Hosting: "apt-get install build-essential apt-get install libfuse-dev apt-get install fuse-utils apt-get install libcurl4-openssl-dev apt-get install libxml2-dev apt-get install mime-support"


General Instructions

From released tarball

Download: http://s3fs.googlecode.com/files/s3fs-1.61.tar.gz
Download SHA1 checksum: 8f6561ce00b41c667b738595fdb7b42196c5eee6
Download size: 154904 
  • tar xvzf s3fs-1.61.tar.gz
  • cd s3fs-1.61/
  • ./configure --prefix=/usr
  • make
  • make install (as root)

From subversion repository

Notes for Specific Operating Systems

Debian / Ubuntu

Tested on Ubuntu 10.10
Install prerequisites before compiling:
  • apt-get install build-essential
  • apt-get install libfuse-dev
  • apt-get install fuse-utils
  • apt-get install libcurl4-openssl-dev
  • apt-get install libxml2-dev
  • apt-get install mime-support

Fedora / CentOS

Tested on Fedora 14 Desktop Edition and CentOS 5.5 (Note: on Nov 25, 2010 with s3fs version 1.16, newer versions of s3fs have not been formally tested on these platforms)
Note: See the comment below on how to get FUSE 2.8.4 installed on CentOS 5.5
Install prerequisites before compiling:
  • yum install gcc
  • yum install libstdc++-devel
  • yum install gcc-c++
  • yum install fuse
  • yum install fuse-devel
  • yum install curl-devel
  • yum install libxml2-devel
  • yum install openssl-devel
  • yum install mailcap

Thursday, August 23, 2012

Installing s3fs on Ubuntu — Zentraal

Installing s3fs on Ubuntu — Zentraal: "s3fs bucketname /mnt/bucketname"


apt-get install build-essential libcurl4-openssl-dev libxml2-dev libfuse-dev \
                comerr-dev libfuse2 libidn11-dev libkrb5-dev libldap2-dev \
                libselinux1-dev libsepol1-dev pkg-config fuse-utils sshfs curl

mkdir ~/downloads
cd ~/downloads

wget http://sourceforge.net/projects/fuse/files/fuse-2.X/2.8.6/fuse-2.8.6.tar.gz/download
mv download fuse-2.8.6.tar.gz
tar -xvzf fuse-2.8.6.tar.gz 
cd fuse-2.8.6
./configure --prefix=/usr
make
make install

cd ~/downloads

wget http://s3fs.googlecode.com/files/s3fs-1.61.tar.gz
tar -xvzf s3fs-1.61.tar.gz 
cd s3fs-1.61
./configure --prefix=/usr
make
make install

Creating the password file and setting permissions

emacs ~/.passwd-s3fs
Format should be like this:
accessKeyId:secretAccessKey
s3fs requires that we have sane permissions on this file
chmod go-r .passwd-s3fs

Mounting our S3 bucket

mkdir /mnt/bucketname
s3fs bucketname /mnt/bucketname

How to mount an Amazon S3 bucket as virtual drive on CentOS 5.2 at A Waage Blog

How to mount an Amazon S3 bucket as virtual drive on CentOS 5.2 at A Waage Blog: "s3fs bucketname -o accessKeyId=XXXXXXXXXXXXXXXXXXXX -o secretAccessKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX /mnt/s3drive That"


#Note: If you are using CentOS 4, it’s the same general process. You might have more difficulty finding the packages to install fuse and dependencies.
This is a simple guide on how to mount your S3 bucket as a “virtual drive”. This is great for backing up your data to S3, or downloading a bunch of files from S3.
#First, make sure you have the fuse package installed.
#On CentOS, fuse is available from RPMforge
#http://wiki.centos.org/AdditionalResources/Repositories/RPMForge
#Now install fuse
yum install fuse
modprobe fuse
#Download s3fs and make
cd /usr/local/src
wget http://s3fs.googlecode.com/files/s3fs-r191-source.tar.gz
cd s3fs
make
#Copy the binary to /usr/local/bin (or wherever you prefer)
cp s3fs /usr/local/bin
#Make a mount point
mkdir /mnt/s3drive
#Mount your bucket like this:
s3fs bucketname -o accessKeyId=XXXXXXXXXXXXXXXXXXXX -o secretAccessKey=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX /mnt/s3drive
That’s it ! You can change directory to your virtual drive or start copying files !
Go ahead and use a visual client such as CyberDuck or S3Hub to verify with your own eyes that this actually worked. :)
Good luck!

Thursday, August 2, 2012

Journey Through The JavaScript MVC Jungle | Smashing Coding

Journey Through The JavaScript MVC Jungle | Smashing Coding:


Frameworks: When To Use What?

To help you get started with narrowing down frameworks to explore, we would like to offer the below high-level framework summaries which we hope will help steer you towards a few specific options to try out.
I want something flexible which offers a minimalist solution to separating concerns in my application. It should support a persistence layer and RESTful sync, models, views (with controllers), event-driven communication, templating and routing. It should be imperative, allowing one to update the View when a model changes. I’d like some decisions about the architecture left up to me. Ideally, many large companies have used the solution to build non-trivial applications. As I may be building something complex, I’d like there to be an active extension community around the framework that have already tried addressing larger problems (MarionetteChaplinAuraThorax). Ideally, there are also scaffolding tools (grunt-bbbbrunch) available for the solution. Use Backbone.js.
I want something that tries to tackle desktop-level application development for the web. It should be opinionated, modular, support a variation of MVC, avoid the need to wire everything in my application together manually, support persistence, computed properties and have auto-updating (live) templates. It should support proper state management rather than the manual routing solution many other frameworks advocate being used. It should also come with extensive docs and of course, templating. It should also have scaffolding tools available (ember.gem, ember for brunch). Use Ember.js.
I want something more lightweight which supports live-binding templates, routing, integration with major libraries (like jQuery and Dojo) and is optimized for performance. It should also support a way to implement models, views and controllers. It may not be used on as many large public applications just yet, but has potential. Ideally, the solution should be built by people who have previous experience creating many complex applications. Use CanJS.
I want something declarative that uses the View to derive behavior. It focuses on achieving this through custom HTML tags and components that specify your application intentions. It should support being easily testable, URL management (routing) and a separation of concerns through a variation of MVC. It takes a different approach to most frameworks, providing a HTML compiler for creating your own DSL in HTML. It may be inspired by upcoming Web platform features such as Web Components and also has its own scaffolding tools available (angular-seed). Use AngularJS.
I want something that offers me an excellent base for building large scale applications. It should support a mature widget infrastructure, modules which support lazy-loading and can be asynchronous, simple integration with CDNs, a wide array of widget modules (graphics, charting, grids, etc) and strong support for internationalization (i18n, l10n). It should have support for OOP, MVC and the building blocks to create more complex architectures. Use Dojo.
I want something which benefits from the YUI extension infrastructure. It should support models, views and routers and make it simple to write multi-view applications supporting routing, View transitions and more. Whilst larger, it is a complete solution that includes widgets/components as well as the tools needed to create an organized application architecture. It may have scaffolding tools (yuiproject), but these need to be updated. Use YUI.
I want something simple that values asynchronous interfaces and lack any dependencies. It should be opinionated but flexible on how to build applications. The framework should provide bare-bones essentials like model, view, controller, events, and routing, while still being tiny. It should be optimized for use with CoffeeScript and come with comprehensive documentation. Use Spine.
I want something that will make it easy to build complex dynamic UIs with a clean underlying data model and declarative bindings. It should automatically update my UI on model changes using two-way bindings and support dependency tracking of model data. I should be able to use it with whatever framework I prefer, or even an existing app. It should also come with templating built-in and be easily extensible. Use KnockoutJS.
I want something that will help me build simple Web applications and websites. I don’t expect there to be a great deal of code involved and so code organisation won’t be much of a concern. The solution should abstract away browser differences so I can focus on the fun stuff. It should let me easily bind events, interact with remote services, be extensible and have a huge plugin community. Use jQuery.

High Scalability - High Scalability - C is for Compute - Google Compute Engine (GCE)

High Scalability - High Scalability - C is for Compute - Google Compute Engine (GCE):


C Is For Compute - Google Compute Engine (GCE)

After poking around the Google Compute Engine (GCE) documentation I had some trouble creating a mental model of how GCE works. Is it like AWS, GAE, Rackspace, just what is it? After watching Google I/O 2012 - Introducing Google Compute Engine and Google Compute Engine -- Technical Details, it turns out my initial impression, that GCE is disarmingly straightforward, turns out to be the point.

The focus of GCE is on the C, which stands for Compute, and that’s what GCE is all about: deploying lots of servers to solve computationally hard problems. What you get with GCE is a Super Datacenter on Google Steroids.

If you are wondering how you will run the next Instagram on GCE then that would be missing the point. GAE is targeted at applications. GCE is targeted at:
  • Delivering a proven, pure, high performance, high scale compute infrastructure using a utility pricing model, on top of an open, secure, extensible Infrastructure-as-a-Service.
  • Delivering an experience that feels like you are in a datacenter and not at creating a massively multi-tenant cloud.
  • Allowing you to become Google. Tackle the same problems Google tackles with the same infrastructure, minus all the data and people of course.
  • Standing up VM instances quickly, do your work, and tear them down quickly.
  • Performing better and better as cluster gets bigger. Google considers large clusters to start at 10-20K instances.
  • Being a compute utility. You get resources affordably because of Google’s efficiency at scale.
  • Consistent performance.  Google has pioneered consistent performance at scale and they are making a huge deal of this and it’s mentioned several times in the demos. GCE is tuned for both high and consistent performance throughout the stack. The idea is you don’t have to design for unstable or inconsistent system, so don’t have to design for worst case. This allowed some customers to reduce their number of cores in half.
  • Giving you a set of servers you can run anyway you want.
  • Creating a technology you can bet your business on. Google is running Google business on the stack today.

Basic Overview Of GCE

  • Customers
    • Targeted at problems using large compute jobs, batch workloads, or that require high performance real-time calculations. Not building websites. In the future they plan on adding more features like load balancing.
    • Right now it’s about work that can be parallelized. Will provide vertical scaling in the future, that is 32+ cores.
    • Seem to want enterprise customers that can make use of lots of cores, not little guys.
  • Datacenters
    • Region: for geography and routing domain.
    • Zone: for fault tolerance
    • Currently operating 3 US datacenters/zones, located on the East coast of the US.  
    • Working on adding more datacenters globally and adding more datacenters in the US.
  • API
    • JSON over HTTP API, REST-inspired, authorization is with OAuth2
    • Main resources: projects, instances, networks, firewalls, disks, snapshots, zones
    • Actions GET, POST (create), DELETE, custom verbs for updates
    • A command line tool (gsutil), a GUI, and a set of standard libraries gives access to the APIs. Experience is like Amazon in that you have an UI and command line tools.
    • All Google tools use the API. There is no backdoor. The web UI is built on Google App Engine, for example. App Engine is the web facing application environment and is considered an orchestration system for GCE.
    • Partners like RightScale, Puppet, and OpsCode, also use the API to provide higher level services.
    • Want people to take their code and run it on their infrastructure. Open API. No backdoors. Can extend that stack at any level.
  • Project 
    • Everything happens within the context of a Project: team membership, group ownership, billing. A Project is a container for a set of resources that are owned by the Project and not by people. Every API action is traced back to a person instead of a credential.
  • Service Account
    • Synthetic identity acting as a user when performing operations in code. Connects seamlessly with GAE, Cloud Storage, Task Queues, and other Google services.
    • When launching a VM an OAuth2 scope is provided that is stored in a special metadata server that is used transparently between services. No configuration or password is required.
  • Virtual Machine
    • Linux virtual machines with root access. For security and performance reasons the kernel is locked down. The kernel is tuned to work with their networking environment.
    • Two stock versions of Linux: Ubuntu and CentOS. They say you can run whatever Linux distribution you want, but I’m not sure how that fits with locked down kernel policy.
    • Comes installed with gsutil, turned off password authentication so only use ssh authentication is used, turned on automatic security updates.
    • High performing 2.6 GHz Intel Sandy Bridge processor.
    • Available in 1,2,4, or 8 virtual CPUs. Each virtual CPU is mapped to a hyperthread.  For a 2 CPU instances you get both halves of a real physical core.
    • 3.7GB RAM per core. 420GB local/ephemeral storage.
    • 8 core instances have dedicated spindles. You are the only one reading and writing from the disk, so you have more predictable/consistent performance.
    • Invented performance unit: the Google Compute Engine Unit (GQ). Roughly matchesAmazon’s compute unit. Each virtual CPU is rated at 2.75 GQs.
    • Smaller machines will be available for prototyping and debugging.
    • Big boxes because focussed on high performance computing.
  • Instances
    • A combination of KVMs (Kernel Virtual Machines) and Linux cgroups are used for the underlying hypervisor technology. Linux scheduler and memory manager are reused to handle the scheduling of the machines.
    • KVM provides virtualization. Cgroups provides resource isolation. Cgroups was pioneered by Google to keep workloads isolated from each other.
    • Internally Google can run virtualized and non-virtualized workloads on the same kernel and on the same machine, which allows them to deploy and test one single kernel.
    • Located in a zone.
    • Fast boot times: 2 minutes.
  • Instance Metadata
    • Solving the configuration problem to customize VMs at boot time.
    • A dictionary of key-value pairs are available on the instance via a private HTTP metadata server just for that machine. This metadata can be set for the instance to control its boot/configuration/role process. Can be read using curl.
    • Project wide metadata is also available that is inherited by all instances. Used to push SSH keys into VM at boot time. A default image knows how to read a special bit of metadata called SSH Keys and then installs them into the VM.
  • Startup Scripts
    • Simple bootstrapping scripts, similar to rc.local, that run on boot.
    • Use to install software and start other software.
  • Service Orientation, not Server Orientation
    • Build across zones to deal with failure.
    • Use startup scripts and metadata for automatic configuration.
    • Use local disk as a cache or scratch area.
    • Build automation using GAE or their partners.
  • Networking - VPN
    • Google considers their network a distinguishing feature. It features high cross sectional bandwidth, that is, machines can talk more directly to each other without competing with neighboring traffic on a bus. This reduces network latency and increases the consistency of performance. They won’t publish any numbers though.
    • Each project gets its own secure VPN that is unshared with anyone else. Spans across all your VMs, no matter where they are.
    • Networking traffic does not transit the Internet. It is routed over Google’s secure, high performance private network.
    • Network is all L3 using private IP addresses that are guaranteed to come from a machine on your VPN.
    • VM name = DNS name. VMs have normal looking hostnames that you can assign and use the DNS to find. This is very convenient when bringing up an arbitrary set of hosts.
    • IPv6 in the future.
    • You can have many VPNs per project, but by default there is one called default that is used by default.
    • Broadcast and multicast are not supported, which if you have a VPN removes a lot of interesting architectures. Maybe with v6?
  • Networking - Internet
    • Traffic from the Internet to your machine is shunted on to Google’s private network as soon as they can and given a “first class” ticket to your VPN. This is like an overlay network you see on CDNs.
    • 1-to-1 NAT. Every VM can be assigned an external IP address that is rewritten as it enters and exits your VPN. They don’t exist on the VM when you do an ifconfig.
    • IP addresses can be detached from a VM in one region and attached to a VM in another region and Google will make sure the traffic is routed properly.
    • Built in firewall to control who talks to what in the system.
    • Can’t use SMTP. Only UDP, TCP, and ICMP can be used to the Internet.
    • IP addresses are advertised with Anycast, then they encapsulate it, and then forward it to your VPN.
  • Storage
    • Focused on creating persistent block device that offers performance / throughput so you don’t need to push storage local.
    • Two block storage devices: Persistent Disk and Local Disk.
  • Persistent disk
    • Off instance durably replicated storage medium. High consistency. High throughput solution. Secure. Backing store for database. Built from scratch to be highly performant and gives good 99.95 percentile performance.
    • Allocated to a zone.
    • Can be mounted read/write to a single instance or read only to a set of instances.
    • Data is transparently encrypted when it leaves your VM, before it is written to disk. Using new processors there’s very little to no overhead. It seems to use Google keys and not your keys.
    • Less than 3% variance in IO bandwidth when doing 4K random reads and writes. This is their consistency theme. Less variance than a local disk, which can vary by 13%.
    • For large block read and writes there’s triple the local bandwidth compared to local disk.
  • Local/ephemeral disk
    • Ephemeral on reboot. When the VM goes away the data goes away.
    • It’s encrypted using a VM specific key.
    • Currently all instances boot of off local disk, looking to boot off of persistent disk in the future.
    • 3.5TB with the 8 CPU instance.
    • With larger instances (4-8 core) you get dedicated spindles. One spindle with the 4 core instance and 2 spindles with the 8 core instance.
  • Google Cloud Storage  
    • Enterprise grade Internet object store.
    • HTTP API for getting and setting values.
    • Don’t have to worry about managing data. Replication is happening for you.
    • Publicly readable objects are cached close to where they will be used. Sounds a bit like a CDN. Data will be replicated to where it is needed and available quickly.
    • Uses Google global high performance internet backbone.
    • Read your writes consistency.
    • Bulk data. Useful for getting data in and out of Google’s cloud using Google’s high capacity pipes.
  • Pricing
    • 50% more compute for your money when compared to AWS.
    • Billed on demand by the hour.
    • SLA and support open to commercial customers.

Examples Of GCE Usage

Invite Media

Runs a real-time ad exchange that has a very high volume of traffic, 400K QPS,  and as with all real-time markets requires consistently low latencies, 150ms end-to-end, in order to calculate the best deals. For each ad request they have time budget of 10ms to find a backend server to serve the request and establish a connection.

Found the GCE model familiar. You have Linux VMs, you have disks, you can assign static IPs, create startup scripts, and have a nice API. Took two weeks to port their system to GCE.

Comparing existing provider with GCE, using 8 core instances:
  • 350 QPS vs 650 QPS (while respecting latency requirements)
  • 284 machines vs 140 machines
  • 5% connection errors vs < .05%
  • 11% of requests timed out vs 6% - means 5 percent more ad requests they can buy for advertisers

Decided to migrate entire operation to GCE.

Hadoop On GCE

This is example code created by someone at Google and will be released in the future.
  • Can run from command line or GAE.
  • Launch a coordinator has an API to set up all the other VMs in the cluster (100 nodes), monitor, etc.
  • Booting from a fresh Ubuntu image the setup was pretty fast. The coordinator installs Hadoop and launches nodes. Took a while, but relatively quick.
  • Launched a job on Hadoop master to process 60GB of compressed wikipedia revision history. Slices data in CSV format. Took 1.5 minutes writing 70GB of data.
  • The CSV is piped into Big Query to answer questions like which wikipedia article had the most edits, who are the top editors, and other interactive questions.

Video Transcoding

This is very common cloud demo.
  1. Video loaded into a job queue.
  2. Consumers, and you can run a lot on GCE, take job and perform the transcode.
  3. Transcoded video is sent to the Google storage service.

MapR On Terasort

MapR ran the Terasort benchmark on a 1250 node cluster in 1:20 minutes at a cost of $16. This was near record performance and they estimate to buy the same hardware to run the test locally would cost nearly $6 million.

They found GCE blazing fast with great disk drive disk and network bandwidth. They were able to provision thousands of VMs in minutes

BuildFax

Put their database and production servers on GCE. They are very pleased with the consistent performance. Their service delivers insurance related data points to customers at the time they write policies. Results were returned in less than 4 seconds with a very low variance. Again, this is the consistent performance claim.

Observations

  1. With GCE Google has designed an experience familiar to Amazon users, with some nice second system improvements in configuration and operations, and a lot of special Google sauce in performance.
  2. Better late than never. GCE is late to the game, but it has a strong performance, pricing, and development model story that often helps with customers wins over first to market entrants. If you need huge scale and/or great performance then why wouldn’t you consider GCE? Performance requires carefull design from the start. It’s hard to add in later. And after all of Google’s bragging about their cool infrastructure this is your chance to give it a spin and see what it is made of.
  3. Kind of bummed that it’s not targeted more at front facing websites. There’s no reason you can’t run a website in GCE it seems, but unlike AWS you won’t get a lot of help. Like in the early days of EC2 it’s all up to you, but that’s probably OK for a lot of people.
  4. As Google deals with more and more customers can they maintain quality? As we’ve seen, most things go bad when problems occur and a lot of traffic is flowing through the system. Shared state is the system killer and Google still has plenty of that. Google has yet to test their cloud infrastrucure in this way.
  5. Where will egress pricing end up once the low promotional pricing ends? Google lockin will occur if it’s expensive to transfer your data out of Google’s cloud. Google pricing in general is a bit scary.
  6. Will AWS Direct Connect be avaialble to GCE?
  7. Is GCE a target for migration or integration? BigData jobs are an obvious target for GCE, but we've also seen examples where real-time services benefit from GCE, so running a few select services in GCE might be a good toe in the water strategy. Concerns over data transfer costs are part of the ecosystem lockin play. Resilience alone however argues for implementing systems in more than one cloud.
  8. Amazon has a huge advantage in services. Will Google go upstack as Amazon has done? Or is this your cloud equivalent of a chance to tap the Android market while everyone else is creating apps for theiPhone?