SurgeCon 2012 Notes

Mainly for my own reference….

* godaddy: culture of fear – not open – need to talk about failure
* “choice and voice” driving change
* big data – churn
* experience / availability / performance / scalability / adaptability / security / economy’
* wearable computing = a big thing.  really?


[Scaling Pinterest]
* it will fail, keep it simple
* amazon ec2/s3 worked well
* pro/con limited choice means working within confines/planning
* mysql, very good support from Percona
* memcache + redis
* clustering vs sharding
* few failure modes = simplicity
* failure modes: data rebalance breaks, data corruption replicates, improper balance, data authority failure
* dont approach sharding too early, end up denormalizing, harder to add features
* objects and mappings – object tables and mapping tables, queries are pk lookups or index lookups: no joins
* data does not move
* no schema changes required, index = new table
* Service Oriented Architecture: conn limits, isolate functionality, isolate access (security)
* Scaling the team
* search / feeds / followers <- services
* kafka / hadoop – log every action, snapshot dbs, run analytics.


[they own the pipes]
* round trip latency is the killer
* reduce number of requests to 1
* cert size (1024 vs 2048) and buy a ‘better’ certificate? — less “chained”
* optimize tcp stacks / ack[nowledgement] segments
* dont run or allow ssl2 / old ciphers
* DTLS: SSL w/o the TCP mess – rfc6347


[arch behind fast dns]
* ttfb (time to first byte)
* ttfdnsqr (time to first dns query response)
* many http optimization ignore dns impact
* reduce multiple domains
* dns needs: better.. redundancy, routing, design.
* down/slow server in delegation = bad (need a healthcheck/load balancer for roundtrip banding)
* unicast (bad) vs anycast (good)
* delegation uptime matters
* hot potato routing
* backbones and routing: ospf/igp (link state / metrics) and bgp (distance vector)
* mix igp (ospf) with bgp. ospf floods routes, ibgp stacks adjacencies formed in ospf
* depend on igp route metrics
* data sync and monitoring (same ips in different spaces)
* two networks: anycast facing out towards users, unicast data replication in between the application
* enemy = complexity, avoid multi-level delegation,  pick right ttls, put dns hardware close to userbase
* automated detection, manual changes to route around congestion in asia
* openbfd echo daemon on github
* ecmp: multiple links between routers, load balance at layer 3 level.


[scaling in the cloud at cost and sla]
* make solution as cloud provider agnostic as possible.
* SOA architecture, multi-datacenter, minimal cross-DC traffic
* config mgmt (puppet,chef,etc)
* monitoring (nagious, cacti, graphite, etc) w/ basic alerts (cpu,load,memory,qps,latency)
* service transport (http, thrift-rpc, native: memcahce, redis, rabbitmq)
* load balancers (haproxy, varnish,nginx/apache) and ELB (AWS Specific)
* GSLB global server loadbalancing (between datacenters) route users to nearest DC
* DNS/BGP/Anycast/cookies/user-config
* instance sizing (go 64bit)
* tweak garbage collection and memory settings
* deploys (phased) w/ health-checks
* failure detection, retries, fuzzing, queuing
* tagged deploy units (easy rollbacks)
* keep expensive operations local to region/ datacenter
* nodejs “hiveway” caching proxy + triggered updates (look for BrightTag to open source soon?)
* do aggregate roll-ups at once (1 min, 5  min, 1hr, etc)


[mysteries of a CDN Explained]
* find/send user to closest node (beat speed of light)
* ip geotarget / anycast dns
* cdn shield / middle tier between cdn origin and cdn pop
* check out ganglia2 / arista switches
* DSA (non caching cdn / http keepalives voodoo)
* fastly github version of apache
* sysctl values (2 to adjust – ref slides)
* short ttl = survive flash traffic




[who needs clouds? ha in your datacenter]
* simple not easy, open not closed, logic not magic
* linux-ha (load balancer)
* IPVS IP Virtual Server
* litmus_paper (open source –
* big_brother
* pacemaker
* “thundering heard” and adding resources
* load balancing the load balancers
* OCF (open cluster framework)
* IPaddr2 script


[realtime web – dirt in production]
* find video on web about shouting at disk drives
* voxer PPT app
* DIRTy apps tend to have a human in the loop
* mobile devices at the edge – network transience causes connection state issues.
* illumOS (Joyent)
* Application Restarts = cascading latency bubbles
* if app cant keep up with with tcp backlog, syn packets get dropped
* how to measure / monitor how close we are to the end of the tcp backlog queue?
* github dtrace / tcplistendrop.d
* slow disk i/o – cloud multi-tenancy (running backups, benchmarks) – no insight
* zfsslower.d tool
* dtrace scripts to drill down into particular areas
* heatmap tool (
* identify latency outliers
* memory growth: whats in the heap? leak (and then where!)? app growth?
* libumem


[zero to 500k qps – scaling appnexus]
* servers + perl + zen = cloud
* auction-based ad-serving
* batch-based change processing
* simple http communication between processes
* packrat log streaming app
* netezza (tupac/icecube), hadoop/hbase (wutag/quest), graphite (real-time)
* netezza vs vertica vs hadoop
* keepalive dns over vendor loadbalancers
* maestro / api driven infrastructure / runbooks


* site architecture:
– load balancer (assigns a webserver)
– webserver (hiphop asssembles data)
– services / cache / databases on the backend
* scaling meta tips
– scale horizontally
– iterate quickly
– ‘gatekeeper’ to stage and rollout code / select what subset of users see
– instrument the world
– “claspin” tool – high-density heatmap viewer for large services, find needle in a haystack, drill down.
– “scuba” tool – in memory datastore – key/value pairs, slice across different sets
– rtwatch tool (realtime watch) – select specific data points, see connections, identify outliers
– ODS tool (operations data store) – site health / key metrics
* scale your tools
* take ownership
* automation: sophisticated systems / failures are common / automate carefully
* faster deployments: central server is bottleneck,  use bittorrent (opentracker)
* challenge: how to restart services quickly enough w/o impacting users
* distributed shell system to run commands on multiple systems at the same time
* fbar tool – facebook auto remediation (write ‘recipes’ or scripts to fix an issue) (think playbooks for NOC engineers)
* fbar uses API + plugins (monitor, config, hardware, etc)
* automation pitfalls: masks systemic problems, cascading failures, unknown actors, cultural fear
* culture wrs: keep teams small, work on most leveraged problem, move people around, constantly prioritize
* scale ops: invest in the people
* ops focus areas: availability, reliability, operation,  efficiency
* roles  to watch out for: the butler, the adversary,  the beautiful mind, the godfather
* fix more, whine less.


[changing architectural foundation w/ continuous deplyment]
* arch needs to change with the business over time
* “boxed” software only gives you 1 chance to deploy
* CI not good for infrq changes /hardware changes/etc
* “culture before tools”
* other people always around to help
* unit tests + functional tests + manual tests
* nagios + naglite2
* “super grep”  -> tail -f | grep
* “deployinator” on github
* graphs: ganglia / graphite
* overlay deployments / code changes with graph stats
* statd (github)
* logster (github)
* “feature flags” allow deploy w/ subset of users to see/use
*  a/b testing for interface changes and to prove interest
* ci pattern: change in small steps.
* dark launch by config. iterate in prod while dark
* maintain old and w in parallel
* ramp up new arch, remove old
* minimize bug hours, trash the schedule, iterate on the tools.


[monitoring and debugging nosql inpruio]
*blekko  (new search engine)
* conventional monitoring can  too noisy
* monitoring tips: subject = most important  info and body needs enough info to login
* ‘turds’ info in /tmp/ (e.g. output of ps/free/etc)
* be aware that problems clear up
* system hangs destroy evidence on reboot
* wrapper cron jobs
* roll up alerts (ex: 57 alerts that are the same = 1 email)
* “audit” the monitoring system
* and saturnalia db
* write your own oom killer and trigger before oom fires
* am i thrashing (swap monitoring)
* identify broken, move out of rotation, flag for follow-up
* automate with scripts common failure recovery tasks
* “beach certify” common tasks