Difference between revisions of "SitePerformanceMonitoringTools"

(Pipeline)
(Steps to DoneDone)
Line 43: Line 43:
 
** Identify each request in the pipeline
 
** Identify each request in the pipeline
 
** How to perform / retrieve data for each request
 
** How to perform / retrieve data for each request
** Assign acceptable benchmarks to each request
+
** Define acceptable benchmarks to each request
  
 
* Aggregate pipeline benchmarks
 
* Aggregate pipeline benchmarks

Revision as of 19:42, 6 September 2007

OurWork Edit-chalk-10bo12.png

What (summary)

Instrumentation that provides a history of performance statistics for each part of the page load pipeline.

Why this is important

The responsiveness and performance of the site makes a big difference in how many pages visitors will view, and how often they will come back. A poorly performing site will also wear out our active members causing some of them to leave.

DoneDone

Define good and acceptable times for pieces of the Performance pipe

  • Max cold-request to fully rendered time for normal pages (including especially front page)
    • 3 seconds is unacceptable
  • Max Load Time for special goodness (Batch Patrol ... etc)

Record a history of how long to

  • lookup DNS for www.aboutus.org, images.aboutus.org, ... from different parts of the world
  • Setup a port 80 TCP connection with each of the squal boxes from different parts of the world
  • Load the frontpage without any client caching
  • Retrieve a memcache item from each combination of two squal boxes (one client, one memcached server)
  • Load the core css files
  • Load the core js files

...

Performance Priorities

  1. View normal page
  2. View random page
  3. Edit click until available
  4. Save click until rendered
  5. Render invalidated frontpage

Instrumentation Steps

  1. End to end on each
  2. Deploy instrumentation boxes in various locations
  3. Determine and instrument the pieces
    • MediaWiki profiling
    • Raw database queries

Steps to DoneDone

  • Articulate the request pipeline
    • Identify each request in the pipeline
    • How to perform / retrieve data for each request
    • Define acceptable benchmarks to each request
  • Aggregate pipeline benchmarks
  • Push XML results to central HTTPS server
    • Remote location / benchmark results stored in database
  • Analyze pipeline benchmarks
    • Graph performance for each location
    • Detailed graph view on each request
    • XML output for monitoring integration
  • Integrate into monitoring
    • Dashboard to identify overall health
    • Notifications via email/paging critical problem arises

Pipeline

1. DNS request - www.aboutus.org & images.aboutus.org

  • Local resolver / cache
Queries against the local resolver at the remote location provides little insight into health of the www.aboutus.org site. If the record does not exist in the local resolver cache (or the TTL has expired), the DNS root servers will be contacted and the authoritative servers. If the record already exists in the cache then it will respond immediately. If the local resolver does not reply as expected, then the issue likely lies with the remote location or possibly the authoritative name servers (or somewhere between).
  • Authoritative name server
 ns1.dnscloud.com
 ns2.dnscloud.com
Response time of the authoritative server is critical. This can also be measured from any location, though, network latency and connectivity will be a factor.

1. IP connectivity [R]

Network connectivity and latency can be measured using ping and traceroute utilities. Most issues with connectivity will most likely be caused by network problems between the two locations which we have no control over. In some cases, the issues could be caused by router, switch, or load-balancer issues on the AboutUs side, but these items will affect all remote locations.

2. HTTP request - /index.php

Response time of a single /index.php GET request is critical. Performance relies on a number of factors.
  • Physical server load
    • CPU
    • Disk I/O
    • Available memory - too little memory causes swapping thereby causing disk I/O performance degradation
    • Network throughput
  • Apache process performance
    • CPU usage
    • Available threads
  • Memcached
  • Database query
    • physical DB (slave) server loads
    • replication
    • MySQL performance
  • DNS request - images.aboutus.org
  • Image GET requests
    • image size
    • number of images per page
    • NFS server load (disk I/O)
    • network throughput

3. HTTP login

  • HTTPS request

Potential Hurdles

  • False positives
  • Caching


Retrieved from "http://aboutus.com/index.php?title=SitePerformanceMonitoringTools&oldid=9637247"