High Value Transaction Processing Mark Callaghan

What do I mean by value?
▪ Low price?
▪ High price/performance?
▪ Valuable data

OLTP in the datacenter
▪ Sharding
▪ Availability
▪ Legacy applications
▪ Used by many applications
  • 1. High Value Transaction Processing Mark Callaghan
  • 2. What do I mean by value? ▪ Low price? ▪ High price/performance? ▪ Valuable data
  • 3. OLTP in the datacenter ▪ Sharding ▪ Availability ▪ Legacy applications ▪ Used by many applications
  • 4. Sharding ▪ Sharding is easy, resharding is hard ▪ Joins within a shard are still frequent and useful ▪ Some all-shards joins must use Hive ▪ Provides some fault-isolation benefits
  • 5. Availability ▪ Sources of downtime ▪ Schema change (but now we have OSC) ▪ Manual failover ▪ Misbehaving applications ▪ Oops
  • 6. Used by many applications If your company is successful then ▪ Your database will be accessed by many different applications ▪ Application authors might not be MySQL experts ▪ Application owners might have different priorities than the DB team
  • 7. Legacy applications If your company is successful then you will have ▪ Applications written many years ago by people who are gone ▪ Design decisions that are not good for your current size ▪ Not enough resources or time to rewrite applications
  • 8. Our busy OLTP deployment ▪ Query response time ▪ 4 ms reads, 5ms writes ▪ Network bytes per second ▪ 38GB peak ▪ Queries per second ▪ 13M peak ▪ Rows read per second ▪ 450M peak ▪ Rows changed per second ▪ 3.5M peak ▪ InnoDB page IO per second ▪ 5.2M peak
  • 9. Recent improvements ▪ Joint work by Facebook, Percona and Oracle/MySQL ▪ Prevent InnoDB stalls ▪ Stalls from caches ▪ Stalls from mutexes ▪ IO efficiency ▪ Improve monitoring ▪ Improve XtraBackup
  • 10. How do you measure performance? ▪ Response time variance leads to bad user experiences ▪ Optimizations that defer work must handle steady-state loads ▪ When designing a server the choices are: ▪ No concurrency (and no mutexes) ▪ One mutex ▪ More than one mutex
  • 11. This has good average performance
  • 12. Which metric matters?
  • 13. Stalls from caches Caches that defer expensive operations must eventually complete them at the same rate at which they are deferred. ▪ InnoDB purge ▪ InnoDB insert buffer ▪ Async writes are not async ▪ Fuzzy checkpoint constraint enforcement
  • 14. InnoDB purge stalls ▪ InnoDB purge removes delete-marked rows ▪ Done by the main background thread in 5.1 plugin ▪ Optionally done by a separate thread in 5.5 ▪ Purge is single-threaded and might be stalled by disk reads ▪ Further it gets behind, more likely it won’t catch up ▪ Need multiple purge threads as the main background thread can become the dedicated purge thread and that isn’t enough do { n_pages_purged = trx_purge(); } while (n_pages_purged);
  • 15. InnoDB insert buffer stalls ▪ The insert buffer is not drained as fast as it can get full ▪ Drain rate is 5% of innodb_io_capacity ▪ ▪ Fixed in the Facebook patch and XtraDB ▪ Patch pending for MySQL 5.5
  • 16. Performance drops when ibuf is full
  • 17. Otherwise, the insert buffer is awesome
  • 18. Fuzzy checkpoint constraint ▪ TotalLogSize = #log_files X innodb_log_file_size ▪ AsyncLimit = 0.70 X TotalLogSize ▪ SyncLimit = 0.75 X TotalLogSize ▪ OldestDirtyLSN is the smallest oldest_modification LSN of all dirty pages in the buffer pool ▪ Age = CurrentLSN – OldestDirtyLSN Fuzzy Checkpoint Constraint ▪ If Age > SyncLimit then flush_dirty_pages_synch() ▪ Else if Age > AsyncLimit then flush_dirty_pages_async()
  • 19. Async page writes are not async ▪ Async page write requests submitted per fuzzy checkpoint constraint are not async ▪ User transactions may do this via log_preflush_pool_modified_pages ▪ Caller does large write for doublewrite buffer ▪ Caller then submits in-place write requests for background write threads ▪ Caller then waits for background write threads to finish ▪ ▪ Fixed in the Facebook patch
  • 20. Fuzzy checkpoint constraint enforcement Prior to InnoDB plugin 5.1.38, page writes done to enforce the fuzzy checkpoint constraint were not submitted by the main background thread. ▪ InnoDB plugin added innodb_adaptive_flushing in 5.1.38 plugin ▪ Percona added innodb_adaptive_checkpoint ▪ Facebook patch added innodb_background_checkpoint
  • 21. Sysbench QPS at 20 second intervals with checkpoint stalls
  • 22. Stalls from mutexes ▪ Extending InnoDB files ▪ Opening InnoDB tables ▪ Purge/undo lock conflicts ▪ TRUNCATE table and LOCK_open ▪ DROP table and LOCK_open ▪ Buffer pool invalidate ▪ LOCK_open and kernel_mutex ▪ Excessive calls to fcntl ▪ Deadlock detection overhead ▪ innodb_thread_concurrency
  • 23. Stalls from extending InnoDB files ▪ A global mutex is locked when InnoDB tables are extended while writes are done to extend the file ▪ All reads on the file are blocked until the writes are done ▪ ▪ To be fixed real soon in the Facebook patch
  • 24. Stalls from opening InnoDB tables ▪ Opening table handler instances is serialized on LOCK_open. Index cardinality stats might then be computed using random reads ▪ and ▪ Fixed in the Facebook patch and MySQL 5.5 ▪ When stats are recomputed many uses of that table will stall ▪ Fixed in the Facebook patch ▪ Index stats could be recomputed too frequently ▪ ▪ Fixed in the Facebook patch, MySQL 5.1 and MySQL 5.5
  • 25. Stalls from purge/undo lock conflicts ▪ Purge and undo are not concurrent on the same InnoDB table ▪ Purge gets a share lock on the table ▪ Undo gets an exclusive lock on the table ▪ REPLACE statements that use insert-then-undo can generate undo ▪ ▪ Fixed in MySQL 5.1.55 and MySQL 5.5
  • 26. TRUNCATE table and LOCK_open ▪ LOCK_open is held when the truncate is done by InnoDB ▪ When file-per-table is used the file must be removed and that can take too long ▪ The InnoDB buffer pool LRU must be scanned ▪ New queries cannot be started ▪ and ▪ Fixed in MySQL 5.5 courtesy of meta-data locking
  • 27. DROP table and LOCK_open ▪ LOCK_open is held when the drop is done by InnoDB ▪ When file-per-table is used the file must be removed and that can take too long ▪ The InnoDB buffer pool LRU must be scanned ▪ New queries cannot be started ▪ ▪ Fixed in the Facebook patch ▪ Do most InnoDB processing in the background drop queue ▪ Fixed in MySQL 5.5 courtesy of meta-data locking
  • 28. TRUNCATE/DROP table and invalidate ▪ Pages for table removed from buffer pool and adaptive hash ▪ InnoDB buffer pool mutex locked while the LRU is scanned ▪ This is slow with a large buffer pool ▪ Most threads in InnoDB will block waiting for the buffer pool mutex ▪ and ▪ I hope Yasufumi can fix it
  • 29. LOCK_open and kernel_mutex conflicts ▪ Thread A ▪ Gather table statistics while holding LOCK_open ▪ Block on kernel_mutex while starting a transaction ▪ Thread B ▪ Hold kernel_mutex while doing deadlock detection ▪ All other threads block on LOCK_open or kernel_mutex ▪ ▪ Fixed in MySQL 5.5
  • 30. Stalls from excessive calls to fcntl ▪ fcntl ▪ My Linux kernels get the big kernel lock on fcntl calls ▪ MySQL called fcntl too often ▪ Doubled peak QPS by hacking MySQL to call fcntl less ▪ Almost 200,000 QPS without using HandlerSocket ▪ ▪ Fixed in Facebook patch, then reverted because it broke SSL tests ▪ Not sure where or when this will be fixed
  • 31. Sysbench read-only with fcntl fix
  • 32. Stalls from deadlock detection overhead ▪ InnoDB deadlock detection was very inefficient. Worst case when all threads waited on the same row lock. ▪ Added option to disable it in the Facebook patch and rely on lock wait timeout ▪ MySQL made it more efficient in MySQL 5.1 ▪
  • 33. Stalls from innodb_thread_concurrency ▪ When there are 1000+ sleeping threads it can take too long to wake up a specific thread ▪ Change innodb_thread_concurrency to use FIFO scheduling in addition to existing use of LIFO and FIFO+LIFO = FLIFO ▪ Fixed in the Facebook patch
  • 34. Sysbench TPS with FLIFO
  • 35. IO efficiency High priority problems for me are: ▪ Reducing IOPs used for my workload ▪ Supporting very large databases Significant improvements: ▪ Switch from mysqldump to XtraBackup ▪ Run innosim to confirm storage performance ▪ Tune InnoDB ▪ Improve schemas and queries
  • 36. mysqldump vs XtraBackup ▪ mysqldump is slower for backup ▪ Clustered index is scanned row-at-a-time in key order (lots of random reads) ▪ Backup accounts for half of the disk reads for servers I watch ▪ Single-table restore is easy with mysqldump ▪ Possible with XtraBackup thanks to work by Vamsi from Facebook ▪ Incremental backup ▪ Not possible with mysqldump ▪ XtraBackup has incremental (scan all data, write only the changed blocks) ▪ Vamsi from Facebook added support for really incremental, scan & write only the changed blocks
  • 37. innosim storage benchmark ▪ InnoDB IO simulator that models ▪ Doublewrite buffer ▪ Dirty page writes ▪ Transaction log and binlog fsync and IO ▪ User transactions that do read, write and commit ▪ Search for “facebook innosim” ▪ Source code on launchpad
  • 38. Tune InnoDB ▪ It is not easy to support many concurrent disk reads ▪ Innodb_thread_concurrency tickets not released when waiting for a read ▪ If innodb_thread_concurrency is too high then writers suffer ▪ If innodb_thread_concurrency is too low then readers suffer ▪ Smaller pages are better for some but not all tables ▪ A large log file can reduce the dirty page flush rate ▪ A large buffer pool can reduce the page read rate
  • 39. IOPs is a function of size and concurrency
  • 40. Smaller pages aren’t always better
  • 41. Checkpoint IO rate by log file size
  • 42. Page read rate by buffer pool size
  • 43. Improve schemas ▪ Make your performance critical queries index only ▪ Primary key columns are included in the secondary index ▪ Understand how the insert buffer makes index maintenance cheaper ▪ Figure out how to do schema changes with minimal downtime ▪ We used the Online Schema Change tool (thanks Vamsi) ▪ You can also do the schema change on a slave first and then promote it
  • 44. Monitoring ▪ Per table, index, account via information_schema tables ▪ Efficient and always enabled ▪ Easy to use ▪ Enhanced slow query log ▪ Facebook patch added options to do sampling for the slow query log ▪ Sample from all queries and from all queries that have an error ▪ Error is limited to errno, error text must wait for 5.5 plugin ▪ Aggregate by query text and URL from query commen
  • 45. Open Problems ▪ Parallel replication apply ▪ Support max concurrent queries ▪ Automate slave failover when a master fails ▪ Use InnoDB compression for OLTP ▪ Multi-master replication with conflict resolution
  • 46. Parallel replication apply ▪ Replication apply is single-threaded. This causes lag on IO-bound slaves even when SQL is simple ▪ mk-slave-prefetch can help but something better is needed ▪ Is a thread running BEGIN; replay-slave-sql; ROLLBACK better? ▪ I want: ▪ N replay queues ▪ Binlog events (SBR or RBR) hashed to queues by database names ▪ Each queue replayed in parallel
  • 47. Max concurrent queries ▪ Use large values for max concurrent connections per account ▪ Enforce smaller values for max concurrent queries ▪ We have begun testing an implementation. ▪ Enforce at statement entry ▪ Account for threads that block (row lock, disk IO, network IO)
  • 48. Automate slave failover ▪ Global transactions IDs from the Google patch is awesome ▪ But I don’t have the skills to port or support it ▪ A unique ID per binlog group or event might be sufficient ▪ Add an attribute to binlog event metadata ▪ Preserve on the slave similar to server ID
  • 49. InnoDB compression for OLTP ▪ Change InnoDB to not log page images for compressed pages ▪ Logging them increases the log IO rate ▪ Increasing the log IO rate then increases the checkpoint IO rate ▪ Change InnoDB to use QuickLZ instead of zlib for compression ▪ Add an option to limit compression to the PK index ▪ Add per-table compression statistics
  • 50. MySQL in the datacenter ▪ Previously dominated the market ▪ Now it must learn to share ▪ PostgreSQL continues to improve for OLTP ▪ Hbase, Cassandra, MongoDB are getting transactions today
  • 51. Why NoSQL ▪ Do less, but do it better ▪ Some offer write-optimized data stores ▪ Some don’t require sharding ▪ Interesting HA models ▪ Cassandra doesn’t have the notion of failover ▪ HBase doesn’t require failover when a server dies ▪ Healthy development communities improve code quickly
  • 52. What comes next ▪ Batch extraction is not the answer for MySQL/NoSQL integration ▪ NoSQL deployments will be reminded that ▪ Some of your problems are independent of technology ▪ You need better monitoring ▪ There is downtime when you need to modify the clustered index ▪ Database ops is hard with legacy apps and multi-user deployments ▪ In a few years someone will document the many stalls in HBase
  • 53. The End Thank you
