Data & Analytics

A/B Testing - Design, Analysis and Pitfals

Description
1. A/B Testing – Design, Analysis and Pitfalls slavabo@gmail.com 2. Business Package Test 3. Additional Advertising Test 4. Agenda ã Design the Experiment ▫ 2 main…
Published
of 21
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Transcript
  • 1. A/B Testing – Design, Analysis and Pitfalls slavabo@gmail.com
  • 2. Business Package Test
  • 3. Additional Advertising Test
  • 4. Agenda • Design the Experiment ▫ 2 main questions – how many users and how long to run the test ▫ Define reasonable number of KPIs ▫ Pay attention on seasonality/weekdays effect • Analyze the Experiment ▫ Statistical methods for checking significance ▫ Non-parametric methods ▫ Outliers/bots/fraud • Data-driven culture • Pitfalls • Open Discussion
  • 5. Design • Test Duration & Sample Size ▫ Duration needs to be defined before the experiment is started! ▫ Depends on distribution of main KPIs  80% have Binomial Distribution (Conversion Rate, CTR, etc…) + CLT can help.  20% other (count events, revenue).  Power calculations for defining N (size) and t (duration) OR use rules of thumb.  General rule – the less difference you want to catch the more data you’ll need to collect.
  • 6. Design ▫ Example – # of searches per user (SweetIM)  Poisson assumption for count events  Not appropriate when variance >> mean  NB was found appropriate  Power limitation of NB
  • 7. Statistical Power and Sensitivity 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 0.5% 1,001,243 1,133,747 1,276,816 1,433,622 1,608,696 1,808,942 2,045,743 2,340,142 2,738,670 3,386,960 1.0% 251,556 284,847 320,792 360,188 404,175 454,485 513,980 587,946 688,074 850,953 1.5% 112,359 127,228 143,284 160,880 180,527 202,998 229,572 262,609 307,332 380,083 2.0% 63,516 71,922 80,998 90,945 102,052 114,755 129,777 148,453 173,734 214,860 2.5% 40,853 46,259 52,097 58,494 65,638 73,808 83,470 95,482 111,743 138,194 3.0% 28,511 32,284 36,358 40,823 45,809 51,511 58,254 66,637 77,985 96,446 3.5% 21,051 23,837 26,845 30,142 33,823 38,033 43,011 49,201 57,580 71,210 4.0% 16,197 18,341 20,655 23,192 26,024 29,264 33,094 37,857 44,304 54,791 4.5% 12,861 14,564 16,401 18,416 20,665 23,237 26,279 30,060 35,180 43,507 5.0% 10,470 11,855 13,351 14,991 16,821 18,915 21,391 24,470 28,637 35,416 5.5% 8,696 9,846 11,089 12,451 13,971 15,710 17,767 20,324 23,785 29,415 6.0% 7,343 8,315 9,364 10,514 11,798 13,267 15,003 17,162 20,085 24,839 6.5% 6,288 7,120 8,018 9,003 10,103 11,360 12,847 14,696 17,199 21,270 7.0% 5,449 6,170 6,948 7,801 8,754 9,844 11,132 12,735 14,903 18,431 7.5% 4,770 5,401 6,083 6,830 7,664 8,618 9,746 11,148 13,047 16,135 8.0% 4,213 4,771 5,373 6,032 6,769 7,612 8,608 9,847 11,524 14,252 8.5% 3,750 4,247 4,783 5,370 6,026 6,776 7,663 8,766 10,259 12,687 9.0% 3,362 3,807 4,287 4,814 5,402 6,074 6,869 7,858 9,196 11,373 9.5% 3,032 3,434 3,867 4,342 4,872 5,478 6,196 7,087 8,294 10,258 10.0% 2,750 3,114 3,507 3,938 4,419 4,969 5,619 6,428 7,523 9,303 Sensitivity Statistical Power Sample size as a function of sensitivity and statistical power; Negative Binomial parameter α =0.31, average and length of the test 𝑡 = 30, 𝜇 = 0.69
  • 8. Design • Define Reasonable Number of KPIs ▫ It’s impossible to conclude based on 20 KPIs • Project your KPI on Main Business (Lead) Indicators • Consider Weighted KPIs or GPI (General Performance Indicator) • Seasonality ▫ Weekends may have different user behavior than Weekdays ▫ Holidays can be unpredictable • 7-days rule of thumb
  • 9. Analysis • Statistical Parametric Methods • Non-Parametric Methods • Permutation Tests • Outliers/Bots/Fraud
  • 10. Analysis • Statistical Parametric Methods ▫ Use confidence intervals based on KPI distribution ▫ T-test, Chi-square test, etc will work, but…  T-test assumes normal distribution of statistic  Chi-square can be weak when low frequencies are observed ▫ Try Hypothesis testing based on KPI distribution – it’s not simple but worse it
  • 11. • Can be used as a generalization of Poisson in over dispersed cases (Var >> Mean). • Has been used before in other domains to analyze the count data (genetics, traffic modeling). • Fits well the real distribution. 0 100 200 300 400 500 0.000.050.100.150.200.25 Number of search Frequency Real data Fitted NB Fitted Poisson
  • 12. Analysis • Non-parametric tests ▫ When it’s hard to estimate the distribution ▫ As Q&A for parametric tests • Mann-Whitney, Kolmogorov-Smirnov ▫ Pros:  Can be appropriate for unknown or not Normal distributions  More robust than t-test ▫ Cons  Less sensitive and have less power than parametric test (median as a parameter)  Assume that both samples come from the same distribution  Assume normal distribution in large samples
  • 13. Analysis • Permutations tests 1. Calculate test statistic 2. Shuffle and resample 2 random groups 3. Calculate again test statistic 4. Compare to your original statistic, if is more extreme ->k=+1 5. Return on step 2 N times 6. Calculate the probability to get a result, more extreme than your original k/N - this is your P-value
  • 14. Analysis • Check for outliers ▫ Plot your data on daily/hourly level ▫ Descriptive statistics can help (variance) • Try to filter bots and crawlers ▫ It is almost impossible to filter all non-human activity on the web. ▫ Automatic bots and crawlers can bias the results and drive to wrong conclusions. • Continuous A/A test for sanity check for the whole system ▫ What difference you observe between A groups and is it insignificant? ▫ Technical and tracking issues
  • 15. Data-Driven Culture • Avoid HiPPO that is not supported by data Highest Paid Person’s Opinion • Be clear about your KPI & how they affect your business • Fight your ego – numbers don’t lie • 80%-90% of tests won’t give positive result • Learn from failed tests
  • 16. Pitfalls • Picking an easy-to-beat KPI without relation to lead business metrics ▫ Example – focusing on increase click-through rate for banners/buttons and ignoring other metrics like user retention or revenue. • Using incorrect statistical methods or violate the assumptions ▫ Example 1 – assuming that KPI has Normal distribution without actually checking it. ▫ Example 2 – Using online significance calculators without understanding the data distribution
  • 17. Pitfalls • Combining ratios from different proportions over time -Simpson’s Paradox ▫ Example: • Ignoring outliers and bots | not plotting data on a timeline ▫ Example: One outlier can change the test results
  • 18. Pitfalls • Starting test without validation (A/A test as a solution) • Change control group during the test (solution- change them both!) • Technical issues with experiment group ▫ Example – redirect , cash, new technology • Running your experiment “until it will reach significant difference” • Not “anchoring” users to one group only (also cookie problems)
  • 19. Reference ▫ How Not To Run An A/B Test ▫ http://www.evanmiller.org/how-not-to-run-an-ab-test.html ▫ Microsoft Experimentation Platform ▫ http://www.exp-platform.com/Pages/ExPpitfalls.aspx ▫ Simpson’s Paradox ▫ http://vudlab.com/simpsons/
  • Search
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks