A/B test best practices

Question, hypothesize, then test 🤔

A/B tests are like science experiments. You should ask yourself what you are trying to accomplish, hypothesize how you can achieve that goal, and then test whether your solution actually works.

For example:

  1. Question — You might wonder why your high DAU isn't translating into a high LTV per user. Your goal is to encourage more monetization events.
  2. Hypothesis — You hypothesize that a more noticeable “purchase” CTA button will lead to more purchases.
  3. Test — You A/B test whether changing the button’s color (to bright blue) or location (to the top of the display) increases LTV per user.

Having a clear idea of these steps will help you set up worthwhile tests and make sense of your results.

One test at a time ☝️

You can test a nearly unlimited number of things in Leanplum, but we recommend testing just one increment at a time (within a single test or across multiple tests).

There are two main reasons for this:

1. More accurate results To evaluate how effective a change is, you need to measure its performance against everything else staying the same. Otherwise, you won’t know which change is impacting your results the most.

For example, if you change an email’s copy, color, and subject all at the same time, and your subscriptions drop by 30%, how will you know which element(s) didn’t work for your users? Was it all of them, or just one in particular?

Testing one increment at a time will give you more reliable data, which can more accurately inform your company’s future marketing and design decisions.

2. Consistent user experience 👩‍💻 Your users want a consistent experience when they open your app. If elements are constantly changing, it might confuse them and lead them to uninstall or unsubscribe.

A user might experience inconsistencies if they enter multiple A/B tests at the same time. (Users can qualify for multiple A/B tests if they fit the target audience for each.) Be diligent with how many changes you expose users to — this will build trust between your brand and your users.

Set it and forget it ⏰

Use Leanplum's time estimate

When you create an experiment with a goal, Leanplum will estimate the recommended length of time you should run the test to achieve your goal.

To get the best results, let your test run for the duration of Leanplum's time estimate, which is calculated using your test goal, DAU, and the number and complexity of variants.

Running the test for the right amount of time ensures your data is reliable — you should be able to say with 95% confidence that the data trends will continue after the test. This way, you can confidently make marketing decisions based on your winning variant.

📘

Timing is important

Stay abreast of any outside factors that might skew your results. For example, your retail app might get different traffic around the December holidays than it typically does the rest of the year. When in doubt, test your hypothesis multiple times in different months.

Don’t make changes mid-test 🙅‍♀️

Once you begin a test, let it run its course. Resist the temptation to tweak the test parameters after you publish.

Making changes mid-test will skew your results and make it difficult to identify how a variant influenced your metrics. This is especially relevant to the audience distribution: if the distribution changes mid-experiment, enrolled users will remain in the same variant group. Only new users will come into the experiment at the new distribution.

Remember, you can always create a follow-up test — after the current one is finished.

Use machine learning 🤖

Let Leanplum do the work for you. We have several machine-learning features to help you get the most out of your A/B tests:

  • Stickiness — Stickiness ensures users who fit your A/B test audience when the test starts (say, city = Los Angeles) will remain in the campaign even if their attributes change mid-campaign (say, if they travel to New York and open your app there).
  • Significant changes — In your test analytics, Leanplum will automatically surface any metric that undergoes a statistically significant change as a result of this test. We do this to help you avoid unforeseen consequences — for example, if your test increases monetization events by 20% but also increases uninstall rates by 15%.