Video Summary and Transcription
Visual Regression Tests are like unit or integration tests but focus on the visual part, allowing developers and QA personnel to identify and address any changes. Challenges in detecting UI changes include elements that are not visible to the human eye and misalignment of elements. Use cases for Visual Regression Tests include testing design system components, responsive designs, and browser renderings. Building a Visual Regression Test Tool involves handling animations, network requests, and flakiness. Docker is the best solution for resolving visual regression issues, and finding the baseline for comparison can be challenging but is handled by the testing tool.
1. Introduction to Visual Regression Tests
Visual Regression Tests are a method of detecting unintended changes to your app's UI. They are like unit or integration tests but focus on the visual part. By comparing screenshots of the current and previous versions, a machine can highlight the differences, which may not be noticeable to the human eye. This allows developers and QA personnel to identify and address any changes.
Welcome. Let's talk about Visual Regression Tests. So what are Visual Regression Tests anyway? They're a method of detecting changes to your website, app, UI that were not intended. Think of them as unit or integration tests, but dealing more with the visual part of your app.
There's a long description to that, but that's kind of boring so as we're talking about Visual Regression Tests, let me just show you. So imagine you're working in a company where there is a website, and you have a marketing department, and the marketing department wants to introduce a small change. We have the change implemented by the department, and now we take a screenshot of that page. So this is something that we call a current shot or a change shot. This is after the change. And then we pull up a version from before, what the website looked like before we made the change. So this is called a baseline image. Now if you look at both versions before and after, you might be able to spot the differences, but it's not that easy because the human eye is not good at it. So what we need is a machine to show us a difference mask. If you look at this, it becomes really obvious what changed. Let me pull this up a little bit so you can see now in the top we added a new blog link. The change in the marketing header is there as well, and at the bottom we have as well some marketing copy changed. Not that obvious for the human eye, but it's there. Now the visual regression test would catch that and will tell somebody like the developer or the QA person or custom manager that, hey, there's a change. What do you want to do about it?
Now you might be saying that, well, it's just a small text change, nothing harmful. What can go wrong? It's not that important. In this case, maybe you're right, but let me show you an example where it's not that harmless. But before we get to that, let me quickly introduce myself. Hi, my name is Chris. I'm a full-stack engineer. I love open source and I'm a co-founder at Lost Pixel. My passion for visual regression tests is so big that I started building a product with a friend of mine and we built even a company around it. If you want to catch up with me, you can find me on social media, on X, Twitter, YouTube, LinkedIn. Just look for the handle Chris Calmer. It would be nice to meet you. All right, let me get back to the example that I was talking about before.
2. Challenges of Detecting UI Changes
Integration tests can miss UI changes that are not visible to the human eye. For example, a button may disappear due to a color change that blends it with the background. This can lead to negative user experiences and potential losses. Another issue is the misalignment of elements, such as the squished title or overlapping content.
What a shop usually has, right. Somebody made a change and suddenly the sales department gets in panic mode because sales numbers are dropping like crazy. Nobody's buying anything anymore. What is going on? Okay, the engineering team has to check what's up and well, they look at the checks. Everything is passing green. They look deeper, even really close to every single build step. And well, unit test passed. Even the Playwright integration test passed. So, what happens? If you look closely, you'll find out what happened. The buy button is missing.
So how could that happen? We had integration tests there, right. They should have caught that problem that somebody removed the button. The truth is a bit more complicated. The button was never gone because this is what the integration test sees. Playwright or Cypress in this case, sees the button. It's still there and can click it. So your tests are passing. But if you go back, the customer doesn't see the button. After some research, you finally find out what happened. Somebody changed the primary color of the action button in another place. And that caused the button to disappear because now the color of the button blends in perfectly with the background of the page. The button is still there, but you can't see it as a user. Only the machine can do it.
Well, that's really not good. And this will cost you. Let me show you another example. For example, here where the description of the title of the product is re-squished because somebody made changes to the line height somewhere else. And this doesn't look good. Or another case where we have content overlapping. So the image is on top of the description, which makes it really hard to read.
3. Use Cases for Visual Regression Tests
Having elements such as images overlapping with descriptions can lead to negative user experiences and loss of trust. Other use cases for visual regression tests include testing design system components in different states, responsive designs, and browser renderings. Manually checking every component, page, state, browser, and screen resolution is not feasible, so automation tools are necessary.
So the image is on top of the description, which makes it really hard to read. So chances are good that the customer loses trust in your ability to provide a good service and gets frustrated and you lost the customer.
There are also other use cases that I would like to show you. Like, for example, if you have a design system, you would like to test all those design system components. Not only in one state, but in multiple different states like a button, hovered, clicked, and all those things. There's a lot of tests that you can do. Responsive designs. You want to see what your website looks like in the desktop, on a tablet, or a mobile phone. And also browser renderings. You want to make sure that your website or your application looks good on Firefox, Chrome, and Safari and Internet Explorer. So this is a good use case for visual regression tests.
Or maybe you have just a designer that loves pixel-perfect designs and you need to make sure that the result looks still perfect. Now regardless of your use case, manual checks don't scale. You would have to check every component, every page, every state, every browser, every screen resolution on every code change. This doesn't work. You need to automate that process and you need to use a tool for that. And luckily, there's plenty of tools out there for visual regression tests. Some of them are open source. Some of them are paid solutions. There's plenty to choose from.
4. Building a Visual Regression Test Tool
Building a visual regression test tool involves taking screenshots and facing challenges such as handling animations and stopping them to ensure consistency. Networking is another issue to address, with network requests potentially causing delays. Mocking out requests and keeping assets local can help mitigate this. Extending the waiting period before taking screenshots is also an option.
As I mentioned before, my passion for visual regression tests led me to create a product and build a company around it. You know, the usual story. You see a problem, you try to solve it, and you end up with a product. But don't worry, I'm not talking about the product today. I just want to share with you what I learned while building this tool and what actually happens behind the scenes of a visual regression test. Because from the outside, it looks like a simple operation. You have screenshots that you compare, and that's basically it. But there's more to that.
The first step is taking the screenshot. And this already has some challenges in it. Animations. You know, you have things like a loading spinner, for example. This is something that really makes a visual regression test problematic. Because every time when you take a screenshot in the build process, the spinner might be in a different position. So this is what you don't want to end up with, and therefore you need to stop those animations. Some animations are JavaScript-based. Luckily, most of them are CSS-based, which is really easy to stop. Another example would be carousels and similar things. All of those have to be stopped so they don't look every time differently when you take a screenshot.
The next problem that you have to deal with is networking. Because when you load your page, there are network requests happening. And you don't know when they're finished, right? So it could take 200 milliseconds. It could take three seconds. But you still need to take a screenshot. And in this case, the best thing is to mock out your requests with tools like Mock Service Worker. Keep your assets local is a good thing. Maybe if you don't have any network requests at all, because you're just testing your components, that's the best case. That's making it easy. There are options of, as well, extending the waiting period for each test. So giving it five more seconds before you take a screenshot.
5. Handling Flakiness and Taking Screenshots
Visual regression tests with a large number of screenshots can cause significant wait time on your CI pipeline. To address this, your tool should wait for incoming requests before taking screenshots. Dealing with flakiness is crucial, and stopping animations, mocking network requests, and masking problematic areas can help. It's important to question the value of tests with strong flakiness. Figuring out the cause of flakiness is essential to prevent potential bugs. Taking screenshots can be done using a web server for full pages or tools like Storybook or Ladle for components.
This works okay if you have a few screenshots to take. But if your test suite has 500 to 1,000 screenshots to take, this can quickly add up to a lot of wait time on your CI pipeline. You don't want to waste that time. So your visual regression tool needs to be clever enough to understand, hey, there are more requests coming in. And I need to wait and only then take the screenshots.
Another thing that is important with visual regression tests is dealing with flakiness. So like a unit test where the test results don't always get a green check, sometimes they get red without any obvious reason. So flakiness needs to be combated. So one of the things that you have to do is definitely stopping animations. As I mentioned before, mocking out your network requests is definitely a very good thing to do. Also, giving it a bit of wait time, as I mentioned before, be cautious with this. It doesn't always end up in a good result. Then you can obviously as well mask problematic areas. For example, if you have external libraries that you don't control and they are flaky, this is a good thing that you could do. So here's an example where you can see a page where, for example, on the top right there is a chart displayed. And this chart is a third library that is always a little bit differently rendering things and you can't control it. So the best solution is just to mask this area. And when you mask it, you will always have the same reproducible result, making your test less flaky.
And then also what you can do is you can question if the test still makes sense. Because if you have really strong flakiness, maybe there's something wrong with your test. Because there is no value in running over and over the same test and getting different results. So it will just get you more headaches and not give you the results that you want to have. In general, it's a good idea to try to figure out what causes the flakiness. Because there's maybe something else going on that you're not aware of. And this could later become a potential bug.
Then taking the screenshot itself. You have different options. So for example, if you're taking shots of your full pages, then usually you will run a web server. Or if it's a static build, it's even simpler. And in the other case where you want to check your components, then usually you would use something like Storybook or Ladle.
6. Utilizing Integration Tests and Dealing with Fonts
You can configure the visual regression testing tool to take screenshots in different browsers and resolutions. Integration tests can be included in visual regression testing to save time. Playwright allows you to take screenshots of specific sections using selectors. Fonts can be problematic in visual regression testing, as differences may not be easily detectable to the human eye.
Or a similar tool where you just show your components. In different states. And yeah, you point your visual regression testing tool in that direction. And it will take care of the screenshots. You also can configure there what browsers you want to take screenshots with. And what resolutions to handle. Templates, desktop, mobile, whatever.
There's something that I would like to point out as well. Which might be helpful. Because you can do something if you have already integration tests, you can utilize them. For this case. Because instead of having to run the web server or the static build generating it or running Storybook. You might also include your visual regression test into your integration test. So basically you would get two for the price of one. So for example, in Playwright, if your integration test is in Playwright. Or the integration steps are passing, at the end of your test you could just create a screenshot. Or in Cypress, different commands, just creating a screenshot and that's it. Now those screenshots you could provide to your visual regression testing tool.
There's also something that you can do with Playwright. For example, if you don't want to take the whole page, but rather just a section of it. You can use selectors and pick them and the screenshot will be only of those little items. Then another thing that we need to deal with are fonts. And fonts are problematic. I need to show you an example to give you a better understanding. So those three screenshots have been taken with Chrome, Firefox and Safari. Each one of them has a different rendering engine. And to the human eye, again, it's quite impossible to detect any differences. Everything looks the same to me. But now if I use the machine to generate difference images, it becomes obvious. So here in the first one we have Chrome versus Firefox.
7. Resolving Visual Regression and Finding Baseline
Docker is the best solution for resolving visual regression issues across different browsers and operating systems. Running tests on your CI pipeline ensures consistent results for everyone. Finding the baseline for comparison in visual regression testing can be challenging, but your testing tool will handle it by following the Git graph and identifying approved changes.
Second one is Firefox and Safari and so on. So we see in the first one there are huge differences already. There are less differences on the right side, but all of them are font related. And this causes not only the font to look different, but as well it will do some layout shifts down the line and so on. So this is something that is not really good. And there are ways to mitigate those issues. If you want to test for different browsers, then that's anyway a good thing.
So you would keep track of your visual regression tests for Chrome and for Firefox and Safari. But if you want to resolve it just for one browser, the best thing is to run it in Docker. Because if you have a team with Mac users, Linux users, Windows users, all of them will generate different results of screenshots. And this is not comparable. Everything would be basically all the time a visual regression that you don't want to deal with. But the best solution of all of them would be just to run everything on your CI pipeline. Because then you get consistent results over and over again for everyone. Doesn't matter where the push of the changes came.
Good. Now that we have the screenshots, let's find the baseline because we need to compare it to something. And this is a very challenging topic. And this is something that your visual regression testing tool will have to take care of for you. So let's imagine a Git graph. Because this is our current code change in this commit age. And we want to find our baseline so we can compare it. So we need to go back in time and follow the Git graph. We jump to the green dots, which are changes that were approved in the visual regression tests. So they're the ones we're interested in. And we can pull our version from GDBA and try to compare it. But the truth is a bit more complicated because the Git graph never looks that linear, right? Unless you're working yourself. But even then, I don't think that's the case. Usually a Git graph looks a little bit different. You have branches, you have PRs, teams working on it, merges are happening, score-free bases are happening, and so on.
8. Finding Baseline and Comparing Shots
To find the baseline, you need to follow all the parents of your branch's change. The process becomes more complicated when dealing with merges and multiple parents. Your tool will handle the complexity by collecting the history from different builds. Once you have the baseline for each shot, you can compare them using the difference image. Flaky areas can be masked to avoid marking consistent changes. Thresholds are used to accept or reject changes, with even a small area of 40x24 pixels considered significant for visual regression tests.
So it's a bit more complicated. So in this case, if you look at it again, if we have our change on our branch, we have to follow all the parents to find out what our baseline is. So in this case, for example, if you go like this, you have to follow both parents. Because sometimes if you have a merge, you don't have just one parent in your commit, but there are two parents or more. To make it a bit easier to understand, let's look at it in a tabular form here. Let's say we have only four page shots that we want to take, and this is the history in GitHub. Not always do we apply changes to every page, right?
Sometimes we just make one page change, sometimes we apply changes to multiple pages. So therefore, you can't just take the last build that is there and use it as your baseline. You have to put together your whole history from different builds during time. And this is what's making the process so complicated. So, for example, for the page one, if we go back in time, the last change that was approved was on this commit. For the page two, we can find it here. For the page three, we have to go back a little bit further in time because there was no change done in some time. And this is getting more complicated if you have a monorepo setup with multiple frontends. But again, this is something that your tool will have to take care of for you.
Good. Now that we have the baseline for each of our shots, now we can start comparing them. As before, I showed you there is the current shot and the baseline shot, and we can create the difference image. Now what we can do also is we can mask some areas that are flaky. In this case, I want to remove the header copy and I have a baseline, so the baseline will always have the same marked area. So in the end, if you look at the result, the difference image will not mark these changes anymore. Only the ones that are outside of the mask. Then we need to think of thresholds. Let's imagine we have a screenshot of a page or a user interface that is of the size of 1200 times 800 pixels. This translates to 960,000 pixels. If I would now apply a threshold of not 1%, but 0.1%. So I want to accept 0.1% changes and not trip the visual regression check. 0.1% would be 960 pixels in this resolution. To give you a better example of what it looks like, it will be more or less an area of 40 times 24 pixels. It's not that big, you might think, but it is actually quite a lot for a visual regression test.
9. Reviewing Changes and Closing the Cycle
Keeping thresholds low is important to avoid unnoticed changes over time. Start with 0% tolerance and adjust for smaller flaky issues. Review changes using the difference mask or side-by-side comparison. Approve or reject changes based on intended or visual regression. Closing the cycle, approved baselines are used for future tests. Thank you for listening and find me on social media. See you at React Amsterdam!
Almost 1000 pixels difference is a lot. If you would accept this as a threshold, it would mean that a lot of changes over time would bleed in and you wouldn't even be aware of that. So you want to keep your thresholds really low. Ideally, you start with 0% tolerance and then you can adjust it for some smaller flaky issues. Most of the tools provide an absolute value, so you can define, instead of percentages, you can define that I want to allow up to 20 pixels in changes, which is fine. Sometimes you will find that rendering borders, rounded borders, gets you different results. The ntlizing algorithm doesn't always kick in properly, so those are acceptable changes. 20 pixels, nothing serious.
Good. Now that we have all those things in place, we need to review those changes, because now we have some differences. And as I showed you before already, you can use the difference mask where it will tell you exactly where the changes are, but there's as well another way of looking at it side by side, but this is not so good because the human eye is really bad at it. And another option is, like you can see here, you can use the slider where you can find out yourself what the differences are, so whatever works for you, just use that.
Good. And in the end, you need to approve or reject the change. Was it an intended change, or is it a visual regression? You have to decide. In the end, you will see the checks all passing. When you complete this, you close the loop, the cycle ends, and everything starts from the beginning. Somebody builds a new PR, or you're building a new PR, and the baselines that you approved are already part of the system and will be used the next time in the process when it starts all over. Thank you for listening. I hope you enjoyed it. I hope you learned something about the visual regression testing, and I hope you're excited about it as well as I am. And yeah, find me on social media, and maybe we'll see each other on React Amsterdam this year. Bye!
Comments