Input lag experiment Blog
Thesis released
2022/02/26 Hello everyone,
HalfwayDead here to report that I am publishing my thesis today! In case you forgot (which
is totally expected with how long it's been), it's called "Effects of induced latency on performance
and perception in video games" and uses the data from the Rocket League experiment. First
of all, I'd like to apologize to you all for stopping the updates. I am not the fastest writer,
so the updates took as much as an entire afternoon. I also felt like it wouldn't be responsible
to release results that I haven't triple checked yet. It just didn't feel worth it to report
on.
The thesis overall still required a much greater effort than what I had expected at the
time of the last update. Thankfully, COVID rules gave me extra time. I ended up spending
most of it on teaching myself advanced statistical methods, and reading 150+ papers
related to the subject area. Originally I wasn't aware of the majority of the research,
due to the way it is labeled and categorized. But once I started going down every cited
paper the amount of relevant sources expanded quite a bit. Unfortunately, that doesn't
mean that any of those sources has given a satisfactory complete model of latency (and
neither have I). I was also shocked that some of the research published in renowned
conferences has grave errors. Multiple articles confused network and local latency.
Although historically some games have had network handling where the network latency
causes local latency, this wasn't always the case for the games in the research. Aside
from the related work summary that appears in the thesis, I originally wrote a longer
chronological summary of the research. If you are interested in reading it, just email
me. I haven't double-checked it, but it has sources attached.
On November 8, 2021 I handed in my thesis and had to wait quite a while to hear back the
result. This week it was ready. I got a 1.3 for my effort. The German grading system
allows for grades between 1.0 (best) to 4.0 (worst) in steps of 0.3. Anything not
passing is considered 5.0. I am very happy about this.
The thesis is over 100 pages with all material included, so I totally understand if you
are not interested in reading that much. I am absolutely planning on releasing the
results in video form. However, in order to make it work for a YouTube audience, the
content will obviously be adapted, and any implementation details will be left out. I
also don't have an ETA when I will get that done. I will not provide a written summary
(beyond the abstract).
This is actually not latency related, but I did already make a video about shot power using the data of the experiment.
I forgot to make a post about that. However, I am assuming that most of the people reading
this are subscribed to my YouTube anyway.
I will post the next update when the video summarizing the thesis is out. Beyond that I
will likely kill the email "newsletter" as it has served its purpose. If you think it
would be great to keep it running for specific kind of updates feel free to email me, and tell me what you want to see on it.
If you want to get a mail on new updates you can still send me a mail with the subject "Notify
me" to experiment@rocketscience.fyi to get added to the mailing list. You can find the previous
update posts below this one.
Update 2
2021/01/08 A month and a half has passed since the last update and there has been some interesting
progress. Still not the final data, which I might not provide at all before I publish
the thesis itself, but at least something. If you want to get a mail on new updates you
can still send me a mail with the subject "Notify me" to experiment@rocketscience.fyi to get added to the mailing list. You can find the previous
update post below this one.
Progress
Progress was slow over the Christmas days but other than that, quite a bit happened.
I've been especially productive in the last 2 weeks. For a while now I've been working
on a list of your monitors. If you recall, during the experiment many things were
collected, like your monitor, your input device (controllers, etc.), personal info, and
even graphics and performance data. Some of those things were standard survey
information, but the other information serves mainly one purpose: estimating the
baseline input lag of your system. The most important pieces are the delay of the
monitor, input device, and the time the game needs to calculate and render everything. I
can only figure out the input lag of your monitor by using data from trusted reviewers.
Due to different measurement methods which can give different results, this further
restrains the usable sources.
The most common lag testing method is the Leo Bodnar Input Lag Tester. This is a great all in one device. It creates a video signal and measures the
response with a sensor for reliable single spot measurements. Unfortunately, it can only
generate a 60 Hz signal. Even though a monitor could technically have the same signal
delay at 60 Hz as at 144 Hz or more, measurements often find that on top of the extra delay you get from refreshing the screen only 60 times
a second, monitors do not scan the image out immediately, but wait for a few ms, so you have
extra delay. This doesn't happen if you use Freesync or GSync and the refresh rate lowers
to 60 because of the framerate. It only happens when you select 60 Hz in the control panel.
The solution: rtings. They have created their own very similar tool which connects to a PC. A PC with a proper graphics card can then drive the monitor at any
refresh rate that a user would be able to. Hardware Unboxed have something of the same kind since a bit more than a year. So these 2 sources were my gold standard when available, and
their variance for the same monitors is rather minor. For 60 and 75 Hz monitors, I allowed
any reputable site using a Leo Bodnar testing tool. Other than that, I'm using tftcentral.co.uk
and pcmonitors.info who determine input lag with via SMTT 2.0. Their numbers should not be compared 1 to 1 to those of rtings, etc.! So why am I doing that anyway? Because I know how both measuring technologies, pixel response
times, and monitor scanouts work. That will allow me to compensate the data, and additionally, I'm not trying to get 1
ms accuracy.
To get back to the main point of the progress. At the end of the day, looking this up is manual
work, and with 763 submissions, how much work could that possibly be... ... ....... Turns
out you guys use 449 unique monitors. That was more than I expected, but worse was the problem
of identifying the true model based on the information collected. I collected the monitor
you chose from a list, or you could submit your own name. There was additional info collected
which is where the name in the list came from. There is something called a DeviceID e.g. BNQ7F51,
which will tell me that you're using a BenQ ZOWIE XL 2540 on DisplayPort. So much for the
theory. In reality, many manufacturers don't seem to care too much about making these unique
or anything. I even occasionally found monitors of completely different sizes using the same
ID. There is also no database that will just give you the names of the monitors by ID. The
name field can be equally useless. The Asus VG279 could be a 144 Hz, 165 Hz, or 280 Hz monitor.
Those models will obviously be significantly different, but the name is just a few letters
different, and those letters are missing in the name that I could read out from the monitor's
info. Regardless, I was able to identify the majority of monitors or narrow them down to seemingly
identical versions, which is still useful.
At the end of the day, here is where I stand: I have 361 datapoints with a monitor of which
I know the input lag. That is only half, but still a few hundred. More importantly, the vast
majority of you are using 120 Hz+ gaming monitors. There is only 1 of those monitors tested,
that has above 10 ms of input lag (and it's still less than 11). The plan is, that I can compare
the results of the monitors of which I have good data (and low input lag) and compare the
results from the people using them to the results of the people using the untested gaming
monitors. If the 2 groups have no statistical difference, then it can be assumed that including
the group of people using untested gaming monitors doesn't skew results.
Ok, enough about monitors for now. Let's do the same with controllers. 90% of you used a
controller and only 10% keyboard/mouse. I don't have the full statistics regarding the
exact input devices right now, because that data is only in a raw state right now, but
it should be over 3/4 DS4 and different Xbox One controllers, nearly all of which I have
tested. There were still 152 unique controller entries, but it turns out that the
majority of the manually entered ones were DS4s that didn't show in the HID devices. I
hope that most people who do not have an original Xbox One controller, entered that fact
and didn't just select the Xbox option, because it is possible for a manufacturer to
imitate it 100% on a driver level. Thus, I have no way of knowing that for certain.
Other different versions were just due to different connection methods, which shows a
different product id. If you call those different controllers, I apparently own 29
different controllers :D.
The data
Now for the juicy part. I have made some progress on the processing, and thankfully,
when I tell the computer to add up 160 shots each for 763 people, it takes less than a
second. First of all though, out of 763 submissions, 16 people submitted twice, nobody
more than twice. The second submission of those people will obviously be removed which
leaves 747. Someone fiddled with the graphics settings despite being told not to, so
that will have to be removed. I haven't ran all the tiny checks yet, but I'll continue
with that, after I remove people for input lag reasons.
This time around I focused calculating the scores for everyone. I only collected raw data,
so the processing will all be done on my end and not on the users, which would make it easier
to cheat. So with that done, I can now show a highscore list:
| Number | Score |
|---|
| 1 | 972 |
| 2 | 864 |
| 3 | 864 |
| 4 | 861 |
| 5 | 858 |
| 6 | 847 |
| 7 | 834 |
| 8 | 833 |
| 9 | 826 |
| 10 | 810 |
I'm sorry if you expected any names, but my privacy policy forbids that. The elephant in
the room is the top score. So far, all my checks have failed to find anything out of the
ordinary, well aside from what you can see already. There wasn't any dribbling, no
unusual resetting. The player had the highest average accuracy, but not by an incredible
margin, the highest average shot power, again not by an incredible margin, and missed
only 2 shots out of 160, despite the added input lag. That in combination, creates the
huge margin. Due to the large amount of data it would've required, I did not save every
physics tick of every shot. I cannot rule out cheating, but I think it's probably just
as likely that they found a smart and easy way to take each of the shots. That was
specifically against the stated request though. I asked players to challenge themselves
by shooting at the earliest opportunity they considered possible.
Most of you will have forgotten your score by now. If you haven't reinstalled BakkesMod since
then, it should be at the top of the results.json player_data.json file that you
can find in the bakkesmod/data/inputlagexperiment/results folder. The average score was 605,
the standard deviation was 122. The average rating of participants was 1350, aka Champ 2.
Not sure if any of them are reading this, but shoutout to the 4 people with negative score
who still finished the experiment. If you didn't touch the ball at all, it reset and subtracted
points, so those people had to have played longer than anyone else. I am aware that the experiment
was targeted towards very experienced players, so I appreciate it all the more when people
are pushing through to give me data on the lower end.
I think that's all the data I will show for now. On the next update I will probably dive
into some fun things like ranks vs hours, ranks vs score etc. Of course rank vs score
will be tainted by the fact that players had input lag. It's why I'm calling them "fun
things". They wouldn't be findings that should be taken as scientific fact, but
something worth researching and verifying further. As I already said, it would be
inappropriate to share any of the scientific experiment results before I have triple
checked them and am certain about their validity.
Update 1
2020/11/24 It's been quite some time since I made the original video about my input lag experiment. I wanted to give out an update on progress and give some extra
information that I was deliberately withholding before the experiment was finished. I'll also
talk about the methodology I'll be using to analyze the dataset. If you want to get a mail
on new updates you can still send me a mail with the subject "Notify me" to experiment@rocketscience.fyi to get added to the mailing list.
Progress
Overall, the experiment has been a success. There were over 700 submitted results which
is more than what you'd reasonably be able to get in any laboratory setting. Thank you
for your participation! Unfortunately, there isn't too much I can report on the
progress. It's been going pretty slow, mostly due to personal reasons, but I also don't
want to proclaim anything too early. As such, I cannot even tell you so far, how many of
the submitted results are usable data points. More on what that entails is in the
section about the methodology. Without that part being done, I have of course not
calculated any of the interesting tests on the results.
Experiment details
While the experiment was ongoing, I was asked many times about how much input lag I
added, as well as whether the user could be given a score on how well they guessed input
lag rather than just how well they did. I kept those values a secret, as I didn't want
users to be influenced in any way. There are those that claim they can sense 1 ms of
input lag, so if they knew how much I added, they might end up just choosing only 0 or 6
(on the 0-6 scoring scale). There are those that claim anything below 100 ms is
irrelevant and might only select 0 because of what they know. And lastly, if some people
knew how poorly they're doing and they were telling others, then those might conclude
that the amount is unnoticeable anyway and therefore end up choosing something
random/not trying.
So how much did I add? I added 0, 1, 2, 4, or 6 physics ticks of input lag. That is (ms) 0,
8.3, 16.7, 33.3, 50. If you recall, the way the experiment was structured, there were 3 different
graphics settings in random order: your own, minimum, maximum. Each had 50 shots. One scenario
equated to the 5 shots of the training pack done in random order with the same input lag.
Then you were asked what you thought the input lag was. So that's 10 different scenarios and
there were 5 different input lag amounts. Each showed up exactly twice for everyone, but the
order is completely random, and getting the same input lag twice in a row is very possible.
The idea is, of course, that all of these random orders balance each other out over the large
dataset, so it's not biased towards any graphics setting, input lag scenario, or shot. An
individual's experiment does not have this measure of safety. It's another reason I shied
away from giving any individual statistics beside a total score.
There were a couple of suggestions and criticisms made about the experiment. The most
common was in regards to the test not being optimal to notice input lag. I was very well
aware of that, as I discussed this with my supervisor. Unfortunately, there is no such
thing as a perfect experiment and the primary goal of the experiment is not whether or
not you can notice it. The goal is to see how performance degrades and whether different
levels of visual effects (graphics) change how input lag effects the player. The
experiment was chosen in a way that the player has a clear goal to execute and the
extent of which they're able to do it has to be easily measurable. This is not really
possible with dribbles or something of the kind.
Other points were regarding the length. Too little warmup, too few shots per input lag scenario,
etc. Again, unfortunately it's not really possible to design the perfect experiment. The longer
I make it, the less people are going to participate, which is also problematic in many ways.
The randomization is supposed to take care of the learning that goes on over the course of
the experiment. Although a single person's performance will change over time, each scenario
has the same chance of being affected by that. This does of course not get rid of the problem
of constantly changing input lag. A player is likely going to perform better at a constant
input lag that they have time to get used to. To set up an experiment that takes this into
account, however, is almost impossible. I'd argue that if your input lag changes, it could
take days of readjusting your muscle memory until you're at the peak performance that you
could be at that input lag. Although I would love to see such a long term study with players
subjecting themselves to different input lag over time, I doubt it will ever happen. This
experiment just focuses on short term effects, and I will acknowledge that players can likely
perform better if they can have time to adjust to input lag.
Explaining the analysis
This will contain quite a few terms that you won't know if you don't have a scientific
background, so you might have to do some googling. I don't think that I would do a
particularly good job at explaining those anyway, so I'll leave it like that.
There are a couple of things that have to be taken into account for the analysis.
Because we are not in a laboratory setting, we cannot control for every variable and
thus the baseline input lag that everyone has on their system is not going to be the
same. Furthermore, we don't exactly know what each player's input lag is, as we can't
measure it for everyone. The baseline input lag is still very important. As you can
imagine, if I play with 1 second of lag and add a further 50 ms, it will not decrease my
likely terrible performance much further. On a system with an incredibly low input lag
of 5 ms, an extra 50 ms will make a huge difference.
So we have to estimate baseline input lag in some form. For that, I've collected multiple
performance metrics of the game while the experiment is going. These should, with my additional
testing, allow me to very accurately determine the amount of in-game/in-engine lag there is.
That's only one aspect though. Monitors and input devices are a large part of input lag too.
That's why I tracked those too. Through my own testing and a couple of other high quality
sources, I hope to get an accurate estimate for many of the used devices. There will be a
large number of results that have to be thrown out due to having no way to reliably estimate
the input lag.
With the input lag out of the way, let's talk about the things that are going to be
evaluated. First the players will be split into groups of different skill levels. How
large of a range one skill group is will depend on the final count of usable results at
each skill level. Then we'll run a mixed design ANOVA on the three factors that we want
to check for "impact" on the player: Latency, graphics, FOV. FOV here refers to the
real-life field of view of the player. This depends on how large the monitor is and how
far you sit away from it. Since it's constant for a player, this is our between-subject
variable. Graphics and input lag are within-subject variables since every subject plays
with every setting. The "impact" on the player will be 4 factors we're testing against:
The score (which is a large portion accuracy with a bonus for power), just accuracy,
just power, and the subjective input lag perception. A significance level of 0.05 will
be used.
There is a chance that I will run further analysis beyond my thesis. However, there is
obviously a good scientific reason for why anything beyond should be taken with 2 grains
of salt. When you throw a bunch of tests at a dataset, you're bound to find something
that isn't actually there and just random variance. It's also not a good idea to try and
extract data that is clearly not about what the experiment was meant to do. So anything
that I do check should not be treated as a scientific proof but as an indicator of what
a proper experiment in that direction might return.