My Misadventures with OpenCV, Tesseract and Among Us
Online Tutorials used: Simple OCR with Tesseract by Andreas M M
Among Us is a simplistic social deduction game where a group of ten bean-like astronauts get dropped into a map and have to run around performing maintenance tasks, while trying to avoid two impostors who are tasked with killing off the crewmates. If a body is found or an emergency button is hit, a meeting is called where all the surviving players discuss and potentially vote another player, the only way to remove impostors from the game.
This game came out over two years ago now, but recently exploded in popularity when members of the livestreaming community picked it up and ran with it. Players quickly moved to their own discord servers to communicate during meetings. Given that these lobbies required ten people, streamers eventually began intermingling between their communities, giving rise to a tide of enjoyable crossovers that would never have happened otherwise.
There are some extremely archival-minded fans of the streamer Northernlion that have compiled a wealth of information on his streams and collaborations, such as a list of every guest and game played on the tri-weekly show the NLSS, details on which content-creator won over the course of a Let’s Play, or a collection of inside jokes within the community. I wish to contribute to this.
Create a script that will take a VoD or recording of an Among Us stream that will extract meeting screens and parse the names of the players there to find out which streamers have played with each other.
Breaking it down
There will be two discrete parts of this program, the part that will parse the VoD itself to find these pictures, and the optical character recognition (OCR) component itself. I was planning to use the OpenCV library to process the video into images, and Tesseract to perform the OCR job itself.
Processing the Video
Even though the majority of the game rules within Among Us are variable, eventually the streamer lobbies settled on more or less the same rules, which had a discussion time usually between 90 to 120 seconds. Also, given that participants will drop in and out between games over the course of a play session, it isn’t imperative to snag every in-game meeting (of which there are typically three or four per game). To be conservative, I initially set the timer to save an image every 90 seconds. This is what a meeting screen would look like:
From here, it was simple to set this as a base image to compare the other screenshots to collate all the meetings in the VoD. For this, we are using the Structural Similarity Index (SSIM), which will calculate if pixels between two images have similar densities or line up. This is a little slower than something like the Mean-Squared Error, which instead looks for differences in pixel intensities. Given how pixel intensities can vary, especially when there are dead players in the lobby, such as in the bottom right of the image above, it may introduce false negatives when using the MSE method.
My First mediOCRe Attempt
Unfortunately, running the above picture through Tesseract did not go over well. Much more image processing was required.
I initially started by grayscaling and thresholding the image whole, but quickly realized that given that the meeting format was the same, I could crop out the individual nameplates to reduce much of the noise of the image. However, the output still wasn’t great. Further training was probably needed.
With some googling around, I found a program called jTessBoxEditor, which will help me refine ground truth data for Tesseract to train itself. The below picture is a peek under the hood of the characters Tesseract detected for “shubble”.
While going through this for all the other cropped images, a couple of things became apparent to me. Tesseract was having a problem due to the letters being too close together, and detecting them as either one whole character or the connected pixels as one new character. Also, for some letters it was only picking up the whitespace of the letter and discounting much of the outer border.
We can kill two birds with one stone this way by using floodFill to fill the outer area of the initial image, and inverting the image. This way, the problem of touching characters is gone, though some additional noise is generated, like in the ‘b’ in the example below.
With a full complement of training data, it was relatively easy to fine tune the ground truth in jTessBoxEditor and begin the process of training Tesseract further. I had been messing around with jTessBoxEditor to train at various parts of my exploration in preprocessing the image, and found it highly inaccurate, causing me to have to touch up every image, but at this point a relatively small amount of fine-tuning was needed to progress further in the tutorial mentioned at the beginning of this. It seems prudent to try and exhaust all your options to preprocess your image before turning to training the model, if only to save you the headache of slowly moving a box one pixel at a time to fit a letter.
There was a noticeable improvement in accuracy after the first round of training, and I could continue to add to the ground truth and further refine it, but that doesn’t seem to be a great usage of my time. I decided that a little bit of human contextual help will bring this project the rest of the way. It’s also worth mentioning that some streamers will also enjoy changing their name between games, so human oversight would be needed regardless. The OCR model is better trained but there are still inaccuracies, such as ‘TonyDataCo’ vs ‘TonyDaTaco’. To group these together, I’d like to introduce an old friend from my second internship, Levenshtein.
The Levenshtein Distance is in essence a calculation of how many single character substitutions are required between two words, and is the first step in finding string similarities. At the time, Levenshtein was too simplistic for my needs, but as a quick and dirty method this will do nicely.
This was an engaging first foray into the world of computer vision and OCR models. I didn’t realize the extent to which image processing played into OCR recognition. Ironically I played into the buzzword belief that one could simply train a model and use the training data itself as a hammer against any perceived nail I could find. Doing as much of the processing before calling in the model seems to be the better option in this case.
Ideally, to take this further I’d investigate a way of downloading the VoDs directly off Twitch. The site offers a way of skipping to a portion of the VoD when the game being played changes, and if there is an API request that exposes that information I could be able to automate the download and the analysis.