How to Quickly Build a Gesture Recognition System
Gesture recognition is a key part of the future of design, and is poised to become the next inflection point in how we interact with devices.
Gesture-based interactions are already prevalent in AR and VR devices; for example, here are some available interactions from Microsoft HoloLens. But gestures have the potential to make a far-reaching impact beyond these specialized uses cases: imagine interacting with everyday objects and machines with just a motion of your hand, instead of pushing buttons or turning knobs. This future may not be as far-off as it seems. The principal driver behind the progress in this space is state-of-the-art computer vision technology that enables machines to recognize human gestures.
At Manifold, some of our clients are exploring opportunities to introduce gestures into their products in novel ways and across various industries. We have collaborated with multiple client companies to deliver gesture interactions quickly and reliably, and have learned some things along the way that may be useful to others hoping to experiment with these capabilities.
This may sound hard, and as if you need hundreds of researchers, but that's not the case. In this post, we share how to build a system quickly and with a small team. We hope you go on to experiment with gesture recognition on your own, and learn how to approach new product development that relies on computer vision technology.
Challenges in Gesture Recognition
Gesture recognition is a challenging problem for a few reasons. First, the human hand has a complex shape with many different fingers that move independently, so segmenting the hand out correctly is challenging. Second, we must overcome the challenge of self-occlusion—fingers on the hand may occlude each other in different gestures. Finally, the motion of the hand can vary significantly in different gestures — one person's swipe, for example, is very different from someone else's. In fact, even your swipe from one day to another may be different.
Fluidity in transition between different gestures occurring naturally is an important aspect of using gesture recognition; this is often ignored by the research community. In most datasets, gestures are performed with a clear beginning and end. In a real-world situation, identifying the beginning and end of a gesture can be a real challenge—which sometimes can be mitigated by good UX design.
Our System Requirements
Below are high-level system requirements for the gesture recognition system we have in mind:
- User is facing the camera, standing three to six feet away from the camera. This short clip shows an example of the video stream input into our gesture module. Note that this is the opposite of most AR/VR setups, where the camera typically is on the headset.
- The system can use advanced compute and hardware specs for better performance. This is also different from the typical AR/VR headsets, which are generally constrained by compute and power efficiency.
- The application requires real-time gesture detection since the user interactions must be fluid and feel natural.
- There are two types of defined gestures:
- Static gestures: Examples include thumbs-up and palm open or closed. Often, a single frame is sufficient to produce inference. It's not necessary to process motion to classify a static gesture.
- Dynamic gestures: Examples include swiping left/right and making a clicking motion. Motion analysis is required to classify a dynamic gesture.
Our ApproachAt Manifold, we always take a lean, iterative approach, and suggest you do the same. This often starts with a set of open-source components that get us to a baseline system. We also explore opportunities to fork these and, only then, do we evaluate components that could be improved by licensing paid alternatives. Having a baseline system also enables us to 1) compare the performance and trade-offs among different choices and to 2) design the interfaces with other modules.
Having an end-to-end system also makes it easy us to plug in other components if needed and evaluate performance from an overall system perspective. Our approach is similar in spirit to the stories in Creative Selection by Ken Kocienda about how the original iPhone team built the first smartphone web browser as a fork of the open-source Konqueror browser with a small team.
Build optionsThese are two simple, but promising paths, to build your own system.
- OpenCV based solutions
- You can quickly build a basic prototype using OpenCV, such as this one. This works well for static gestures, such as detecting open palm or closed fist, under relatively clean conditions like below.
- The limitations of the above approach are:
- Unsatisfactory performance when the background is cluttered
- Lighting has a significant impact on accuracy
- It's limited to static gestures; we would need to build a separate solution for dynamic gestures
- Pose-based solutions
- Another approach is based on Pose estimation. The system includes a Pose module tracking body movements. You'll be able to leverage that to build a classical ML model that accepts a time series of keypoint positions as input, and outputs a detected gesture.
- The system block diagram is shown below.
- There are open-source trial versions of this approach—see this one for an example.
- This approach also has limitations. While initial results may look promising, know that you have to collect a large training dataset to meet high accuracy requirements. It is also clear that there are many hyper-parameters that must be carefully tuned for this approach to work well.
Buy OptionsThere are many off-the-shelf options available in the market for gesture recognition. We've studied most of them; here is a quick overview:
- Gestoos is a San Francisco based technology company developing computer vision based gesture systems. They supply gesture systems for automotive, retail, and consumer electronics products. They have a C++ SDK that can be integrated into products.
- TwentyBN is an AI technology provider with offices in Berlin and Toronto. They provide pre-trained Deep Learning models for a number of use cases. They also have a crowd-sourcing data platform which can be used to gather large training datasets for custom use cases.
- Motion Gestures is a technology provider. Initially they provided gesture systems that are based on motion sensors in mobile phones, but they added a computer-vision-based gesture system to their catalog recently.
- Leap motion is a San Francisco-based company that are well known in the AR/VR space. They provide hand and gesture tracking systems that are very power-efficient.
We analyzed solutions along a few different dimensions, including: How close is their training dataset to our use case? What is the highest frame rate (FPS) we can run the inference on? What are the compute resource requirements of the gesture module?
In our experience, TwentyBN's system most closely matches the requirements described above. Their data collection methodology is also promising, in case requirements change in the future. We've also found TwentyBN's support team to be responsive, and have previously worked closely with them to make customizations to match requirements more closely.
Beyond the specific steps for this context, we strongly recommend the general principle we described of building an end-to-end baseline to clarify our product requirements and interfaces.
Use the formula above to get up and running with gesture recognition. While this is the fastest path to get a usable system, we're sure there are ways it can be improved. If you find some of those ways, let us know!
Microsoft — Gestures
Touchless Sensing & Gesture Recognition