In my last Artificial Intelligence (AI) column, we reviewed challenges that arise when there is either insufficient data to learn from, or when we have been supplied with data that has been intentionally pre-sorted in a way that prevents the Machine Learning (ML) algorithm from being able to recognize all the necessary patterns in the data to classify them successfully. In this post we will continue that thread with an eye toward the impact of ML implementations on power consumption and latency.
ML is now part of our everyday life now with ‘appliances’ such as Amazon’s Alexa that use ML to find and play our favorite music and generative AI tools like ChatGPT that can draft a letter, instantly, with relevant references to help a physician negotiate payment from a healthcare insurer for a test or service that may have been declined. While these are excellent use-cases for ML, it is important to understand both the capabilities and the limitations of this technology to apply it effectively in medical products in a way that meets all functional requirements while delivering the desired user experience.
Let’s consider a “smart” toothbrush that connects with a mobile app that shows how much time the user has spent brushing each area of their teeth. To track the location of the toothbrush in the mouth, we will evaluate two approaches, and will consider each with and without ML.
The IMU Approach
The first approach uses a multi-axis Inertial Measurement Unit (IMU) chip with an accelerometer and gyro to provide feedback on the attitude and position of the toothbrush as the user brushes their teeth. An algorithm could be written to correlate the sensor data with the location of the toothbrush, but this may take significant time as it would need to be tested with many different users, while capturing data and watching the users during brushing, to ensure that it works well with lefties, righties, and with folks of all ages.
If we instead try an ML approach to determine the position in the mouth, once again, users could be observed while brushing with their real-time IMU data being captured, while the brushing location is being labeled by the observer. However, instead of a programmer analyzing the IMU data to find associations between the data and the location of the toothbrush in the mouth, this data could be analyzed by an ML model to have it learn the IMU characteristics associated with the toothbrush being in the various quadrants of the mouth.
While this also requires labeling datasets for many users, the requirements to create an algorithm are much different. Under this ML approach, no algorithm needs to be written as the classification of which quadrant is being brushed will come from the ML model. It is important to understand that this will require a large number of labeled datasets to ensure that the ML algorithm has ‘seen’ examples of every variety of tooth-brushing from those that hold the toothbrush in a vertical orientation to a child who dances while clenching the brush between their teeth.
The size of this dataset is very important because if our ML dataset does not have multiple samples from a sufficient cross-section of users, then we will have an overfit situation where we have trained the algorithm on only a subset of the user-types that will be encountered in the world. This is reminiscent of the language barrier I encountered during my travels to Paris with my limited French vocabulary. Since my internal French-English dictionary was limited, this ‘overfit’ restricted my ability to decode words and phrases to converse effectively with those who spoke French fluently.
The Image Recognition Approach
An alternative approach to the IMU method could be to add a camera to the toothbrush so snapshots of teeth can be used to identify which tooth is being brushed. While this may seem silly or inefficient, it is surely a more direct approach as it absolutely identifies the quadrant of the mouth by identifying the teeth—thus no observer needs to help label data. The challenges here are related to the poor image quality due to the toothpaste in the mouth, combined with the wide variety of dental conditions including braces, caps, partials, and more. As with the IMU approach, an algorithm could be developed or an ML model could be built to identify the tooth in each image by comparing it to a library of tooth data, but what might this data look like for each of these approaches?
For the algorithm approach, we might write a program to analyze many different photos of teeth to identify unique features that can be used to identify a specific tooth. This might be done in a manner similar to that used to detect a biometric like a fingerprint to gain access to a smartphone or a building, where a full image does not need to be used but instead, only key characteristics of the image are analyzed to perform a successful match of a person to their fingerprint. This approach serves as a shorthand so image comparison with a large set of images is not required which can reduce time and power required to make a match. Note that with fingerprint-detection, fewer features are required to uniquely identify my fingerprint amongst only my coworkers versus identifying my fingerprint compared to all people in the U.S. Following this train of thought, to limit the scope for tooth identification, the key characteristics needed to identify teeth for users up to age 20 might be significantly fewer features than the number of features that would be required for all tooth-situations across an entire population.
For this approach, the ML algorithm will require ‘training’ using a large number of tooth photos, some with a toothpaste slurry and some without. Note that ML algorithms such as the Support Vector Machine (SVM) operate by comparing each captured image to the ‘stack’ of images in memory to make a successful match, and that any difference between the captured image and the stored images in our ‘stack’ will result in a failure to match.
An example of how literal this comparison is was demonstrated by a Deep Mind executive during a live presentation where a dumbbell from a gym was incorrectly identified (classified) until he put his hand on the dumbbell—because all the labeled training-data photos of dumbbells had included a person’s hand. So, in our toothbrush example, a metal wire from braces surrounding a tooth may cause the standard photo of that tooth, not to match, which underscores the need for a large number of photos to use this approach successfully.
While this is a direct approach to learning the location of the toothbrush in the mouth, it suffers challenges from both the likely poor image quality and from another factor: latency, which is the time required to determine which tooth is in view. Battery-powered products are unlikely to be able to do rapid pattern matching against a library of tooth photos, thus the user may have moved to a new location in the mouth before the algorithm has completed its identification of the last tooth. Although the processing could be done in the cloud, the power required to transmit every image frame may also be burdensome on the battery-life of the toothbrush, and there are latency as well as privacy and security concerns if a toothbrush is streaming video from our homes.
This example illustrates some challenges and considerations related to product architectures that might include ML.