This project was an assignment for the "Hello, Computer: Unconventional Uses of Voice Technology" course at ITP in Fall 2020.
The assignment was to create something that takes non-speech input from a person and responds with speech synthesis.
I decided to create an app that observes the user's face through the computer's camera, and speaks about it.
Here's the Live Demo on Glitch.
How it works
When the app's page loads, it will ask for your permission to access the camera on your computer. You need to allow it to use the app. After the loading is done, you will be asked to click on the page to start the app. This interaction is mainly to initialize the audioContext, otherwise the user may not hear any sound from the browser.
When the app starts, the computer will start with a brief greeting and introduction. The user will be able to hear its voice and see the text on the page at the same time. And then, it will begin observing the user's face through the camera and speak about it.
The observing happens in the following order and logic. In each step, it will speak about the result and either move to the next or previous step depending on it:
Check if the user's face is present in the camera.
Check if the user's face is close enough to the camera.
Check if the user is looking at the camera. (Or, more accurately, if the user is directly facing the camera)
Check if the user's mouth is closed.
After all these conditions are met, the computer will start praising the user for following its requests. (e.g. "You are the best human I've ever seen")
Process & Thoughts
For the face tracking part, I reused the code I've written for the Face DJ app that I created last semester. It uses the Tensorflow Facemesh library which I think is pretty accurate and performant. So my work this time was basically implementing the speech part using the face detection data (e.g. face position) that I already had.
The implementation of the speech part ended up having many conditions (over 200 lines) to check what the computer previously said, and what the current observed state is. I'm sure there's a simpler/cleaner way to do the same thing but I didn't spend much time thinking about organizing the structure of the code since it's a small project anyway.
The purpose of this project was to explore Text-To-Speech (TTS) technology and create something fun with it rather than creating something useful. So I hope people to enjoy using the app, and I guess nobody will like to be observed by a computer in a serious manner.
I think an interesting use case of this app would be to help people who are being easily distracted (e.g. ADHD) to stay focused while working in front of the computer since the app will notify the user as soon as the user is not looking at or try to move away from the computer screen for instance.