The problem with speech to text is the background noise and the many variations of speech. I’ve played around with a couple of models. I can get one to work with my voice with little effort in training, but when my window AC kicks in or my computer fan hits the highest setting, it becomes a problem because the training is very dependant on the noise floor. I think they are likely extremely limited in the audio gear available in combination with the compute hardware to make it viable. Human hearing has a relatively large dynamic range and we have natural analog filtering. A machine just doing math can’t handle things like clipping from someone speaking too loud, or understand the periodicity of all the vehicle and background noises like wind, birds, and other people in the vicinity. Everything that humans can contextualize is like a small learned program and alignment that took many years to train.
You will not see the full use cases of AI for quite awhile. The publicly facing tools are nowhere near the actual capabilities of present AI. If you simply read the introductory documentation for the Transformers library, which is the basis of almost all the AI stuff you see in any public spaces, the documentation clearly states that it is a a simplified tool that bypasses complexity in an attempt to make the codebase approachable to more people in various fields. It is in no way a comprehensive implementation. People are forming opinions based on projects that are hacked together using Transformers. The real shakeups are happening in business where companies like OpenAI are not peddling the simple public API, they are demonstrating the full implementations directly.
OpenAI seems to be functioning.
The problem with speech to text is the background noise and the many variations of speech. I’ve played around with a couple of models. I can get one to work with my voice with little effort in training, but when my window AC kicks in or my computer fan hits the highest setting, it becomes a problem because the training is very dependant on the noise floor. I think they are likely extremely limited in the audio gear available in combination with the compute hardware to make it viable. Human hearing has a relatively large dynamic range and we have natural analog filtering. A machine just doing math can’t handle things like clipping from someone speaking too loud, or understand the periodicity of all the vehicle and background noises like wind, birds, and other people in the vicinity. Everything that humans can contextualize is like a small learned program and alignment that took many years to train.
You will not see the full use cases of AI for quite awhile. The publicly facing tools are nowhere near the actual capabilities of present AI. If you simply read the introductory documentation for the Transformers library, which is the basis of almost all the AI stuff you see in any public spaces, the documentation clearly states that it is a a simplified tool that bypasses complexity in an attempt to make the codebase approachable to more people in various fields. It is in no way a comprehensive implementation. People are forming opinions based on projects that are hacked together using Transformers. The real shakeups are happening in business where companies like OpenAI are not peddling the simple public API, they are demonstrating the full implementations directly.