Speech recognition program - need conceptual approach

Solo

New member
Joined
Apr 15, 2019
Messages
3
Location
Ashe county, NC
Programming Experience
10+
I currently have in test operation a program written in VB that uses SAPI for speech and the speech recognition facility. The PC is connected to a hardware device that the user will occasionally send hardware commands to by speaking into a mic. The program reports on the results of the operations by speaking to the user. All that works just fine. There's a limited grammar of about 25 words that are defined to the speech recognition engine. The issue is that the user in this situation is speaking continually to another person and in the course of his speaking would say one of the words defined in the grammar and thus cause the program to react. What's needed is a keyword that wouldn't turn up in the user's normal speech that he would say prior entering a spoken command. It would function much like the Alexa device where the user would first say "Alexa" before saying a command. So, the question is how to implement that function. (Don't need the code, just the approach).

One approach might be to alter the grammar available to the speech recognition engine. On startup, the engine grammar would contain only the Alexa key word. When that word was recognized, the program would add to the grammar all the other words that could be recognized. But, it's not clear to me how then I could return the grammar to just the single key word after the detailed processing was finished. There are ways to add to the grammar, but I couldn't find a way to subtract from the grammar. I suppose that I could just stop the engine and then restart it with just the single word in the grammar.

Another approach might be to start two separate speech recognition engines, one with a one word grammar and one with the full grammar. The full grammar engine could be stopped and started by the code associated with the one word grammar. But, it's not clear whether you can have two speech recognition engines operating at one time, nor if the same mic input could be used for both.

Any suggestions on the best solution for this?
 
I have not worked with speech recognition, but have read some articles from time to another, I found this with a seach now: Voice Recognition - Speech Recognition with .NET Desktop Applications
The example here is similar to what you ask, it has start-stop commands in one grammar and other functionality in another, multiple grammars can be loaded and listened to at the same time. In SpeechRecognized it checks for start-stop first and toggles a Boolean field (speechOn) and uses this to determine whether to process other recognized speech. Similar you could have a single start keyword that enables recognizing commands, and following a recognized command you would set the boolean to false again and only the start word would be processed.
In short: one SpeechRecognitionEngine with multiple active grammars, where one of them is just the start command.
What should happen if start word is detected, but no commands follow is up to you, but I would probably use a Date type field too and set speechOn to false if it took too long time between start and command.
 
Thanks for the info. Conceptually, that's what I'm trying to accomplish but the mechanics of doing the grammar changes are apparently quite different between C and Visual Basic. I just am not able to decipher the C code and effectively translate it into Visual Basic. While some of the terms are the same, the structures appear to be quite different. So far, I've found that removing one of the grammars and then stopping/starting the speech engine doesn't prevent the engine from recognizing words that were in only the deleted grammar. I'll keep experimenting and see what I can figure out.
 
The example is in C#, there are some semicolons and curly brackets but it's not very different from VB. Look up a C# VB converter online and you get VB sample code. For instance taking the first find Code Converter C# to VB and VB to C# – Telerik I get this
VB.NET:
Imports System
Imports Microsoft.Speech.Recognition
Imports Microsoft.Speech.Synthesis
Imports System.Globalization

Namespace ConsoleSpeech
    Class ConsoleSpeechProgram
        Shared ss As SpeechSynthesizer = New SpeechSynthesizer()
        Shared sre As SpeechRecognitionEngine
        Shared done As Boolean = False
        Shared speechOn As Boolean = True

        Private Shared Sub Main(ByVal args As String())
            Try
                ss.SetOutputToDefaultAudioDevice()
                Console.WriteLine(vbLf & "(Speaking: I am awake)")
                ss.Speak("I am awake")
                Dim ci As CultureInfo = New CultureInfo("en-us")
                sre = New SpeechRecognitionEngine(ci)
                sre.SetInputToDefaultAudioDevice()
                sre.SpeechRecognized += AddressOf sre_SpeechRecognized
                Dim ch_StartStopCommands As Choices = New Choices()
                ch_StartStopCommands.Add("speech on")
                ch_StartStopCommands.Add("speech off")
                ch_StartStopCommands.Add("klatu barada nikto")
                Dim gb_StartStop As GrammarBuilder = New GrammarBuilder()
                gb_StartStop.Append(ch_StartStopCommands)
                Dim g_StartStop As Grammar = New Grammar(gb_StartStop)
                Dim ch_Numbers As Choices = New Choices()
                ch_Numbers.Add("1")
                ch_Numbers.Add("2")
                ch_Numbers.Add("3")
                ch_Numbers.Add("4")
                Dim gb_WhatIsXplusY As GrammarBuilder = New GrammarBuilder()
                gb_WhatIsXplusY.Append("What is")
                gb_WhatIsXplusY.Append(ch_Numbers)
                gb_WhatIsXplusY.Append("plus")
                gb_WhatIsXplusY.Append(ch_Numbers)
                Dim g_WhatIsXplusY As Grammar = New Grammar(gb_WhatIsXplusY)
                sre.LoadGrammarAsync(g_StartStop)
                sre.LoadGrammarAsync(g_WhatIsXplusY)
                sre.RecognizeAsync(RecognizeMode.Multiple)

                While done = False
                End While

                Console.WriteLine(vbLf & "Hit <enter> to close shell" & vbLf)
                Console.ReadLine()
            Catch ex As Exception
                Console.WriteLine(ex.Message)
                Console.ReadLine()
            End Try
        End Sub

        Private Shared Sub sre_SpeechRecognized(ByVal sender As Object, ByVal e As SpeechRecognizedEventArgs)
            Dim txt As String = e.Result.Text
            Dim confidence As Single = e.Result.Confidence
            Console.WriteLine(vbLf & "Recognized: " & txt)
            If confidence < 0.60 Then Return

            If txt.IndexOf("speech on") >= 0 Then
                Console.WriteLine("Speech is now ON")
                speechOn = True
            End If

            If txt.IndexOf("speech off") >= 0 Then
                Console.WriteLine("Speech is now OFF")
                speechOn = False
            End If

            If speechOn = False Then Return

            If txt.IndexOf("klatu") >= 0 AndAlso txt.IndexOf("barada") >= 0 Then
                (CType(sender, SpeechRecognitionEngine)).RecognizeAsyncCancel()
                done = True
                Console.WriteLine("(Speaking: Farewell)")
                ss.Speak("Farewell")
            End If

            If txt.IndexOf("What") >= 0 AndAlso txt.IndexOf("plus") >= 0 Then
                Dim words As String() = txt.Split(" "c)
                Dim num1 As Integer = Integer.Parse(words(2))
                Dim num2 As Integer = Integer.Parse(words(4))
                Dim sum As Integer = num1 + num2
                Console.WriteLine("(Speaking: " & words(2) & " plus " & words(4) & " equals " & sum & ")")
                ss.SpeakAsync(words(2) & " plus " & words(4) & " equals " & sum)
            End If
        End Sub
    End Class
End Namespace
 
Thanks, JohnH. I have everything working now, but the one part of my original question is still unanswered. That was: once a grammar has been established and recognizable words added, is there any way to remove words from the grammar? I don't see anything in the language that would provide for that. I suppose I could just shut down the speech recognition engine, and start it or another instance of the engine and then load the grammar minus the words I wanted to remove. At any rate, that's a moot point now as I've just added code in the speech recognition routine to ignore certain words if they are not valid in the current context.

Your example raises another important question: This application has a rather limited set of words to recognize - around 20. I've been having problems with miscellaneous sounds like a cough or sneeze being recognized as one of the 20 words. I see that your example looked at the confidence property of the recognized speech and ignored responses with a confidence factor below .6 . Question is: what's a reasonable value to check for, and where did the .6 in your example come from?

I've been monitoring the confidence factor as I speak my 20 words and most of the time the value is above .93. But, one of the words - "Australia" - rarely gets a confidence of more than .65. I'd expect words with fewer or softer consonant sounds to get a lower confidence, but being a novice in this speech recognition game, that's just my guess. In the case of this application, a high confidence number is important as each spoken word results in a request for physical action of an attached mechanical device.

I've also read several comments in other forums criticizing the accuracy of the recognition in the engine used in Visual Studio. I don't know whether those are valid or not, but I wonder if there's any way to use other speech recognition engines with Visual Basic and Visual Studio. Haven't been able to find anything on that yet.
 
It's not my example, it's from the MSDN Magazine article I posted.

Read about Confidence here: RecognizedPhrase.Confidence Property (System.Speech.Recognition) | Microsoft Docs
As explained, Confidence is not an absolute value, but a relative one that can be compared with Confidence of other possible phrases recognized (probability of alternates) or relative "level" of same phrase recognized previously.

I have not worked with speech recogniztion, but I can see the engine supports both LoadGrammar and UnloadGrammar methods, so that would be a way to add and remove things that it will listen for.
 
I currently have in test operation a program written in VB that uses SAPI for speech and the speech recognition facility. The PC is connected to a hardware device that the user will occasionally send hardware commands to by speaking into a mic. The program reports on the results of the operations by speaking to the user. All that works just fine. There's a limited grammar of about 25 words that are defined to the speech recognition engine. The issue is that the user in this situation is speaking continually to another person and in the course of his speaking would say one of the words defined in the grammar and thus cause the program to react. What's needed is a keyword that wouldn't turn up in the user's normal speech that he would say prior entering a spoken command. It would function much like the Alexa device where the user would first say "Alexa" before saying a command. So, the question is how to implement that function. (Don't need the code, just the approach).

One approach might be to alter the grammar available to the speech recognition engine. On startup, the engine grammar would contain only the Alexa key word. When that word was recognized, the program would add to the grammar all the other words that could be recognized. But, it's not clear to me how then I could return the grammar to just the single key word after the detailed processing was finished. There are ways to add to the grammar, but I couldn't find a way to subtract from the grammar. I suppose that I could just stop the engine and then restart it with just the single word in the grammar.

Another approach might be to start two separate speech recognition engines, one with a one word grammar and one with the full grammar. The full grammar engine could be stopped and started by the code associated with the one word grammar. But, it's not clear whether you can have two speech recognition engines operating at one time, nor if the same mic input could be used for both.

Any suggestions on the best solution for this?
This is a great start in the right direction if still needed: SrgsToken.Pronunciation Property (System.Speech.Recognition.SrgsGrammar)
 
Back
Top