The problem of speech recognition has not yet been solved. Turn automatic speech recognition on and off Set up speech recognition windows 7

Windows Vista is the first operating system from Microsoft to have speech recognition built in. Using this function, you can perform various tasks with your voice, such as launching programs, closing, saving and deleting files, dictating text to be recorded verbatim, and editing it. Deb Shinder, IT Consultant, will walk you through the details of how to use this feature.

Ever since the release of the Star Trek movie, many computer users have dreamed of throwing out keyboards, mice and controlling their computer with their voice. Programs that allowed you to speak different commands, dictate text to a computer - have been made for many years and were very useful for those who were not physically able to use other input methods. But for some reason these programs were not popular.

Windows Vista is the first operating system from Microsoft that can recognize speech. Previously, the speech recognition feature was present in Microsoft Office XP and Office 2003, and it was also possible to use programs from other developers, such as Dragon NaturallySpeaking. Microsoft has also added speech recognition to Windows Mobile.

You don't have to buy anything extra to start talking to your computer, Vista already has it all. By default, the function is disabled, but you can easily enable it in the Control Panel, as shown in Figure A.

You can also launch this feature from the menu by selecting All Programs | Standard | Accessibility (All Programs | Accessories | Ease Of Access), as shown in Figure B.

How it works

You can choose one of two speech recognition modes:

To manage programs: Start and close programs, switch between them, save and delete files, and so on.
To dictate the text, which will be recorded verbatim, as well as edit it.

Software developers can add support for this feature to their programs. Annoyingly, speech recognition currently only supports a few languages: English (US and UK), German, French, Spanish, Japanese, and Chinese (Traditional and Simplified).

Setting up speech recognition

Before you can use speech recognition, you will have to complete the following steps:

Turn on speech recognition.
Set up the microphone.
Read the manual (optional).
Practice clearly, speak (also not necessary).

After double-clicking on Speech Recognition in the Control Panel or selecting Speech Recognition from the menu, you will be presented with a setup window as shown in Figure C.

When you click on Start Speech Recognition, a voice control panel will appear at the top of your screen, as shown in Figure D.

If you already have this function configured, then the socket will be registered in autostart and will start every time Windows boots. A blue voice control icon will also appear in the tray.
You can call the context menu for settings by right-clicking on the tray icon, or on the voice control panel, as shown in Figure E.

In the menu you will see the following settings:

Turn Speech On: The computer will listen to everything you say and will execute the commands it recognizes.
Standby mode (Sleep): The computer will follow your speech, but will not respond to any commands until you say "Start listening".
Off: The computer does not listen to you, no matter what you tell it.
Open Speech Reference Card: A handy cheat sheet with basic commands and additional information.
Start Speech Tutorial: Video tutorial where you will be told and shown everything.
Help: Opens a help file about this function.
Options: Here you can set up the program to load with Windows, automatic text correction, etc.
Configuration: Here you can set up your microphone, improve speech recognition, and open the control panel.
Open The Speech Dictionary: You can add new words (very useful for names and words that are difficult to recognize), and you can also exclude words that you never say.
Dictation Topic: Only Narrative can be selected here.
Visit the site (Go To The Speech Recognition Web site).
Get Information About Speech Recognition: This is the Windows dialog box familiar to us, in which the version, license number and name of the program are written.
Open Speech Recognition.
Exit: Completely closes the program.

Translation

Since deep learning entered the speech recognition scene, the number of errors in word recognition has dramatically decreased. But despite all the articles you may have read, we still don't have human-level speech recognition. Speech recognizers have many failure modes. To further improve them, you need to identify and try to eliminate them. This is the only way to go from recognition that works for some people most of the time to recognition that works for all people all of the time.

Improvements in the number of misidentified words. A test voice dial was assembled on a telephone switchboard in 2000 from 40 random conversations between two people whose native language is English.

To say that we have reached the level of a human in speech recognition in conversations, based only on a set of conversations from a telephone switchboard, is like saying that a robotic car drives as well as a person, having tested it in a single city on a sunny day without any traffic . The recent shifts in speech recognition are amazing. But claims about speech recognition at the human level are too bold. Here are a few areas where improvements still need to be made.

Accents and noise

One of the obvious disadvantages of speech recognition is the processing accents and background noise. The main reason for this is that most of the training data consists of American dialects with a high signal-to-noise ratio. For example, in a set of conversations from a telephone switchboard, there are only conversations of people whose native language is English (mostly Americans) with little background noise.

But increasing the training data by itself will most likely not solve this problem. There are many languages containing many dialects and accents. It is unrealistic to collect labeled data for all cases. Creating a high-quality speech recognizer for American English only requires up to 5,000 hours of audio transcribed into text.

Comparison of speech-to-text people with Baidu's Deep Speech 2 on different types of speech. People are worse at recognizing non-American accents, perhaps because of the abundance of Americans among them. I think that people who grew up in a particular region would have coped with recognizing the accent of that region with much fewer errors.

In the presence of background noise in a moving car, the signal-to-noise ratio can be as low as -5 dB. People easily cope with speech recognition of another person in such conditions. Automatic recognizers degrade much faster as noise increases. The graph shows how much the separation of people increases with increasing noise (at low SNR, signal-to-noise ratio)

Semantic errors

Often the number of erroneously recognized words is not an end in itself for a speech recognition system. We are targeting the number of semantic errors. This is the proportion of expressions in which we incorrectly recognize the meaning.

An example of a semantic error is when someone says "let's meet up Tuesday" [let's meet on Tuesday] and the recognizer returns "let's meet up today" [let's meet today]. There are also errors in words without semantic errors. If the resolver didn't recognize "up" and returned "let's meet Tuesday", the semantics of the sentence didn't change.

We need to carefully use the number of misidentified words as a yardstick. To illustrate this, I'll give you a worst case example. 5% of word errors correspond to one missing word out of 20. If there are 20 words in each sentence (which is quite within the average for English), then the number of incorrectly recognized sentences approaches 100%. It can be hoped that misrecognised words do not change the semantic meaning of the sentences. Otherwise, the recognizer may misinterpret each sentence even with 5% misrecognised words.

When comparing models with people, it is important to check the essence of errors and monitor not only the number of incorrectly recognized words. In my experience, speech-to-text people make fewer mistakes and they are not as serious as computers.

Researchers from Microsoft recently compared the errors of human and computer recognizers of a similar level. One of the differences found is that the model confuses “uh” [uh…] with “uh huh” [yeah] much more often than people. The two terms have very different semantics: "uh" fills in the gaps, while "uh huh" denotes an acknowledgment from the listener. Also, models and people found many errors of matching types.

Many voices in one channel

Recognizing recorded telephone conversations is also easier because each speaker was recorded on a separate microphone. There is no overlap of multiple voices in one audio channel. People can understand several speakers, sometimes speaking at the same time.

A good speech recognizer should be able to divide the audio stream into segments depending on the speaker (subject him to diarization). He must also extract meaning from an audio recording with two overlapping voices (separation of sources). This needs to be done without a microphone located directly at the mouth of each of the speakers, that is, so that the recognizer works well when placed in an arbitrary place.

Recording quality

Accents and background noise are just two factors that a speech recognizer needs to be robust against. Here are a few more:

Reverberation in different acoustic conditions.
Artifacts associated with equipment.
Artifacts of the codec used to record and compress the signal.
Sampling frequency.
The speaker's age.

Most people can't tell the difference between mp3 and wav files. Recognizers must become robust to these sources of variation before claiming human-like performance.

Context

It can be seen that the number of errors that people make on tests in recordings from the telephone exchange is quite high. If you were talking to a friend who didn't understand 1 word out of 20, it would be very difficult for you to communicate.

One of the reasons for this is recognition without context. In real life, we use many different additional signs to help us understand what the other person is saying. Some examples of context used by humans and ignored by speech recognizers:

The history of the conversation and the topic under discussion.
Visual cues about the speaker - facial expressions, lip movement.
The body of knowledge about the person we are talking to.

Android's speech recognizer now has a list of your contacts, so it can recognize your friends' names. Voice search on maps uses geolocation to narrow down the options you want to get directions to.

The accuracy of recognition systems increases with the inclusion of such signals in the data. But we're only just beginning to delve into the type of context we might include in processing and how to use it.

Deployment

Recent advances in spoken language recognition cannot be deployed. When imagining deploying a speech recognition algorithm, you need to keep latency and processing power in mind. These parameters are related because algorithms that increase power requirements also increase latency. But for simplicity, we will discuss them separately.

Latency: The time from the end of the user's speech to the end of receiving the transcript. A small delay is a typical requirement for recognition. It greatly affects the user's experience of working with the product. Often there is a limit of tens of milliseconds. This may seem too strict, but remember that issuing a transcript is usually the first step in a series of complicated calculations. For example, in the case of a voice Internet search, after speech recognition, you still need to have time to complete the search.

Bidirectional recurrent layers are a typical example of an improvement that worsens the latency situation. All the latest high quality transcript results are obtained with their help. The only problem is that we can't count anything past the first bidirectional layer until the person has finished talking. Therefore, the delay increases with the length of the sentence.

Left: Direct recurrence allows decryption to begin immediately. Right: Bidirectional recurrence requires you to wait until the end of the speech before starting to transcribe.

A good way to efficiently incorporate future information into speech recognition is still being sought.

Computing power: This parameter is affected by economic constraints. You must consider the cost of the banquet for each improvement in the accuracy of the recognizer. If an improvement does not reach the economic threshold, it will not be able to deploy it.

A classic example of continuous improvement that is never deployed is collaborative deep learning. Reducing the number of errors by 1-2% rarely justifies an increase in computing power by 2-8 times. Modern models of recurrent networks also fall into this category, since they are very unprofitable to use in the search for a bunch of trajectories, although I think the situation will change in the future.

I want to clarify - I'm not saying that improving recognition accuracy with a serious increase in computational costs is useless. We have already seen how the principle of “first slowly, but precisely, and then quickly” works in the past. The point is that until the improvement is fast enough, it cannot be used.

In the next five years

There are still many unsolved and complex problems in the field of speech recognition. Among them:

Expanding the capabilities of new data storage systems, recognition of accents, speech against the background of strong noise.
The inclusion of context in the recognition process.
Diarization and separation of sources.
The number of semantic errors and innovative methods for evaluating recognizers.
Very little delay.

I look forward to the progress that will be made over the next five years on these and other fronts.

Tags: Add tags

Touch screen control is already standard. The latest systems such as Windows 8 “understand” voice commands. Speech recognition should make our communication with the computer even easier, more intuitive and… more natural. I'll tell you what it looks like today.

A bit of history - how communication with the machine developed

Ways to communicate with a computer have evolved over the years. The first interface through which a person could issue commands was punched cards, which date back to 1832. They were used in machines for the production of cloth. The keyboard began to be used in 1960. Two decades later, the standard mouse joined in and is still in use today. Although the mouse has shared power with the trackpad, it is still the most popular form of control. Thanks to smartphones and tablets, the touch interface and gestures have become very popular, which are used, in particular, to control the Xbox 360 Kinect. After touch screens and gestures, comes voice control, but this solution has so far been so underdeveloped that sometimes you don’t hear about it.

Setting up speech recognition in Windows 8

Unfortunately, voice control is not yet available in Russian. Currently supported are English, French, German, Japanese, Korean, Chinese and Spanish. Microsoft decided to focus on the largest and most developed countries, but it is possible that for some time it will add this feature for our country as well. If you try to run it, it swears like this

If you still want to test this solution, you need to set up the system (change the language) and learn a couple of words in English. To do this, you need to go to the control panel, and select the Language item. If you do not have any other language than Russian, you must click the "Add language" button, and then select one of the supported languages. In our case, it is "English (United States)". We see that only the layout in this language is available, double-click, the availability of the language for the interface will be checked, after checking, click "Download and install the language pack", and the process will start, patiently wait for it to load. Once this process is complete, set the default language to English

Now you need to go to the Windows 8 start screen (tiled), type “Windows Speech Recognition” in the search box and press Enter.

Thus, you can launch the voice recognition tool. When you first start it will prompt you to configure the microphone, after choosing, say something to check.

Then offer to take training lessons. They last up to 15-20 minutes, but are very useful and provide basic information about how to use the features. But if you are not strong in English, I think you should not waste time, it will be difficult to make out anything, go straight to battle

How to work

In order for the computer to start recognizing your speech, you must say "start listening" (which means start listening), or press the microphone button to start listening mode. Now you can open the application or simply dictate words into a text editor, browser or search bar

What can we do

In principle, the possibilities are huge, in addition to standard words, you can create your own teams. The main features are shown in the table

Action	What to say
Select any element by its name	Click File,Start,View
Select any element or icon	Click Recycle Bin,Click Computer,Click(file name)
Double click or double click any element	Double-clickRecycle Bin,Double-click Computer
Switch between open applications	Switch to Paint,Switch to WordPad
scrolling	scroll up; scroll down; scroll left; Scroll right
Include new paragraph or new line in document	new paragraph; new line
Select a word in a document
Word correction	correct word
Select and delete certain words
Show a list of applicable commands
	Refresh speech commands
Turn on listening mode
Disable listening mode
Collapse microphone	Minimize speech recognition
View Windows Help and Support	How do I do something? For example: How do I install a printer?

If you do not know how to pronounce the phrase, I suggest you use Google Translate or http://tutor.ru (he understood this site better)

I had a desire to write down my teams consisting of simple bourgeois words. Which I can pronounce. So he did not let me do this, he could not start the command editor. As a result, he perfectly understood my pronunciation of the words One, Two and Open. With this set, you can launch the application by number in the home screen. First say the number, then say OPEN. Not a lot, of course, but I consider the experiment a success. It would not be bad if Microsoft introduced the Russian language, a good replacement for the remote control.

The Windows 7 operating system is equipped with many options that give more and more opportunities to users of this system. They were able to introduce a very interesting function into it, which is called "speech recognition". But what is this system? This will be discussed.

The option in question allows applications throughout the system to use a completely new way of user interaction with the computer. It is the Windows 7 Speech Recognition system that allows you to control your computer without using a keyboard, mouse, or other means.

I would like to note that this innovation will be available in other Microsoft products. This feature was noticed a little earlier, that is, they tried to implement it in Windows Vista, but in the seventh version of the Microsoft operating system, voice control is performed at a higher level than its predecessor. To put it simply, such an option as Windows 7 speech recognition has become even more functional.

In addition to all that has been said, I would like to note that it has a fairly wide range of applications. Windows 7 users with speech recognition have the ability to run programs and convert all sound fragments to text, execute all kinds of commands on the computer, using just their voice and the necessary devices. But what does it take to make Windows 7 speech recognition a reality?

First of all, you will need a microphone, which should be connected to your computer. In addition, you must purchase a special application or program that is published by the manufacturer itself, that is, by Microsoft. After all the necessary components are installed, and the microphone is connected to the computer, a certain work plan should be implemented:

You need to execute test voice commands and convert them to text.
After you train the recognition program, you will need to create templates for different commands with your voice. It is on the basis of this work that the computer will be able to accept and execute all the commands you specify.

The Windows 7 speech recognition feature is used in Microsoft's WordPad text editor. It functions flawlessly when filling out various forms, and also performs well in Internet Explorer and when

In addition, this option will easily edit the previously recorded text by defining special voice commands. Of course, in the process of recognizing a particular task, typical errors occur (when erroneous recognition of certain sounds occurs). In this case, the program provides a list of correspondences of certain words.

The function, of course, is phenomenal, but still there is one “but”. The thing is that the recognition of Russian speech is now, in principle, not available. There are excellent versions of the program for English, French, German and Japanese. There are also versions for Chinese, Spanish and Italian speech.

But this novelty is not quite adapted for Russian speech. Your computer will not be able to perceive the tasks assigned to it, which means that it will be easier for you to write something using the keyboard or perform certain tasks with the mouse.

Of course, you can try to work with similar Russian-language programs or give your preference to English, but still it remains to be hoped that soon speech recognition in Russian will also be available in high-quality mode. And just then you will be able to try out such a unique function in practice. After all, it, without a doubt, clearly simplifies the work on a personal computer and is a huge breakthrough in the field of programming. So all that remains is to wait.

No program can completely replace the manual work of transcribing recorded speech. However, there are solutions that can significantly speed up and facilitate the translation of speech into text, that is, simplify transcription.

Transcription is the recording of an audio or video file in text form. There are paid paid tasks on the Internet, when a certain amount of money is paid to the performer for transcribing a text.

Speech to text translation is useful

students to translate recorded audio or video lectures into text,
bloggers leading websites and blogs,
writers, journalists to write books and texts,
information businessmen who need a text after their webinar, speech, etc.,
people who find it difficult to type - they can dictate a letter and send it to relatives or friends,
other options.

We will describe the most effective tools available on PC, mobile applications and online services.

1 Site speechpad.ru

This is an online service that allows you to translate speech into text through the Google Chrome browser. The service works with a microphone and with ready-made files. Of course, the quality will be much higher if you use an external microphone and dictate yourself. However, the service does a good job even with YouTube videos.

Click "Enable recording", answer the question about "Using a microphone" - for this, click "Allow".

The long instruction on how to use the service can be collapsed by clicking on button 1 in fig. 3. You can get rid of advertising by going through a simple registration.

Rice. 3. Service speechpad

The finished result is easy to edit. To do this, you either need to manually correct the highlighted word or dictate it again. The results of the work are saved in your personal account, they can also be downloaded to your computer.

List of video tutorials on working with speechpad:

You can transcribe videos from Youtube or from your computer, however, you will need a mixer, more details:

Video "audio transcription"

The service operates in seven languages. There is a small minus. It lies in the fact that if you need to transcribe a finished audio file, then its sound is distributed to the speakers, which creates additional interference in the form of an echo.

2 Service dictation.io

A wonderful online service that will allow you to translate speech into text for free and easily.

Rice. 4. Service dictation.io

1 in fig. 4 - Russian language can be selected at the end of the page. In the Google Chrome browser, the language is selected, but in Mozilla for some reason there is no such possibility.

It is noteworthy that the ability to autosave the finished result is implemented. This will prevent accidental deletion as a result of closing a tab or browser. This service does not recognize finished files. Works with a microphone. You need to name punctuation marks when you are dictating.

The text is recognized quite correctly, there are no spelling errors. You can insert punctuation marks yourself from the keyboard. The finished result can be saved on your computer.

3 RealSpeaker

This program allows you to easily translate human speech into text. It is designed to work on different systems: Windows, Android, Linux, Mac. With its help, you can convert speech that sounds into a microphone (for example, it can be built into a laptop), as well as recorded in audio files.

Can perceive 13 languages of the world. There is a beta version of the program that works as an online service:

You need to follow the link above, select the Russian language, upload your audio or video file to the online service and pay for its transcription. After transcription, you can copy the received text. The larger the file for transcription, the more time it will take to process it, more details:

In 2017 there was a free transcription option using RealSpeaker, in 2018 there is no such possibility. It is very embarrassing that the transcribed file is available to all users for download, perhaps this will be finalized.

Contacts of the developer (VKontakte, Facebook, Youtube, Twitter, email, phone) of the program can be found on the page of his website (more precisely, in the footer of the site):

4 Speechlogger

An alternative to the previous application for mobile devices running on Android. Available for free in the app store:

The text is edited automatically, punctuation marks are placed in it. Great for dictating notes or making lists. As a result, the text will turn out to be of very decent quality.

5 Dragon Dictation

This is an application that is distributed free of charge for mobile devices from Apple.

The program can work with 15 languages. It allows you to edit the result, select the desired words from the list. It is necessary to clearly pronounce all sounds, do not make unnecessary pauses and avoid intonation. Sometimes there are mistakes in the endings of words.

The Dragon Dictation application is used by the owners, for example, to dictate the shopping list in the store while moving around the apartment. I will come there, it will be possible to look at the text in the note, and there is no need to listen.

Whatever program you use in your practice, be prepared to double-check the result and make certain adjustments. This is the only way to get a flawless text without errors.

Also useful services:

Get up-to-date articles on computer literacy directly to your inbox.
Already more 3.000 subscribers