• 5716阅读
  • 0回复

[转载]Open Source Dictation: Demo Time [复制链接]

上一主题 下一主题
离线XChinux
 

只看楼主 倒序阅读 楼主  发表于: 2013-07-13
原文见: http://grasch.net/node/22



Open Source Dictation: Demo Time        


<div submitted"="">Wed, 07/10/2013 - 19:39    
Over the last couple of weeks, I've been working towards a demo of open source speech recognition. I did a review of existing resources, and managed to improve both acoustic- and language model. That left turning Simon into a real dictation system.

Making Simon work with large-vocabulary models


First of all, I needed to hack Simond a bit to accept and use ann-gram based language model instead of the scenarios grammar when thefirst was available. With this little bit of trickery, Simon was alreadyable to use the models I built in the last weeks.


Sadly, I immediately noticed a big performance issue: Up until now,Simon basically recorded one sample until the user stopped speaking andthen started recognizing. While not a problem when the "sentences" areconstrained to simple, short commands, this would cause significant lagas the length of the sentences, and therefore the time required forrecognition, increased. Even when recognizing faster than real time,this essentially meant that you had to wait for ~ 2 seconds after sayinga ~ 3 second sentence.

To keep Simon snappy, I implemented continuous recognition in Simond(for pocketsphinx): Simon now feeds data to the recognizer engine assoon as the initial buffer is filled, making the whole system much moreresponsive.

Revisiting the Dictation plugin


Even before this project started, Simon already had a "Dictation"command plugin. Basically, this plugin would just write out everythingthat Simon recognizes. But that's far from everything there is todictation from a software perspective.


First of all, I needed to take care of replacing the special wordsused for punctuation, like ".period", with their associated signs. To dothat, I implemented a configurable list of string replaces in thedictation plugin.





An already existing option to add a given text at the end of arecognition result takes care of adding spaces after sentences ifconfigured to do so. I also added the option to uppercase the firstletter of every new spoken sentence.
Then, I set up some shortcut commands that would be useful whiledictating ("Go to the end of the document" for ctrl+end or "Delete that"for backspace, for example).
To deal with incorrect recognition results, I also wanted to be ableto modified already written text. To do that, I made Simon aware of thecurrently focused text input field by using AT-SPI 2.I then implemented a special "Select x" command that would searchthrough the current text field and select the text "x" if found. Thisenables the user to select the offending word(s) to either remove themor simply dictate the correction.

Demonstration


So without much ado, this is the end result:

http://youtu.be/uItCqkpMU_k



What's next?


Of course, this is just the beginning. If we want to build a real,competitive open source speech recognition offering we have to tackle -among others - the following challenges:
  • Turning the adaption I did manually into an integrated, guided setup procedure for Simon (enrollment).
  • Continuing to work towards better language- and acoustic models in general. There's a lot to do there.
  • Improving the user interface for the dictation: We should show offthe current (partial) hypothesis even while the user is speaking. Thatwould make the system feel even more responsive.
  • Better accounting for spontaneous input: Simon should be aware of(and ignore) filler words, support mid-sentence corrections, falsestarts, etc.
  • Integrating semantic logic into the language model; For example, inthe current prototype, recognizing "Select x" is pretty tricky becausee.g., "Select hear" is not a sentence that makes sentence according tothe language model - it does in the application, though (select the text"hear" in the written text for correction / deletion).
  • Better incorporating the dictation with traditional command &control: When not dictating texts, we should still exploit theinformation we do have (available commands) to keep recognition accuracyas high as it is for the limited-vocabulary use case we have now. Amixture (or switching) between grammar and language model should beexplored.
  • Better integration in other apps: The AT-SPI information used forcorrecting mistakes is sadly not consistent across toolkits and widgets.Many KDE widgets are in fact not accessible through AT-SPI (e.g. thedocument area of Calligra Words does not report to be a text field).This is mostly down to the fact that no other application currentlyrequires the kind of information Simon does.

Even this rather long list is just a tiny selection of what Ican think of right off the top of my head - and I'm not even touching onimprovements in e.g. CMU SPHINX.
There's certainly still a lot left to do, but all of it is very exciting and meaningful work.
I'll be at the Akademy conference for the coming week where I'll also be giving a talk about the future of open source speech recognition.If you want to get involved in the development of an open source speechrecognition system capable of dictation: Get in touch with me - eitherin person, or - if you can't make it to Akademy - write me an email!




二笔 openSUSE Vim N9 BB10 XChinux@163.com 网易博客 腾讯微博
承接C++/Qt、Qt UI界面、PHP及预算报销系统开发业务
快速回复
限100 字节
 
上一个 下一个