Yantorno, Robert E.; Picone, Joseph; Silage, Dennis (Temple University. Libraries, 2012)
      Abstract A Keyword Spotting System (KWS) is a system that recognizes predefined keywords in spoken utterances or written documents. The objective is to obtain the highest possible keyword detection rate without increasing the number of false detections in a system. The common approach to keyword spotting is the use of a Hidden Markov Model (HMM). These are usually complex systems which require training speech data. The Typical HMM approach uses garbage templates or HMM models to match non-keyword speech and non-speech sounds. The purpose of this research is to design a simple Keyword Spotting System. The system will be designed to spot English words and should be easily adaptable to other languages There are many challenges in designing a keyword spotting system such as variations in speech like pitch, loudness, timbre that make recognition difficult. There can be wide variations in utterances even from the same speaker. In this research, the use of cross-correlation, as an alternative means for detecting keywords in an utterance, was investigated. This research also involves the modeling of a global keyword using a quantized dynamic time warping algorithm, which can function effectively with multi-speakers. The global keyword is an aggregation of the features from several occurrences of the same keyword. This research also investigates the effect of pitch normalization on keyword detection. The use of cross-correlation as a method for keyword spotting was investigated in both the time and MFCC domain. In the time domain the global keyword was cross-correlated with a pitch-normalized utterance. A zero lag ratio (the ratio of the power around the zero lag obtained from a cross correlation to the power in the rest of the signal is computed) was computed for each speech frame, a threshold was then used to determine if the keyword is present. For the MFCC domain the MFCC features of each keyword were computed, normalized and cross-correlated with the normalized MFCC features of portions of the utterance of the same size as the keyword. Cross-correlation of MFCC features of the keyword with that of each portion of the utterance yields a single value between 0-1. The portion with the highest value is usually the location of the keyword. Results in the time domain varied from keyword to keyword, some words showed a 60% hit rate while the average obtained from various keywords from the Call Home database had an average of 41%. Cross-correlation of the keywords and utterance in the MFCC domain yielded a 66% hit rate in test conducted on all different keywords in the Call Home and Switchboard corpus. The system accuracy is keyword dependent with some keywords having an 85% hit rate