Abstract
When deployed in embedded systems, speech recognizers are necessary reduced from large-vocabulary continuous speech recognizers (LVCSR) found on desktops or servers to fit the limited hardware. However, embedded hardware continues to evolve in capability; today’s smartphones are vastly more powerful than their recent ancestors. This begets a new question: which hardware features not currently found on today’s embedded platforms, but potentially add-ons to tomorrow’s devices, are most likely to improve recognition performance? Said differently – what is the sensitivity of the recognizer to fine-grain details of the embedded hardware resources? To answer this question rigorously and quantitatively, we offer results from a detailed study of LVCSR performance as a function of microarchitecture options on an embedded ARM11 and an enterprise-class Intel Core2Duo. We estimate speed and energy consumption, and show, feature by feature, how hardware resources impact recognizer performance.

Index Terms: speech recognition, software performance, hardware profiling

1. Introduction
Speech is an ideal input modality for resource-constrained environments, so a great deal of effort has been expended to implement speech recognition in mobile devices. However, these devices have limited processing capability due to power, size, and price constraints. Embedded recognizers necessarily pare back features from large-vocabulary, continuous speech recognition (LVCSR) codes found on workstations to achieve usable performance. These changes, like smaller acoustic models, limited language models [1] or sacrificed significant accuracy for speed [2], improve recognition speed but also reduce accuracy, sometimes dramatically [2].

Technology trends suggest quality compromises for the sake of speed will become less prevalent in the future. The computational capabilities of embedded devices have greatly improved, to the point where processors found on high-end smartphones now resemble single core workstation processors, adding functionality like the ability to decode multiple instructions per cycle and floating-point operations. Future hardware improvements may allow best-quality speech recognizers to be ported to mobile devices with less attenuation/reduction of critical features. Still, it is unclear exactly how/where recognition performance might be impacted by potential hardware upgrades. Our goal in this paper is to answer these questions in a rigorous way.

When designing an embedded speech recognizer one normally chooses algorithms and models that balance the speed and accuracy constraints against a fixed set of hardware resources. In this paper, we turn this equation around: the recognizer is fixed, but the hardware is variable. We use modern architecture analysis techniques (i.e., expensive cycle simulation) across a wide range of both realistic and idealized hardware mechanisms. We measure the sensitivity of recognizer performance to a large set of possible hardware improvements we may see in the near future.

To extract the sensitivity of speech recognition to hardware, we profiled a typical LVCSR engine on a variety of processor configurations based on an ARM11 core [3-4] and an Intel Core2Duo [5]. We measured both execution time and energy consumption, the two metrics which most affect mobile recognition. From these we determine not only which hardware resources have the largest effect on run time and energy, but also some general guidelines on how to optimize for performance. To our knowledge, this is the first study that (1) measures the sensitivity of recognizer performance across a range of modern architectural features; (2) compares these sensitivities across both embedded and enterprise hardware; and (3) extracts sensitivities for not only recognizer execution time, but also energy consumption.

This paper also helps optimize speech recognition by determining the application’s resource utilization and limitations of the current hardware environments. This is especially important in the mobile space due to the large variety of processors, memory storage devices, and batteries. Some decisions are simple, like changing arithmetic to fixed-point when no floating-point is available, but questions like how cache sizes impact recognition speed or what recognizer is the most power hungry, require deeper analysis.

Although there have been previous efforts to characterize speech recognition on processors, the recognizers studied were primitive [6-7], or sacrificed significant accuracy for speed [8]. The previous work also only tested processor architectures found in workstations, so their conclusions are at best a weak fit to embedded devices, whose resource constraints lead to a different set of performance bottlenecks.

2. Profiling Methodology
Processing time and energy consumption are the two essential performance metrics in the mobile domain. Thus we profiled speech recognition using both metrics to determine how the processor architecture affects performance.

To measure processing time and find architectural bottlenecks, we used Simplescalar [9], a cycle-accurate simulator for program performance analysis. We chose to model the ARM11 processor with a floating point coprocessor to simulate the computing resources of a very powerful mobile device and a single core of an Intel Core2Duo to represent a workstation processor. Important SimpleScalar values for both baseline processor configurations are listed in Table 1.

Total power consumption is the sum of the dynamic and static (leakage) power of the CPU. To estimate dynamic
power, we used 65nm technology parameters based on BSIM3 models [10] and Wattch [11], an architectural-level power analysis tool that runs on top of SimpleScalar. For our statistics we used an aggressive conditional gating that disables unused components to consume less power. While Wattch will give a good estimate of dynamic power, measuring static power is very difficult because leakage is extremely sensitive to temperature (which fluctuates) and also random process variations. To estimate the static power we multiplied the dynamic power consumption from Wattch with the nominal ratio of static-to-dynamic power at ambient inside box temperature (45°C). For 65nm technology, we assumed a ratio of 30% [12].

Table 1. Key baseline processor configuration parameters.

<table>
<thead>
<tr>
<th>Processor Type</th>
<th>ARM11</th>
<th>Intel</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clock frequency</td>
<td>500 MHz</td>
<td>2 GHz</td>
</tr>
<tr>
<td>Issue policy</td>
<td>In-order</td>
<td>Out-of-order</td>
</tr>
<tr>
<td>Instruction issue</td>
<td>1 per cycle</td>
<td>4 per cycle</td>
</tr>
<tr>
<td>Branch predictor</td>
<td>128-entry bimodal</td>
<td>Comb. of bimodal and gshare</td>
</tr>
<tr>
<td>IL1 cache</td>
<td>16 KB, 4-way</td>
<td>64 KB, 8-way</td>
</tr>
<tr>
<td>DL1 cache</td>
<td>16 KB, 4-way</td>
<td>64 KB, 8-way</td>
</tr>
<tr>
<td>L2-cache</td>
<td>None</td>
<td>2 MB, 4-way</td>
</tr>
<tr>
<td>Cache replacement policy</td>
<td>FIFO</td>
<td>LRU</td>
</tr>
<tr>
<td>Memory latency</td>
<td>L1: 1, Main: 24</td>
<td>L1: 3, L2: 14, Main: 100</td>
</tr>
<tr>
<td>Functional Units (ALU/mult)</td>
<td>Integer: 2/1</td>
<td>Integer: 4/1</td>
</tr>
<tr>
<td></td>
<td>FP: 2/1</td>
<td>FP: 2/1</td>
</tr>
</tbody>
</table>

We chose to profile Sphinx 3.0 [13], a large-vocabulary, speaker-independent, continuous speech recognizer from Carnegie Mellon University. Sphinx 3.0 can be divided into three separate stages: feature extraction, feature scoring, and search. Feature extraction takes the input speech and computes 13 MFCCs for every 10 ms of speech. Next, feature scoring uses the MFCCs to generate a 39 dimensional feature vector using the MFCC and its first and second time derivatives and computes probabilities of tied-triphone states represented with Gaussian mixture models (GMMs). Tied-triphone state probabilities are computed for eight frames at a time to increase the amount of computation performed per memory access. Finally, during the search stage, a single-pass flat lexical search is performed. While there are many other speech recognizers with different algorithms, Sphinx 3.0 contains the major elements of an enterprise-class recognizer and provides a good testbed to understand the parameters to which recognition performance is most sensitive.

3. Results

The speech corpus used in these experiments was the Wall Street Journal 5000 word task [14]. The acoustic model had 4147 tied-triphone states, each of which was represented with a Gaussian mixture model with 8 mixture components for a total of 33,176 Gaussians. 3-state hidden Markov models were used to represent triphones, and the language model had 4989 unigrams, 1.64 million bigrams, and 2.68 million trigrams. The test set comprised 40 minutes of speech and Sphinx 3.0's word error rate was 6.707%.

As a practical matter, we note that while a larger corpus would be desirable, we are running a rather large set of benchmarks in “emulation”, that is, on top of a cycle-by-cycle hardware simulator, and not on real hardware. This allows us to explore a wide range of hardware and architecture options. But total runtime here was a serious concern. The WSJ 5K corpus represents a compromise between LVCSR realism and our ability to complete this study. The experiments shown herein required roughly 3 months of CPU time to collect.

3.1. Baseline Configuration

The execution time and energy consumption breakdown for Sphinx 3.0 for the baseline ARM11 and Intel configuration are shown in Table 2. The bulk of the execution time is spent in feature scoring and search, which is consistent with the results from earlier, simpler recognizers. Feature extraction is composed of many digital signal processing algorithms that require very few operations and memory accesses. Feature scoring takes a long time because it needs to read in a large acoustic model and perform many computations. Although search has very little arithmetic, it is constantly accessing the active HMMs and language model, so it requires considerable memory bandwidth. Any performance optimization must focus on the feature scoring or search, so the rest of this paper will focus on these two stages.

Table 2. Percentage breakdown for baseline configurations.

<table>
<thead>
<tr>
<th>Stage</th>
<th>Feature Extract.</th>
<th>Feature Scoring</th>
<th>Search</th>
</tr>
</thead>
<tbody>
<tr>
<td>ARM11 – 9.40 xRT</td>
<td>2.1%</td>
<td>66.0%</td>
<td>31.9%</td>
</tr>
<tr>
<td>Intel – 0.84 xRT</td>
<td>2.1%</td>
<td>69.2%</td>
<td>28.7%</td>
</tr>
</tbody>
</table>

Both feature extraction and feature scoring represent a larger share of execution time for the ARM11 processor than the Intel processor because the ARM11 architecture does not fully exploit the parallelism present in the two stages. Since the ARM11 is a scalar, in-order processor, operations that are processed concurrently on the Intel processor like FFT or GMM computations are instead serialized. To verify this, we compared the ratio of the instructions per cycle (IPC) for the ARM11 processor and Intel processor for the two stages. It is 3.76 for feature extraction and 3.66 for feature scoring, which is close to the instruction issue ratio of 4 for the two processors. Since the search stage is limited by memory accesses, its ratio is only 1.87.

Since the methodology used to estimate leakage power was rather crude, it is best to regard the exact power estimate as less useful than the relative power comparison of different architectures or different stages of speech recognition. To measure the impact on battery life, energy is a better metric than power, so in Table 2 we show the breakdown of energy used instead of power consumed. Feature scoring consumes a slightly greater fraction of the energy than of execution time; this is because the processor is constantly computing and has little idle time. Conversely, during search there are many processor components that can be turned off while waiting on memory, so the power consumed per cycle is slightly lower.

3.2. Execution Time Bottlenecks

To better understand what limits recognizer decoding speed on the ARM11 processor, we also evaluated Sphinx 3.0 with a combination of “perfect” memory, “perfect” instruction fetch/decode/issue (“perfect” instruction), and “perfect” ALUs. Removing these resource constraints allowed us to see which of them has the largest impact on execution time. To
achieve perfect memory, we configured SimpleScalar to have one-cycle latencies for all cache and off-chip memory accesses. For perfect instruction and ALU, we set the number of instructions fetched/decoded-issued and number of ALUs to the largest values SimpleScalar would accept (64 fetches, 16 decodes/issues, 8 integer ALUs, 8 integer multipliers, 8 floating point ALUs, 8 floating point multipliers). Figure 1 shows the cycle count of the two most important stages and the total time normalized against the baseline model results.

![Figure 1: Effects of “perfect” memory, instruction, and ALU on ARM11 execution time, y-axis starts at 0.6.](image)

Memory latency and instruction issue have the greatest impact on execution time. It is intuitively obvious why memory is important, because both feature scoring and search require reading data sets which are too large to store on-chip. Instruction issue generally does not limit performance of applications with large amounts of parallelism, but as discussed earlier the scalar, in-order architecture limits the instruction-level parallelism exploited. Without enough instructions being issued, the functional units remain idle, so “perfect” ALU shows no improvement. When examining individual stages, one finds feature scoring benefits more from instructions issued because of its inherent parallelism, while search favors faster memory because it requires fewer operations and much more memory bandwidth. For the WSJ5K corpus, improving memory latency has a greater impact on overall run time, but for other test sets that spend a larger fraction of execution time in feature scoring, run time will benefit more from using a superscalar processor.

Embedded speech recognizers need to run near real-time, so it is unlikely that for this corpus the ARM11 processor will be fast enough without major architectural changes. Even with “perfect” all configuration the best performance is only 21.4% faster than baseline or 7.39 xRT. Better branch prediction alone will not solve the problem. Even with perfect branch prediction, the “perfect all” configuration only improves by 1.82%.

### 3.3. Effects of Cache Size

To see how sensitive performance is to the L1 cache size, we varied the IL1 and DL1 size from 4KB to 64KB. Instead of just using execution time and energy consumed to evaluate the processors, we also used the energy-delay product (EDP) [15], a metric used widely in the low-power digital design community. Any architectural improvement that improves decoding time may also increase energy consumed, so we need a metric that measures the trade-off. By multiplying the energy consumed and decoding time we place equal weight on energy and time, but one can easily add exponents to the variables if one metric is more important.

The effects of L1 cache sizes on execution time, energy consumed, and EDP are shown in Figure 2. The size of IL1 has virtually no effect on execution time because the miss rate for a 4KB IL1 is already only 0.19%. Since larger caches draw more power without improving execution time, it is not surprising to see the EDP is monotonically increasing with IL1 cache size.

![Figure 2: Normalized execution time, energy consumed, and EDP for varying cache sizes, y-axis starts at 0.8.](image)

Conversely, EDP is parabolic with respect to DL1 size because the large change in execution time from a 4KB DL1 to an 8KB DL1. The improvement lies in the feature scoring stage, and we found it was because the working set does not fit in a 4 KB cache without replacement. Ideally the acoustic model of a single GMM and set of eight feature vectors would fit in DL1 without conflict to prevent time-consuming main memory accesses. However, the malloc function allocated 4KB per GMM so evaluating a single feature vector required off-chip accesses. Since the working set does not fit in a 4KB DL1, there is a large improvement in execution time when increasing the DL1 to 8KB. Further increases in DL1 size lead to larger increases in energy than decreases in execution time so the optimal DL1 size for EDP is 8KB.

### 3.4. Advanced Processor Configurations

Since embedded speed recognition needs a more powerful processor, in the following experiments we measured the sensitivity of performance to reasonable upgrades to the ARM11 architecture. We focused on these four modifications to bring the ARM11 processor closer to the Intel one: (1) 256KB UL2, (2) double the instructions fetched/decoded/issued, (3) out-of-order (“OOO”) core, and (4) L1 caches with LRU replacement. The resulting recognition speeds, and percent changes are shown in Table 3.

![Table 3. Performance of ARM11 processor configurations.](image)

Both doubling the instructions fetched/decoded/issued and switching to an out-of-order (OOO) core greatly improves recognition speed because these changes allowed more of the instruction-level parallelism to be exposed. Doubling instructions fetched/decoded did not impact performance; all the gains were due to issuing two instructions instead of just one. When the working set is in the DL1 there is enough data and ALUs, all that is missing is the instructions. Using an OOO core leads to a 16% improvement in execution time, more than any other modification or single “perfect” configuration. Both feature scoring and search achieve
individual gains of over 13% because independent instructions are no longer stalled due to memory misses. Although these modifications increase the power the processor draws, the overall energy consumed decreases because the execution time decreased by a larger factor.

Adding the unified L2 cache slightly improved execution time but the gains are offset by the extra energy consumed and increase in chip area. For feature scoring if the working set did not fit in the DL1 cache having a faster cache than main memory would help, but the 16 KB DL1 cache is large enough to prevent thrashing. Other data like the acoustic model and active HMMs are streamed in with little reuse so there is little temporary locality for the UL2 to exploit. Using caches with LRU replacement had a minimal effect.

We also combined the different improvements to see what performance could be expected in the future. Using an out-of-order core capable of issuing two instructions per cycle improved recognition speed by 45.8%, which is greater than the sum of the individual speedups. Its performance is even better than the baseline “perfect” all configuration, even though it has much slower memory, less instructions fetched/issued/decoded, and less ALUs. The two upgrades are complimentary because issuing more instructions helps when the data is in the cache and the out-of-order helps more when the processor needs to wait for main memory. Adding a UL2 cache further improves recognition speed, and its performance is 42% slower than best possible performance an OOO core can achieve mainly due to main memory latencies.

Another way to improve performance is using faster processors, but power is directly proportional to clock frequency so it is unlikely embedded systems will reach the speeds of workstation processors. Also, faster clock frequencies do not completely translate to faster recognition because memory latencies do not scale. Using a 4 times faster clock with our baseline ARM11 configuration improves recognition speed by only 2.51 times to 3.74xRT, and assuming energy scales linearly with frequency the energy used increases by over a factor of 5.

### 4. Analysis and Conclusions

By profiling Sphinx 3.0 on a resource-limited processor, we understand how hardware resources impact the speed and energy characteristics of a workstation recognizer. Although different recognizers, different acoustic/language models, and different hardware will change the specific breakdowns, the general trends will still hold true.

When designing a recognizer for a specific platform or choosing a platform for a specific recognizer, it is important to find the right balance of memory bandwidth, functional units, and instructions fetched/decoded/issued to optimize the speed and energy. For our baseline system, adding more functional units did not improve execution time and only increased energy consumed. IL1 size matters little, DL1 must be large enough to compute GMM probabilities without capacity or conflict misses, and adding a UL2 cache does not provide enough speedup to offset the power and area costs.

Even with technology scaling, it is doubtful that the ARM11 architecture will achieve workstation-level accuracy or achieve sizable gains in recognition speed unless an out-of-order core is used. The in-order core severely limits the instruction-level parallelism exploited, restricting any further gains from other hardware improvements like issuing more instructions per cycle or adding more on-chip memories. Using an out-of-order core also leads to more energy-efficient speedups than increasing the clock frequency. If further performance improvements are desired, one could remove the limitations inherent to software running on processors by using custom hardware speech recognizers, which can simultaneously improve decoding speed and energy used without impacting accuracy [16].

Of course, in any practical implementation scenario, the recognizer architecture is not fixed, and is adjusted to match the hardware platform. But by reversing these assumptions, we gain some deep insight into how future platforms might evolve – or perhaps be encouraged to evolve – to ease the task of designing the best possible embedded recognizers.

### 5. Acknowledgements

This research was supported by the National Science Foundation and the FCRP Focus Center for Circuit & System Solutions (C2S2). K. Yu is supported by a National Science Foundation Graduate Research Fellowship.

### 6. References