七、语音识别

在本章中，我们将介绍以下食谱:

读取和绘制音频数据
将音频信号转换到频域
生成带有自定义参数的音频信号
合成音乐
提取频域特征
建立隐马尔可夫模型
构建语音识别器

简介

语音识别是指识别和理解口语的过程。输入以音频数据的形式出现，语音识别器将处理这些数据，从中提取有意义的信息。这有很多实际用途，例如语音控制设备、将口语转录成单词、安全系统等等。

语音信号在本质上是多用途的。同一种语言有许多不同的说法。言语有不同的要素，如语言、情感、语气、噪音、口音等等。很难严格定义一套可以构成言语的规则。即使有所有这些变化，人类也非常善于相对容易地理解所有这些。因此，我们需要机器以同样的方式理解语音。

在过去的几十年里，研究人员致力于语音的各个方面，如识别说话者、理解单词、识别口音、翻译语音等。在所有这些任务中，自动语音识别一直是许多研究者关注的焦点。在本章中，我们将学习如何构建语音识别器。

读取和绘制音频数据

让我们来看看如何读取音频文件并可视化信号。这将是一个很好的起点，它将让我们对音频信号的基本结构有一个很好的了解。在开始之前，我们需要了解音频文件是实际音频信号的数字化版本。实际的音频信号是复杂的连续值波。为了保存数字版本，我们对信号进行采样并将其转换为数字。例如，语音通常以 44100 赫兹采样。这意味着信号的每一秒都被分解成 44100 个部分，并且存储这些时间戳的值。换句话说，每 1/44100 秒存储一个值。由于采样率高，当我们在媒体播放器上收听时，我们会感觉到信号是连续的。

怎么做…

新建一个 Python 文件，导入以下包:

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile

我们将使用wavfile包从已经提供给您的input_read.wav输入文件中读取音频文件:
```
# Read the input file
sampling_freq, audio = wavfile.read('input_read.wav')
```

让我们打印出这个信号的参数:

# Print the params
print '\nShape:', audio.shape
print 'Datatype:', audio.dtype
print 'Duration:', round(audio.shape[0] / float(sampling_freq), 3), 'seconds'

音频信号存储为 16 位有符号整数数据。我们需要标准化这些值:
```
# Normalize the values
audio = audio / (2.**15)
```

让我们提取前 30 个值进行绘图，如下所示:

# Extract first 30 values for plotting
audio = audio[:30]

X 轴是时间轴。让我们构建这个轴，考虑到它应该使用采样频率因子进行缩放的事实:
```
# Build the time axis
x_values = np.arange(0, len(audio), 1) / float(sampling_freq)
```
将单位转换为秒:
```
# Convert to seconds
x_values *= 1000
```

让我们将绘制如下:

# Plotting the chopped audio signal
plt.plot(x_values, audio, color='black')
plt.xlabel('Time (ms)')
plt.ylabel('Amplitude')
plt.title('Audio signal')
plt.show()

The full code is in the read_plot.py file. If you run this code, you will see the following signal:
You will also see the following printed on your Terminal:

![How to do it…](img/B05485_07_02.jpg)

将音频信号转换到频域

音频信号由不同频率、振幅和相位的正弦波的复杂混合物组成。正弦波也被称为正弦波。音频信号的频率内容中隐藏着大量信息。事实上，音频信号的主要特征是其频率内容。整个的语言和音乐世界就是基于这个事实。在你继续之前，你需要一些关于傅立叶变换的知识。快速复习可以在http://www.thefouriertransform.com找到。现在，让我们看看如何将音频信号转换到频域。

怎么做…

新建一个 Python 文件，导入如下包:

import numpy as np
from scipy.io import wavfile
import matplotlib.pyplot as plt

阅读已经提供给你的input_freq.wav文件:

# Read the input file
sampling_freq, audio = wavfile.read('input_freq.wav')

将信号标准化，如下所示:

# Normalize the values
audio = audio / (2.**15)

音频信号只是一个 NumPy 数组。因此，您可以使用以下代码提取长度:
```
# Extract length
len_audio = len(audio)
```

让我们应用傅立叶变换。傅立叶变换信号沿中心镜像，所以我们只需要取变换信号的前半部分。我们的最终目标是提取电力信号。因此，我们对信号中的值进行平方，为这个做准备:

# Apply Fourier transform
transformed_signal = np.fft.fft(audio)
half_length = np.ceil((len_audio + 1) / 2.0)
transformed_signal = abs(transformed_signal[0:half_length])
transformed_signal /= float(len_audio)
transformed_signal **= 2

提取信号的长度:

# Extract length of transformed signal
len_ts = len(transformed_signal)

我们需要根据信号的长度加倍信号:

# Take care of even/odd cases
if len_audio % 2:
    transformed_signal[1:len_ts] *= 2
else:
    transformed_signal[1:len_ts-1] *= 2

使用以下公式提取功率信号:

# Extract power in dB
power = 10 * np.log10(transformed_signal)

X 轴是时间轴。我们需要根据采样频率进行缩放，然后将其转换为秒:

# Build the time axis
x_values = np.arange(0, half_length, 1) * (sampling_freq / len_audio) / 1000.0

绘制信号，如下所示:

```py
# Plot the figure
plt.figure()
plt.plot(x_values, power, color='black')
plt.xlabel('Freq (in kHz)')
plt.ylabel('Power (in dB)')
plt.show()
```

The full code is in the freq_transform.py file. If you run this code, you will see the following figure:

![How to do it…](img/B05485_07_03.jpg)

生成带有自定义参数的音频信号

我们可以使用 NumPy 生成音频信号。如前所述，音频信号是正弦曲线的复杂混合物。因此，当我们生成自己的音频信号时，我们会记住这一点。

怎么做…

新建一个 Python 文件，导入以下包:

import numpy as np
import matplotlib.pyplot as plt
from scipy.io.wavfile import write

我们需要定义生成的音频将被存储的输出文件:

# File where the output will be saved
output_file = 'output_generated.wav'

让我们指定音频生成参数。我们想产生一个采样频率为44100且音调频率为587 Hz 的三秒长信号。时间轴上的值将从 -2pi* 到 2pi* :
```
# Specify audio parameters
duration = 3  # seconds
sampling_freq = 44100  # Hz
tone_freq = 587
min_val = -2 * np.pi
max_val = 2 * np.pi
```

让我们生成时间轴和音频信号。音频信号是一个简单的正弦曲线，具有前面提到的参数:

# Generate audio
t = np.linspace(min_val, max_val, duration * sampling_freq)
audio = np.sin(2 * np.pi * tone_freq * t)

让我们给信号添加一些噪声:

# Add some noise
noise = 0.4 * np.random.rand(duration * sampling_freq)
audio += noise

我们需要将这些值缩放到 16 位整数，然后才能存储它们:

# Scale it to 16-bit integer values
scaling_factor = pow(2,15) - 1
audio_normalized = audio / np.max(np.abs(audio))
audio_scaled = np.int16(audio_normalized * scaling_factor)

将此信号写入输出文件:

# Write to output file
write(output_file, sampling_freq, audio_scaled)

使用前 100 个值绘制信号:

# Extract first 100 values for plotting
audio = audio[:100]

生成时间轴:

# Build the time axis
x_values = np.arange(0, len(audio), 1) / float(sampling_freq)

将时间轴转换为秒:

```py
# Convert to seconds
x_values *= 1000
```

绘制信号，如下所示:

```py
# Plotting the chopped audio signal
plt.plot(x_values, audio, color='black')
plt.xlabel('Time (ms)')
plt.ylabel('Amplitude')
plt.title('Audio signal')
plt.show()
```

The full code is in the generate.py file. If you run this code, you will get the following figure:

![How to do it…](img/B05485_07_04.jpg)

合成音乐

既然我们知道了如何生成音频，那就用这个原理来合成一些音乐吧。你可以查看这个链接，http://www.phy.mtu.edu/~suits/notefreqs.html。该链接列出了各种音符，如 A 、 G 、 D 等等以及它们对应的频率。我们将使用它来生成一些简单的音乐。

怎么做…

新建一个 Python 文件，导入以下包:

import json
import numpy as np
from scipy.io.wavfile import write
import matplotlib.pyplot as plt

根据输入参数

# Synthesize tone
def synthesizer(freq, duration, amp=1.0, sampling_freq=44100):

，定义合成音调的功能

构建时间轴值:

    # Build the time axis
    t = np.linspace(0, duration, duration * sampling_freq)

使用输入参数构建音频样本，例如振幅和频率:

    # Construct the audio signal
    audio = amp * np.sin(2 * np.pi * freq * t)

    return audio.astype(np.int16)

我们来定义main函数。已经向您提供了一个名为tone_freq_map.json的 JSON 文件，其中包含一些音符及其频率:
```
if __name__=='__main__':
    tone_map_file = 'tone_freq_map.json'
```

加载文件:

    # Read the frequency map
    with open(tone_map_file, 'r') as f:
        tone_freq_map = json.loads(f.read())

假设我们想要生成一个持续时间为2秒的 G 音符:

    # Set input parameters to generate 'G' tone
    input_tone = 'G'
    duration = 2     # seconds
    amplitude = 10000
    sampling_freq = 44100    # Hz

用以下参数调用函数:

    # Generate the tone
    synthesized_tone = synthesizer(tone_freq_map[input_tone], duration, amplitude, sampling_freq)

将生成的信号写入输出文件:

    # Write to the output file
    write('output_tone.wav', sampling_freq, synthesized_tone)

在媒体播放器中打开此文件并收听。那就是 G 注！让我们做一些更有趣的事情。让我们按顺序生成一些音符，给它一种音乐的感觉。定义音符序列及其持续时间(以秒为单位):

```py
    # Tone-duration sequence
    tone_seq = [('D', 0.3), ('G', 0.6), ('C', 0.5), ('A', 0.3), ('Asharp', 0.7)]
```

遍历这个列表，为每个列表调用合成器函数:

```py
    # Construct the audio signal based on the chord sequence
    output = np.array([])
    for item in tone_seq:
        input_tone = item[0]
        duration = item[1]
        synthesized_tone = synthesizer(tone_freq_map[input_tone], duration, amplitude, sampling_freq)
        output = np.append(output, synthesized_tone, axis=0)
```

将信号写入输出文件:

```py
    # Write to the output file
    write('output_tone_seq.wav', sampling_freq, output)
```

完整的代码在synthesize_music.py文件中。您可以在媒体播放器中打开output_tone_seq.wav文件并收听。你能感受到音乐！

提取频域特征

我们前面讨论了如何将信号转换到频域。在大多数现代语音识别系统中，人们使用频域特征。将信号转换到频域后，需要将其转换成可用的形式。梅尔频率倒频谱系数 ( MFCC )是一个很好的方法。MFCC 提取信号的功率谱，然后使用滤波器组和离散余弦变换的组合来提取特征。如果您需要快速复习，可以查看http://practical ryphysicy . com/杂项/机器学习/指南-Mel-frequency-cepstral-coefficients-mfccs。开始前，确保安装了python_speech_features包。您可以在http://python-speech-features.readthedocs.org/en/latest找到安装说明。让我们来看看如何提取 MFCC 特征。

怎么做…

新建一个 Python 文件，导入以下包:

import numpy as np
import matplotlib.pyplot as plt
from scipy.io import wavfile 
from features import mfcc, logfbank

阅读已经提供给你的input_freq.wav输入文件:

# Read input sound file
sampling_freq, audio = wavfile.read("input_freq.wav")

提取 MFCC 和滤波器组特征:

# Extract MFCC and Filter bank features
mfcc_features = mfcc(audio, sampling_freq)
filterbank_features = logfbank(audio, sampling_freq)

打印参数，查看生成了多少个窗口:

# Print parameters
print '\nMFCC:\nNumber of windows =', mfcc_features.shape[0]
print 'Length of each feature =', mfcc_features.shape[1]
print '\nFilter bank:\nNumber of windows =', filterbank_features.shape[0]
print 'Length of each feature =', filterbank_features.shape[1]

让我们想象一下 MFCC 的特色。我们需要变换矩阵，使时域水平:

# Plot the features
mfcc_features = mfcc_features.T
plt.matshow(mfcc_features)
plt.title('MFCC')

让我们想象一下滤波器组的特性。同样，我们需要变换矩阵，使时域水平:

filterbank_features = filterbank_features.T
plt.matshow(filterbank_features)
plt.title('Filter bank')

plt.show()

The full code is in the extract_freq_features.py file. If you run this code, you will get the following figure for MFCC features:
The filter bank features will look like the following:
You will get the following output on your Terminal:

建立隐马尔可夫模型

我们现在准备讨论语音识别。我们将使用隐马尔可夫模型 ( HMMs )来执行语音识别。hmm 非常擅长建模时间序列数据。由于音频信号是时间序列信号，HMMs 非常适合我们的需求。隐马尔可夫模型是一种表示观测序列上概率分布的模型。我们假设输出由隐藏状态产生。所以，我们的目标是找到这些隐藏的状态，这样我们就可以对信号进行建模。你可以在https://www.robots.ox.ac.uk/~vgg/rg/slides/hmm.pdf了解更多。在继续之前，您需要安装hmmlearn包。您可以在http://hmmlearn.readthedocs.org/en/latest找到安装说明。让我们来看看如何构建 hmm。

怎么做…

创建一个新的 Python 文件。让我们定义一个类来建模 HMMs:
```
# Class to handle all HMM related processing
class HMMTrainer(object):
```
Let's initialize the class. We will use Gaussian HMMs to model our data. The n_components parameter defines the number of hidden states. The cov_type defines the type of covariance in our transition matrix, and n_iter indicates the number of iterations it will go through before it stops training:
```
    def __init__(self, model_name='GaussianHMM', n_components=4, cov_type='diag', n_iter=1000):
```
前面参数的选择取决于手头的问题。您需要了解您的数据，以便以明智的方式选择这些参数。

初始化变量:

        self.model_name = model_name
        self.n_components = n_components
        self.cov_type = cov_type
        self.n_iter = n_iter
        self.models = []

用以下参数定义模型:

        if self.model_name == 'GaussianHMM':
            self.model = hmm.GaussianHMM(n_components=self.n_components, 
                    covariance_type=self.cov_type, n_iter=self.n_iter)
        else:
            raise TypeError('Invalid model type')

输入的数据是一个 NumPy 数组，其中每个元素都是一个特征向量，由k-维度:

    # X is a 2D numpy array where each row is 13D
    def train(self, X):
        np.seterr(all='ignore')
        self.models.append(self.model.fit(X))

组成

根据模型定义一种提取分数的方法:

    # Run the model on input data
    def get_score(self, input_data):
        return self.model.score(input_data)

我们建立了一个类来处理隐马尔可夫模型的训练和预测，但是我们需要一些数据来看到它的实际应用。我们将在下一个食谱中使用它来构建一个语音识别器。完整代码在speech_recognizer.py文件中。

构建语音识别器

我们需要一个语音文件的数据库来构建我们的语音识别器。我们将使用上的数据库。它包含七个不同的单词，每个单词有 15 个与之相关的音频文件。这是一个小数据集，但这足以理解如何构建一个可以识别七个不同单词的语音识别器。我们需要为每个类建立一个 HMM 模型。当我们想在一个新的输入文件中识别这个单词时，我们需要运行这个文件中的所有模型，并选择得分最高的那个。我们将使用我们在前面的食谱中构建的 HMM 类。

怎么做…

新建一个 Python 文件，导入以下包:

import os
import argparse 

import numpy as np
from scipy.io import wavfile 
from hmmlearn import hmm
from features import mfcc

定义一个函数来解析命令行中的输入参数:

# Function to parse input arguments
def build_arg_parser():
    parser = argparse.ArgumentParser(description='Trains the HMM classifier')
    parser.add_argument("--input-folder", dest="input_folder", required=True,
            help="Input folder containing the audio files in subfolders")
    return parser

定义main函数，解析输入参数:

if __name__=='__main__':
    args = build_arg_parser().parse_args()
    input_folder = args.input_folder

启动保存所有隐马尔可夫模型的变量:
```
    hmm_models = []
```

解析包含所有数据库音频文件的输入目录:

    # Parse the input directory
    for dirname in os.listdir(input_folder):

提取子文件夹的名称:

        # Get the name of the subfolder 
        subfolder = os.path.join(input_folder, dirname)

        if not os.path.isdir(subfolder): 
            continue

子文件夹的名称就是这个类的标签。使用以下方式提取:

        # Extract the label
        label = subfolder[subfolder.rfind('/') + 1:]

初始化变量进行训练:

        # Initialize variables
        X = np.array([])
        y_words = []

遍历每个子文件夹中的音频文件列表:

        # Iterate through the audio files (leaving 1 file for testing in each class)
        for filename in [x for x in os.listdir(subfolder) if x.endswith('.wav')][:-1]:

读取每个音频文件，如下所示:

```py
            # Read the input file
            filepath = os.path.join(subfolder, filename)
            sampling_freq, audio = wavfile.read(filepath)
```

提取 MFCC 特征:

```py
            # Extract MFCC features
            mfcc_features = mfcc(audio, sampling_freq)
```

继续将其附加到X变量:

```py
            # Append to the variable X
            if len(X) == 0:
                X = mfcc_features
            else:
                X = np.append(X, mfcc_features, axis=0)
```

也附上相应的标签:

```py
            # Append the label
            y_words.append(label)
```

一旦从当前类的所有文件中提取出特征，训练并保存隐马尔可夫模型。由于隐马尔可夫模型是无监督学习的生成模型，我们不需要标签来为每个类建立隐马尔可夫模型。我们明确假设将为每个类构建单独的 HMM 模型:

```py
        # Train and save HMM model
        hmm_trainer = HMMTrainer()
        hmm_trainer.train(X)
        hmm_models.append((hmm_trainer, label))
        hmm_trainer = None
```

获取未用于培训的测试文件列表:

```py
    # Test files
    input_files = [
            'data/pineapple/pineapple15.wav',
            'data/orange/orange15.wav',
            'data/apple/apple15.wav',
            'data/kiwi/kiwi15.wav'
            ]
```

解析输入文件，如下所示:

```py
    # Classify input data
    for input_file in input_files:
```

读入每个音频文件:

```py
        # Read input file
        sampling_freq, audio = wavfile.read(input_file)
```

提取 MFCC 特征:

```py
        # Extract MFCC features
        mfcc_features = mfcc(audio, sampling_freq)
```

定义存储最高分的变量和输出标签:

```py
        # Define variables
        max_score = None
        output_label = None
```

遍历所有的模型，并通过每个模型运行输入文件:

```py
        # Iterate through all HMM models and pick 
        # the one with the highest score
        for item in hmm_models:
            hmm_model, label = item
```

提取分数并存储最高分:

```py
            score = hmm_model.get_score(mfcc_features)
            if score > max_score:
                max_score = score
                output_label = label
```

打印真实和预测标签:

```py
        # Print the output
        print "\nTrue:", input_file[input_file.find('/')+1:input_file.rfind('/')]
        print "Predicted:", output_label 
```

The full code is in the speech_recognizer.py file. If you run this code, you will see the following on your Terminal:

![How to do it…](img/B05485_07_08.jpg)

OpenDocCN / apachecn-ml-zh

七、语音识别

简介

读取和绘制音频数据

怎么做…

将音频信号转换到频域

怎么做…

生成带有自定义参数的音频信号

怎么做…

合成音乐

怎么做…

提取频域特征

怎么做…

建立隐马尔可夫模型

怎么做…

构建语音识别器

怎么做…

简介

发行版

贡献者

近期动态

OpenDocCN / apachecn-ml-zh .gitee-modal { width: 500px !important; }

七、语音识别

简介

读取和绘制音频数据

怎么做…

将音频信号转换到频域

怎么做…

生成带有自定义参数的音频信号

怎么做…

合成音乐

怎么做…

提取频域特征

怎么做…

建立隐马尔可夫模型

怎么做…

构建语音识别器

怎么做…

简介

发行版

贡献者

近期动态

搜索帮助

OpenDocCN / apachecn-ml-zh