八、剖析时间序列和序列数据

在本章中，我们将介绍以下食谱:

将数据转换为时间序列格式
切片时间序列数据
对时间序列数据进行操作
从时间序列数据中提取统计数据
为序列数据建立隐马尔可夫模型
为连续文本数据构建条件随机字段
使用隐马尔可夫模型分析股票市场数据

简介

时间序列数据基本上是随时间收集的一系列测量值。这些测量是相对于预定变量并以规则的时间间隔进行的。时间序列数据的一个主要特征就是排序很重要！

我们收集的观察列表是按时间顺序排列的，它们出现的顺序说明了很多潜在的模式。如果你改变顺序，这将完全改变数据的意义。顺序数据是一个广义的概念，它包含任何以顺序形式出现的数据，包括时间序列数据。

我们在这里的目标是建立一个模型，描述时间序列或任何一般序列的模式。这种模型用于描述时间序列模式的重要特征。我们可以用这些模型来解释过去如何影响未来。我们还可以使用它们来查看两个数据集如何关联，预测未来值，或者控制基于某个指标的给定变量。

为了可视化时间序列数据，我们倾向于使用折线图或条形图来绘制它。时间序列数据分析经常用于金融、信号处理、天气预测、轨迹预测、地震预测或我们必须处理时间数据的任何领域。我们在时间序列和顺序数据分析中构建的模型应该考虑数据的顺序，并提取相邻数据之间的关系。让我们继续查看一些用 Python 分析时间序列和顺序数据的方法。

将数据转换为时间序列格式

我们将从了解如何将观测序列转换为时间序列数据并可视化开始。我们将使用名为熊猫的库来分析时间序列数据。确保你安装熊猫，然后再继续。您可以在http://pandas.pydata.org/pandas-docs/stable/install.html找到安装说明。

怎么做…

新建一个 Python 文件，导入以下包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

让我们定义一个函数来读取一个输入文件，该文件将顺序观察转换为时间索引数据:
```
def convert_data_to_timeseries(input_file, column, verbose=False):
```
我们将使用由四列组成的文本文件。第一列表示年份，第二列表示月份，第三和第四列表示数据。让我们把它加载到一个 NumPy 数组中:
```
    # Load the input file
    data = np.loadtxt(input_file, delimiter=',')
```

按照时间顺序排列，第一行包含开始日期，最后一行包含结束日期。让我们提取这个数据集的开始和结束日期:

    # Extract the start and end dates
    start_date = str(int(data[0,0])) + '-' + str(int(data[0,1]))
    end_date = str(int(data[-1,0] + 1)) + '-' + str(int(data[-1,1] % 12 + 1))

这个函数还有一个详细模式。所以如果设置为 true，它会打印一些东西。让我们打印出开始和结束日期:
```
    if verbose:
        print "\nStart date =", start_date
        print "End date =", end_date
```

让我们创建一个熊猫变量，它包含每月间隔的日期序列:

    # Create a date sequence with monthly intervals
    dates = pd.date_range(start_date, end_date, freq='M')

我们的下一步是把给定的列转换成时间序列数据。您可以使用月份和年份(与索引相反)来访问这些数据:
```
    # Convert the data into time series data
    data_timeseries = pd.Series(data[:,column], index=dates)
```

使用详细模式打印出前十个元素:

    if verbose:
        print "\nTime series data:\n", data_timeseries[:10]

返回时间索引变量，如下所示:
```
    return data_timeseries
```
定义main功能，如下所示:

```py
if __name__=='__main__':
```

我们将使用已经提供给您的data_timeseries.txt文件:

```py
    # Input file containing data
    input_file = 'data_timeseries.txt'
```

从该文本文件加载第三列，并将其转换为时间序列数据:

```py
    # Load input data
    column_num = 2
    data_timeseries = convert_data_to_timeseries(input_file, column_num)
```

熊猫库提供了一个很好的绘图功能，你可以直接在变量上运行:

```py
    # Plot the time series data
    data_timeseries.plot()
    plt.title('Input data')

    plt.show()
```

The full code is given in the convert_to_timeseries.py file that is provided to you. If you run the code, you will see the following image:

![How to do it…](img/B05485_08_01.jpg)

对时间序列数据进行切片

在这个食谱中，我们将学习如何使用熊猫对时间序列数据进行切片。这将有助于您从时间序列数据的不同间隔中提取信息。我们将学习如何使用日期来处理数据子集。

怎么做…

新建一个 Python 文件，导入以下包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from convert_to_timeseries import convert_data_to_timeseries

我们将使用与上一个配方中相同的文本文件对数据进行切片和切割:
```
# Input file containing data
input_file = 'data_timeseries.txt'
```

我们将再次使用第三列:

# Load data
column_num = 2
data_timeseries = convert_data_to_timeseries(input_file, column_num)

让我们假设我们想要提取给定开始和结束年份之间的数据。让我们定义这些，如下所示:
```
# Plot within a certain year range
start = '2008'
end = '2015'
```

绘制给定年份范围之间的数据:

plt.figure()
data_timeseries[start:end].plot()
plt.title('Data from ' + start + ' to ' + end)

我们也可以根据某个月的范围对数据进行切片:

# Plot within a certain range of dates
start = '2007-2'
end = '2007-11'

绘制数据，如下所示:

plt.figure()
data_timeseries[start:end].plot()
plt.title('Data from ' + start + ' to ' + end)

plt.show()

The full code is given in the slicing_data.py file that is provided to you. If you run the code, you will see the following image:
The next figure will display a smaller time frame; hence, it looks like we have zoomed into it:

根据时间序列数据运行

现在我们知道如何对数据进行切片，提取各种子集，下面我们来讨论一下如何对时间序列数据进行操作。您可以用许多不同的方式过滤数据。熊猫库允许你以任何你想要的方式对时间序列数据进行操作。

怎么做…

新建一个 Python 文件，导入以下包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from convert_to_timeseries import convert_data_to_timeseries

我们将使用与上一个配方中相同的文本文件:

# Input file containing data
input_file = 'data_timeseries.txt'

我们将使用这个文本文件中的第三和第四列:

# Load data
data1 = convert_data_to_timeseries(input_file, 2)
data2 = convert_data_to_timeseries(input_file, 3)

将数据转换成熊猫数据帧:

dataframe = pd.DataFrame({'first': data1, 'second': data2})

绘制给定年份范围内的数据:

# Plot data
dataframe['1952':'1955'].plot()
plt.title('Data overlapped on top of each other')

让我们假设我们想要绘制在给定年份范围内刚刚加载的两列之间的差异。我们可以使用以下几行代码来做到这一点:

# Plot the difference
plt.figure()
difference = dataframe['1952':'1955']['first'] - dataframe['1952':'1955']['second']
difference.plot()
plt.title('Difference (first - second)')

如果我们想根据第一列和第二列的不同条件过滤数据，我们可以只指定这些条件并绘制如下:

# When 'first' is greater than a certain threshold
# and 'second' is smaller than a certain threshold
dataframe[(dataframe['first'] > 60) & (dataframe['second'] < 20)].plot()
plt.title('first > 60 and second < 20')

plt.show()

The full code is in the operating_on_data.py file that is already provided to you. If you run the code, the first figure will look like the following:
The second output figure denotes the difference, as follows:
The third output figure denotes the filtered data, as follows:

![How to do it…](img/B05485_08_06.jpg)

从时间序列数据中提取统计数据

我们要分析时间序列数据的主要原因之一就是从中提取有趣的统计数据。这提供了大量关于数据性质的信息。在这个食谱中，我们将看看如何提取这些统计数据。

怎么做…

新建一个 Python 文件，导入以下包:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from convert_to_timeseries import convert_data_to_timeseries

我们将使用我们在前面的配方中使用的相同文本文件进行分析:
```
# Input file containing data
input_file = 'data_timeseries.txt'
```

加载两个数据列(第三列和第四列):

# Load data
data1 = convert_data_to_timeseries(input_file, 2)
data2 = convert_data_to_timeseries(input_file, 3)

创建一个熊猫数据结构来保存这些数据。这个数据框就像一个字典，有键和值:
```
dataframe = pd.DataFrame({'first': data1, 'second': data2})
```
让我们现在开始提取一些统计数据。要提取最大值和最小值，请使用以下代码:
```
# Print max and min
print '\nMaximum:\n', dataframe.max()
print '\nMinimum:\n', dataframe.min()
```

要打印数据的平均值或行平均值，请使用以下代码:

# Print mean
print '\nMean:\n', dataframe.mean()
print '\nMean row-wise:\n', dataframe.mean(1)[:10]

滚动平均值是时间序列处理中经常使用的一个重要统计量。最著名的应用之一是平滑信号以消除噪声。滚动平均值是指计算在时间尺度上不断滑动的窗口中信号的平均值。让我们考虑24的窗口大小，并绘制如下图:
```
# Plot rolling mean
pd.rolling_mean(dataframe, window=24).plot()
```

相关系数有助于理解数据的性质，如下所示:

# Print correlation coefficients
print '\nCorrelation coefficients:\n', dataframe.corr()

让我们用60 :

# Plot rolling correlation
plt.figure()
pd.rolling_corr(dataframe['first'], dataframe['second'], window=60).plot()

plt.show()

的窗口大小来绘制这个图

The full code is given in the extract_stats.py file that is already provided to you. If you run the code, the rolling mean will look like the following:

![How to do it…](img/B05485_08_07.jpg)

The second output figure indicates the rolling correlation:

![How to do it…](img/B05485_08_08.jpg)

In the upper half of the Terminal, you will see max, min, and mean values printed, as shown in the following image:

![How to do it…](img/B05485_08_09.jpg)

In the lower half of the Terminal, you will see the row-wise mean stats and correlation coefficients printed, as seen in the following image:

![How to do it…](img/B05485_08_10.jpg)

建立序列数据的隐马尔可夫模型

当涉及到顺序数据分析时，隐马尔可夫模型 ( HMMs )真的很强大。它们被广泛用于金融、语音分析、天气预报、单词排序等等。我们通常对揭示随时间出现的隐藏模式感兴趣。

任何产生一系列输出的数据源都可能产生模式。请注意，hmm 是生成模型，这意味着一旦他们了解了底层结构，他们就可以生成数据。hmm 不能区分基本形式的类。这与区分模型形成对比，区分模型可以学习区分类，但不能生成数据。

做好准备

例如，假设我们想预测明天天气是晴朗、寒冷还是下雨。为此，我们查看所有参数，如温度、压力等，而底层状态是隐藏的。这里，基础状态指的是三个可用的选项:晴天、冷天或雨天。如果你想了解更多关于头盔显示器的知识，请点击https://www.robots.ox.ac.uk/~vgg/rg/slides/hmm.pdf查看本教程。

我们将使用hmmlearn来构建和培训 hmm。在继续之前，请确保安装了此软件。您可以在http://hmmlearn.readthedocs.org/en/latest找到安装说明。

怎么做…

新建一个 Python 文件，导入以下包:

import datetime

import numpy as np
import matplotlib.pyplot as plt
from hmmlearn.hmm import GaussianHMM

from convert_to_timeseries import convert_data_to_timeseries

我们将使用已经提供给您的名为data_hmm.txt的文件中的数据。该文件包含逗号分隔的行。每行包含三个值:一年、一个月和一个浮点数据。让我们将它加载到一个 NumPy 数组中:
```
# Load data from input file
input_file = 'data_hmm.txt'
data = np.loadtxt(input_file, delimiter=',')
```
让我们按列堆叠数据进行分析。我们不需要在技术上对它进行列堆叠，因为它只有一列。但是，如果您有多个列要分析，您可以使用以下结构:
```
# Arrange data for training 
X = np.column_stack([data[:,2]])
```
使用四个组件创建和训练隐马尔可夫模型。成分的数量是我们必须选择的超参数。这里，通过选择四个，我们说数据是使用四个底层状态生成的。我们将很快看到性能如何随此参数变化:
```
# Create and train Gaussian HMM 
print "\nTraining HMM...."
num_components = 4
model = GaussianHMM(n_components=num_components, covariance_type="diag", n_iter=1000)
model.fit(X)
```

运行预测器获取隐藏状态:

# Predict the hidden states of HMM 
hidden_states = model.predict(X)

计算隐藏状态的均值和方差:

print "\nMeans and variances of hidden states:"
for i in range(model.n_components):
    print "\nHidden state", i+1
    print "Mean =", round(model.means_[i][0], 3)
    print "Variance =", round(np.diag(model.covars_[i])[0], 3)

正如我们之前讨论的，hmm 是生成模型。因此，让我们生成例如1000样本并绘制如下:

# Generate data using model
num_samples = 1000
samples, _ = model.sample(num_samples) 
plt.plot(np.arange(num_samples), samples[:,0], c='black')
plt.title('Number of components = ' + str(num_components))

plt.show()

The full code is given in the hmm.py file that is already provided to you. If you run the code, you will see the following figure:
You can experiment with the n_components parameter to see how the curve gets nicer as you increase it. You can basically give it more freedom to train and customize by allowing a larger number of hidden states. If you increase it to 8, you will see the following figure:
If you increase this to 12, it will get even smoother:

![How to do it…](img/B05485_08_13.jpg)

In the Terminal, you will get the following output:

![How to do it…](img/B05485_08_14.jpg)

为连续文本数据构建条件随机字段

条件随机场 ( CRFs )是用于分析结构化数据的概率模型。它们经常用于标记和分割序列数据。通用报告格式是区别模型，而 hmm 是生成模型。通用报告格式被广泛用于分析序列、股票、语音、单词等。在这些模型中，给定一个特定的标记观察序列，我们定义了这个序列的条件概率分布。这与 hmm 形成对比，在 hmm 中，我们定义了标签和观察序列的联合分布。

做好准备

hmm 假设当前输出在统计上独立于以前的输出。这是 hmm 需要的，以确保推理以健壮的方式工作。然而，这个假设不一定总是正确的！时间序列设置中的当前输出通常取决于先前的输出。相对于 hmm，CRF 的主要优势之一是它们本质上是有条件的，这意味着我们不假设输出观测之间有任何独立性。使用通用报告格式比使用 hmm 还有其他一些优势。在许多应用中，如语言学、生物信息学、语音分析等，通用报告格式往往优于 hmm。在这个食谱中，我们将学习如何使用 CRFs 来分析字母序列。

我们将使用名为pystruct的库来构建和训练 CRF。在继续之前，请确保安装了此软件。您可以在https://pystruct.github.io/installation.html找到安装说明。

怎么做…

新建一个 Python 文件，导入以下包:

import os
import argparse 
import cPickle as pickle 

import numpy as np
import matplotlib.pyplot as plt
from pystruct.datasets import load_letters
from pystruct.models import ChainCRF
from pystruct.learners import FrankWolfeSSVM

定义一个参数解析器，将C值作为输入参数。C是一个超参数，它控制你希望你的模型有多具体，同时又不会失去归纳的能力:

def build_arg_parser():
    parser = argparse.ArgumentParser(description='Trains the CRF classifier')
    parser.add_argument("--c-value", dest="c_value", required=False, type=float,
            default=1.0, help="The C value that will be used for training")
    return parser

定义一个类来处理所有与 CRF 相关的处理:
```
class CRFTrainer(object):
```

定义一个init函数来初始化值:

    def __init__(self, c_value, classifier_name='ChainCRF'):
        self.c_value = c_value
        self.classifier_name = classifier_name

我们将使用链式 CRF 来分析数据。我们需要为此添加一个错误检查，如下所示:

        if self.classifier_name == 'ChainCRF':
            model = ChainCRF()

定义我们将在通用报告格式模型中使用的分类器。我们将使用一种类型的支持向量麻吉 ne 来实现这一点:

            self.clf = FrankWolfeSSVM(model=model, C=self.c_value, max_iter=50) 
        else:
            raise TypeError('Invalid classifier type')

加载字母数据集。该数据集由分段字母及其相关特征向量组成。我们不会分析图像，因为我们已经有了特征向量。每个单词的第一个字母都被去掉了，所以我们只有小写字母:
```
    def load_data(self):
        letters = load_letters()
```

将数据和标签加载到各自的变量中:

        X, y, folds = letters['data'], letters['labels'], letters['folds']
        X, y = np.array(X), np.array(y)
        return X, y, folds

定义一种训练方法，如下所示:

    # X is a numpy array of samples where each sample
    # has the shape (n_letters, n_features) 
    def train(self, X_train, y_train):
        self.clf.fit(X_train, y_train)

定义评估模型性能的方法:

```py
    def evaluate(self, X_test, y_test):
        return self.clf.score(X_test, y_test)
```

定义新数据的分类方法:

```py
    # Run the classifier on input data
    def classify(self, input_data):
        return self.clf.predict(input_data)[0]
```

这些字母被编入一个编号数组。为了检查输出并使其可读，我们需要将这些数字转换成字母。为此定义一个函数:

```py
def decoder(arr):
```

```py
    alphabets = 'abcdefghijklmnopqrstuvwxyz'
    output = ''
    for i in arr:
        output += alphabets[i] 

    return output
```

定义函数并解析输入参数:

```py
if __name__=='__main__':
    args = build_arg_parser().parse_args()
    c_value = args.c_value
```

用类和C值初始化变量:

```py
    crf = CRFTrainer(c_value)
```

加载字母数据:

```py
    X, y, folds = crf.load_data()
```

将数据分成训练和测试数据集:

```py
    X_train, X_test = X[folds == 1], X[folds != 1]
    y_train, y_test = y[folds == 1], y[folds != 1]
```

训练通用报告格式模型，如下所示:

```py
    print "\nTraining the CRF model..."
    crf.train(X_train, y_train)
```

评估通用报告格式模型的性能:

```py
    score = crf.evaluate(X_test, y_test)
    print "\nAccuracy score =", str(round(score*100, 2)) + '%'
```

让我们取一个随机测试向量，并使用模型预测输出:

```py
    print "\nTrue label =", decoder(y_test[0])
    predicted_output = crf.classify([X_test[0]])
    print "Predicted output =", decoder(predicted_output)
```

The full code is given in the crf.py file that is already provided to you. If you run this code, you will get the following output on your Terminal. As we can see, the word is supposed to be "commanding". The CRF does a pretty good job of predicting all the letters:

![How to do it…](img/B05485_08_15.jpg)

利用隐马尔可夫模型分析股市数据

让我们使用隐马尔可夫模型分析股市数据。股市数据是时间序列数据的一个很好的例子，其中数据以日期的形式组织。在我们将要使用的数据集中，我们可以看到不同公司的股票价值是如何随着时间波动的。隐马尔可夫模型是用于分析这种时间序列数据的生成模型。在这个食谱中，我们将使用这些模型来分析股票价值。

怎么做…

新建一个 Python 文件，导入以下包:

import datetime

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.finance import quotes_historical_yahoo_ochl
from hmmlearn.hmm import GaussianHMM

从雅虎财经获取股票报价。matplotlib中有一种方法可以直接加载:

# Get quotes from Yahoo finance
quotes = quotes_historical_yahoo_ochl("INTC", 
        datetime.date(1994, 4, 5), datetime.date(2015, 7, 3))

每个报价中有六个值。让我们提取相关数据，如股票的收盘价和交易的股票数量，以及它们对应的日期:

# Extract the required values
dates = np.array([quote[0] for quote in quotes], dtype=np.int)
closing_values = np.array([quote[2] for quote in quotes])
volume_of_shares = np.array([quote[5] for quote in quotes])[1:]

让我们计算每种类型数据的收盘价的百分比变化。我们将使用这作为特征之一:

# Take diff of closing values and computing rate of change
diff_percentage = 100.0 * np.diff(closing_values) / closing_values[:-1]

dates = dates[1:]

按列堆叠两个数组进行训练:

# Stack the percentage diff and volume values column-wise for training
X = np.column_stack([diff_percentage, volume_of_shares])

使用五个组件训练隐马尔可夫模型:

# Create and train Gaussian HMM 
print "\nTraining HMM...."
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000)

model.fit(X)

使用训练好的隐马尔可夫模型生成500样本并绘制出来，如下所示:

# Generate data using model
num_samples = 500 
samples, _ = model.sample(num_samples) 
plt.plot(np.arange(num_samples), samples[:,0], c='black')

plt.show()

The full code is given in hmm_stock.py that is already provided to you. If you run this code, you will see the following figure:

OpenDocCN / apachecn-ml-zh

八、剖析时间序列和序列数据

简介

将数据转换为时间序列格式

怎么做…

对时间序列数据进行切片

怎么做…

根据时间序列数据运行

怎么做…

从时间序列数据中提取统计数据

怎么做…

建立序列数据的隐马尔可夫模型

做好准备

怎么做…

为连续文本数据构建条件随机字段

做好准备

怎么做…

利用隐马尔可夫模型分析股市数据

怎么做…

简介

发行版

贡献者

近期动态

OpenDocCN / apachecn-ml-zh .gitee-modal { width: 500px !important; }

八、剖析时间序列和序列数据

简介

将数据转换为时间序列格式

怎么做…

对时间序列数据进行切片

怎么做…

根据时间序列数据运行

怎么做…

从时间序列数据中提取统计数据

怎么做…

建立序列数据的隐马尔可夫模型

做好准备

怎么做…

为连续文本数据构建条件随机字段

做好准备

怎么做…

利用隐马尔可夫模型分析股市数据

怎么做…

简介

发行版

贡献者

近期动态

搜索帮助

OpenDocCN / apachecn-ml-zh