机器学习笔记4

03 Feb 2020

1. RNN

RNN（Recurrent Neural Network）是一类用于处理序列数据的神经网络。序列可以是时间序列、文字序列等，序列数据有一个特点——后面的数据跟前面的数据有关系。

应用：NLP情感分析，字符生成

1.1 杂记

字典统计：Counter 类 collections.Counter([iterable-or-mapping])

Counter 集成于 dict 类，因此也可以使用字典的方法，此类返回一个以元素为 key 、元素个数为 value 的 Counter 对象集合

>>> c = Counter('which')
>>> c.update('witch')           # add elements from another iterable
>>> d = Counter('watch')
>>> c.update(d)                 # add elements from another counter
>>> c['h']                      # four 'h' in which, witch, and watch
4

统计单词：
text = "hekl hekl hello my doc"
counter.update(text.split())

>>> b = [2 for i in range(5)]#生成多个数据
>>> b
[2, 2, 2, 2, 2]

Python sorted() 函数

Python 内置函数

sorted() 函数对所有可迭代的对象进行排序操作。

sort 与 sorted 区别：
sort 是应用在 list 上的方法，sorted 可以对所有可迭代的对象进行排序操作。

word_counts = sorted(counter, key=counter.get, reverse=True)
key = counter.get)将返回一个counter.get(item)的值。key是一个函数

运行Python，解释器给出如下输出：<generator object at 0x000001B66C1799A8>

  tuple1 = (1,2,4,8,16,32,64,128,256)
  list1 = [i for i in tuple1]
  print(i for i in list1) ##correct, print([i for i in list1])

numpy 插入一列

使用 np.insert

np.insert(a, 0, values=b, axis=1)
array([[1, 1, 2, 3],
	[1, 4, 5, 6],
	[1, 7, 8, 9]])
np.insert(a, 3, values=b, axis=1)
array([[1, 2, 3, 1],
	[4, 5, 6, 1],
	[7, 8, 9, 1]])

使用’column_stack’

np.column_stack((a,b))
 
array([[ 1.,  2.,  3.,  1.],
       [ 4.,  5.,  6.,  1.],
       [ 7.,  8.,  9.,  1.]])

np.concatenate数组拼接

  >>> b = np.array([2, 2, 2, 2, 2])
  >>> c = np.array([1,2,3])
  >>> a = np.concatenate((b,c)，axis=0)
  >>> a
  array([2, 2, 2, 2, 2, 1, 2, 3])
	
  插入一行
  >>> a = np.row_stack((b,c))
  >>> a
  array([[2, 2, 2, 2, 2],
         [0, 1, 2, 3, 4]])

pyprind包

功能进度展示：monitor－是否查看CPU使用

import time
import pyprind

pbar = pyprind.ProgBar(10, title='进度', monitor=True, bar_char='█')
for i in range(10):
    time.sleep(0.5)
    pbar.update()

python中List类型与numpy.array类型的互相转换

  List转numpy.array:
  temp = np.array(list) 
	
  numpy.array转List:
  arr = temp.tolist() 

pandas输入输出

  输入
  reviewdata = pd.read_csv('moveTmp.csv')
  reviewdata.values()

  输出
  data = pd.Series(mapped_reviews) ## mapped_reviews:list
  data.to_csv('moveTmp.csv')

pandas数据结构

Series 是带标签的一维数组，可存储整数、浮点数、字符串、Python 对象等类型的数据。轴标签统称为索引。调用 pd.Series 函数即可创建 Series：

》 s = pd.Series(data, index=index)

DataFrame 是由多种类型的列构成的二维标签数据结构，类似于 Excel 、SQL 表，或 Series 对象构成的字典。DataFrame 是最常用的 Pandas 对象，与 Series 一样，DataFrame 支持多种类型的输入数据：

一维 ndarray、列表、字典、Series 字典
二维 numpy.ndarray
结构多维数组或记录多维数组
Series
DataFrame

In [37]: d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [38]: df = pd.DataFrame(d)

索引 / 选择

索引基础用法如下：

操作				句法				结果
选择列			  df[col]			Series
用标签选择行		df.loc[label]	Series
用整数位置选择行   df.iloc[loc]		Series
行切片			   df[5:10]			DataFrame
用布尔向量选择行	df[bool_vec]	DataFrame

numpy:选择，是不包括最后一个索引
>>> a
array([[2, 2, 2, 2, 2],
       [0, 1, 2, 3, 4]])
>>> a[:,:-2]
array([[2, 2, 2],
       [0, 1, 2]])

2.tensorflow

安装tensorflow-1.14.0时，出现了以下错误

运行tensorflow时出现了两种错误: 我们来看看，错误分为两类： D:\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:516: FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / ‘(1,)type’. _np_qint8 = np.dtype([(“qint8”, np.int8, 1)])

D:\Anaconda3\lib\site-packages\h5py_init_.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters

numpy的版本与tensorflow版本不匹配，更换到numpy-1.16.0
pip install h5py==2.8.0rc1

另外，其它tensorflow的版本与GPU匹配使用，需要显卡CDUA的版本匹配。

numpy切片索引：倒序表达为 b=a[::-1] # [start: stop: step]

切片还可以包括省略号 …，来使选择元组的长度与数组的维度相同。如果在行位置使用省略号，它将返回包含行中元素的 ndarray。

  import numpy as np
	 
  a = np.array([[1,2,3],[3,4,5],[4,5,6]])  
  print (a[...,1])   # 第2列元素
  print (a[1,...])   # 第2行元素
  print (a[...,1:])  # 第2列及剩下的所有元素
  输出结果为：
  [2 4 5]
  [3 4 5]
  [[2 3]
   [4 5]
   [5 6]]

tf.reshape(sensor, shape=[], name=’’)

shape=[-1], -1用来推断形状（infer）: reshape(t, [ 2, -1, 3])

与numpy.reshape中的一样。np.resize是在本身上进行操作，np.reshape返回的是修改之后的参数。

np.prod（）计算元素的乘积

空阵，返回1，默认计算所有元素的积

np.prod([])
Out[3]: 1.0
np.prod([[1,3],[4,5]])
Out[4]: 60

tf.reduce_mean()

用于计算张量tensor沿着指定的数轴（tensor的某一维度）上的的平均值，主要用作降维或者计算tensor（图像）的平均值

  x = tf.constant([[1., 1.], [2., 2.]])
  tf.reduce_mean(x)  # 1.5
  tf.reduce_mean(x, 0)  # [1.5, 1.5]
  tf.reduce_mean(x, 1)  # [1.,  2.]

	sess=tf.Session()
	sess.run(tf.global_variables_initializer())
	print("out1=",x.eval(session=sess))
	out1= 1.5
	
tensor_a=tf.convert_to_tensor(a)


类似函数还有:

tf.reduce_sum ：计算tensor指定轴方向上的所有元素的累加和;
tf.reduce_max  :  计算tensor指定轴方向上的各个元素的最大值;
tf.reduce_all :  计算tensor指定轴方向上的各个元素的逻辑和（and运算）;
tf.reduce_any:  计算tensor指定轴方向上的各个元素的逻辑或（or运算）;

tf.nn.softmax_cross_entropy_with_logits的用法

tf.nn.softmax_cross_entropy_with_logits(logits, labels, name=None)

函数功能：计算最后一层是softmax层的cross entropy，把softmax计算与cross entropy计算放到一起了，用一个函数来实现，用来提高程序的运行速度。

与方法有关的一共两个参数：

第一个参数logits：就是神经网络最后一层的输出，如果有batch的话，它的大小就是[batchsize，num_classes]，单样本的话，大小就是num_classes

第二个参数labels：实际的标签，大小同上

如果要求交叉熵，我们要再做一步tf.reduce_sum操作,就是对向量里面所有元素求和，最后才得到；
如果求loss，则要做一步tf.reduce_mean操作，对向量求均值！

计算损失：
cross_entropy_loss = tf.reduce_mean(
            tf.nn.softmax_cross_entropy_with_logits(
                logits=h4, labels=tf_y_onehot),
            name='cross_entropy_loss')

tf.equal( x, y, name=None)

返回(x == y)元素的真值bool。必须是以下类型之一:bfloat16、half、float32、float64、uint8、int8、int16、int32、int64、complex64、quint8、qint8、qint32、string、bool、complex128。

tf.layers.dense(input, C)

Dense层就是全连接神经网络层，逻辑上等价于这样一个函数：

权重W为m*n的矩阵.

输入x为n维向量.

激活函数Activation.

偏置bias.

输出向量out为m维向量.

out=Activation(Wx+bias).

3.CNN

卷积神经网络（Convolutional Neural Network，简称CNN），是一种前馈神经网络，人工神经元可以响应周围单元，可以进行大型图像处理。卷积神经网络包括卷积层和池化层。

池化层

通常会在卷积层之间周期性插入一个池化层，其作用是逐渐降低数据体的空间尺寸，这样就能够减少网络中参数的数量，减少计算资源耗费，同时也能够有效地控制过拟合。

池化层和卷积层一样也有一个空间窗口，通常采用的是取这些窗口中的最大值作为输出结果，然后不断滑动窗口，对输入数据体每一个深度切片单独处理，减少它的空间尺寸

除了最大值池化之外，还有一些其他的池化函数，比如平均池化，或者L2范数池化。在实际中证明，在卷积层之间引入最大池化的效果是最好的，而平均池化一般放在卷积神经网络的最后一层。

全连接层

每个神经元与前一层所有的神经元全部连接.

卷积后的大小

W：矩阵宽，H：矩阵高，F：卷积核宽和高，P：padding（需要填充的0的个数），N：卷积核的个数，S：步长

width：卷积后输出矩阵的宽，height：卷积后输出矩阵的高

width = floor[（W - F + 2P）/ S ] + 1

height = floor[（H - F + 2P） / S ] + 1

3.1理解神经网络中的Dropout

过拟合是深度神经网（DNN）中的一个常见问题：模型只学会在训练集上分类，这些年提出的许多过拟合问题的解决方案，其中dropout具有简单性而且效果也非常良好。

dropout是指在深度学习网络的训练过程中，对于神经网络单元，按照一定的概率将其暂时从网络中丢弃。注意是暂时，对于随机梯度下降来说，由于是随机丢弃，故而每一个mini-batch都在训练不同的网络。增加一个概率系数（1/（1－p）），所以计算会慢一点。

第一种理解方式是，在每次训练的时候使用dropout，每个神经元有百分之50的概率被移除，这样可以使得一个神经元的训练不依赖于另外一个神经元，同样也就使得特征之间的协同作用被减弱。Hinton认为，过拟合可以通过阻止某些特征的协同作用来缓解。

Dropout通常使用L2归一化以及其他参数约束技术。
L1与L2正则化：

4.知识图谱

AI时代你需要知道的：知识图谱技术原理（必读）

5.目标识别

[AI开发]目标检测之素材标注

常见目标检测算法有SSD、Yolo以及Faster-RCNN等。

图像素材标注工具有很多，很多人在用的是labelimg，主要用于目标检测素材标注：

###

参考：