We have a set of real data, which are showing the measurements of different pollutants. What we expect in the first step is to see the visual correlation between different pollutants, forexample pm2.5 and pm10
%reset -f
from numpy import genfromtxt, savetxt
import numpy as np
#pollution data
Data = genfromtxt(open('data/pollution.csv','r'),dtype=float, delimiter=',')[1:]
Labels = Data[:,0]
Data = Data[:,1:]
header= genfromtxt(open('data/pollution.csv','r'),delimiter=',',dtype = None)[0]
header = header[1:]
header = header[np.newaxis,:]
print 'size of data set: ', Data.shape
# %reset -f
import sys
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
msz0 = 50
msz1 = 50
cd = msz0*msz1*1*1
dlen = 100*1000*1*1*1#+224
dim = 3
from pandas.tools.plotting import scatter_matrix
from pandas import Series, DataFrame
import pandas as pd
df = DataFrame(data = Data[1:1000,:], columns= header.T)
fig = scatter_matrix(df, alpha=0.2, figsize=(10, 10), diagonal='kde')
we can either use random values for initializing the som weight vectors or we can linearly initialize them. For that, we first need to calculate eigenvectors of the correlation matrix of the original data. Then, we take two largest eigenvalues and their corresponding eigenvectors and initialize the weight vectors linealy alongside of the SOM map based on these two PCs. For large data size it is better to use Randomized PCA method as it is fast and the results are good.
We use RandomizedPCA from scikitlearn package.
As we could expect, the first two PCs show which features have mutual correlation, but certainly in a linear way.
If SOM training starts based on this PCA based initial values, the training needs less number of iterations.
reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var',initmethod='pca')
sm.init_map()
setattr(sm, 'compname', header)
sm.view_map(which_dim = 'all')
The deafult training algorithm is Batch training.
The learning parameters are selected autamatically.
n_job is defining the number cores to be used in parallel. It uses Joblib multiprocessing library from scikitlearn
by deafault, there is no shared memory. this means if you add 1 job it increases the amount of memory you need. Then, you can use shared memory based on numpy.memmap. But to be honest, I didn’t get better performance and I guess I didn’t implement it correctly.
The current implementation looks to behave linearly scalable with the size of memory and number of cores, but I have this feeling that my memory allocation is not optimum.
sm.train(n_job = 1, shared_memory = 'no',verbose='on')
sm.view_map(which_dim = 'all')
sm.view_map(which_dim= 'all' , pack='Yes',text_size=6,save='No',save_dir='')
sm.hit_map()
I used an specific data size, since it was used in a map-reduce based implementation of SOM: (http://www.hicomb.org/papers/HICOMB2011-01.pdf)
The data size: 81920 X 256 on 50X50 SOM grid
The best achieved results in this paper (using 1024 cores) is 4.02 minutes, which doesn’t look acceptable, as the data is not that large
1-parallel-Map-reduce SOM: 241.2 seconds
2- Matlab: 69.621502 seconds
3- sompy: 26.1570 seconds
%reset -f
import numpy as np
import sompylib.som_structure as SOM
from matplotlib import pyplot as plt
import sys
msz0 = 50
msz1 = 50
cd = msz0*msz1*1*1
dlen = 81920*1*1*1*1#+224
# dlen = 200*1000
dim = 256
Data = np.random.randint(0,2,size = (dlen,dim))
reload(sys.modules['sompylib.som_structure'])
sm = SOM.SOM('sm', Data, mapsize = [msz0, msz1],norm_method = 'var')
sm.train()
I compared my implementation with Matlab with larger data sizes.
For example:
1- data: 200*1000 X 20 on 50X50 SOM grid
1-1- Matlab: 40.590471 seconds
1-2- sompy: 11.6 secodns
2- data: 400*1000 X 20 on 50X50 SOM grid
2-1- Matlab: 81.514382
2-2- sompy: 24.992000
3- data: 800*1000 X 20 on 50X50
3-1- Matlab: 166.681264 seconds
3-2- sompy: 49.350000 secodns
2- data: 200*1000 X 50 on 80X80 SOM grid
2-1- Matlab: 191.710211 seconds
2-2- sompy: 74.932 seconds, CPU gets blocked for a while since memory becomes full! this is the current problem. When I checked the CPU performance, it seems in the begining I have a peak of memory that I can’t find it.
from pandas.tools.plotting import scatter_matrix
from pandas import Series, DataFrame
import pandas as pd
data = Data[48*1000:58*1000]
Target = 8
print 'Variable to predict: ', header[0][Target]
pred = sm.predict_by(data,Target, K =1)
real = data[:,Target]
accuracy = (1-np.abs((pred-real)/real))*100
print 'median accuracy', np.median(accuracy)
print 'mean accuracy', np.mean(accuracy)
print 'std accuracy', np.std(accuracy)
print 'min accuracy', np.min(accuracy)
print 'max accuracy', np.max(accuracy)
DF = DataFrame({'True Value': real[1:100], 'Predicted Value':pred[1:100]})
fig = plt.figure();
DF.plot(DF.index,DF.columns[:],label=header[0][Target],colormap='jet',x_compat=True,style='.-'); plt.legend(loc='best',bbox_to_anchor = (1.0, 1.0),fontsize = 'medium')
plt.ylabel('values')
font = {'size' : 12}
plt.rc('font', **font)
fig.set_size_inches(10,10)
Similar Notebooks