Vol. 1, No. 1 | Jan – June 2017 
   
 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   114 

 

Optimizing Distributed Machine Learning for Large  

Scale EEG Data Set  

 
M Bilal Shaikh, M Abdul Rehman, Attaullah Sahito   
Department of Computer Science, Sukkur IBA, Pakistan 

bilal.shaikh@iba-suk.edu.pk , rehman@iba-suk.edu.pk  
 

Abstract 

 Distributed Machine Learning (DML) has gained its importance more than ever in this 

era of Big Data. There are a lot of challenges to scale machine learning techniques on distributed 

platforms. When it comes to scalability, improving the processor technology for high level 

computation of data is at its limit, however increasing machine nodes and distributing data along 

with computation looks as a viable solution. Different frameworks   and platforms are available 

to solve DML problems. These platforms provide automated random data distribution of datasets 

which miss the power of user defined intelligent data partitioning based on domain knowledge. 

We have conducted an empirical study which uses an EEG Data Set collected through P300 

Speller component of an ERP (Event Related Potential) which is widely used in BCI problems; 

it helps in translating the intention of subject w h i l e performing any cognitive task. EEG data 

contains noise due to waves generated by other activities in the brain which contaminates true 

P300Speller. Use of Machine Learning techniques could help in detecting errors made by P300 

Speller. We are solving this classification problem by partitioning data into different chunks and 

preparing distributed models using Elastic CV Classifier. To present a case of optimizing 

distributed machine learning, we propose an intelligent user defined data partitioning approach 

that could impact on the accuracy of distributed machine learners on average. Our results show 

better average AUC as compared to average AUC obtained after applying random data 

partitioning which gives no control to user over data partitioning. It improves the average 

accuracy of distributed learner due to the domain specific intelligent partitioning by the user. 

Our customized approach achieves 0.66 AUC on individual sessions and 0.75 AUC on mixed 

sessions, whereas random / uncontrolled data distribution records 0.63 AUC.  

Keywords: Data Set, Optimizing, Machine Learning 

1. Introduction 
 Machine Learning is a type of 

artificial intelligence (AI) that provides 

computers with the ability to learn real time 

scenarios from observation data. Based on 

those observations, models are prepared 

which can predict unknown outcomes. The 

consistency of that model helps human in 

making decisions. Models could be 

categorized as supervised, unsupervised and 

also semi-supervised. As the data size, speed 

and its variety have massively increased we 

have entered in to an era of Big Data. The 

mailto:bilal.shaikh@iba-suk.edu.pk
mailto:rehman@iba-suk.edu.pk


M .Bilal et al. Optimizing Distributed Machine Learning for Large Scale EEG Data Set       (pp. 114 - 121) 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   115 

 

scalability of tools and techniques used for 

processing large scale data sets have become 

an active research direction for researchers in 

Big Data community. To scale machine 

learning techniques with large datasets it is a 

common practice to distribute data on several 

systems, called data nodes. These distributed 

nodes contribute their computational power 

and storage to the overall data intensive task. 

 During a machine learning task, 

accumulative measure from different working 

nodes is calculated. This measure dictates the 

scale of quality of a machine learning 

prediction.  

 A. Research Statement: During any 

cognitive process in human, brain produces 

some brain activity. Such an activity could be 

logged through waves generated in the brain. 

These brain waves could be translated to 

human intentions of what they want to do. 

EEG [1] is a device to record such wave data. 

Data collected through EEG is very noisy with 

low SNR (Signal to Noise Ratio). Due to this 

noisy nature of collected data, it is challenging 

to extract event related potential and 

interpretation of human intention correctly. 

 During spelling task, the problem is 

in detecting errors, which is done through 

analyzing the brain waves of subject. 

Differentiating between P300’s true and noisy 

signal is a difficult job. Due to complete 

paralysis, the patient cannot communicate but 

it is awake and fully aware. In such a situation, 

u s i n g BCI (Brain Computer Interaction) a 

patient can establish contact a channel directly 

from the brain (signals) to the computer. As 

EEG signals are very noisy so noise could 

remove after some important feature 

extraction from the dataset. Other irrelevant 

information from the dataset could also be 

identified to remove noise which will help in 

analyzing the important part of dataset 

collected. A learner could be prepared to 

predict the accuracy of spelling error. As the 

data set is large and could be distributed 

among various nodes, we have taken this 

problem in hand to conduct an empirical 

study. This study will help us in presenting a 

proof of concept to below research question.  

1) Distributed Machine Learning Platforms 

provide automated random data 

distribution/partitioning of data set which 

neglects the advantage of user defined 

controlled partitioning of datasets. So if we 

inculcate domain specific intelligence while 

partitioning the data for different nodes, will 

this impact on learner’s accuracy?  

2. Literature Review  
 Apache Hadoop is an open source 

framework for distributed processing and 

storage of large datasets on commodity     

hardware. HDFS (Hadoop Distributed File 

System) is the central technology is designed 

across low-cost commodity hardware and for 

the efficient scale out storage. HDFS is 

responsible for providing reliable and scalable 

data storage that deals with span of large 

clusters of commodity servers [3]. Hadoop 

implements the Map Reduce [4] 

computational paradigm and using HDFS as 

its compute node.  

 HDFS is a distributed file system 

designed to run on commodity hardware [5]. 

HDFS key feature is it’s highly fault tolerant 

behavior. HDFS is designed for deployment 

on low-cost commodity hardware. HDFS is 

also good for providing high throughput 

access to application data and is quite suitable 

for applications that have large volume of 

data.  

 GFS was designed by Google [6] to 

support the similar goals as previous 

distributed file system have like HDFS [5] 

performance, scalability, reliability and 

availability. Google File System has been 

driven by observations Google made to meet 

their storage needs. Google File System 

presented new extensions to existing 

distributed file systems keeping various 

aspects for both micro-benchmarks and real 

world use.  

 MapReduce [4] is a paradigm 

shifting programming model for processing 



M .Bilal et al. Optimizing Distributed Machine Learning for Large Scale EEG Data Set       (pp. 114 - 121) 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   116 

 

large datasets dealing with parallel distributed 

algorithm. In map reduce user defines the 

computations as map and reduce functions and 

the underlying run time system automatically 

parallelizes the computation across the large- 

scale clusters of machines, possible machine 

failures and schedules other inter-machine 

communications to make the possible efficient 

use of the network and disks.  

 Graph Lab [7] was developed by 

identifying common pat- terns in ML, it is a 

parallel abstraction that achieves higher 

usability, expressiveness and performance. 

Unlike previous parallel abstractions, Graph 

Lab offers representation of structured data 

dependencies, iterative computation, and 

flexible scheduling. It uses data graph to 

encrypt the computational structure and data 

dependencies of problem. It represents local 

computation as update functions which 

transform the data on the data graph. Because 

these update functions can modify 

overlapping state, the Graph Lab framework 

provides a set of data consistency models 

which allow the user to specify the minimum 

consistency requirements of their application.  

Spark [8] is a new framework that supports 

applications that are not focused around 

acyclic dataflow model retaining the 

scalability and fault tolerance of MapReduce. 

 Spark introduces a new layer of 

abstraction called Resilient Distributed Data 

Sets (RDDS). An RDD Set is a collection of 

objects partitioned and read only across a 

group of machines which reestablishes itself 

when a partition gets lost. Spark author claims 

to outperform popular Hadoop by 10 times in 

iterative machine learning jobs, and highly 

efficient for interactive query processing to 

large scale datasets than existing frame- 

works.  

 Petuum is a recent framework for 

distributed Machine Learning [9], the 

development of Petuum is based on a theoretic 

ML-centric optimization principle. Petuum 

formalizes ML algorithms as iterative 

convergent programs which encompass a 

larger scope of modern machine learning like 

MCMC, stochastic gradient to estimate points 

in latent variable models, coordinate descent, 

proximal optimization for structured sparsity 

problems, variation methods for graphical 

models, among others. Petuum authors claim 

it to be better than existing ML platform. 

Petuum displays better performance for being 

an alternative to single machine algorithms 

CNN, Caffe and DML [9].  

3. Research Methodology 
 For setting up development 

environment for the sake of proof of concept, 

we have used Spyder IDE (Integrated 

Development Environment) for Python-based 

development. For using Machine Learning 

techniques SciKit Learn library [10] was used.   

ElasticNet API (Application Programmable 

Interface) provided the implementation of 

ElasticCV classifier. Numpy [11] was used to 

partition data into the two-dimension dataset 

into multiplied is joint 2D datasets. For 

plotting the ROC graphs of classifiers ggplot 

[12] is used.  

 We have a 9.5 GB EEG raw dataset 

which was selected to conduct empirical 

experiments. The purpose of collection of data 

was to predict the error in spelling correction 

from p300 speller which was used by Perrin 

et.al [13]. An experiment was carried out over 

nine different subjects with five sessions each. 

These five sessions are assumed to be an 

Epoch window, i.e. a dataset within a time 

frame which is collected after each stimulus. 

These Epochs will then processed as training 

dataset to the classifiers. Perrin [13] has 

presented an explanation about the 

configuration of EEG device was used with 

the subjects.  

 Dataset contains both training data as 

well as data for testing of learners. Training 

dataset consists of 16 subjects while testing 

dataset comprises 10 subjects; each had 

attended 5 disjoint sessions on spelling. In 

master dataset, total trials for training were 

5440 and 3400 were test trials. There are two 

labels of data, (i) Target and (ii) Non-Target.  



M .Bilal et al. Optimizing Distributed Machine Learning for Large Scale EEG Data Set       (pp. 114 - 121) 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   117 

 

 In preprocessing phase shown in 

Figure 1.0is regarding EEG signal data, EOG 

channel was removed implemented in python. 

EOG (Electro Oculo Gram) channel produces 

information introduced by the blinking of eye 

which is a noise in our case. Then, butter-

worth filter between 1-40Hz band pass filtered 

the EEG signals is applied. Butter-worth is 

also known as maximally flat magnitude filter. 

Only 1.3 seconds is set for Epochs which is 

after the occurrence of any possible stimuli or 

feedback event by the subject. Then feature 

extraction is applied before the classification. 

Only preferred electrodes are selected within 

a recommended time of 1.3 seconds  

 

Figure. 1: Pre-processing Workflow from Raw 

Data Set to Sessions Based Data Partitions 

Window later concatenated with Meta data. 

Data size got reduced. A total master dataset 

with 5440 instances having 2211 dimensions 

became available for further processing. The 

EEG based Feature extraction is done as per 

the following methods:  

 1) Dawn Covariance: Two sets of 5 

XDAWN spatial filters are estimated, one for 

each class (Error and Correct). The grand 

average evoked potential of each class is then 

filtered by the corresponding set of spatial 

filters, and concatenated to each epoch. The 

covariance matrix of each resulting epoch is 

then used as feature for the next steps [14].  

 2) Electrode Selection: A channel 

selection is applied to keep only relevant 

channels. The procedure consists in a 

backward elimination with the Riemannian 

distance between the Riemannian Geometric 

mean of the covariance of each class as the 

criterion.  

 3) Tangent Space: Reduced 

Covariance matrices are then projected in the 

tangent space [15]  

 4) Normalization: Feature 

Normalization using al1 norm. Epoch 

windows which was partitioned on the basis of 

different sessions attended by subjects into 

five disjoint datasets. This partitioning is 

totally data dependent, unknown to machine 

learning learners and underlined 

infrastructure. This could be called as user 

defined data partitioning.  

 A new dimension was added to the 

dataset to categorically divide it. This 

dimension labeled each instance with the 

respective session ID of that instance. All the 

labeled data was extracted later in order to 

achieve different sub datasets. These sub 

datasets could be distributed to different nodes 

and processed in parallel in case of speeding 

up the process. Our goal is to focus on 

optimization in accuracy of learners. 

Therefore, speeding up the performance is not 

the important concern here. 

4. Classifiers Based on User 
Defined Intelligent Data 

Partitioning 
 Now as the trained dataset is ready 

after partitioning to train multiple distributed 

classifiers. So, each learner has its own dataset 

which has been partitioned as per respective 

session of the subject. These disjoint 

partitioned datasets are used for acquiring 

knowledge about the parameters using 



M .Bilal et al. Optimizing Distributed Machine Learning for Large Scale EEG Data Set       (pp. 114 - 121) 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   118 

 

ElasticNet. ElasticNet overcomes limitations 

of lasso and ridge regression, it is linear 

regularized regression algorithm and works 

well with numerical attributes. Elastic can 

formulate our problem. Five different 

classifiers are built individually for each 

disjoint dataset. As all partitions are based on 

sessions taken by each subject. This complete 

workflow is shown in Figure 2.0. Therefore, 

this controls the partition data as per prior 

knowledge enabled by the user to define its 

own data partition for each classifiers training. 

Each case classifier is not only trained to 

predict the expected error in user session but 

it will also help the observer to notice the 

behavior of user customization which is based 

on systematic knowledge of the domain. Not 

incorporated in map reduce paradigm [4]. 

5. Classifiers Based on Traditional 
Hdfs like Data Partitioning 

 On the other side, another five 

learners are built which are trained on 

randomly partitioned dataset which is a 

behavior of HDFS [3] where control over 

data partitions is not provided. These 

learners are developed to cross validate 

against our customized learners which 

have been injected the user defined 

domain specific intelligence. Although 

their data sizes are similar and the 

instances contained within these disjoint 

randomly partitioned datasets are 

different to random distribution.  

6. RESULTS  
 The observed results from the 

conducted experiments that were 

described in previous sections are 

presented here. The observations about 

the accuracy trends are noted along with 

the results. First results are shown as per 

our proposed customized user defined 

intelligent partitioning and then the results 

are compared with the platform controlled 

random like data partitioning used with 

most distributed machine learning 

solutions like [17] by NDjuric. 

7. Analysis of Results 
 Experiments were run on all data that 

after mapping the space from 3-D to 2-D space 

to make it compatible with ElasticNet 

classifier. First of all, we are going to analyze 

how well our approach performs, for this we 

Figure. 2: Overall Work Flow: From Pre-processing to Classifier Preparation of the subject. 



M .Bilal et al. Optimizing Distributed Machine Learning for Large Scale EEG Data Set       (pp. 114 - 121) 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   119 

 

have used average of areas under ROC curve 

as measure for accuracy, as it is commonly 

used in the field of BCI. In ROC [18], [19] 

curve AUC (Area under curve) determines the 

credibility of classifier clearer than [4] just 

scalar metrics. ROC curves along with their 

AUC for individual sessions based of 

intelligent data partitioning is shown in Figure 

3.  

 In ROC curve, which is TPR versus 

FPR, different graph representing different 

models are showing the impact of data. A 

variation in ROC Curve can be observed 

easily. However, every session’s data has its 

own effect towards the creation of respective 

learner. To get the overall effect of sessions 

feature towards the accuracies we have 

average all five accuracies. Our measure of 

average area under curve for our intelligently 

partitioned dataset based models is 0.66 for 

the whole test data set, which shows better 

performance of models. ROCs could also be 

combined by aggregating [17] or collecting 

global sum of accuracies [20].  

8. Comparison with Traditional 
Approach  

 To explore how well our customized 

partitioning approach performs, we have 

compared our approach average model 

accuracy obtained after random partitioning of 

data as done HDFS which takes back the 

control of underlying data distribution on 

different nodes from the user [5]. Testing data 

is randomly sampled into 5 sets, each of 

similar size of trials as it was for our custom 

partitioned data sets.  

 A. AUC Comparison: We compared 

the ROCs obtained from partitioning datasets 

as per user defined session based intelligence 

against the ROCs obtained after partitioning 

the data set in traditional way of HDFS. In 

both cases, we obtained five learner 

accuracies which we combined by taking 

average of each set. After aggregation of 

accuracy the Average Accuracy in User 

defined with intelligent partitioning resulted 

approximately 66 percent obtained from 

iindividual session based accuracies shown in 

Figure 3.0. While the average of accuracies 

obtained from traditional HDFS [5] way of 

partitioning obtained was around 63 percent. 

This shows an overall improvement of 3 

percent in combined learner’s accuracy. If 

user observes other important features in the 

data set or empirically test the variation in 

learner’s performance the same data set, 

Accuracies [21] of Machine Learners could be 

tweaked at a large span.  

9. Conclusion   
 This research work proposes 

Intelligent Data Partitioning with test case 

taken from a BCI P300 speller error detection 

problem. This approach has shown results that 

are improving learner’s accuracy on even 

Figure. 3: Area Under ROC Curve of the classifiers from Individual Session 



M .Bilal et al. Optimizing Distributed Machine Learning for Large Scale EEG Data Set       (pp. 114 - 121) 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   120 

 

average aggregation. The impact of observers’ 

intelligent data partitioning would increase 

with higher relevance of partitioning feature. 

More efficient feature engineering and nature 

of dataset could also improve the results. Such 

a type of optimization in distribution machine 

learning results could also expose other key 

insights about features of data that are only 

specific to a domain. This entails that allowing 

user controlled data partition- in will enable 

the analyst to dig deeper into the process of 

efficient machine learning. As per observed 

results of our proposed approach; the system 

performs relatively efficient for classification 

of the selected EEG signals in terms of 

average AUC in intelligent data partitioning 

scenario having better results. There is visible 

evidence for comparison based on average 

ROC to build a combine decision model while 

keeping a data attribute under control for 

AUC. Our proposed approach demonstrates a 

relatively better AUC in phase of testing 

supplied with low amount of data for training. 

 We conclude that our proposed 

approach will be effective if applied in other 

machine learning scenarios we could gain 

even better Average AUC and it could 

perform better during the other inter-features 

variability.  

 

Acknowledgment 
 The authors would like to thank 

Sukkur IBA ICT Department for their 

laboratory services which helped us in 

carrying out empirical experiments.  

 

References  
[1] L.  A.  Farwell  and  E.  Donchin, 

―Talking off  the  top  of  your  head: toward a 

mental prosthesis utilizing event-related brain 

potentials,‖ Electroencephalography  and clinical 

Neurophysiology, vol. 70, no. 6, pp. 510–523, 
1988.   

[2] T. O. Zander, C. Katha, S. Welke, and M. 

Ro¨ tting, ―Utilizing secondary input from passive 

brain-computer interfaces for enhancing human- 

machine interaction,‖ in Foundations of Augmented 
Cognition.  

Neuroergonomics and Operational Neuroscience.  
Springer, 2009, pp. 759–771.  

[3] K. Shvachko, H. Kuang, S. Radia, and R. 

Chansler, ―The hadoop distributed  file system,‖  

in  Mass  Storage  Systems and  Technologies 

(MSST), 2010 IEEE 26th Symposium on.   IEEE, 

2010, pp. 1–10. [4]  J. Dean and S. Ghemawat, 

―Mapreduce: a flexible data processing 

tool,‖Communications of the ACM, vol. 53, no. 1, 

pp. 72–77, 2010. [5]  D. Borthakur, ―The hadoop 

distributed file system: Architecture and design,‖ 

Hadoop Project Website, vol. 11, no. 2007, p. 21, 

2007.  

[6] S. Ghemawat, H. Gobioff, and S.-T. 

Leung, ―The google file system,‖ in  

ACM SIGOPS operating systems review, vol. 37, 
no. 5.   ACM, 2003, pp. 29–43.  

[7] Y.  Low,  J.  E.  Gonzalez,  A.  Kyrola,  

D.  Bickson,  C.  E.  Guestrin, and J. Hellerstein, 

―Graphlab: A new framework for parallel 

machine learning,‖ arXiv preprint 
arXiv:1408.2041, 2014.  

[8] M. Zaharia, M. Chowdhury, M. J. 

Franklin, S. Shenker, and I. Stoica, ―Spark: 

Cluster computing with working sets.‖ HotCloud, 

vol. 10, pp.10– 10, 2010.  

[9] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. 

Wei, S. Lee, X. Zheng, P. Xie, A.  

Kumar, and Y. Yu, ―Petuum: a new platform for 

distributed machine learning on big data,‖ Big Data, 

IEEE Transactions on, vol. 1, no. 2, pp. 49–67, 
2015.  

[10] F.  Pedregosa,  G.  Varoquaux,  A.  

Gramfort,  V.  Michel,  B.  Thirion, O. Grisel, M. 

Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et 

al., ―Scikitlearn: Machine learning in python,‖ The 

Journal  of Machine Learning Research, vol. 12, pp. 
2825–2830, 2011.  

[11] P. P. e. a. Eric Jones, Travis Oliphant. 

(2016, may) Scipy: Open source scientific tools for 

python. 2001. [Online]. Available: http://www. 

scipy.org/ [12]  M. C. Sachs and M. M. C. Sachs, 

―Package plotroc,‖ 2015.  



M .Bilal et al. Optimizing Distributed Machine Learning for Large Scale EEG Data Set       (pp. 114 - 121) 

SJCMS | P-ISSN: 2520-0755 | Vol. 1 | No. 1 | © 2017 Sukkur IBA   121 

 

[13] P. Margaux, M. Emmanuel, D. 

Se´bastien, B. Olivier, and M. Je´re´mie, 

―Objective and subjective evaluation of online 

error correction during p300-based spelling,‖ 

Advances in Human-Computer Interaction,  
vol.2012, p. 4, 2012.  

[14] B. Rivet, A. Souloumiac, V. Attina, and 

G. Gibert, ―xdawn algorithm to enhance evoked 

potentials: application to brain–computer 

interface,‖ Biomedical Engineering, IEEE 

Transactions on, vol. 56, no. 8, pp. 2035– 2043, 
2009.  

[15] A. Barachant, S. Bonnet, M. Congedo, 

and C. Jutten, ―Classification of covariance 

matrices using a riemannian-based kernel for bci 

applications,‖ Neuro computing, vol. 112, pp. 172–

178, 2013. [16]  H. Zou and T. Hastie, 

―Regularization and variable selection via the 

elastic net,‖ Journal of the Royal Statistical Society: 

Series B (Statistical Methodology), vol. 67, no. 2, 
pp. 301–320, 2005.  

[17] N. Djuric, M. Grbovic, and S. Vucetic, 

―Distributed confidence-weighted classification 

on  mapreduce,‖ in  Big Data,  2013 IEEE 

International Conference on.   IEEE, 2013, pp. 458–
466.  

[18] S. Wu and P. Flach, ―A scored auc 

metric for classifier evaluation and selection,‖ in 

Second Workshop on ROC Analysis in ML, Bonn, 

Germany,2005.  

[19] T. Fawcett, ―An introduction to roc 

analysis,‖ Pattern recognition letters, vol. 27, no. 8, 
pp. 861–874, 2006.  

[20] A. Priyadarshini et al., ―A map reduce 

based support vector machine for big data 

classification,‖ International  Journal  of Database  

Theory and Application, vol. 8, no. 5, pp. 77–98, 
2015.  

[21] P. Simon, Too Big to Ignore:  The 

Business Case  for Big Data,  ser.Wiley and SAS  

Business Series.    Wiley, 2013. [Online]. 

Available: 

https://books.google.com.pk/books?id=Dn-
Gdoh66sgC