Jekyll2020-10-10T11:55:57+00:00https://estadistika.github.io//feed.xmlEstadistikaWrite an awesome description for your new site here. You can edit this line in _config.yml. It will appear in your document head meta (for Google search results) and in your feed.xml site description.Bayesian Sequential Learning2020-03-01T04:00:00+00:002020-03-01T04:00:00+00:00https://estadistika.github.io//bayesian/probability/modeling/inference/julia/2020/03/01/Bayesian%20Sequential%20Learning<p>In my previous <a href="https://estadistika.github.io/data/analyses/wrangling/julia/programming/packages/2018/10/14/Introduction-to-Bayesian-Linear-Regression.html" target="_blank">article</a>, I discussed how we can use Bayesian approach in estimating the parameters of the model. The process revolves around solving the following conditional probability, popularly known as the <i>Bayes’ Theorem</i>:</p> $\begin{equation} \mathbb{P}(\mathbf{w}|\mathbf{y})=\frac{\mathbb{P}(\mathbf{w})\mathbb{P}(\mathbf{y}|\mathbf{w})}{\mathbb{P}(\mathbf{y})}, \end{equation}$ <p>where $\mathbb{P}(\mathbf{w})$ is the <em>a priori</em> (prior distribution) for the objective parameters, $\mathbb{P}(\mathbf{y}|\mathbf{w})$ is the <em>likelihood</em> or model evidence, and $\mathbb{P}(\mathbf{y})$ is the <em>normalizing constant</em> with the following form:</p> $\begin{equation} \mathbb{P}(\mathbf{y})=\int_{\forall\mathbf{w}\in\mathscr{P}}\mathbb{P}(\mathbf{w})\mathbb{P}(\mathbf{y}|\mathbf{w})\operatorname{d}\mathbf{w}, \end{equation}$ <p>where $\mathscr{P}$ is the parameter space.</p> <h3 id="posterior-distribution">Posterior Distribution</h3> <p>The details on the derivation of the <em>a posteriori</em> were also provided in the said article, but there were missing pieces, which I think is necessary for us to support our proposition, and thus we have the following result:</p> <div class="wrapper"> <div class="title-label" style="background-color: #5e5e5e; color: #fff; padding: 2px 10px; font-family: 'Saira Condensed', sans-serif;"><b>Proposition</b></div> <div class="proposition" style="padding: 10px; background-color: #efefef; font-family: 'Amita', cursive; text-align: left !important;"> Let $\mathscr{D}\triangleq\{(\mathbf{x}_1,y_1),\cdots,(\mathbf{x}_n,y_n)\}$ be the set of data points s.t. $\mathbf{x}\in\mathbb{R}^{p}$. If $$y_i\overset{\text{iid}}{\sim}\mathcal{N}(w_0+w_1x_i,\alpha^{-1})$$ and $\mathbf{w}\triangleq[w_0,w_1]^{\text{T}}$ s.t. $\mathbf{w}\overset{\text{iid}}{\sim}\mathcal{N}_2(\mathbf{0},\mathbf{I})$, then $\mathbf{w}|\mathbf{y}\overset{\text{iid}}{\sim}\mathcal{N}_2(\boldsymbol{\mu},\boldsymbol{\Sigma})$ where \begin{align} \boldsymbol{\mu}&amp;=\alpha\boldsymbol{\Sigma}\mathbf{\mathfrak{A}}^{\text{T}}\mathbf{y},\\ \boldsymbol{\Sigma}&amp;=(\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\mathfrak{A}+\beta\mathbf{I})^{-1} \end{align} and $\boldsymbol{\mathfrak{A}}\triangleq\left[(\mathbf{x}_i^{\text{T}})\right]_{\forall i}$. </div> </div> <div class="wrapper" style="margin-top: 10px"> <div style="text-align: left !important;"> <b><i>Proof</i></b>. Let $\hat{y}_i\triangleq w_0+w_1x_i$ be the model, then the data can be described as follows: \begin{align} y_i=\hat{y}_i+\varepsilon_i,\quad\varepsilon_i\overset{\text{iid}}{\sim}\mathcal{N}(0,1/\alpha), \end{align} where $\varepsilon_i$ is the innovation that the model can't explain, and $\mathbb{Var}(\varepsilon_i)=\alpha^{-1}$ since $\mathbb{Var}(y_i)=\alpha^{-1}$ as given above. Then the likelihood of the model is given by: <div style="overflow-x: auto;"> \begin{align} \mathcal{L}(\mathbf{w}|\mathbf{y})\triangleq\mathbb{P}(\mathbf{y}|\mathbf{w})&amp;=\prod_{\forall i}\frac{1}{\sqrt{2\pi}\alpha^{-1}}\text{exp}\left[-\frac{(y_i-\hat{y}_i)^2}{2\alpha^{-1}}\right]\\ &amp;=\frac{\alpha^n}{(2\pi)^{n/2}}\text{exp}\left[-\alpha\sum_{\forall i}\frac{(y_i-\hat{y}_i)^2}{2}\right], \end{align} </div> or in vector form: <div style="overflow-x: auto;"> \begin{align} \mathcal{L}(\mathbf{w}|\mathbf{y})\propto\text{exp}\left[-\frac{\alpha(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})}{2}\right], \end{align} </div> where $\mathbf{\mathfrak{A}}\triangleq[(\mathbf{x}_i^{\text{T}})]_{\forall i}$ is the <i>design matrix</i> given above. If the <i>a priori</i> of the parameter is assumed to be standard bivariate Gaussian distribution, i.e. $\mathbf{w}\overset{\text{iid}}{\sim}\mathcal{N}_2(\mathbf{0}, \mathbf{I})$, then <div style="overflow-x: auto;"> \begin{align} \mathbb{P}(\mathbf{w}|\mathbf{y})&amp;\propto\mathcal{L}(\mathbf{w}|\mathbf{y})\mathbb{P}(\mathbf{w})\\ &amp;\propto\text{exp}\left[-\frac{\alpha(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})}{2}\right]\exp\left[-\frac{1}{2}\mathbf{w}^{\text{T}}\beta\mathbf{I}\mathbf{w}\right]\\ &amp;\propto\text{exp}\left\{-\frac{1}{2}\left[\alpha(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})+\mathbf{w}^{\text{T}}\beta\mathbf{I}\mathbf{w}\right]\right\}. \end{align} </div> Expanding the terms in the exponential function returns the following: <div style="overflow-x: auto;"> $$\alpha\mathbf{y}^{\text{T}}\mathbf{y}-2\alpha\mathbf{w}^{\text{T}}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}+\mathbf{w}^{\text{T}}(\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I})\mathbf{w},$$ </div> thus <div style="overflow-x: auto;"> \begin{equation} \mathbb{P}(\mathbf{w}|\mathbf{y})\propto\mathcal{C}\text{exp}\left\{-\frac{1}{2}\left[\mathbf{w}^{\text{T}}(\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I})\mathbf{w}-2\alpha\mathbf{w}^{\text{T}}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}\right]\right\}. \end{equation} </div> The inner terms of the exponential function is of the form $ax^2-2bx$. This a quadratic equation and therefore can be factored by completing the square. To do so, let $\mathbf{D}\triangleq\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I}$ and $\mathbf{b}\triangleq\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}$, then <div style="overflow-x: auto;"> \begin{align} \mathbb{P}(\mathbf{w}|\mathbf{y})&amp;\propto\mathcal{C}\text{exp}\left[-\frac{1}{2}\left(\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{w}-2\mathbf{w}^{\text{T}}\mathbf{b}\right)\right]\\&amp;=\mathcal{C}\text{exp}\left[-\frac{1}{2}\left(\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{w}-\mathbf{w}^{\text{T}}\mathbf{b}-\mathbf{b}^{\text{T}}\mathbf{w}\right)\right]. \end{align} </div> In order to proceed, the matrix $\mathbf{D}$ must be symmetric and invertible (<i>this can be proven separately</i>). If satisfied, then $\mathbf{I}\triangleq\mathbf{D}\mathbf{D}^{-1}=\mathbf{D}^{-1}\mathbf{D}$, so that the terms inside the exponential function above become: <div style="overflow-x: auto;"> $$\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{w}-\mathbf{w}^{\text{T}}\mathbf{D}\mathbf{D}^{-1}\mathbf{b}-\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{w}+\underset{\text{constant introduced}}{\underbrace{(\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{b}-\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{b})}}.$$ </div> Finally, let $\boldsymbol{\Sigma}\triangleq\mathbf{D}^{-1}$ and $\boldsymbol{\mu}\triangleq\mathbf{D}^{-1}\mathbf{b}$, then <div style="overflow-x: auto;"> \begin{align} \mathbb{P}(\mathbf{w}|\mathbf{y})&amp;\propto\mathcal{C}\text{exp}\left[-\frac{1}{2}\left(\mathbf{w}^{\text{T}}\boldsymbol{\Sigma}^{-1}\mathbf{w}-\mathbf{w}^{\text{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}-\boldsymbol{\mu}^{\text{T}}\boldsymbol{\Sigma}^{-1}\mathbf{w}+\boldsymbol{\mu}^{\text{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}\right)\right]\\ &amp;=\mathcal{C}\text{exp}\left[-\frac{(\mathbf{w}-\boldsymbol{\mu})^{\text{T}}\boldsymbol{\Sigma}^{-1}(\mathbf{w}-\boldsymbol{\mu})}{2}\right], \end{align} </div> where $-\mathbf{b}^{\text{T}}\mathbf{D}^{-1}\mathbf{D}\mathbf{D}^{-1}\mathbf{b}$ becomes part of $\mathcal{C}$, and that proves the proposition. $\blacksquare$ </div> </div> <h3 id="simulation-experiment">Simulation Experiment</h3> <p>The above result can be applied to any linear models (<em>cross-sectional</em> or <em>time series</em>), and I’m going to demonstrate how we can use it to model the following simulated data. I will be using Julia in <a href="https://nextjournal.com/">Nextjournal</a> (be sure to head over to <a href="https://nextjournal.com/al-asaad/bayesian-sequential-learning">this link</a> for reproducibility), which already has an image available for version 1.3.1. Having said, some of the libraries are already pre-installed in the said image, for example <a href="https://github.com/JuliaPlots/Plots.jl" target="_blank">Plots.jl</a>. Thus, we only have to install the remaining libraries that we will need for this experiment. <script src="https://gist.github.com/alstat/b444c7054c0f119f38ce773e6040ba07.js"></script> Load the libraries as follows: <script src="https://gist.github.com/alstat/5eb8fba973b4527b578bba9874642818.js"></script> The <code>theme</code> above simply sets the theme of the plots below. Further, for reproducibility purposes, I provided a seed as well. The following function will be used to simulate a cross-sectional data with population parameters $w_0\triangleq-.3$ and $w_1\triangleq-.5$ for 20 sample size. <script src="https://gist.github.com/alstat/3c4e971a5aa2bd40e019b5c2f500f585.js"></script> From the above results, the parameters of the <i>a posteriori</i> can be implemented as follows: <script src="https://gist.github.com/alstat/00fb06b4b55c23a85b0c05801c5c4afb.js"></script> One feature of Julia is it supports unicode, making it easy to relate the codes to the math above, i.e. <code>Σ</code> and <code>μ</code> in the code are obviously $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ above, respectively. The vector operations in Julia are also cleaner compared to that in R and in Python, for example we can encode Julia’s <code>A'A</code> above as <code>t(A) %*% A</code> in R, and <code>A.T.dot(A)</code> in Python.</p> <p>Finally, we simulate the data as follows: <script src="https://gist.github.com/alstat/2664db066600c54ea2e31edffe17831a.js"></script> We don’t really have to write down the <code>wtrue</code> variable above since that’s the default values of <code>w0</code> and <code>w1</code> arguments, but we do so just for emphasis.</p> <p>While the main subject here is Bayesian Statistics, it would be better if we have an idea as to what the Frequentist solution would be. As most of you are aware of, the solution to the weights above is given by the following <i>normal equation</i>: $$\hat{\mathbf{w}}=(\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}})^{-1}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y},$$ this is a known result, and we can prove this in a separate article. The above equation is implemented as follows: <script src="https://gist.github.com/alstat/a13f83140f692ca94fe67c8569c0ba31.js"></script> Therefore, for the current sample data, the estimate we got when assuming the weights as fixed and unknown is $\hat{\mathbf{w}}=[-0.32264, -0.59357]^{\text{T}}$. The figure below depicts the corresponding fitted line. <script src="https://gist.github.com/alstat/42d38f37c05b4cc6249bd10f4a81a433.js"></script> <img src="https://drive.google.com/uc?export=view&amp;id=1LpAcFwC3pCOzCvsMsINmR-d52D83X8bi" /> Not bad given that we have very small dataset. Now let’s proceed and see how we can infer the parameters in Bayesian framework. The prior distribution can be implemented as follows: <script src="https://gist.github.com/alstat/ab32e9a8f171975532910c6305a4f95f.js"></script> As indicated in the above proposition, the parameters are jointly modelled by a bivariate Normal distribution, as indicated by the dimension of the hyperparameter <code>μ</code> above. Indeed, the true parameter we are interested in is the weight vector, $\mathbf{w}$, but because we considered it to be random, then the parameters of the model we assign to it are called <i>hyperparameters</i>, in this case the vector $\boldsymbol{\mu}$ and the identity matrix $\mathbf{I}$ of the prior distribution.</p> <p>Moreover, the likelihood of the data can be implemented as follows: <script src="https://gist.github.com/alstat/b0fa92ec05c0a03c2d91bb82c1c5ca5e.js"></script> The mean of the likelihood is the specified linear model itself, which in vector form is the inner product of the transformed <i>design matrix</i>, $\boldsymbol{\mathfrak{A}}$, and the weights vector, $\mathbf{w}$, i.e. <code>A[i, :]'w</code>. This assumption is valid since the fitted line must be at the center of the data points, and that the error should be random. One of my favorite features of Julia language is the <i>multiple dispatch</i>. For example, the two <code>likelihood</code> functions defined above are not in conflict since Julia evaluates the inputs based on the type of the function arguments. The same is true for the posterior distribution implemented below. Unlike in R and in Python, I usually have to write this as a helper function, e.g. <code>likelihood_helper</code>. <script src="https://gist.github.com/alstat/1eb7917d65287d6726cb800955c99077.js"></script> Finally, the prediction is done by sampling the weights from the posterior distribution. The center of these weights is of course the mean of the <i>a posteriori</i>. <script src="https://gist.github.com/alstat/cfc5d39166922ca852c781517beed170.js"></script> For example, to sample 30 weights from the posterior distribution using all the sampled data, and return the corresponding predictions, is done as follows: <script src="https://gist.github.com/alstat/f284bacb094b5f4eca795a6606ef0fc6.js"></script> The predicted function returns both <i>x</i> and <i>y</i> values, that’s why we indexed the above result to 2, to show the predicted <i>y</i>s only. Further, to use only the first 10 observations of the data for calculating the $\hat{y}$, is done as follows: <script src="https://gist.github.com/alstat/17861ce8cea5cdb67cd0104c4d74c794.js"></script> So you might be wondering, what’s the motivation of only using the first 10 and not all observations? Well, we want to demonstrate how Bayesian inference learns the weights or the parameters of the model sequentially.</p> <h3 id="visualization">Visualization</h3> <p>At this point, we are ready to generate our main vis. The first function (<code>dataplot</code>) plots the generated fitted lines from the <i>a priori</i> or <i>a posteriori</i>, both of which is plotted using the extended method of <code>contour</code> we defined below. <script src="https://gist.github.com/alstat/dda95d054b6cc702dd78ed61062d56de.js"></script> Tying all the codes together, gives us this beautiful grid plot. <script src="https://gist.github.com/alstat/8fca90fe8e0cbb84b36f59367fe130aa.js"></script> <img src="https://drive.google.com/uc?export=view&amp;id=1rKM8d9ROHw63MvBf83EGoNMOoHe6vx__" /> Since we went with style before comprehension, let me guide you then with the axes. All figures have unit square space, with contour plots having the following axes: $w_0$ (the <i>x</i>-axis) and $w_1$ (the <i>y</i>-axis). Obviously, the data space has the following axes: predictor (the <i>x</i>-axis) and response (the <i>y</i>-axis).</p> <h3 id="discussions">Discussions</h3> <p>We commenced the article with emphasis on the approach of Bayesian Statistics to modeling, whereby the estimation of the parameters as mentioned is based on the <i>Bayes’ Theorem</i>, which is a conditional probability with the following form: \begin{equation} \mathbb{P}(\mathbf{w}|\mathbf{y})=\frac{\mathbb{P}(\mathbf{w})\mathbb{P}(\mathbf{y}|\mathbf{w})}{\mathbb{P}(\mathbf{y})}. \end{equation} Now we will relate this to the above figure using some analogy, as to how the model sequentially learns the optimal estimate for the parameter of the linear model.</p> <p>Consider the following: say you forgot where you left your phone, and for some reason you can’t ring it up, because it could be dead or can’t pick up some signals. Further, suppose you don’t wanna look for it, rather you let Mr. Bayes, your staff, to do the task. How would he then proceed? Well, let us consider the weight vector, <span>$\mathbf{w}\triangleq[w_0,w_1]^{\text{T}}$</span>, be the location of your phone. In order to find or at least best approximate the exact location, we need to first consider some prior knowledge of the event. In this case, we need to ask the following questions: where were you the last time you had the phone? Were you in the living room? Or in the kitchen? Or in the garage? And so on. In the context of modeling, this prior knowledge about the true location can be described by a probability distribution, and we refer to this as the <i>a priori</i> (or the prior distribution). These set of possible distributions obviously are models itself with parameters, as mentioned above, referred to as the <i>hyperparameters</i>, which we can tweak to describe our prior knowledge of the event. For example, you might consider the kitchen as the most probable place where you left your phone. So we adjust the <i>location</i> parameter of our <i>a priori</i> model to where the kitchen is. Hence Mr. Bayes should be in the kitchen already, assessing the coverage of his search area. Of course, you need to help Mr. Bayes on the extent of the coverage. This coverage or domain can be described by the <i>scale</i> parameter of your <i>a priori</i>. If we relate this to the main plot, we assumed the prior distribution over the weight vector to be standard bivariate Gaussian distribution, centered at zero vector with identity variance-covariance matrix. Since the prior knowledge can have broad domain or coverage on the possible values of the weight vector, the samples we get generates random fitted lines as we see in the right-most plot of the first row of the figure above.</p> <p>Once we have the prior knowledge in place, that is, we are already in the kitchen and we know how wide the search area likely to be, we can start looking for evidence. The evidence is the realization of your true model, relating to the math above, these realizations are the <span>$y_i$</span>s, coming from <span>$y_i=f(x|\mathbf{w})$</span>, where <span>$f(x|\mathbf{w})$</span> is the link function of the <i>true</i> model, which we attempt to approximate with our hypothesized link function, <span>$h(x|\hat{\mathbf{w}})$</span>, that generated the predicted <span>$\hat{y}_i$</span>s. For example, you may not know where exactly your phone is, but you are sure with your habits. So you inform Mr. Bayes, that the last time you were with your phone in the kitchen was drinking some coffee. Mr. Bayes will then use this as his first evidence, and assess the likelihood of each suspected location in the kitchen. That is, what is the likelihood that a particular location, formed (or realized, generated or connected to) the first evidence (taking some coffee)? For example, is it even comfortable to drink coffee in the sink? Obviously not, so very low likelihood, but likely in the dining table or close to where the coffee maker is. If we assess all possible location within our coverage using the first evidence, we get the <i>profile likelihood</i>, which is what we have in the first column of the grid plot above, <i>profile likelihood for the i</i>th <i>evidence</i>. Further, with the first evidence observed, the prior knowledge of Mr. Bayes needs to be updated to obtain the posterior distribution. The new distribution will have an updated location and scale parameters. If we relate to the above figure, we can see the samples of the fitted lines in the <i>data space</i> plot (third column, second row), starting to make guesses of possible lines given the first evidence observed. Moving on, you inform Mr. Bayes of the second evidence, that you were reading some newspaper while having some coffee. At this point, the prior belief of Mr. Bayes, for the next posterior, will be the posterior of the first evidence, and so the coverage becomes restrictive and with new location, which further help Mr. Bayes on managing the search area. The second evidence, as mentioned, will then return a new posterior. You do this again and again, informing Mr. Bayes of other evidences sequentially until the last evidence. The final evidence will end up with the final posterior distribution, which we expect to have new <i>location</i> parameter, closer to the exact location, and small <i>scale</i> parameter, covering the small circle of the exact solution. The final posterior will then be your best guess that would describe the exact location of your phone.</p> <p>This may not be the best analogy, but that is how the above figure sequentially learns the optimal estimate for the weight vector in Bayesian framework.</p> <h3 id="bayesian-deep-learning">Bayesian Deep Learning</h3> <p>This section deserves a separate article, but I will briefly give some motivation on how we can generalize the above discussion into complex modeling.</p> <p>The intention of the article is to give the reader a low-level understanding of how the Bayes’ theorem works, and without loss of generalization, I decided to go with simple linear regression to demonstrate the above subject. However, this can be applied to any model indexed by or a function of some parameters or weights $\mathbf{w}$, with the assumption that the solution is random but govern by some probability distribution.</p> <p>Complex modeling such as in Deep Learning are usually based on the assumption that the weights are fixed and unknown, which in Statistics is the Frequentist approach to inference, but without assuming some probability distribution on the error of the model. Therefore, if we are to assume some randomness on the weights, we can then use Bayesian inference to derive or at least approximate (for models with no closed-form solution) the posterior distribution. Approximate Bayesian inference are done via Markov Chain Monte Carlo (MCMC) or Variational Inference, which we can tackle in a separate post.</p> <h3 id="libraries">Libraries</h3> <p>There are several libraries for doing Bayesian inference, the classic and still one of the most powertful library is <a href="https://mc-stan.org/">Stan</a>. For Python, we have <a href="https://docs.pymc.io/">PyMC3</a>, <a href="http://pyro.ai/examples/index.html">Pyro</a> (based on <a href="https://pytorch.org/">Pytorch</a>), and <a href="https://www.tensorflow.org/probability">TensorFlow Porbability</a>. For Julia, we have <a href="https://turing.ml/dev/">Turing.jl</a>, <a href="https://mambajl.readthedocs.io/en/latest/index.html">Mamba.jl</a>, <a href="http://probcomp.csail.mit.edu/">Gen.jl</a>, and <a href="https://mc-stan.org/users/interfaces/julia-stan">Stan.jl</a>. I will have a separate article for these libraries.</p> <h3 id="next-steps">Next Steps</h3> <p>The obvious next steps for readers to try out is to model the variance as well, since in the above result, the variance of the innovation or the error is known and is equal to $\alpha$. Further, one might consider the Frequentist sequential learning as well. Or proceed with other nonlinear complex models, such as Neural Networks. We can have these in a separate article.</p> <h3 id="software-versions">Software Versions</h3> <script src="https://gist.github.com/alstat/f7b473b4919e3d76681c97dee178fcbc.js"></script>Al-Ahmadgaid B. AsaadIn my previous article, I discussed how we can use Bayesian approach in estimating the parameters of the model. The process revolves around solving the following conditional probability, popularly known as the Bayes’ Theorem:TensorFlow 2.0: Building Simple Classifier using Low Level APIs2019-11-01T04:00:00+00:002019-11-01T04:00:00+00:00https://estadistika.github.io//python/tensorflow/2019/11/01/TensorFlow%202.0-Building%20Classifier%20using%20Low%20Level%20APIs<p>At the end of my comparison — <a href="https://estadistika.github.io/julia/python/packages/knet/flux/tensorflow/machine-learning/deep-learning/2019/06/20/Deep-Learning-Exploring-High-Level-APIs-of-Knet.jl-and-Flux.jl-in-comparison-to-Tensorflow-Keras.html">TensorFlow 1.14 Keras’ API versus Julia’s Flux.jl and Knet.jl high level APIs</a> — I indicated some future write-ups I plan to do, one of which is to compare (obviously) on the low level APIs. However, with the release of the much anticipated TensorFlow 2.0, I decided not to wait for the next comparison and instead dedicate a separate article for the said release. The goal of this blog post then is to simply replicate the modeling in my <a href="https://estadistika.github.io/julia/python/packages/knet/flux/tensorflow/machine-learning/deep-learning/2019/06/20/Deep-Learning-Exploring-High-Level-APIs-of-Knet.jl-and-Flux.jl-in-comparison-to-Tensorflow-Keras.html">previous article</a>, without using the Keras API. Specifically, I’ll discuss the steps on constructing a custom layer, how we can chain multiple layers as in Keras’ <code>Sequential</code>, do a feedforward, a backward by updating the gradient, and batching the training datasets.</p> <h3 id="load-the-libraries">Load the Libraries</h3> <p>To start with, load the necessary libraries by running the following codes: <script src="https://gist.github.com/alstat/6ca5094612c2031fa80a6ca42fac34b7.js"></script> I should emphasize that Line 7 is optional, and I’m using this for annotation purposes on function parameters.</p> <h3 id="define-the-constants">Define the Constants</h3> <p>The following are the constants that we will be using on the specification of the succeeding codes: <script src="https://gist.github.com/alstat/454aa49e0ccf7d3ca86d65c35a1002c7.js"></script> As you notice, these are the hyperparameters that we configured for the model below. Specifically, we use ReLU (<code>tf.nn.relu</code>) as the activation for the first hidden layer and softmax (<code>tf.nn.softmax</code>) for the output layer. Other parameters include batch-size, the number of epochs, and the optimizer which in this case is Adam (<code>tf.keras.optimizers.Adam</code>). Feel free to change the items above, for example, you can explore on the different optimizers, or on the activation functions.</p> <h3 id="load-the-data">Load the Data</h3> <p>The Iris dataset is available in Python’s <a href="https://scikit-learn.org/">Scikit-Learn</a> library and can be loaded as follows: <script src="https://gist.github.com/alstat/80e748cd186d94d51736348d607efc03.js"></script> <code>xdat</code> and <code>ydat</code> are both <code>np.ndarray</code> objects, and we are going to partition these into training and testing datasets using the <code>Data</code> class defined in the next section.</p> <h3 id="define-the-data-class">Define the Data Class</h3> <p>To organize the data processing, below is the class with methods on partitioning, tensorizing, and batching the input data: <script src="https://gist.github.com/alstat/e24d40807d6816b92076c300b57a4bce.js"></script></p> <h3 id="data-cleaningprocessing">Data Cleaning/Processing</h3> <p>We then apply the above codes to the iris dataset as shown below: <script src="https://gist.github.com/alstat/0284a46ed09b92bcf19de4e5cee527d2.js"></script></p> <!-- The unit that TensorFlow crunches during computation is of type <code>tf.Tensor</code>. Thus, we need to convert the <code>np.ndarray</code> to <code>tf.Tensor</code> objects, as in Lines 2-5 above. --> <h3 id="defining-the-dense-layer">Defining the Dense Layer</h3> <p>The model as mentioned in my <a href="https://estadistika.github.io/julia/python/packages/knet/flux/tensorflow/machine-learning/deep-learning/2019/06/20/Deep-Learning-Exploring-High-Level-APIs-of-Knet.jl-and-Flux.jl-in-comparison-to-Tensorflow-Keras.html">previous article</a> is a MultiLayer Perceptron (MLP) with one hidden layer, please refer to the said article for more details. In order to use the low level APIs, we need to define the <code>Dense</code> layer from scratch. As a guide, below is the mathematical formulation of the said layer: \begin{equation} \mathbf{y} = \sigma\left(\mathbf{x}\cdot\mathbf{W} + \mathbf{b}\right), \end{equation} where $\sigma$ is the activation function, which can either be ReLU, sigmoid, etc. Now in order for the above dot product to work, $\mathbf{x}$ must be a row-vector, i.e.: \begin{equation} \mathbf{x} = \left[ x_1,x_2,\cdots,x_n \right] \end{equation} where $n$ is the number of features. Thus, $\mathbf{W}$ is a $n\times j$-matrix, with $n$ as the dimension of the input layer and $j$ as the dimension of the next layer (it could be hidden or output). With regards to the weights matrix and the bias vector, we need to initialize them to some random starting values. There are several ways to do this, one of which is to use samples from the standard Gaussian distribution, i.e.</p> \begin{align} \mathbf{W} = [(w_{ij})]\quad &amp;\textrm{and}\quad \mathbf{b} = [(b_i)],\nonumber\\ w_{ij}, b_i \sim \mathcal{N}(0,1), \;i = &amp;\{1,\cdots,m\}, \;j=\{1,\cdots,n\}\nonumber \end{align} <p>Translating all these into codes, we have the following: <script src="https://gist.github.com/alstat/6b743375c1b3b5a468de559f77408aa7.js"></script> Lines 3-6 initialize the <code>Dense</code> layer by assigning starting values to the weights, and Lines 8-12 define the feedforward. Further, if there is something special in the above code, it’s the <code>tf.Variable</code>. This class tells TensorFlow that the object is learnable during model optimization. Lastly, I want to emphasize that prior to TensorFlow 2.0, <code>tf.random.normal</code> was defined as <code>tf.random_normal</code>.</p> <h3 id="feedforward">Feedforward</h3> <p>We can now use the <code>Dense</code> layer to perform feedforward computation as follows: <script src="https://gist.github.com/alstat/7ade32577acd6c5f22007eaa6e3ac894.js"></script> The above code uses the first and the first five rows, respectively, of the <code>data.xtrn</code> as input. Of course, these values may change since we did not specify any seed for the initial weights and biases.</p> <h3 id="defining-the-chain">Defining the Chain</h3> <p>Now that we know how to initialize a single <code>Dense</code> layer, we need to extend it to multiple layers to come up with a MLP architecture. Hence, what we need to have now is a <code>Chain</code> for layers. In short, we need to replicate (not all attributes and methods) the Keras’ <code>Sequential</code>, and this is done as follows: <script src="https://gist.github.com/alstat/832bf930c99beeae550c93c3d8fbb0e8.js"></script> Let me walk you through, method-by-method. To initialize this class, we need to have a list of <code>Dense</code> layers (referring to <code>self.layers</code> at Line 4 above) as input. Once we have that, we can do feedforward computation using the untrained weights by simply calling the object (as defined by <code>__call__</code>).</p> <p>To train the weights, we need to call the <code>backward</code> method, which does backpropagation. This is done by <code>optimize</code>-ing the weights, using the <code>grad</code>ient obtained by differentiating the <code>loss</code> function of the model, with respect to the learnable parameters (referring to <code>self.params</code> in Line 12 above). In TensorFlow 2.0, it is recommended to use the Keras optimizers (<code>keras.optimizers</code>) instead of the <code>tf.train.GradientDescentOptimizer</code>. One thing to note as well, is the loss function which is a <code>.categorical_crossentropy</code> as opposed to <code>.sparse_categorical_crossentropy</code> which is used in my <a href="https://estadistika.github.io/julia/python/packages/knet/flux/tensorflow/machine-learning/deep-learning/2019/06/20/Deep-Learning-Exploring-High-Level-APIs-of-Knet.jl-and-Flux.jl-in-comparison-to-Tensorflow-Keras.html">previous article</a>. The difference is due to the fact that the target variable we have here is encoded as one-hot vector (you can confirm this under the <code>to_tensor</code> method of the <code>Data</code> class above), and thus <i>cross entropy</i> is used, as opposed to integer encoding, in which case <i>sparse cross entropy</i> is more appropriate.</p> <p>Referring back to the codes, I think Lines 33-34 are self-explanatory and should redirect us to the <code>grad</code> function where we find the <code>tf.GradientTape</code>, which is not obvious as to what exactly it does at first glance, apart from the high level understanding that it has something to do with the gradient computation (or it could really be the main moving part). Well, from the name itself, we can think of it as a “Tape Recorder”, which records the gradient of the trainable variables (referring to <code>tf.Variable</code> in Lines 4-5 of the <code>Dense</code> layer) with respect to the loss of the model. That is, when we do forward operation (referring to <code>self(inputs)</code> at Line 28 above), TensorFlow automatically differentiate the loss with respect to the parameters. In order then to update the weights under the Adam (referring to <code>opt</code> in Line 34 above) algorithm, we need to have these gradients at the given iteration. Thus we have a recorder (<code>tf.GradientTape</code> in this case), meant to extract the recorded gradients.</p> <p>Finally, we chain the layers as follows: <script src="https://gist.github.com/alstat/94cbf6f01918a7d19d370cbc7b4cc83b.js"></script> This extends the previous model (referring to <code>layer</code> object above) into chains of two layers, and we can call it as follows: <script src="https://gist.github.com/alstat/f71a1f966ede34a473ee4308ca6a0dc6.js"></script> The output above is the prediction of the model using the untrained weights.</p> <h3 id="training">Training</h3> <p>At this point, we can now optimize the parameters by calling the <code>backward</code> method; and because we are not using the Keras’ <code>Sequential([...]).fit</code>, we can customize our training procedure to suit our needs — starting with the custom definition of the model accuracy: <script src="https://gist.github.com/alstat/757b6796c9cd62c9b6bbadec179c254f.js"></script> Here’s the code for weight’s estimation: <script src="https://gist.github.com/alstat/6e515094992afecb508bfd3ec5e0d7bf.js"></script> As you can see, we have three loops, two of which are inner-loops for the minibatches on both training and testing datasets. Needless to say, the minibatches used in the testing dataset above are not really necessary, since we can have a single batch for validation. However, we have them for purpose of comparing the performance of the optimization algorithm on both single batch and three minibatches. Finally, the following tabularizes the statistics we obtained from model estimation. <script src="https://gist.github.com/alstat/35a1f774a548e60198e9773c79edca6a.js"></script> The following plot shows the loss of the model across 500 epochs. Since I did not specify any seed on weights’ initial values, you will likely get a different curve: <img src="https://drive.google.com/uc?export=view&amp;id=1PPMJVt2RPtj7OYnTlPGbpOqS9ffzv89O" /> The corresponding accuracy is depicted below: <img src="https://drive.google.com/uc?export=view&amp;id=1ROu_mLT7t2D4RFj79YF17g9b-GXXIQAM" /> For the codes of the above figures, please refer to this <a href="https://gist.github.com/alstat/a2f7f2725a2456ddfe86b83f9e6c1df6">link</a>. With regards to the series above, the optimization using three minibatches overfitted the data after 400+ epochs. This is evident on both figures, where we find a decrease in accuracy. Thus, it is recommended to have a model-checkpoint and early-stopping during calibration, and I’ll leave that to the reader.</p> <h3 id="end-note">End Note</h3> <p>That’s it, I have shown you how to do modeling using TensorFlow 2.0’s Core APIs. As an end note, I want to highlight two key points on the importance of using the low-level APIs. The first one, is having full control on your end-to-end modeling process. Having the flexibility on your tools, enables the user to solve problems with custom models, custom objective function, custom optimization algorithms, and whatnot. The second point, is the appreciation of the theory. It is simply fulfilling to see how the theory works in practice, and it gives the user the confidence to experiment, for example on the gradients and other internals.</p> <p>Lastly, I am pleased with the clean API of the TF 2.0 as opposed to the redundant APIs we have in the previous versions; and with eager-execution as the default configuration, makes the library even more pythonic. Feel free to share your thoughts, if you have comments/suggestions.</p> <h3 id="next-steps">Next Steps</h3> <p>In my next article, I will likely start on TensorFlow Probability, which extends the TF core APIs by incorporating Bayesian approach to modeling and statistical analyses. Otherwise, I will touch on modeling image datasets, or present new topic.</p> <h3 id="complete-codes">Complete Codes</h3> <p>If you are impatient, here is the complete code excluding the plots. This should work after installing the required libraries: <script src="https://gist.github.com/alstat/8e5af440bc199b8ee8dfc53056c848ec.js"></script></p> <h3 id="references">References</h3> <ul> <li><a href="https://www.tensorflow.org/api/stable">TensorFlow 2.0 API Documentation</a></li> <li><a href="https://datascience.stackexchange.com/questions/41921/sparse-categorical-crossentropy-vs-categorical-crossentropy-keras-accuracy">Difference of Sparse Categorical Crossentropy and Categorical Crossentropy</a></li> </ul> <h3 id="software-versions">Software Versions</h3> <script src="https://gist.github.com/alstat/662ddb50fbe5c61bbebd6dbb6bf21e20.js"></script>Al-Ahmadgaid B. AsaadAt the end of my comparison — TensorFlow 1.14 Keras’ API versus Julia’s Flux.jl and Knet.jl high level APIs — I indicated some future write-ups I plan to do, one of which is to compare (obviously) on the low level APIs. However, with the release of the much anticipated TensorFlow 2.0, I decided not to wait for the next comparison and instead dedicate a separate article for the said release. The goal of this blog post then is to simply replicate the modeling in my previous article, without using the Keras API. Specifically, I’ll discuss the steps on constructing a custom layer, how we can chain multiple layers as in Keras’ Sequential, do a feedforward, a backward by updating the gradient, and batching the training datasets.Model Productization: Crafting a Web Application for Iris Classifier2019-07-25T04:00:00+00:002019-07-25T04:00:00+00:00https://estadistika.github.io//julia/python/software-engineering/ui-ux/model-deployment/productization/2019/07/25/Model%20Productization-Creating%20a%20Web%20Application%20for%20Iris%20Classifier<p>Any successful data science project must end with productization. This is the stage where trained models are deployed as application that can be easily accessed by the end users. The application can either be part of already existing system, or it could also be a standalone application working in the back-end of any interface. In any case, this part of the project deals mainly with software engineering — a task that involves front-end programming.</p> <p>In my previous article, we talked about data engineering by <a href="https://estadistika.github.io/julia/python/packages/relational-databases/2019/07/07/Interfacing-with-Relational-Database-using-MySQL.jl-and-PyMySQL.html">interfacing with relational database using MySQL.jl and PyMySQL</a>; and we’ve done <a href="https://estadistika.github.io/data/analyses/wrangling/julia/programming/packages/2018/06/08/Julia-Introduction-to-Data-Wrangling.html">data wrangling</a> as well. Moreover, we also touched on modeling using <a href="https://estadistika.github.io/julia/python/packages/knet/flux/tensorflow/machine-learning/deep-learning/2019/06/20/Deep-Learning-Exploring-High-Level-APIs-of-Knet.jl-and-Flux.jl-in-comparison-to-Tensorflow-Keras.html">Multilayer Perceptron (MLP) for classifying the Iris species</a>. Hence, to cover the end-to-end spectrum of a typical data science project (excluding the business part), we’re going to deploy our Iris model as a web application.</p> <p>There are of course several options for user-interface other than web, for example desktop and mobile front-ends; but I personally prefer web-based since you can access the application without having to install anything — only the browser.</p> <h3 id="final-output">Final Output</h3> <p>The final interface of our application is given below. The app is powered by a server on both ends, with two models (Julia and Python) running in the back-end. Try playing with the app to see how it works.</p> <p class="codepen" data-height="747" data-theme-id="dark" data-default-tab="result" data-user="alstat" data-slug-hash="YoMMOY" style="height: 550px; box-sizing: border-box; display: flex; align-items: center; justify-content: center; border: 2px solid; margin: 1em 0; padding: 1em;" data-pen-title="Tutorial"> <span>See the Pen <a href="https://codepen.io/alstat/pen/YoMMOY/"> Tutorial</a> by Al Asaad (<a href="https://codepen.io/alstat">@alstat</a>) on <a href="https://codepen.io">CodePen</a>.</span> </p> <script async="" src="https://static.codepen.io/assets/embed/ei.js"></script> <p>As indicated in the status pane of the application, the UI above is not powered by the back-end servers. That is, the prediction is not coming from the model, but rather randomly sampled from a set of possible values of Iris species, i.e. Setosa, Versicolor, and Virginica. The same case when estimating the model, the app simply mimics the model training. This is because I don’t have a cloud server, so this only works in the local server in my machine. The application above, however, does send request to the server (localhost:8081 and port 8082, in this case), but if unaccessible, then it will randomly sample from the set of species when classifying; and randomly generates a test accuracy when training — and this is why we still get prediction despite no communication to the back-end servers. In succeeding sections though, I will show you (via screencast) how the application works with the back-end servers.</p> <h3 id="project-source-codes">Project Source Codes</h3> <p>The complete codes for this article are available <a href="https://github.com/estadistika/projects/tree/master/2019-07-25-Iris-Web-App/model-deployment">here</a> as a Github repo, and I encourage you to have it in your machine if you want to follow the succeeding discussions. The repo has the following folder structure: <script src="https://gist.github.com/alstat/1c5b9f777d05373f1be8801dc2bb100c.js"></script></p> <h3 id="software-architecture">Software Architecture</h3> <p>As mentioned earlier, the application is powered by back-end servers that are instances of Julia and Python. The communication between the user-interface (client) and the servers is done via HTTP (HyperText Transfer Protocol). The following is the architecture of our application: <img src="https://drive.google.com/uc?export=view&amp;id=1zRt5mOCgy5hY2F23QVHpFcvzA4TtCsIS" /> As shown in the figure, the client side handles response asynchronously. That is, we can send multiple request without waiting for the response of the preceding request. On the other hand, the server side processes the request synchronously, that is, the request from the client are handled one-at-a-time. These networks work via <a href="https://github.com/JuliaWeb/HTTP.jl">HTTP.jl</a> for Julia, and <a href="https://palletsprojects.com/p/flask/">Flask</a> for Python. If you are interested in asynchronous back-end server, checkout Python’s <a href="https://klein.readthedocs.io/en/latest/">Klein</a> library (Flask only works synchronously); and for Julia you can set HTTP.jl to work asynchronously. I should mention though that HTTP.jl is a bit lower in terms of API level compared to Flask. In fact, HTTP.jl is better compared to Python’s <a href="https://2.python-requests.org/en/master/">Requests</a> library. For Flask counterpart, however, Julia offers <a href="https://github.com/GenieFramework/Genie.jl">Genie.jl</a> and I encourage you to check that out as well.</p> <h3 id="hypertext-transfer-protocol-http">HyperText Transfer Protocol (HTTP)</h3> <p>We’ve emphasized in the previous section that, the communication between the interface and the model is done via HTTP. In this section, we’ll attempt to briefly cover the basics of this protocol. To do this, we’ll setup a client and server instances both in Julia and in Python. This is done by first running the server in the Terminal/CMD, before running the client in a separate Terminal/CMD (follow the order, execute the server first before the client). Here are the codes:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-072019-1', 'tabcontent-1')">Julia (Client)</button> <button class="tablinks" onclick="openCity(event, 'julia-072019-2', 'tabcontent-1')">Julia (Server)</button> <button class="tablinks" onclick="openCity(event, 'python-072019-1', 'tabcontent-1')">Python (Client)</button> <button class="tablinks" onclick="openCity(event, 'python-072019-2', 'tabcontent-1')">Python (Server)</button> </div> <div id="julia-072019-1" class="tabcontent-1 first"> <script src="https://gist.github.com/alstat/efd32b484f5c3066db4aba073d7232e5.js"></script> </div> <div id="julia-072019-2" class="tabcontent-1" style="display: none;"> <script src="https://gist.github.com/alstat/14ac41a9aa0e4f84636dfe0d225e40ab.js"></script> </div> <div id="python-072019-1" class="tabcontent-1" style="display: none;"> <script src="https://gist.github.com/alstat/4a1b41e4ea3cf56dcbbd627245c079fd.js"></script> </div> <div id="python-072019-2" class="tabcontent-1" style="display: none;"> <script src="https://gist.github.com/alstat/47b77733fb514036c1c7483ff4e2eb33.js"></script> </div> <p>For your reference, here are the outputs of the above codes.</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-072019-output-1', 'tabcontent-1-out')">Julia (Output)</button> <button class="tablinks" onclick="openCity(event, 'python-072019-output-1', 'tabcontent-1-out')">Python (Output)</button> </div> <div id="julia-072019-output-1" class="tabcontent-1-out first"> <img id="julia-output" src="https://drive.google.com/uc?export=view&amp;id=1gpJIaqpP7Y4dye0-2dtGLEglcjvcezG0" style="margin-top: 16px" /> </div> <div id="python-072019-output-1" class="tabcontent-1-out" style="display: none;"> <img id="python-output" src="https://drive.google.com/uc?export=view&amp;id=1vqYNskzKAJiZGfYTmh0gQoshjUI6F88h" style="margin-top: 16px" /> </div> <p>As shown in the screenshots above, the codes were executed at the root folder of the project repo (see the Project Source Codes section for folder tree structure). The server we setup is running at localhost:8081 or 127.0.0.1:8081 — waiting (or listening) for any incoming request from the client. Thus, when we ran the client codes, which send POST request with the data <code>{"Hello" : "World"}</code>, to localhost:8081, the server immediately responded back to the client — throwing the data received. The header specified in the response, <code>"Access-Control-Allow-Origin" =&gt; "*"</code>, simply tells the server to respond to any (<code>*</code>) client. For more details on HTTP, I think <a href="https://www.youtube.com/watch?v=eesqK59rhGA">this video</a> is useful.</p> <h3 id="server-iris-classifier">Server: Iris Classifier</h3> <p>At this point, you should have at least an idea of how HTTP works. To set this up in Julia and in Python, here are the codes for the REST API:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-072019-3', 'tabcontent-2')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-072019-3', 'tabcontent-2')">Python</button> </div> <div id="julia-072019-3" class="tabcontent-2 first"> <script src="https://gist.github.com/alstat/0b6a33e478979665e9adf10c17969f92.js"></script> </div> <div id="python-072019-3" class="tabcontent-2" style="display: none;"> <script src="https://gist.github.com/alstat/d1099efc4d6071488b16f5dbf17bcaee.js"></script> </div> <p>The requests are received at line 38 of the Julia code, and are handled depending on the target URL. If the target URL of the client’s request is <code>/classify</code>, then the <code>classifier</code> router will handle the request via the <code>classify</code> function defined in line 7. The same case when we receive training request, the <code>train</code> function defined in line 19 will handle the request. Each of these functions then returns a response back to the client with the data. On the other hand, we see that Python’s Flask library uses a decorator for specifying the router, with two main helper functions defined in line 32 and 37. The two approaches are indeed easy to follow, despite the difference in implementation. I should emphasize though that the above codes refer to other files to perform the training and classification. These files include <i>model-training.jl</i> and <i>model_training.py</i>, etc. I will leave these to the reader to explore the scripts in the project repo.</p> <h3 id="client-web-interface">Client: Web Interface</h3> <p>Now that we have an idea of how the request in the servers are processed, in this section, we need to understand how the client prepares and sends requests to the server, and how it processes the response. In the CodePen embedded above, the client codes are in the JS (Javascript) tab. The following is a copy of it: <script src="https://gist.github.com/alstat/b53164002475b06c559a0b9cc1177365.js"></script> Lines 23-28 define the event listener for the buttons in the application. Line 24, for example, adds functionality to the <kbd>Classify</kbd> button, which is defined in line 44. This is a void function, but it sends HTTP request in line 79 to the specified url defined in lines 63-67. The <code>httpRequest</code> function in line 5, is the one that handles the communication with the servers. It takes three arguments, the <code>URL</code> (the address of the server), the <code>data</code> (the request data to be sent to the server), and the <code>callback</code> (the function that will handle the response from the server). The request as mentioned is handled asynchronously, and is specified by the third argument — <code>true</code> (asynchronous) — of the <code>xhr.open</code> method defined in line 8. The rest of the codes are functions defining the functionalities of the buttons, status pane, output display, etc. of the app.</p> <h3 id="htmlcss">HTML/CSS</h3> <p>The other codes that are part of the application are the HTML and CSS files. I will leave these to the reader, since these codes are simply declarations of how the layout and style of the app should look like.</p> <h3 id="screencast">Screencast</h3> <p>The following video shows how the application works with the back-end servers running. <!-- <iframe width="100%" height="400px" src="https://www.youtube.com/watch?v=jxM_U9USkv4" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe> --></p> <iframe width="100%" height="400px" src="https://www.youtube.com/embed/jxM_U9USkv4" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen=""></iframe> <h3 id="conclusion">Conclusion</h3> <p>The aim of this exercise is to demonstrate the capability of Julia in production, and one of my concern was precompilations. Specifically, if these occur for every client’s request. Fortunately, these only happen at the first request, the succeeding ones are guaranteed to be fast. Having said, just as the stability of Python for production, I think Julia is already stable as a back-end language for creating a fully featured web application, and is therefore capable enough for an end-to-end project.</p> <h3 id="software-versions">Software Versions</h3> <script src="https://gist.github.com/alstat/65dab0d062ea0fd229b4aa23c18fcd21.js"></script>Al-Ahmadgaid B. AsaadAny successful data science project must end with productization. This is the stage where trained models are deployed as application that can be easily accessed by the end users. The application can either be part of already existing system, or it could also be a standalone application working in the back-end of any interface. In any case, this part of the project deals mainly with software engineering — a task that involves front-end programming.Interfacing with Relational Database using MySQL.jl and PyMySQL2019-07-07T04:00:00+00:002019-07-07T04:00:00+00:00https://estadistika.github.io//julia/python/packages/relational-databases/2019/07/07/Interfacing%20with%20Relational%20Database%20using%20MySQL.jl%20and%20PyMySQL<p>Prior to the advent of computing, relational database can be thought of log books typically used for inventory and visitor’s time-in time-out. These books contain rows that define the item/transaction, and columns describing the features of each row. Indeed, these are the core attributes of any relational database. Unlike spreadsheets, which are used for handling small datasets, databases are mostly used for storing huge transactional data for later use. They run on a server and often at the backend of any user (client) interface such as websites and mobile applications. These applications communicate with database via processing servers (e.g. <a href="http://flask.pocoo.org/">Flask</a> and <a href="https://www.djangoproject.com/">Django</a>). The figure below illustrates the request and response communcations between client and servers. <img src="https://drive.google.com/uc?export=view&amp;id=1cedn62AXe6LS-jxCjXBxCYmL1iDFRlYQ" /> As mentioned earlier, databases are meant to store data for <i>later use</i> — in the sense that we can use it as a response to client’s requests, such as viewing or data extraction for insights. In this article, we are interested in data extraction from the database. In particular, the objective is to illustrate how to send request to MySQL server, and how to process response both from Julia and Python.</p> <h3 id="mysql-server-setup">MySQL Server Setup</h3> <p>To start with, we need to setup MySQL server in our machine. Click the following link to download and install the application.</p> <ul> <li><a href="https://docs.google.com/document/d/1B3ol7-Hte08mqzB5J8dBn1wFlpkxZ1Lgb2W7OUpIu60/edit?usp=sharing">MySQL Server Download and Installation on macOS</a></li> <li><a href="https://docs.google.com/document/d/1GaZ5dPOH9o5rQPmxFWjGS4ZmUjEEeDM497UdXwcEgco/edit?usp=sharing">MySQL Server Download and Installation on Windows</a></li> </ul> <p>Note that I recommend you to download the latest version of MySQL since the setup above is using the old version.</p> <h3 id="query-creating-database">Query: Creating Database</h3> <p>In order to appreciate what we are aiming in this article, we need to go through some basic SQL queries to understand what type of request to send to MySQL server. I’m using macOS, but the following should work on Windows as well. For macOS users, open the MySQL Server Shell by running <code>mysql -u root -p</code> (hit <kbd>return</kbd> or <kbd>enter</kbd> , and type in your MySQL root password you specified during the installation setup from the previous section) in the terminal. For windows, try to look for it in the Start Menu. <!-- <img src="http://drive.google.com/uc?export=view&id=1wRuD_gG4tJpp1ZKj3jCbwbzWAtERSjsn" style="margin: -4px auto -30px auto;"> --> <img src="https://drive.google.com/uc?export=view&amp;id=1hcAnM6KYuASiBhu5AzHqZpf2P1EBneYb" style="margin: -4px auto -30px auto;" /> <!-- https://drive.google.com/file/d/1hcAnM6KYuASiBhu5AzHqZpf2P1EBneYb/view?usp=sharing --> From here, we are going to check the available databases in MySQL server. To do this, run the following: <script src="https://gist.github.com/alstat/1dbad1187130a31091aead6145dc0151.js"></script> Indeed, there are four out-of-the-box defined databases already, and we don’t want to touch that. Instead, we are going to create our own database, let’s call it <code>tutorial</code>. To do this, run the following codes: <script src="https://gist.github.com/alstat/3d62031b3a0f2f2236568ffe0b9ec189.js"></script> The best thing about SQL syntax is that, everything is self-explanatory, except maybe for line 19, which simply confirmed that we are using <code>tutorial</code> as our database.</p> <h3 id="query-creating-table">Query: Creating Table</h3> <p>Next is to create a table for our database, we are going to use the <a href="https://halalanresults.abs-cbn.com/">2019 Philippine Election results</a> with columns: Last Name, First Name, Party, Votes. Further, for purpose of illustration, we are going to use the top 5 senators only. <script src="https://gist.github.com/alstat/79d2c2c420a781d14834ef2307413045.js"></script></p> <h3 id="query-inserting-values">Query: Inserting Values</h3> <p>The following codes will insert the top five senators from the 2019 Philippine election results. <script src="https://gist.github.com/alstat/771051ecfca7075229a965e5861353f8.js"></script></p> <h3 id="query-show-data">Query: Show Data</h3> <p>To view the encoded data, we simply select all (<code>*</code>) the columns from the table. <script src="https://gist.github.com/alstat/dd620c5151a14261d8095d614d80b81a.js"></script></p> <h2 id="mysql-clients-on-julia-and-python">MySQL Clients on Julia and Python</h2> <p>In this section, we are going to interface with MySQL server on Julia and Python. This is possible using <a href="https://github.com/JuliaDatabases/MySQL.jl">MySQL.jl</a> and <a href="https://pymysql.readthedocs.io/en/latest/index.html">PyMySQL</a> libraries. To start with, install the necessary libraries as follows:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-1', 'tabcontent-1')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-1', 'tabcontent-1')">Python</button> </div> <div id="julia-062819-1" class="tabcontent-1 first"> <script src="https://gist.github.com/alstat/844cee7187081181baea0aeb13efafa7.js"></script> </div> <div id="python-062819-1" class="tabcontent-1" style="display: none;"> <script src="https://gist.github.com/alstat/f38f27e23bc74549930bd439af5075f9.js"></script> </div> <p>For this exercise, our goal is to save the <a href="https://cran.r-project.org/web/packages/nycflights13/nycflights13.pdf">NYC Flights (2013)</a> data into the database and query it from Julia and Python.</p> <h3 id="downloading-nyc-flights-data">Downloading NYC Flights Data</h3> <p>I have a copy of the dataset on Github, and so the following code will download the said data:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-2', 'tabcontent-2')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-2', 'tabcontent-2')">Python</button> </div> <div id="julia-062819-2" class="tabcontent-2 first"> <script src="https://gist.github.com/alstat/c0cf42053baa058cb3336867d9040d1d.js"></script> </div> <div id="python-062819-2" class="tabcontent-2" style="display: none;"> <script src="https://gist.github.com/alstat/47d181a0efd1a63b829eded83cfe7402.js"></script> </div> <h3 id="connect-to-mysql-server">Connect to MySQL Server</h3> <p>In order for the client to send request to MySQL server, the user/client needs to connect to it using the credentials set in the installation.</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-3', 'tabcontent-3')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-3', 'tabcontent-3')">Python</button> </div> <div id="julia-062819-3" class="tabcontent-3 first"> <script src="https://gist.github.com/alstat/ef8e9274a0abf17055fa1cd35e343b02.js"></script> </div> <div id="python-062819-3" class="tabcontent-3" style="display: none;"> <script src="https://gist.github.com/alstat/e7b04fe16d4ae8f324ba2eab2fe3a47e.js"></script> </div> <p>Note that you need to have a strong password, and this configuration should not be exposed to the public. The above snippets are meant for illustration.</p> <h3 id="first-request">First Request</h3> <p>To test the connection, let’s send our first request — to show the tables in the database:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-4', 'tabcontent-4')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-4', 'tabcontent-4')">Python</button> </div> <div id="julia-062819-4" class="tabcontent-4 first"> <script src="https://gist.github.com/alstat/ab4d2403017c7b0a2e57e87590b202ad.js"></script> </div> <div id="python-062819-4" class="tabcontent-4" style="display: none;"> <script src="https://gist.github.com/alstat/84b1c60e618d93b00ab2294c13438c35.js"></script> </div> <p>In Julia, the response is recieved as a MySQL.Query object and can be viewed using DataFrame. For Python, however, you will get a tuple object.</p> <h3 id="create-nyc-flights-table">Create NYC Flights Table</h3> <p>At this point, we can now create the table for our dataset. To do this, run the following:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-5', 'tabcontent-5')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-5', 'tabcontent-5')">Python</button> </div> <div id="julia-062819-5" class="tabcontent-5 first"> <script src="https://gist.github.com/alstat/edd6aff9b0d5fc9a45b808b36a2d3f95.js"></script> </div> <div id="python-062819-5" class="tabcontent-5" style="display: none;"> <script src="https://gist.github.com/alstat/890fe4cdc2e50f694a749448594cb248.js"></script> </div> <p>As shown in the previous section, sending request to the server both in Julia and in Python is done by simply using a string of SQL queries as input to MySQL.jl and PyMySQL APIs. Hence, the <code>query</code> object (in line 3 of Julia code and line 4 of Python code) above, simply automates the concatenation of SQL query for creating a table. Having said, you can of course write the query manually. To check if we have indeed created the table, run the following codes:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-6', 'tabcontent-6')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-6', 'tabcontent-6')">Python</button> </div> <div id="julia-062819-6" class="tabcontent-6 first"> <script src="https://gist.github.com/alstat/3694845948496741bd3256729a1d8469.js"></script> </div> <div id="python-062819-6" class="tabcontent-6" style="display: none;"> <script src="https://gist.github.com/alstat/789b5aae723486aee01b6c018e61c60a.js"></script> </div> <p>As you can see, we’ve created it already, but with no entry yet.</p> <h3 id="populating-the-table">Populating the Table</h3> <p>Finally, we are going to populate the table in the database by inserting the values row by row.</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-7', 'tabcontent-7')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-7', 'tabcontent-7')">Python</button> </div> <div id="julia-062819-7" class="tabcontent-7 first"> <script src="https://gist.github.com/alstat/c8ee6c05a99d9cd0f4270dd8a8beb984.js"></script> </div> <div id="python-062819-7" class="tabcontent-7" style="display: none;"> <script src="https://gist.github.com/alstat/2f22ad0b1dd5f3ed39f360d2244c32f7.js"></script> </div> <p>From the above Julia code, the result of the <code>stmt</code> is an SQL <code>INSERT</code> query with placeholder values indicated by <code>?</code>. The timed (<code>@time</code> in Julia code) loop in line 9 above maps the values of the vector, one-to-one, to the elements (<code>?</code>) of the tuple in <code>stmt</code>. Having said, <code>MySQL.Stmt</code> has no equivalent in PyMySQL. Further, one major difference between these libraries is that, PyMySQL will not populate the table even after executing all sorts of SQL queries unless you commit it (<code>con.commit</code>), as shown above. This is contrary to MySQL.jl which automatically commits every execution of the SQL queries. I do like the idea of having <code>con.commit</code> in PyMySQL, since this avoids accidental deletion or modification in the database, thus adding a layer of security. To check if we have indeed populated the table, run the following:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-062819-8', 'tabcontent-8')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-062819-8', 'tabcontent-8')">Python</button> </div> <div id="julia-062819-8" class="tabcontent-8 first"> <script src="https://gist.github.com/alstat/40c790d0216614cdcbf41de31dfa4e1a.js"></script> </div> <div id="python-062819-8" class="tabcontent-8" style="display: none;"> <script src="https://gist.github.com/alstat/09cf619f91773599b9902ba77fde7d76.js"></script> </div> <p>To disconnect from the server, run <code>MySQL.disconnect(con)</code> (Julia) or <code>con.close()</code> (Python).</p> <h3 id="benchmark">Benchmark</h3> <p>For the benchmark, I added a timelapse recorder in populating and reading the table in the previous section. The figure below summarizes the results. <img src="https://drive.google.com/uc?export=view&amp;id=1fhMJg3qIPupf3xhvyCW1p5Ph7tzn7UAH" /> The figure was plotted using <a href="http://gadflyjl.org/stable/index.html">Gadfly.jl</a>. Install this package using <code>Pkg</code> as described above (see the first code block under <i>MySQL Clients on Julia and Python</i> section), along with <a href="https://github.com/JuliaGraphics/Cairo.jl">Cario.jl</a> and <a href="https://github.com/JuliaGraphics/Fontconfig.jl">Fontconfig.jl</a>. The latter two packages are used to save the plot in PNG format. See the code below to reproduce: <script src="https://gist.github.com/alstat/370b6b9eb33089f52c3f2f721e10e5d2.js"></script></p> <h3 id="conclusion">Conclusion</h3> <p>The aim of this article was simply to illustrate the usage of MySQL.jl APIs in comparison to the PyMySQL; and I would say both libraries have similarities in APIs (as expected) and are stable for the tasks. I should emphasize though that, I do like the <code>con.commit</code> of PyMySQL since this adds a level of security, and I think this is a good addition to MySQL.jl in the future.</p> <h3 id="complete-codes">Complete Codes</h3> <p>If you are impatient, here are the complete codes excluding the MySQL codes and the plots. These should work after installing the required libraries shown above:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-nn', 'tabcontent-nn')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-060319-nn', 'tabcontent-nn')">Python</button> </div> <div id="julia-knet-060319-nn" class="tabcontent-nn first"> <script src="https://gist.github.com/alstat/eda562ebbd22f3de61385ec79dad2373.js"></script> </div> <div id="python-060319-nn" class="tabcontent-nn" style="display: none;"> <script src="https://gist.github.com/alstat/69d25cb0a6210b3e702fe582c2127ba4.js"></script> </div> <h3 id="references-and-resources">References and Resources</h3> <ul> <li>MySQL.jl Github Repo: https://github.com/JuliaDatabases/MySQL.jl</li> <li>PyMySQL Github Repo: https://github.com/PyMySQL/PyMySQL</li> <li>Flaticon: https://www.flaticon.com/</li> </ul> <h3 id="software-versions">Software Versions</h3> <script src="https://gist.github.com/alstat/65dab0d062ea0fd229b4aa23c18fcd21.js"></script>Al-Ahmadgaid B. AsaadPrior to the advent of computing, relational database can be thought of log books typically used for inventory and visitor’s time-in time-out. These books contain rows that define the item/transaction, and columns describing the features of each row. Indeed, these are the core attributes of any relational database. Unlike spreadsheets, which are used for handling small datasets, databases are mostly used for storing huge transactional data for later use. They run on a server and often at the backend of any user (client) interface such as websites and mobile applications. These applications communicate with database via processing servers (e.g. Flask and Django). The figure below illustrates the request and response communcations between client and servers. As mentioned earlier, databases are meant to store data for later use — in the sense that we can use it as a response to client’s requests, such as viewing or data extraction for insights. In this article, we are interested in data extraction from the database. In particular, the objective is to illustrate how to send request to MySQL server, and how to process response both from Julia and Python. MySQL Server Setup To start with, we need to setup MySQL server in our machine. Click the following link to download and install the application.Deep Learning: Exploring High Level APIs of Knet.jl and Flux.jl in comparison to Tensorflow-Keras2019-06-20T04:00:00+00:002019-06-20T04:00:00+00:00https://estadistika.github.io//julia/python/packages/knet/flux/tensorflow/machine-learning/deep-learning/2019/06/20/Deep%20Learning:%20Exploring%20High%20Level%20APIs%20of%20Knet.jl%20and%20Flux.jl%20in%20comparison%20to%20Tensorflow-Keras<p>When it comes to complex modeling, specifically in the field of deep learning, the go-to tool for <a href="https://towardsdatascience.com/which-deep-learning-framework-is-growing-fastest-3f77f14aa318">most researchers</a> is the <a href="https://www.tensorflow.org/">Google’s TensorFlow</a>. There are a number of good reason as to why, one of it is the fact that it provides both high and low level APIs that suit the needs of both beginners and advanced users, respectively. I have used it in some of my projects, and indeed it was powerful enough for the task. This is also due to the fact that TensorFlow is one of the <a href="https://github.com/tensorflow/tensorflow/graphs/contributors">most actively</a> developed deep learning framework, with Bayesian inference or probabilistic reasoning as the recent extension (see <a href="https://www.tensorflow.org/probability/">TensorFlow Probability</a>, another extension is the <a href="https://www.tensorflow.org/js">TensorFlow.js</a>). While the library is written majority in C++ for optimization, the main API is served in Python for ease of use. This design works around the static computational graph that needs to be defined declaratively before executed. The static nature of this graph, however, led to difficulty on debugging the models since the codes are itself data for defining the computational graph. Hence, you cannot use a debugger to check the results of the models line by line. Thankfully, it’s 2019 already and we have a stable <a href="https://www.tensorflow.org/guide/eager">Eager Execution</a> that allows users to immediately check the results of any TensorFlow operations. Indeed, this is more intuitive and more pythonic. In this article, however, we’ll attempt to explore, what else we have in 2019. In particular, let’s take look at Julia’s deep learning libraries and compare it to high level APIs of TensorFlow, i.e. Keras’ model specification.</p> <p>As a language that leans towards numerical computation, it’s no surprise that Julia offers a number of choices for doing deep learning, here are the stable libraries:</p> <ol> <li> <a href="https://github.com/FluxML/Flux.jl">Flux.jl</a> - The Elegant Machine Learning Stack. </li> <li> <a href="https://github.com/denizyuret/Knet.jl">Knet.jl</a> - Koç University deep learning framework. </li> <li> <a href="https://github.com/alan-turing-institute/MLJ.jl">MLJ.jl</a> - Julia machine learning framework by Alan Turing Institute. </li> <li> <a href="https://github.com/apache/incubator-mxnet/tree/master/julia#mxnet">MXNet.jl</a> - Apache MXNet Julia package. </li> <li> <a href="https://github.com/malmaud/TensorFlow.jl">TensorFlow.jl</a> - A Julia wrapper for TensorFlow. </li> </ol> <p>Other related packages are maintained in <a href="https://github.com/JuliaML">JuliaML</a>. For this article, we are going to focus on the usage of Flux.jl and Knet.jl, and we are going to use the <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris dataset</a> for classification task using <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">Multilayer Perceptron</a>. To start with, we need to install the following packages. I’m using Julia 1.1.0. and Python 3.7.3.</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-060319-1', 'tabcontent-1')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-060319-1', 'tabcontent-1')">Python</button> </div> <div id="julia-060319-1" class="tabcontent-1 first"> <script src="https://gist.github.com/alstat/0f696956e8856bbd40c4364c8bd526b8.js"></script> </div> <div id="python-060319-1" class="tabcontent-1" style="display: none;"> <script src="https://gist.github.com/alstat/980a3dd113dea6774a8ff9c9d4b65f2b.js"></script> </div> <h3 id="loading-and-partitioning-the-data">Loading and Partitioning the Data</h3> <p>The Iris dataset is available in the <a href="https://github.com/JuliaStats/RDatasets.jl">RDatasets.jl</a> Julia package and in Python’s <a href="https://scikit-learn.org/">Scikit-Learn</a>. The following codes load the libraries and the data itself.</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-060319-knet-2', 'tabcontent-2')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-060319-flux-2', 'tabcontent-2')">Julia (Flux.jl)</button> <button class="tablinks" onclick="openCity(event, 'python-060319-2', 'tabcontent-2')">Python</button> </div> <div id="julia-060319-knet-2" class="tabcontent-2 first"> <script src="https://gist.github.com/alstat/e8dcf372308bf96df39b098a4b443d33.js"></script> </div> <div id="julia-060319-flux-2" class="tabcontent-2" style="display: none;"> <script src="https://gist.github.com/alstat/d8bee345413942db404ab30609c91170.js"></script> </div> <div id="python-060319-2" class="tabcontent-2" style="display: none;"> <script src="https://gist.github.com/alstat/9e763b0ddbe0e010da6191322b79a394.js"></script> </div> <p>The random seed set above is meant for reproducibility as it will give us the same random initial values for model training. The <code>iris</code> variable in line 11 (referring to Julia code) contains the data, and is a data frame with 150 × 5 dimensions, where the columns are: Sepal Length, Sepal Width, Petal Length, Petal Width, and Species. There are several ways to partition this data into training and testing datasets, one procedure is to do stratified sampling, with simple random sampling without replacement as the sampling selection within each stratum — the species. The following codes define the function for partitioning the data with the mentioned sampling design:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-060319-3', 'tabcontent-3')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-060319-3', 'tabcontent-3')">Python</button> </div> <div id="julia-060319-3" class="tabcontent-3 first"> <script src="https://gist.github.com/alstat/f1441c844ef4a5f465b00f60aa11ec85.js"></script> </div> <div id="python-060319-3" class="tabcontent-3" style="display: none;"> <script src="https://gist.github.com/alstat/ef9c6e6fdd78dd4931458e1c7a644644.js"></script> </div> <p>Extract the training and testing datasets using the function above as follows:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-4', 'tabcontent-4')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-flux-060319-4', 'tabcontent-4')">Julia (Flux.jl)</button> <button class="tablinks" onclick="openCity(event, 'python-060319-4', 'tabcontent-4')">Python</button> </div> <div id="julia-knet-060319-4" class="tabcontent-4 first"> <script src="https://gist.github.com/alstat/b49d14f563a43b2e2d25b1b70860539c.js"></script> </div> <div id="julia-flux-060319-4" class="tabcontent-4 first" style="display: none;"> <script src="https://gist.github.com/alstat/54a248104146c7e94340510b0a1c26eb.js"></script> </div> <div id="python-060319-4" class="tabcontent-4" style="display: none;"> <script src="https://gist.github.com/alstat/9c5c3549794b0918172205235137ad25.js"></script> </div> <p>All three codes above extract <code>xtrn</code>, the training data (feature) matrix of size 105 × 4 (105 observations by 4 features) dimensions; <code>ytrn</code>, the corresponding training target variable with 105 × 1 dimension; <code>xtst</code>, the feature matrix for testing dataset with 45 × 4 dimensions; and <code>ytst</code>, the target variable with 45 × 1 dimension for testing dataset. Moreover, contrary to TensorFlow-Keras, Knet.jl and Flux.jl need further data preparation from the above partitions. In particular, Knet.jl takes minibatch object as input data for model training, while Flux.jl needs one-hot encoding for the target variables <code>ytrn</code> and <code>ytst</code>. Further, unlike Knet.jl which ships with minibatch function, Flux.jl gives the user the flexibility to create their own.</p> <h3 id="specify-the-model">Specify the Model</h3> <p>The model that we are going to use is a <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">Multilayer Perceptron</a> with the following architecture: 4 neurons for the input layer, 10 neurons for the hidden layer, and 3 neurons for the output layer. The first two layers contain bias, and the neurons of the last two layers are activated with Rectified Linear Unit (ReLU) and softmax functions, respectively. The diagram below illustrates the architecture described: <img src="https://drive.google.com/uc?export=view&amp;id=1oYnD8KZ1NQqbJccw8NYugj0kqihQZBJX" /> <!-- https://drive.google.com/file/d/1jqASeBjmbSImp5hEoEZxHMizKUkSLyC8/view?usp=sharing --> The codes below specify the model:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-5', 'tabcontent-5')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-flux-060319-5', 'tabcontent-5')">Julia (Flux.jl)</button> <button class="tablinks" onclick="openCity(event, 'python-060319-5', 'tabcontent-5')">Python (TensorFlow-Keras)</button> </div> <div id="julia-knet-060319-5" class="tabcontent-5 first"> <script src="https://gist.github.com/alstat/e5ebdd980786a552511ceafcc70905f5.js"></script> </div> <div id="julia-flux-060319-5" class="tabcontent-5 first" style="display: none;"> <script src="https://gist.github.com/alstat/b84fe0bc6c5bfba22d903f1bc8b1f774.js"></script> </div> <div id="python-060319-5" class="tabcontent-5 first" style="display: none;"> <script src="https://gist.github.com/alstat/b99baeaf255f8704e64a0ff9077e0283.js"></script> </div> <p>Coming from TensorFlow-Keras, Flux.jl provides Keras-like API for model specification, with <code>Flux.Chain</code> as the counterpart for Keras’ <code>Sequential</code>. This is different from Knet.jl where the highest level API you can get are the nuts and bolts for constructing the layers. Having said, however, <code>Flux.Dense</code> is defined almost exactly as the Dense struct of the Knet.jl code above (check the source code <a href="https://github.com/FluxML/Flux.jl/blob/1902c0e7c568a1bdb0cda7dca4d69f3896c023c7/src/layers/basic.jl#L82-L100">here</a>). In addition, since both Flux.jl and Knet.jl are written purely in Julia, makes the source codes under the hood accessible to beginners. Thus, giving the user a full understanding of not just the code, but also the math. Check the screenshots below for the distribution of the file types in the Github repos of the three frameworks: <img src="https://drive.google.com/uc?export=view&amp;id=1hmWiYy6C01q_X8HGl5xDSjQClJz5H7Ym" /> <img src="https://drive.google.com/uc?export=view&amp;id=17VAf7wOT9Ej47OZQu9B6o4kOCCs4G_tw" /> <img src="https://drive.google.com/uc?export=view&amp;id=18RmeurpIX0uzBP9sKwiN24QI_VVLicYF" /> From the above figure, it’s clear that Flux.jl is 100% Julia. On the other hand, Knet.jl while not apparent is actually 100% Julia as well. The 41.4% of Jupyter Notebooks and other small percentages account for the tutorials, tests and examples and not the <a href="https://github.com/denizyuret/Knet.jl/tree/master/src">source codes</a>. <!-- There are several --> <!-- <img src="http://drive.google.com/uc?export=view&id=1HKlC04oVF_3ggraWE7qnMNvW-Atxc89N"> --> <!-- https://drive.google.com/file/d/1HKlC04oVF_3ggraWE7qnMNvW-Atxc89N/view?usp=sharing --></p> <h3 id="train-the-model">Train the Model</h3> <p>Finally, train the model as follows for 100 epochs:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-6', 'tabcontent-6')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-flux-060319-6', 'tabcontent-6')">Julia (Flux.jl)</button> <button class="tablinks" onclick="openCity(event, 'python-060319-6', 'tabcontent-6')">Python (TensorFlow-Keras)</button> </div> <div id="julia-knet-060319-6" class="tabcontent-6 first"> <script src="https://gist.github.com/alstat/e06a44c5d748f2e0d266f0882b623b2a.js"></script> </div> <div id="julia-flux-060319-6" class="tabcontent-6 first" style="display: none;"> <script src="https://gist.github.com/alstat/c007ed68a89d4974419ad4ffcea2ff81.js"></script> </div> <div id="python-060319-6" class="tabcontent-6" style="display: none;"> <script src="https://gist.github.com/alstat/82525c52d17b5a5b470c392598a99e58.js"></script> </div> <p>The codes (referring to Julia codes) above save both loss and accuracy for every epoch into a data frame and then into a CSV file. These will be used for visualization. Moreover, unlike Flux.jl and Knet.jl which require minibatch preparation prior to training, TensorFlow-Keras specifies this on <code>fit</code> method as shown above. Further, it is also possible to train the model in Knet.jl using a single function without saving the metrics. This is done as follows:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-6-a', 'tabcontent-6-a')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-flux-060319-6-a', 'tabcontent-6-a')">Julia (Flux.jl)</button> </div> <div id="julia-knet-060319-6-a" class="tabcontent-6-a first"> <script src="https://gist.github.com/alstat/cc3882f36e47ef8c289392626131af9d.js"></script> </div> <div id="julia-flux-060319-6-a" class="tabcontent-6-a first" style="display: none;"> <script src="https://gist.github.com/alstat/5002d261a67eccebd174ad0ea43969e4.js"></script> </div> <p>The Flux.jl code above simply illustrates the use of <code>Flux.@epochs</code> macro for looping instead of the <code>for</code> loop. The loss of the model for 100 epochs is visualized below across frameworks: <img src="https://drive.google.com/uc?export=view&amp;id=11y4RGyrY1e62cNl9H6Kce6tdcVug8AHt" /> From the above figure, one can observe that Flux.jl had a bad starting values set by the random seed earlier, good thing <a href="https://arxiv.org/abs/1412.6980">Adam</a> drives the gradient vector rapidly to the global minimum. The figure was plotted using <a href="http://gadflyjl.org/stable/index.html">Gadfly.jl</a>. Install this package using <code>Pkg</code> as described in the first code block, along with <a href="https://github.com/JuliaGraphics/Cairo.jl">Cario.jl</a> and <a href="https://github.com/JuliaGraphics/Fontconfig.jl">Fontconfig.jl</a>. The latter two packages are used to save the plot in PNG format, see the code below to reproduce: <script src="https://gist.github.com/alstat/f0d63cf8cd0b41125f9bc072e0fc451b.js"></script></p> <h3 id="evaluate-the-model">Evaluate the Model</h3> <p>The output of the model ends with a vector of three neurons. The index or location of the neurons in this vector defines the corresponding integer encoding, with 1st index as setosa, 2nd as versicolor, and 3rd as virginica. Thus, the codes below take the argmax of the vector to get the integer encoding for evaluation.</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-7', 'tabcontent-7')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-flux-060319-7', 'tabcontent-7')">Julia (Flux.jl)</button> <button class="tablinks" onclick="openCity(event, 'python-060319-7', 'tabcontent-7')">Python (TensorFlow-Keras)</button> </div> <div id="julia-knet-060319-7" class="tabcontent-7 first"> <script src="https://gist.github.com/alstat/668465ba90240dc071a71c8132f07268.js"></script> </div> <div id="julia-flux-060319-7" class="tabcontent-7 first" style="display: none;"> <script src="https://gist.github.com/alstat/59a20104dcdce45045e28a5d60ffd86d.js"></script> </div> <div id="python-060319-7" class="tabcontent-7" style="display: none;"> <script src="https://gist.github.com/alstat/59d17503d149abbe4ec338b4190affec.js"></script> </div> <p>The figure below shows the traces of the accuracy during training: <img src="https://drive.google.com/uc?export=view&amp;id=1L2d8mfkC-9zl3KeLXdVpynXR_OuyyzC1" /> TensorFlow took 25 epochs before surpassing 50% again. To reproduce the figure, run the following codes (make sure to load Gadfly.jl and other related libraries mentioned earlier in generating the loss plots): <script src="https://gist.github.com/alstat/9539e00aef208062b1c6b900efa6c258.js"></script></p> <h3 id="benchmark">Benchmark</h3> <p>At this point, we are going to record the training time of each framework.</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-8', 'tabcontent-8')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-flux-060319-8', 'tabcontent-8')">Julia (Flux.jl)</button> <button class="tablinks" onclick="openCity(event, 'python-060319-8', 'tabcontent-8')">Python (TensorFlow-Keras)</button> </div> <div id="julia-knet-060319-8" class="tabcontent-8 first"> <script src="https://gist.github.com/alstat/0381d47e59cc9dd401fc9fb1342f2374.js"></script> </div> <div id="julia-flux-060319-8" class="tabcontent-8 first" style="display: none;"> <script src="https://gist.github.com/alstat/d738aeeaa2cddb62ee41478748d4446b.js"></script> </div> <div id="python-060319-8" class="tabcontent-8" style="display: none;"> <script src="https://gist.github.com/alstat/cf2bf0fac9fe9e55051f524d353aaf97.js"></script> </div> <p>The benchmark was done by running the above code repeatedly for about 10 times for each framework, I then took the lowest timestamp out of the results. In addition, before running the code for each framework, I keep a fresh start of my machine. <img src="https://drive.google.com/uc?export=view&amp;id=1u0selT5l8vP7n-LNdvac_bzyMGC1vSDD" /> The code of the above figure is given below (make sure to load Gadfly.jl and other related libraries mentioned earlier in generating the loss plots): <script src="https://gist.github.com/alstat/c1bc13ab6e772a4104b51d164f4e172d.js"></script></p> <h3 id="conclusion">Conclusion</h3> <p>In conclusion, I would say Julia is worth investing even for deep learning as illustrated in this article. The two frameworks, Flux.jl and Knet.jl, provide a clean API that introduces a new way of defining models, as opposed to the object-oriented approach of the TensorFlow-Keras. One thing to emphasize on this is the <code>for</code> loop which I plainly added in training the model just to save the accuracy and loss metrics. The <code>for</code> loop did not compromise the speed (though Knet.jl is much faster without it). This is crucial since it let’s the user spend more on solving the problem and less on optimizing the code. Further, between the two Julia frameworks, I find Knet.jl to be <a href="https://www.youtube.com/watch?v=ijI0BLf-AH0">Julia + little-else</a>, as described by <a href="http://www.denizyuret.com/">Professor Deniz Yuret</a> (the main developer), since there are no special APIs for Dense, Chains, etc., you have to code it. Although this is also possible for Flux.jl, but Knet.jl don’t have these out-of-the-box, it ships only with the nuts and bolts, and that’s the highest level APIs the user gets. Having said, I think Flux.jl is a better recommendation for beginners coming from TensorFlow-Keras. This is not to say that Knet.jl is hard, it’s not if you know Julia already. In addition, I do love the extent of flexibility on Knet.jl by default which I think is best for advanced users. Lastly, just like the different extensions of TensorFlow, Flux.jl is flexible enough that it works well with <a href="https://turing.ml/">Turing.jl</a> for doing Bayesian deep learning, which is a good alternative for <a href="https://www.tensorflow.org/probability/">TensorFlow Probability</a>. For Neural Differential Equations, Flux.jl works well with <a href="https://github.com/JuliaDiffEq/DifferentialEquations.jl">DifferentialEquations.jl</a>, checkout <a href="https://julialang.org/blog/2019/01/fluxdiffeq">DiffEqFlux.jl</a>.</p> <h3 id="next-steps">Next Steps</h3> <p>In my next article, we will explore the low level APIs of Flux.jl and Knet.jl in comparison to the low level APIs of TensorFlow. One thing that’s missing also from the above exercise is the use of GPU for model training, and I hope to tackle this in future articles. Finally, I plan to test these Julia libraries on real deep learning problems, such as computer vision and natural language processing (checkout the <a href="https://www.youtube.com/watch?v=21_wokgnNog">workshop</a> on these from JuliaCon 2018).</p> <h3 id="complete-codes">Complete Codes</h3> <p>If you are impatient, here are the complete codes excluding the benchmarks and the plots. These should work after installing the required libraries shown above:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-knet-060319-n', 'tabcontent-n')">Julia (Knet.jl)</button> <button class="tablinks" onclick="openCity(event, 'julia-flux-060319-n', 'tabcontent-n')">Julia (Flux.jl)</button> <button class="tablinks" onclick="openCity(event, 'python-060319-n', 'tabcontent-n')">Python (TensorFlow-Keras)</button> </div> <div id="julia-knet-060319-n" class="tabcontent-n first"> <script src="https://gist.github.com/alstat/e51343935f90c972aa6dcf18b60aefe2.js"></script> </div> <div id="julia-flux-060319-n" class="tabcontent-n first" style="display: none;"> <script src="https://gist.github.com/alstat/004cc6d457bd22fce99148d14f37dc32.js"></script> </div> <div id="python-060319-n" class="tabcontent-n" style="display: none;"> <script src="https://gist.github.com/alstat/1bdf9cc7019ca0e5684c991fae4715ec.js"></script> </div> <h3 id="references">References</h3> <ul> <li>Yuret, Deniz (2016). <a href="https://pdfs.semanticscholar.org/28ee/845420b8ba275cf1d695fbf383cc21922fbd.pdf">Knet: beginning deep learning with 100 lines of Julia</a>. 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.</li> <li>Innes, Mike (2018). <a href="http://joss.theoj.org/papers/10.21105/joss.00602">Flux: Elegant machine learning with Julia</a>. Journal of Open Source Software, 3(25), 602, https://doi.org/10.21105/joss.00602</li> <li>Abadi, Martin et al (2016). <a href="https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf">TensorFlow-Keras: A system for large-scale machine learning</a>. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). p265–283.</li> </ul> <h3 id="software-versions">Software Versions</h3> <script src="https://gist.github.com/alstat/65dab0d062ea0fd229b4aa23c18fcd21.js"></script>Al-Ahmadgaid B. AsaadWhen it comes to complex modeling, specifically in the field of deep learning, the go-to tool for most researchers is the Google’s TensorFlow. There are a number of good reason as to why, one of it is the fact that it provides both high and low level APIs that suit the needs of both beginners and advanced users, respectively. I have used it in some of my projects, and indeed it was powerful enough for the task. This is also due to the fact that TensorFlow is one of the most actively developed deep learning framework, with Bayesian inference or probabilistic reasoning as the recent extension (see TensorFlow Probability, another extension is the TensorFlow.js). While the library is written majority in C++ for optimization, the main API is served in Python for ease of use. This design works around the static computational graph that needs to be defined declaratively before executed. The static nature of this graph, however, led to difficulty on debugging the models since the codes are itself data for defining the computational graph. Hence, you cannot use a debugger to check the results of the models line by line. Thankfully, it’s 2019 already and we have a stable Eager Execution that allows users to immediately check the results of any TensorFlow operations. Indeed, this is more intuitive and more pythonic. In this article, however, we’ll attempt to explore, what else we have in 2019. In particular, let’s take look at Julia’s deep learning libraries and compare it to high level APIs of TensorFlow, i.e. Keras’ model specification.Julia: Introduction to Web Scraping (PHIVOLCS’ Seismic Events)2018-10-30T04:00:00+00:002018-10-30T04:00:00+00:00https://estadistika.github.io//web/scraping/philippines/julia/programming/packages/2018/10/30/Introduction-to-Web-Scraping-Julia<p>Data nowadays are almost everywhere, often stored in as simple as traditional log books, to as complex as multi-connected-databases. Efficient collection of these datasets is crucial for analytics since data processing takes almost 50% of the overall workflow. An example where manual data collection can be automated is in the case of datasets published in the website, where providers are usually government agencies. For example in the Philippines, there is a website dedicated to <a href="http://openstat.psa.gov.ph/" target="_blank">Open Stat</a> initiated by the <a href="https://psa.gov.ph/" target="_blank">Philippine Statistics Authority (PSA)</a>. The site hoards public datasets for researchers to use and are well prepared in CSV format, so consumers can simply download the file. Unfortunately, for some agencies this feature is not yet available. That is, users need to either copy-paste the data from the website, or request it to the agency directly (this also takes time). A good example of this is the seismic events of the <a href="https://www.phivolcs.dost.gov.ph/" target="_blank">Philippine Institute of Volcanology and Seismology (PHIVOLCS)</a>.</p> <p>Data encoded in HTML can be parsed and saved into formats that’s workable for doing analyses (e.g. CSV, TSV, etc.). The task of harvesting and parsing data from the web is called <strong>web scraping</strong>, and PHIVOLCS’ <a href="https://www.phivolcs.dost.gov.ph/html/update_SOEPD/EQLatest.html" target="_blank">Latest Seismic Events</a> is a good playground for beginners. There are several tutorials available especially for Python (see <a href="https://www.dataquest.io/blog/web-scraping-tutorial-python/" target="_blank">this</a>) and R (see <a href="https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on-knowledge/" target="_blank">this</a>), but not much for Julia. Hence, this article is primarily for Julia users. However, this work introduces web tools as well – how to use it for inspecting the components of the website – which can be useful for non-Julia users.</p> <h3 id="why-julia">Why Julia?</h3> <p>The creators of the language described it well in their first announcement (I suggest you read the full post): <a href="https://julialang.org/blog/2012/02/why-we-created-julia" target="_blank">Why we created Julia?</a> Here’s part of it:</p> <blockquote> <p><em>We are greedy: we want more.</em></p> </blockquote> <blockquote> <p><em>We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.</em></p> </blockquote> <blockquote> <p><em>(Did we mention it should be as fast as C?)</em></p> </blockquote> <p>I used Julia in my master’s thesis for my <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo" target="_blank">MCMC simulations</a> and benchmarked it against R. It took seconds in Julia while R took more than an hour (sampling over the posterior distribution). I could have optimized my R code using <a href="http://www.rcpp.org/" target="_blank">Rcpp</a> (writting the performance-critical part in C++ to speed up and wrapped/call it in R), but I have no time for that. Hence, Julia solves the <a href="https://www.quora.com/What-is-the-2-language-problem-in-data-science" target="_blank">two-language problem</a>.</p> <h3 id="getting-to-know-html">Getting to know HTML</h3> <p>Since the data published in the websites are usually encoded as a table, it is therefore best to understand the structure of the HTML document before performing web scraping. HTML (Hypertext Markup Language) is a standardized system for tagging text files to achieve font, color, graphic, and hyperlink effects on World Wide Web pages [<a href="https://www.google.com/search?q=what+is+HTML&amp;ie=utf-8&amp;oe=utf-8&amp;client=firefox-b-ab" target="_blank">1</a>]. For example, bold text in HTML is enclosed inside the <code class="language-plaintext highlighter-rouge">&lt;b&gt;</code> tag, e.g. <code class="language-plaintext highlighter-rouge">&lt;b&gt;</code>text<code class="language-plaintext highlighter-rouge">&lt;/b&gt;</code>, the result is <b>text</b>. A webpage is a HTML document that can be structured in several ways, one possible case is as follows: <img src="https://drive.google.com/uc?export=view&amp;id=1WW3yUzJ5ZhGRNPYolM_S4dsJ5ts5urPU" /> Scrapers must be familiar with the hierarchy of the HTML document as this will be the template for the frontend source code of every website. Following the structure of the above figure, data encoded in HTML table are placed inside the <code class="language-plaintext highlighter-rouge">td</code> (table data) tag, where <code class="language-plaintext highlighter-rouge">td</code> is under <code class="language-plaintext highlighter-rouge">tr</code> (table row), <code class="language-plaintext highlighter-rouge">tr</code> is under <code class="language-plaintext highlighter-rouge">tbody</code> (table body), and so on. <code class="language-plaintext highlighter-rouge">td</code> is the lowest level tag (sorting by hierarchy) from the figure above that can contain data. However, <code class="language-plaintext highlighter-rouge">td</code> can also take precedence over <code class="language-plaintext highlighter-rouge">p</code> (paragraph), <code class="language-plaintext highlighter-rouge">a</code> (hyperlink), <code class="language-plaintext highlighter-rouge">b</code> (bold), <code class="language-plaintext highlighter-rouge">i</code> (italic), <code class="language-plaintext highlighter-rouge">span</code> (span), and even <code class="language-plaintext highlighter-rouge">div</code> (division). So expect to encounter these under <code class="language-plaintext highlighter-rouge">td</code> as well.</p> <p>As indicated in the figure, each HTML tag can have attributes, such as <code class="language-plaintext highlighter-rouge">id</code> and <code class="language-plaintext highlighter-rouge">class</code>. To understand how the two differ, consider <code class="language-plaintext highlighter-rouge">id="yellow"</code> and <code class="language-plaintext highlighter-rouge">id="orange"</code>, these are unique identities (<code class="language-plaintext highlighter-rouge">id</code>s) of colors. These <code class="language-plaintext highlighter-rouge">id</code>s can be grouped into a class, e.g. <code class="language-plaintext highlighter-rouge">class="colors"</code>. HTML tags are not required to have these attributes but are useful for adding custom styles and behavior when doing web development. This article will not dive into the details of the HTML document, but rather to give the reader a high level understanding. There are many resources available on the web, just google.</p> <h3 id="inspecting-the-source-of-the-website">Inspecting the Source of the Website</h3> <p>In order to have an idea on the structure of the website, browsers such as Google Chrome and Mozilla Firefox include tools for Web Developers. For purpose of illustration but without loss of generality, this article will only scrape portion (why? read on and see the explanation below) of the <a href="https://www.phivolcs.dost.gov.ph/html/update_SOEPD/EQLatest-Monthly/2018/2018_September.html" target="_blank">September 2018 earthquake events</a>. The web developer tools can be accessed from <b>Tools &gt; Web Developer</b> in Firefox, and can be accessed from <b>View &gt; Developer</b> in Google Chrome. The following video shows how to use the inspector tool of the Mozilla Firefox.</p> <iframe width="100%" height="400px" src="https://www.youtube.com/embed/RJEnugditnA" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen=""></iframe> <h3 id="scraping-using-julia">Scraping using Julia</h3> <p>To perform web scraping, Julia offers three libraries for the job, and these are <a href="https://github.com/Algocircle/Cascadia.jl" target="_blank">Cascadia.jl</a>, <a href="https://github.com/JuliaWeb/Gumbo.jl" target="_blank">Gumbo.jl</a> and <a href="https://github.com/JuliaWeb/HTTP.jl" target="_blank">HTTP.jl</a>. <a href="https://github.com/JuliaWeb/HTTP.jl" target="_blank">HTTP.jl</a> is used to download the frontend source code of the website, which then is parsed by <a href="https://github.com/JuliaWeb/Gumbo.jl" target="_blank">Gumbo.jl</a> into a hierarchical structured object; and <a href="https://github.com/Algocircle/Cascadia.jl" target="_blank">Cascadia.jl</a> provides a CSS selector API for easy navigation.</p> <p>To start with, the following code will download the frontend source code of the PHIVOLCS’ Seismic Events for September 2018. <script src="https://gist.github.com/alstat/4e5bbbb9587b6506c4341a8097804c69.js"></script> Extract the HTML source code and parsed it as follows: <script src="https://gist.github.com/alstat/deb9ef2abe52af58fc03be63f0482ccf.js"></script> Now to extract the header of the HTML table, use the Web Developer Tools for preliminary inspection on the components of the website. As shown in the screenshot below, the header of the table is enclosed inside the <code class="language-plaintext highlighter-rouge">p</code> tag of the <code class="language-plaintext highlighter-rouge">td</code>. Further, the <code class="language-plaintext highlighter-rouge">p</code> tag is of class <code class="language-plaintext highlighter-rouge">auto-style33</code>, which can be accessed via CSS selector by simply prefixing it with <code class="language-plaintext highlighter-rouge">.</code>, i.e. <code class="language-plaintext highlighter-rouge">.auto-style33</code>. <img src="https://drive.google.com/uc?export=view&amp;id=1LVNFZRHdT-o-vX-0dLiBMiVv3872LXpE" /> <script src="https://gist.github.com/alstat/63c8855a48c9a33669c9bdd4d3ae2e9a.js"></script> <code class="language-plaintext highlighter-rouge">qres</code> contains the HTML tags that matched the CSS selector’s query. The result is further cleaned by removing the tabs, spaces and line breaks via <a href="https://en.wikipedia.org/wiki/Regular_expression" target="_blank">Regular Expressions</a>, and is done as follows: <script src="https://gist.github.com/alstat/94e5af21b303b995d93af24a8ae69841.js"></script> Having the header names, next is to extract the data from the HTML table. Upon inspection, the <code class="language-plaintext highlighter-rouge">td</code>s containing the data next to the header rows seem to have the following classes (see screenshot below): <code class="language-plaintext highlighter-rouge">auto-style21</code> for first column (Date-Time), <code class="language-plaintext highlighter-rouge">auto-style81</code> for second column (Latitude), <code class="language-plaintext highlighter-rouge">auto-style80</code> for third and fourth columns (Longitude and Depth), <code class="language-plaintext highlighter-rouge">auto-style74</code> for fifth column (Magnitude), and <code class="language-plaintext highlighter-rouge">auto-style79</code> for sixth column (Location). Unfortunately, this is not consistent across rows (<code class="language-plaintext highlighter-rouge">tr</code>s), and is therefore best not to use it with <a href="https://github.com/Algocircle/Cascadia.jl" target="_blank">Cascadia.jl</a>. Instead, use <a href="https://github.com/JuliaWeb/Gumbo.jl" target="_blank">Gumbo.jl</a> to navigate down the hierarchy of the <a href="" target="_blank">Document Object Model</a> of the HTML. <img src="https://drive.google.com/uc?export=view&amp;id=1XEkudOAB7o4Cix5CY4tqSJe2J0BpoJ_G" /> Starting with the <code class="language-plaintext highlighter-rouge">table</code> tag which is of class <code class="language-plaintext highlighter-rouge">.MsoNormalTable</code> (see screenshot below), the extraction proceeds down to <code class="language-plaintext highlighter-rouge">tbody</code> then to <code class="language-plaintext highlighter-rouge">tr</code> and finally to <code class="language-plaintext highlighter-rouge">td</code>. <img src="https://drive.google.com/uc?export=view&amp;id=1x3N58LE_5ENyzMbyyhrx8Zrlgp40EBNI" /> The following code describes how parsing is done, read the comments: <script src="https://gist.github.com/alstat/b4fbfe5dc8330ef16ad1abc05d44056f.js"></script></p> <h3 id="complete-code-for-phivolcs-september-2018-portion-seismic-events">Complete Code for PHIVOLCS’ September 2018 (Portion) Seismic Events</h3> <p>The September 2018 Seismic Events are encoded in two separate HTML tables of the same class, named <code class="language-plaintext highlighter-rouge">MsoNormalTable</code>. For purpose of simplicity, this article will only scrape the first portion (3rd-indexed, see line 14 below: <code class="language-plaintext highlighter-rouge">tbody = html;</code>) of the table (581 rows). The second portion (4th-indexed, change line 14 below to: <code class="language-plaintext highlighter-rouge">tbody = html;</code>) is left to the reader to try out and scrape it as well.</p> <p>The following code wraps the parsers into functions, namely <code class="language-plaintext highlighter-rouge">htmldoc</code> (downloads and parses the HTML source code of the site), <code class="language-plaintext highlighter-rouge">scraper</code> (scrapes the downloaded HTML document), <code class="language-plaintext highlighter-rouge">firstcolumn</code> (logic for parsing the first column of the table, used inside <code class="language-plaintext highlighter-rouge">scraper</code> function). <script src="https://gist.github.com/alstat/698125867d853b941ab4284de34d9362.js"></script> <script src="https://gist.github.com/alstat/5b7463964703db410b05214f124bf028.js"></script> Having the data, analyst can now proceed to do exploratory analyses, for example the following is the descriptive statistics of the variables: <script src="https://gist.github.com/alstat/8e7506524552f00bf8d8cdb690bda27b.js"></script> <code class="language-plaintext highlighter-rouge">describe</code> is clever enough not only to not return <code class="language-plaintext highlighter-rouge">mean</code> and <code class="language-plaintext highlighter-rouge">median</code> for non-continuous variables, but also determine the <code class="language-plaintext highlighter-rouge">min</code>, <code class="language-plaintext highlighter-rouge">max</code> and <code class="language-plaintext highlighter-rouge">nunique</code> (number of uniques) for these variables (date and location).</p> <h3 id="end-note">End Note</h3> <p>I use Python primarily at work with <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup</a> as my go-to library for web scraping. Compared to <a href="https://github.com/Algocircle/Cascadia.jl" target="_blank">Cascadia.jl</a> and <a href="https://github.com/JuliaWeb/Gumbo.jl" target="_blank">Gumbo.jl</a>, <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">BeautifulSoup</a> offers comprehensive documentations and other resources that are useful for figuring out bugs, and understand how the module works. Having said, I hope this article somehow contributed to the documentation of the said Julia libraries. Further, I am confident to say that <a href="https://github.com/Algocircle/Cascadia.jl" target="_blank">Cascadia.jl</a> and <a href="https://github.com/JuliaWeb/Gumbo.jl" target="_blank">Gumbo.jl</a> are stable enough for the job.</p> <p>Lastly, as a precaution to beginners, make sure to read the privacy policy (if any) of any website you want to scrape.</p> <h3 id="software-versions">Software Versions</h3> <script src="https://gist.github.com/alstat/12f52ef5b2c76fdeb5f927fc3239c613.js"></script>Al-Ahmadgaid B. AsaadData nowadays are almost everywhere, often stored in as simple as traditional log books, to as complex as multi-connected-databases. Efficient collection of these datasets is crucial for analytics since data processing takes almost 50% of the overall workflow. An example where manual data collection can be automated is in the case of datasets published in the website, where providers are usually government agencies. For example in the Philippines, there is a website dedicated to Open Stat initiated by the Philippine Statistics Authority (PSA). The site hoards public datasets for researchers to use and are well prepared in CSV format, so consumers can simply download the file. Unfortunately, for some agencies this feature is not yet available. That is, users need to either copy-paste the data from the website, or request it to the agency directly (this also takes time). A good example of this is the seismic events of the Philippine Institute of Volcanology and Seismology (PHIVOLCS).Julia, Python, R: Introduction to Bayesian Linear Regression2018-10-14T04:00:00+00:002018-10-14T04:00:00+00:00https://estadistika.github.io//data/analyses/wrangling/julia/programming/packages/2018/10/14/Introduction-to-Bayesian-Linear-Regression<p><a href="https://en.wikipedia.org/wiki/Thomas_Bayes" target="_blank">Reverend Thomas Bayes</a> (see Bayes, 1763) is known to be the first to formulate the Bayes’ theorem, but the comprehensive mathematical formulation of this result is credited to the works of <a href="https://en.wikipedia.org/wiki/Pierre-Simon_Laplace" target="_blank">Laplace</a> (1986). The Bayes’ theorem has the following form:</p> <div class="math"> \begin{equation} \label{eq:bayes-theorem} \mathbb{P}(\mathbf{w}|\mathbf{y}) = \frac{\mathbb{P}(\mathbf{w})\mathbb{P}(\mathbf{y}|\mathbf{w})}{\mathbb{P}(\mathbf{y})} \end{equation} </div> <p>where $\mathbf{w}$ is the weight vector and $\mathbf{y}$ is the data. This simple formula is the main foundation of Bayesian modeling. Any model estimated using Maximum Likelihood can be estimated using the above conditional probability. What makes it different, is that the Bayes’ theorem considers uncertainty not only on the observations but also uncertainty on the weights or the objective parameters.</p> <p>As an illustration of Bayesian inference to basic modeling, this article attempts to discuss the Bayesian approach to linear regression. Let <span>$\mathscr{D}\triangleq\{(\mathbf{x}_1,y_1),\cdots,(\mathbf{x}_n,y_n)\}$</span> where $\mathbf{x}_i\in\mathbb{R}^{d}, y_i\in \mathbb{R}$ be the pairwised dataset. Suppose the response values, $y_1,\cdots,y_n$, are independent given the parameter $\mathbf{w}$, and is distributed as $y_i\overset{\text{iid}}{\sim}\mathcal{N}(\mathbf{w}^{\text{T}}\mathbf{x}_i,\alpha^{-1})$, where $\alpha^{-1}$ (assumed to be known in this article) is referred to as the <i>precision</i> parameter — useful for later derivation. In Bayesian perspective, the weights are assumed to be random and are governed by some <i>a priori</i> distribution. The choice of this distribution is subjective, but choosing arbitrary <i>a priori</i> can sometimes or often result to an intractable integration, especially for interesting models. For simplicity, a conjugate prior is used for the latent weights. Specifically, assume that <span>${\mathbf{w}\overset{\text{iid}}{\sim}\mathcal{N}(\mathbf{0},\beta^{-1}\mathbf{I})}$</span> such that $\beta&gt;0$ is the hyperparameter supposed in this experiment as known value. The posterior distribution based on the Bayes’ rule is given by</p> <div style="overflow-x: auto;"> \begin{equation}\label{eq:bayesrulepost} \mathbb{P}(\mathbf{w}|\mathbf{y})=\frac{\mathbb{P}(\mathbf{w})\mathbb{P}(\mathbf{y}|\mathbf{w})}{\mathbb{P}(\mathbf{y})}, \end{equation} </div> <p>where $\mathbb{P}(\mathbf{w})$ is the <i>a priori</i> distribution of the parameter, $\mathbb{P}(\mathbf{y}|\mathbf{w})$ is the likelihood, and $\mathbb{P}(\mathbf{y})$ is the normalizing factor. The likelihood is given by</p> <div style="overflow-x: auto;"> \begin{align} \mathbb{P}(\mathbf{y}|\mathbf{w})&amp;=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\alpha^{-1}}}\exp\left[-\frac{\alpha(y_i-\mathbf{w}^{\text{T}}\mathbf{x}_i)^2}{2}\right]\nonumber\\ &amp;=\left(\frac{\alpha}{2\pi}\right)^{n/2}\exp\left[-\sum_{i=1}^n\frac{\alpha(y_i-\mathbf{w}^{\text{T}}\mathbf{x}_i)^2}{2}\right].\label{eq:likelihood:blreg} \end{align} </div> <p>In matrix form, this can be written as</p> <div style="overflow-x: auto;"> \begin{equation} \mathbb{P}(\mathbf{y}|\mathbf{w})\propto\exp\left[-\frac{\alpha}{2}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})\right] \end{equation} </div> <p>where $\boldsymbol{\mathfrak{A}}\triangleq\left[(\mathbf{x}_i^{\text{T}})\right]$, i.e. $\boldsymbol{\mathfrak{A}}\in(\mathbb{R}^{n}\times\mathbb{R}^d)$, this matrix is known as the <i>design matrix</i>. Given that $\mathbf{w}$ has the following prior distribution</p> <div style="overflow-x: auto;"> \begin{equation}\label{eq:wpriori} \mathbb{P}(\mathbf{w})=\frac{1}{\sqrt{(2\pi)^{d}|\beta^{-1}\mathbf{I}|}}\exp\left[-\frac{1}{2}\mathbf{w}^{\text{T}}\beta\mathbf{I}\mathbf{w}\right], \end{equation} </div> <p>implies that the posterior has the following form:</p> <div style="overflow-x: auto;"> \begin{align} \mathbb{P}(\mathbf{w}|\mathbf{y})&amp;\propto\exp\left[-\frac{\alpha}{2}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})\right]\exp\left[-\frac{1}{2}\mathbf{w}^{\text{T}}\beta\mathbf{I}\mathbf{w}\right]\nonumber\\ &amp;=\exp\left\{-\frac{1}{2}\left[\alpha(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})^{\text{T}}(\mathbf{y}-\boldsymbol{\mathfrak{A}}\mathbf{w})+\mathbf{w}^{\text{T}}\beta\mathbf{I}\mathbf{w}\right]\right\}. \end{align} </div> <p>Expanding the terms in the exponent, becomes</p> <div style="overflow-x: auto;"> \begin{equation}\label{eq:expterms} \alpha\mathbf{y}^{\text{T}}\mathbf{y}-2\alpha\mathbf{w}^{\text{T}}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}+\mathbf{w}^{\text{T}}(\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I})\mathbf{w}. \end{equation} </div> <p>The next step is to complete the square of the above equation such that it resembles the inner terms of the exponential factor of the Gaussian distribution. That is, the quadratic form of the exponential term of a $\mathcal{N}(\mathbf{w}|\boldsymbol{\mu},\boldsymbol{\Sigma}^{-1})$ is given by</p> <div style="overflow-x: auto;"> \begin{align} (\mathbf{w}-\boldsymbol{\mu})^{\text{T}}\boldsymbol{\Sigma}^{-1}(\mathbf{w}-\boldsymbol{\mu})&amp;=(\mathbf{w}-\boldsymbol{\mu})^{\text{T}}(\boldsymbol{\Sigma}^{-1}\mathbf{w}-\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu})\nonumber\\ &amp;=\mathbf{w}^{\text{T}}\boldsymbol{\Sigma}^{-1}\mathbf{w}- 2\mathbf{w}^{\text{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}+\boldsymbol{\mu}^{\text{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}.\label{eq:expnorm} \end{align} </div> <p>The terms in Equation (\ref{eq:expterms}) are matched up with that in (\ref{eq:expnorm}), so that</p> <div style="overflow-x: auto;"> \begin{equation}\label{eq:sigmablrgauss} \boldsymbol{\Sigma}^{-1}=\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\boldsymbol{\mathfrak{A}}+\beta\mathbf{I} \end{equation} </div> <p>and</p> <div style="overflow-x: auto;"> \begin{align} \mathbf{w}^{\text{T}}\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}&amp;=\alpha\mathbf{w}^{\text{T}}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}\nonumber\\ \boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}&amp;=\alpha\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}\nonumber\\ \boldsymbol{\mu}&amp;=\alpha\boldsymbol{\Sigma}\boldsymbol{\mathfrak{A}}^{\text{T}}\mathbf{y}.\label{eq:mublrgauss} \end{align} </div> <p>Thus the <i>a posteriori</i> is a Gaussian distribution with location parameter in Equation (\ref{eq:mublrgauss}) and scale parameter given by the inverse of Equation (\ref{eq:sigmablrgauss}). I’ll leave to the reader the proper mathematical derivation of $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ without matching like what we did above.</p> <h3 id="simulation-experiment">Simulation Experiment</h3> <p>In this section, we are going to apply the theory above using simulated data. I will use Julia as the primary programming language for this article, but I also provided codes for R and Python. To start with, load the following libraries:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-1', 'tabcontent-1')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-1', 'tabcontent-1')">Python</button> </div> <div id="julia-1" class="tabcontent-1 first"> <script src="https://gist.github.com/alstat/00ac3ea439baddddab166ca40902f4b0.js"></script> </div> <div id="python-1" class="tabcontent-1" style="display: none;"> <script src="https://gist.github.com/alstat/e814d09d53a8c3cba1e27d7be4c46d02.js"></script> </div> <p>Next, define the following functions for data simulation and parameter estimation. The estimate of the paramters is governed by the <i>a posteriori</i> which from above is a multivariate Gaussian distribution, with mean given by Equation (\ref{eq:mublrgauss}) and variance-covariance matrix defined by the inverse of Equation (\ref{eq:sigmablrgauss}).</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-2', 'tabcontent-2')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-2', 'tabcontent-2')">Python</button> <button class="tablinks" onclick="openCity(event, 'r-2', 'tabcontent-2')">R</button> </div> <div id="julia-2" class="tabcontent-2 first"> <script src="https://gist.github.com/alstat/df66b766a478aac49c45c2d792184534.js"></script> </div> <div id="python-2" class="tabcontent-2" style="display: none;"> <script src="https://gist.github.com/alstat/42c43fe8cbf482e192da1283c0e7756c.js"></script> </div> <div id="r-2" class="tabcontent-2" style="display: none;"> <script src="https://gist.github.com/alstat/a100a97eaf25659490a01121d1da8fa3.js"></script> </div> <p>Execute the above functions and return the necessary values as follows:</p> <div class="tab" style="margin-bottom: -16px;"> <button class="tablinks" onclick="openCity(event, 'julia-3', 'tabcontent-3')">Julia</button> <button class="tablinks" onclick="openCity(event, 'python-3', 'tabcontent-3')">Python</button> <button class="tablinks" onclick="openCity(event, 'r-3', 'tabcontent-3')">R</button> </div> <div id="julia-3" class="tabcontent-3 first"> <script src="https://gist.github.com/alstat/0a60ea652e1caca60544cea239ccae4b.js"></script> </div> <div id="python-3" class="tabcontent-3" style="display: none;"> <script src="https://gist.github.com/alstat/5dfa29ebb09275b961806f67e89e5530.js"></script> </div> <div id="r-3" class="tabcontent-3" style="display: none;"> <script src="https://gist.github.com/alstat/5defe8880d40bdbf35ae36688bbcf98a.js"></script> </div> <p>Finally, plot the fitted lines whose weights are samples from the <i>a posteriori</i>. The red line in the plot below is the Maximum <i>A Posteriori</i> (MAP) of the parameter of interest. Note that, however, the code provided for the animated plot below is Julia. Python and R users can use <a href="https://matplotlib.org/index.html" target="_blank">matplotlib.pyplot</a> (Julia’s Plots backend) and <a href="https://github.com/thomasp85/gganimate" target="_blank">gganimate</a>, respectively. <script src="https://gist.github.com/alstat/023ff855025d0da2fa50b7923b834fd8.js"></script> <img src="https://drive.google.com/uc?export=view&amp;id=1XhKHztWM4OpxL1t_KzPxeU1kd40czUvK" /></p> <h3 id="end-note">End Note</h3> <p>There are many libraries available for Bayesian modeling, for Julia we have: <a href="https://github.com/JuliaStats/Klara.jl" target="_blank">Klara.jl</a>, <a href="https://github.com/brian-j-smith/Mamba.jl" target="_blank">Mamba.jl</a>, <a href="https://github.com/goedman/Stan.jl" target="_blank">Stan.jl</a>, <a href="https://github.com/TuringLang/Turing.jl" target="_blank">Turing.jl</a> and <a href="https://juliaobserver.com/categories/Bayesian" target="_blank">more related</a>; for Python, my favorite is <a href="https://docs.pymc.io/" target="_blank">PyMC3</a>; and for R, I prefer <a href="http://mc-stan.org/users/interfaces/rstan" target="_blank">RStan</a>.</p> <p>As always, coding from scratch is a good exercise and it helps you appreciate the math. Further, I found Julia to be quite easy to use as a tool for statistical problems. In fact, Julia’s linear algebra API is very close to the mathematical formulae above.</p> <h3 id="references">References</h3> <ul> <li>Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. <em>Philosophical Transactions</em>, 53, 370-418. URL: http://www.jstor.org/stable/105741</li> <li>Laplace, P. S. (1986). Memoir on the probability of the causes of events. <em>Statist. Sci.</em>, 1(3), 364–378. URL: http://dx.doi.org/10.1214/ss/1177013621 doi: 10.1214/ss/1177013621</li> </ul> <h3 id="software-versions">Software Versions</h3> <script src="https://gist.github.com/alstat/53a54b8e96ec1f45883d1447efeab0ff.js"></script>Al-Ahmadgaid B. AsaadReverend Thomas Bayes (see Bayes, 1763) is known to be the first to formulate the Bayes’ theorem, but the comprehensive mathematical formulation of this result is credited to the works of Laplace (1986). The Bayes’ theorem has the following form: \begin{equation} \label{eq:bayes-theorem} \mathbb{P}(\mathbf{w}|\mathbf{y}) = \frac{\mathbb{P}(\mathbf{w})\mathbb{P}(\mathbf{y}|\mathbf{w})}{\mathbb{P}(\mathbf{y})} \end{equation} where $\mathbf{w}$ is the weight vector and $\mathbf{y}$ is the data. This simple formula is the main foundation of Bayesian modeling. Any model estimated using Maximum Likelihood can be estimated using the above conditional probability. What makes it different, is that the Bayes’ theorem considers uncertainty not only on the observations but also uncertainty on the weights or the objective parameters.Julia: Data Wrangling using JuliaDB.jl and JuliaDBMeta.jl2018-06-08T04:00:00+00:002018-06-08T04:00:00+00:00https://estadistika.github.io//data/analyses/wrangling/julia/programming/packages/2018/06/08/Julia-Introduction-to-Data-Wrangling<p>I’m a heavy user of Python’s <a href="https://pandas.pydata.org/">pandas</a> and R’s <a href="https://cran.r-project.org/web/packages/dplyr/index.html">dplyr</a> both at work and when I was taking my master’s degree. Hands down, both of these tools are very good at handling the data. So what about Julia? It’s a fairly new programming language that’s been around for almost 6 years already with a very active community. If you have no idea, I encourage you to visit <a href="http://julialang.org/">Julialang.org</a>. In summary, it’s a programming language that walks like a <a href="https://www.python.org/">Python</a>, but runs like a <a href="https://en.wikipedia.org/wiki/C_%28programming_language%29">C</a>.</p> <p>For data wrangling, there are two packages that we can use, and these are <a href="https://github.com/JuliaData/DataFrames.jl">DataFrames.jl</a> and <a href="http://juliadb.org/latest/">JuliaDB.jl</a>. Let me reserve a separate post for <a href="https://github.com/JuliaData/DataFrames.jl">DataFrames.jl</a>, and instead focus on <a href="http://juliadb.org/latest/">JuliaDB.jl</a> and <a href="https://piever.github.io/JuliaDBMeta.jl/latest/">JuliaDBMeta.jl</a> (an alternative for querying the data, like that of R’s <a href="https://cran.r-project.org/web/packages/dplyr/index.html">dplyr</a>) packages.</p> <h3 class="section">Package Installation</h3> <p>By default, the libraries I mentioned above are not built-in in Julia, and hence we need to install it: <script src="https://gist.github.com/alstat/78138748ba87580653416a6181693caa.js"></script></p> <h3 class="section">Data: nycflights13</h3> <p>In order to compare Julia’s capability on data wrangling with that of R’s <a href="https://cran.r-project.org/web/packages/dplyr/index.html">dplyr</a>, we’ll reproduce the example in this <a href="https://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html">site</a>. It uses all 336,776 flights that departed from New York City in 2013. I have a copy of it on github, and the following will download and load the data: <script src="https://gist.github.com/alstat/c0c2bc4e5355ac55ad83fc07fa8561c8.js"></script> The rows of the data are not displayed as we execute <code>nycflights</code> in line 7, that’s because we have a lot of columns, and by default <a href="http://juliadb.org/latest/">JuliaDB.jl</a> will not print all these unless you have a big display (unfortunately, I’m using my 13 inch laptop screen, and that’s why). Hence, for the rest of the article, we’ll be using selected columns only: <script src="https://gist.github.com/alstat/2cde6bb6e7ede38ddcdba7d47fb1fed7.js"></script></p> <h3 class="section">Filter Rows</h3> <p>Filtering is a row-wise operation and is done using the <code>Base.filter</code> function with extended method for <code>JuliaDB.IndexedTables</code>. Therefore, to filter the data for month equal to 1 (January) and day equal to 1 (first day of the month), is done as follows: <script src="https://gist.github.com/alstat/fe17e7133a3de644bfc853b624bb6af3.js"></script> To see the output for line 2 using <code>Base.filter</code>, simply remove the semicolon and you’ll have the same output as that of line 5 (using <code>JuliaDBMeta.@filter</code>).</p> <h3 class="section">Arrange Rows</h3> <p>To arrange the rows of the columns, use <code>Base.sort</code> function: <script src="https://gist.github.com/alstat/1211792bac2febc1d7c4ba058107e2d9.js"></script></p> <h3 class="section">Select Columns</h3> <p>We’ve seen above how to select the columns, but we can also use ranges of columns for selection. <script src="https://gist.github.com/alstat/785e35fe4535c84cc8f60dafa9b39e69.js"></script></p> <h3 class="section">Rename Column</h3> <p>To rename the column, use <code>JuliaDB.renamecol</code> function: <script src="https://gist.github.com/alstat/048463d348450873dba81f3a96a473d1.js"></script></p> <h3 class="section">Add New Column</h3> <p>To add a new column, use <code>insertcol</code>, <code>insertcolafter</code> and <code>insertcolbefore</code> of the <a href="http://juliadb.org/latest/">JuliaDB.jl</a>. <script src="https://gist.github.com/alstat/a5a2df1fbdb3feaad408a2ca92244e30.js"></script> or use the <code>@transform</code> macro of the <a href="https://piever.github.io/JuliaDBMeta.jl/latest/">JuliaDBMeta.jl</a>: <script src="https://gist.github.com/alstat/ee7f0ab8405473aa88c5f52193ede352.js"></script></p> <h3 class="section">Summarize Data</h3> <p>The data can be summarized using the <code>JuliaDB.summarize</code> function <script src="https://gist.github.com/alstat/3891fec973a923dcc0f6cc451ead4859.js"></script> <code>@with</code> macro is an alternative from <a href="https://piever.github.io/JuliaDBMeta.jl/latest/">JuliaDBMeta.jl</a>.</p> <h3 class="section">Grouped Operations</h3> <p>For grouped operations, we can use the <code>JuliaDB.groupby</code> function or the <code>JuliaDBMeta.@groupby</code>: <script src="https://gist.github.com/alstat/523976efd34a747f8fe6211b16ad6bf0.js"></script> We’ll use the summarized data above and plot the flight delay in relation to the distance travelled. We’ll use the <a href="http://gadflyjl.org/stable/">Gadfly.jl</a> package for plotting and <a href="https://github.com/JuliaData/DataFrames.jl">DataFrames.jl</a> for converting <a href="http://juliadb.org/latest/">JuliaDB.jl</a>’s IndexedTable objects to DataFrames.DataFrame object, that’s because Gadfly.plot has no direct method for JuliaDB.IndexedTables. <script src="https://gist.github.com/alstat/c8485c39992d82c9129ccd2e5e2745c2.js"></script> To plot, run the following: <script src="https://gist.github.com/alstat/2d6322571f78ec940af76c6011ed9f1f.js"></script> <img src="https://raw.githubusercontent.com/estadistika/assets/master/imgs/2018-6-8-p2.svg?sanitize=true" /> To find the number of planes and the number of flights that go to each possible destination, run: <script src="https://gist.github.com/alstat/6a78c1dc19914326c39a4c47eecb7b8e.js"></script></p> <h3 class="section">Piping Multiple Operations</h3> <p>For multiple operations, it is convenient to use piping and that is the reason why we have tools like <a href="https://piever.github.io/JuliaDBMeta.jl/latest/">JuliaDBMeta.jl</a>. The following example using <a href="https://cran.rstudio.com/web/packages/dplyr/vignettes/dplyr.html">R’s dplyr</a>: <script src="https://gist.github.com/alstat/1ef5992f368ebdb4be5e8b95678e6021.js"></script> is equivalent to the following Julia code using <a href="https://piever.github.io/JuliaDBMeta.jl/latest/">JuliaDBMeta.jl</a>: <script src="https://gist.github.com/alstat/a91f46846a8bc6ef0ac2992293734f90.js"></script></p> <h3 class="section">Conclusion</h3> <p>I’ve demonstrated how easy it is to use Julia for doing data wrangling, and I love it. In fact, there is a library that can query any table-like data structure in Julia, and is called <a href="https://github.com/davidanthoff/Query.jl">Query.jl</a> (will definitely write a separate article for this in the future).</p> <p>For more on <a href="http://juliadb.org/latest/">JuliaDB.jl</a>, watch the <a href="https://www.youtube.com/watch?v=d5SzUh2_ono">Youtube tutorial</a>.</p>Al-Ahmadgaid B. AsaadI’m a heavy user of Python’s pandas and R’s dplyr both at work and when I was taking my master’s degree. Hands down, both of these tools are very good at handling the data. So what about Julia? It’s a fairly new programming language that’s been around for almost 6 years already with a very active community. If you have no idea, I encourage you to visit Julialang.org. In summary, it’s a programming language that walks like a Python, but runs like a C.