Estadistika

I’m a heavy user of Python’s pandas and R’s dplyr both at work and when I was taking my master’s degree. Hands down, both of these tools are very good at handling the data. So what about Julia? It’s a fairly new programming language that’s been around for almost 6 years already with a very active community. If you have no idea, I encourage you to visit Julialang.org. In summary, it’s a programming language that walks like a Python, but runs like a C.

For data wrangling, there are two packages that we can use, and these are DataFrames.jl and JuliaDB.jl. Let me reserve a separate post for DataFrames.jl, and instead focus on JuliaDB.jl and JuliaDBMeta.jl (an alternative for querying the data, like that of R’s dplyr) packages.

Package Installation

By default, the libraries I mentioned above are not built-in in Julia, and hence we need to install it:

Data: nycflights13

In order to compare Julia’s capability on data wrangling with that of R’s dplyr, we’ll reproduce the example in this site. It uses all 336,776 flights that departed from New York City in 2013. I have a copy of it on github, and the following will download and load the data: The rows of the data are not displayed as we execute nycflights in line 7, that’s because we have a lot of columns, and by default JuliaDB.jl will not print all these unless you have a big display (unfortunately, I’m using my 13 inch laptop screen, and that’s why). Hence, for the rest of the article, we’ll be using selected columns only:

Filter Rows

Filtering is a row-wise operation and is done using the Base.filter function with extended method for JuliaDB.IndexedTables. Therefore, to filter the data for month equal to 1 (January) and day equal to 1 (first day of the month), is done as follows: To see the output for line 2 using Base.filter, simply remove the semicolon and you’ll have the same output as that of line 5 (using JuliaDBMeta.@filter).

Arrange Rows

To arrange the rows of the columns, use Base.sort function:

Select Columns

We’ve seen above how to select the columns, but we can also use ranges of columns for selection.

Rename Column

To rename the column, use JuliaDB.renamecol function:

Add New Column

To add a new column, use insertcol, insertcolafter and insertcolbefore of the JuliaDB.jl. or use the @transform macro of the JuliaDBMeta.jl:

Summarize Data

The data can be summarized using the JuliaDB.summarize function @with macro is an alternative from JuliaDBMeta.jl.

Grouped Operations

For grouped operations, we can use the JuliaDB.groupby function or the JuliaDBMeta.@groupby: We’ll use the summarized data above and plot the flight delay in relation to the distance travelled. We’ll use the Gadfly.jl package for plotting and DataFrames.jl for converting JuliaDB.jl’s IndexedTable objects to DataFrames.DataFrame object, that’s because Gadfly.plot has no direct method for JuliaDB.IndexedTables. To plot, run the following: To find the number of planes and the number of flights that go to each possible destination, run:

Piping Multiple Operations

For multiple operations, it is convenient to use piping and that is the reason why we have tools like JuliaDBMeta.jl. The following example using R’s dplyr: is equivalent to the following Julia code using JuliaDBMeta.jl:

Conclusion

I’ve demonstrated how easy it is to use Julia for doing data wrangling, and I love it. In fact, there is a library that can query any table-like data structure in Julia, and is called Query.jl (will definitely write a separate article for this in the future).

For more on JuliaDB.jl, watch the Youtube tutorial.

Julia: Data Wrangling using JuliaDB.jl and JuliaDBMeta.jl