Julia: Data Wrangling using JuliaDB.jl and JuliaDBMeta.jl
I’m a heavy user of Python’s pandas and R’s dplyr both at work and when I was taking my master’s degree. Hands down, both of these tools are very good at handling the data. So what about Julia? It’s a fairly new programming language that’s been around for almost 6 years already with a very active community. If you have no idea, I encourage you to visit Julialang.org. In summary, it’s a programming language that walks like a Python, but runs like a C.
For data wrangling, there are two packages that we can use, and these are DataFrames.jl and JuliaDB.jl. Let me reserve a separate post for DataFrames.jl, and instead focus on JuliaDB.jl and JuliaDBMeta.jl (an alternative for querying the data, like that of R’s dplyr) packages.
Package Installation
By default, the libraries I mentioned above are not built-in in Julia, and hence we need to install it:
Data: nycflights13
In order to compare Julia’s capability on data wrangling with that of R’s dplyr, we’ll reproduce the example in this site. It uses all 336,776 flights that departed from New York City in 2013. I have a copy of it on github, and the following will download and load the data:
The rows of the data are not displayed as we execute nycflights
in line 7, that’s because we have a lot of columns, and by default JuliaDB.jl will not print all these unless you have a big display (unfortunately, I’m using my 13 inch laptop screen, and that’s why). Hence, for the rest of the article, we’ll be using selected columns only:
Filter Rows
Filtering is a row-wise operation and is done using the Base.filter
function with extended method for JuliaDB.IndexedTables
.
Therefore, to filter the data for month equal to 1 (January) and day equal to 1 (first day of the month), is done as follows:
To see the output for line 2 using Base.filter
, simply remove the semicolon and you’ll have the same output as that of line 5 (using JuliaDBMeta.@filter
).
Arrange Rows
To arrange the rows of the columns, use Base.sort
function:
Select Columns
We’ve seen above how to select the columns, but we can also use ranges of columns for selection.
Rename Column
To rename the column, use JuliaDB.renamecol
function:
Add New Column
To add a new column, use insertcol
, insertcolafter
and insertcolbefore
of the JuliaDB.jl.
or use the @transform
macro of the JuliaDBMeta.jl:
Summarize Data
The data can be summarized using the JuliaDB.summarize
function
@with
macro is an alternative from JuliaDBMeta.jl.
Grouped Operations
For grouped operations, we can use the JuliaDB.groupby
function or the JuliaDBMeta.@groupby
:
We’ll use the summarized data above and plot the flight delay in relation to the distance travelled. We’ll use the Gadfly.jl package for plotting and DataFrames.jl for converting JuliaDB.jl’s IndexedTable objects to DataFrames.DataFrame object, that’s because Gadfly.plot has no direct method for JuliaDB.IndexedTables.
To plot, run the following:
To find the number of planes and the number of flights that go to each possible destination, run:
Piping Multiple Operations
For multiple operations, it is convenient to use piping and that is the reason why we have tools like JuliaDBMeta.jl. The following example using R’s dplyr: is equivalent to the following Julia code using JuliaDBMeta.jl:
Conclusion
I’ve demonstrated how easy it is to use Julia for doing data wrangling, and I love it. In fact, there is a library that can query any table-like data structure in Julia, and is called Query.jl (will definitely write a separate article for this in the future).
For more on JuliaDB.jl, watch the Youtube tutorial.