Julia: Data Wrangling using JuliaDB.jl and JuliaDBMeta.jl
I’m a heavy user of Python’s pandas and R’s dplyr both at work and when I was taking my master’s degree. Hands down, both of these tools are very good at handling the data. So what about Julia? It’s a fairly new programming language that’s been around for almost 6 years already with a very active community. If you have no idea, I encourage you to visit Julialang.org. In summary, it’s a programming language that walks like a Python, but runs like a C.
For data wrangling, there are two packages that we can use, and these are DataFrames.jl and JuliaDB.jl. Let me reserve a separate post for DataFrames.jl, and instead focus on JuliaDB.jl and JuliaDBMeta.jl (an alternative for querying the data, like that of R’s dplyr) packages.
By default, the libraries I mentioned above are not built-in in Julia, and hence we need to install it:
In order to compare Julia’s capability on data wrangling with that of R’s dplyr, we’ll reproduce the example in this site. It uses all 336,776 flights that departed from New York City in 2013. I have a copy of it on github, and the following will download and load the data:
The rows of the data are not displayed as we execute
nycflights in line 7, that’s because we have a lot of columns, and by default JuliaDB.jl will not print all these unless you have a big display (unfortunately, I’m using my 13 inch laptop screen, and that’s why). Hence, for the rest of the article, we’ll be using selected columns only:
Filtering is a row-wise operation and is done using the
Base.filter function with extended method for
Therefore, to filter the data for month equal to 1 (January) and day equal to 1 (first day of the month), is done as follows:
To see the output for line 2 using
Base.filter, simply remove the semicolon and you’ll have the same output as that of line 5 (using
To arrange the rows of the columns, use
We’ve seen above how to select the columns, but we can also use ranges of columns for selection.
To rename the column, use
Add New Column
The data can be summarized using the
@with macro is an alternative from JuliaDBMeta.jl.
For grouped operations, we can use the
JuliaDB.groupby function or the
We’ll use the summarized data above and plot the flight delay in relation to the distance travelled. We’ll use the Gadfly.jl package for plotting and DataFrames.jl for converting JuliaDB.jl’s IndexedTable objects to DataFrames.DataFrame object, that’s because Gadfly.plot has no direct method for JuliaDB.IndexedTables.
To plot, run the following:
To find the number of planes and the number of flights that go to each possible destination, run:
Piping Multiple Operations
For multiple operations, it is convenient to use piping and that is the reason why we have tools like JuliaDBMeta.jl. The following example using R’s dplyr: is equivalent to the following Julia code using JuliaDBMeta.jl:
I’ve demonstrated how easy it is to use Julia for doing data wrangling, and I love it. In fact, there is a library that can query any table-like data structure in Julia, and is called Query.jl (will definitely write a separate article for this in the future).