Note:
This is part 1 in a 2-part series. Click here to go to part 2. The complete code associated with parts 1 and 2 can be seen here.
Background
I have been accepted to Google Summer of Code 2018 as part of the Haskell.org open source organization. As part of this, I am trying to implement an add-on to the Frames library that would allow one to access Postgres from a dataframe in Haskell. This would aid in-memory analysis of data sourced from Postgres in Haskell. Over the course of this blog post I shall introduce Frames which shall be used for it’s in-memory data representation.
A Short Introduction to Frames
Frames
is a Haskell library that reads CSVs and turns them into typed dataframes. It indexes the underlying datasets (that it reads) by column names either found in the data or provided by the user. This allows for compile time checking of type safety of column access and manipulation. It uses the DataKinds extension to promote column names from the term level to the type level.
Consider the following example:
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE QuasiQuotes #-}
{-# LANGUAGE TemplateHaskell #-}
module DataFrame where
import Data.Vinyl (rcast)
import Frames
-- Let's assume we have a CSV ('users.csv') containing the fields
-- 'email'(text), 'first_name'(text), 'last_name'(text),
-- 'is_member'(bool), 'days_in_queue'(integer).
-- Let's further assume the CSV also contains column headers
-- and does not have NAs (i.e. no missing data).
-- The following will read a certain number of rows
-- and generate the types and type synonyms corresponding
-- to each column. For example it will generate:
-- @type IsMember = "is_member" :-> Bool@
-- and similarly for the other columns.
tableTypes "UserF" "data/users.csv"
loadUsers :: IO (Frame UserF)
loadUsers = inCoreAoS (readTable "data/users.csv")
columnSubset :: UserF -> Record '[Email, LastName]
columnSubset = rcast
In the above, tableTypes
will read upto 1000 rows, infer column types and generate appropriate types/type synonym declarations. Template-Haskell helps with the code generation behind the scenes. Note that to represent the row level data, we now also have a column-indexed vinyl
record declaration. This declaration is used to stream the rows from disk into an efficient vector
-backed in-memory UserF
dataframe (having the type Frame UserF
). In this case, Frame
is parameterised over a vinyl
-record declaration, UserF
is simply a type synonym for this declaration. We can confirm this fact from ghci
:
ghci> :i UserF
type UserF =
Record
'["email" :-> Text, "first_name" :-> Text, "last_name" :-> Text,
"is_member" :-> Bool, "days_in_queue" :-> Int]
Running the column subset function (columnSubset
) is as simple as:
ghci> us <- loadUsers
ghci> let miniUser = fmap columnSubset us
ghci>mapM_ print miniUser
{email :-> "james@example.com", last_name :-> "Smith"}
{email :-> "betty@example.com", last_name :-> "Jones"}
{email :-> "james@pallo.com", last_name :-> "Pallo"}
{email :-> "betty@sims.com", last_name :-> "Sims"}
{email :-> "james@oreily.com", last_name :-> "O'Reily"}
{email :-> "sam@sophitz.com", last_name :-> "Sophitz"}
{email :-> "sam@jely.com", last_name :-> "Jely"}
{email :-> "sam@example.com", last_name :-> "Taylor"}
Looking at the Frames
documentation reveals that a Frame
data type is a “finite collection of rows indexed by Int”, where “Int” is the row-number. This allows us to get the last three rows (a row-subset) by doing the following:
ghci> us <- loadUsers
ghci> mapM_ (print . frameRow us) [frameLength us - 3 .. frameLength us - 1]
{email :-> "sam@sophitz.com", first_name :-> "Sam", last_name :-> "Sophitz", is_member :-> False, days_in_queue :-> 42}
{email :-> "sam@jely.com", first_name :-> "Sam", last_name :-> "Jely", is_member :-> True, days_in_queue :-> 1}
{email :-> "sam@example.com", first_name :-> "Sam", last_name :-> "Taylor", is_member :-> False, days_in_queue :-> 42}
Type Safety
Consider the following addition to the module DataFrame
:
badColumnSubset :: UserF -> Record '[Email, Age]
badColumnSubset = rcast
In essence we state that we want a column subset of the original data, containing the columns email and age. But the underlying dataset does not have an ‘Age’ column. And therefore no column type ‘Age’ has been generated. The compiler helps us by throwing the following error:
Not in scope: type constructor or class ‘Age’
|
27 | badColumnSubset :: UserF -> Record '[Email, Age]
While the above may look a bit contrived, the point I am trying to make is that this can be very helpful, especially while working with multilpe datasets and code that processes them in some way (slicing, dicing, joining, summarizing etc.). By making the assumptions about the underlying datasets in our code explicit, we have a better chance of being able to maintain the code that processes it, with the compiler helping us when the inevitable drift between the two occurs.
Conclusion
Now we have some sense about how Frames
works under the hood to give us column-name indexed records, that are used to build up column-name indexed/row-indexed dataframes. Having column names as part of the type signature has a nice property of allowing a programmer to make his/her intent very clear while writing the code. And the compiler helps further by checking the “plausibility” of this intent, given certain input data.
I would refer you to the official tutorial in case you’re interested in learning more about Frames
.
In the next part of this series of posts, we talk about beam, which we shall use for Postgres access.