long_panel()
and widen_panel()
One of the initial challenges a data analyst is likely to face with panel data is getting it into a format suitable for analysis. Most regression analyses for panel data require the data to be in long format. That means there is a row for each entity (e.g., person) at each time point. If I conducted a 3-wave panel survey of 300 people, each of whom responded to all 3 waves, the long format of these data would have 900 rows (300 respondents x 3 waves).
For example, the following is how long data look, where
id
is the identifier for each entity, wave
is
the indicator of the time point, and Q1
/Q2
are
measures repeated at each time point.
# A tibble: 9 × 4
id wave Q1 Q2
<chr> <dbl> <dbl> <dbl>
1 1 1 1 5
2 1 2 1.5 7.5
3 1 3 2 10
4 2 1 5 14
5 2 2 4 10.5
6 2 3 3 7
7 3 1 15 8
8 3 2 12 12
9 3 3 9 16
Wide data, on the other hand, have only one row per entity and a separate column for each measure and time point. The same data above in wide format look like this:
# A tibble: 3 × 7
id Q1_W1 Q1_W2 Q1_W3 Q2_W1 Q2_W2 Q2_W3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1.5 2 5 7.5 10
2 2 5 4 3 14 10.5 7
3 3 15 12 9 8 12 16
Here you differentiate between waves by looking at the column name, which in this case end in “_W” and then the wave indicator. Some analyses prefer the data in this format, like structural equation models.
panelr
considers the native format of panel data to be
long and provides the panel_data
class to keep
your data tidy in the long format. Of course, sometimes your raw data
aren’t in long format and need to be “reshaped” from wide to long. In
other cases, you have long format data but need to get it into wide
format for some reason or another. panelr
provides tools to
help with these situations.
There are some other tools, including ones that panelr
uses internally, that can manage these situations. However, they tend to
be some combination of confusing, inflexible, or too general to be
easily used for these purposes by non-experts.
In my experience, survey contractors (i.e., the people you pay to carry out panel surveys) like to provide the data in wide format. As a general rule, the conversion of data from wide to long is much more difficult than the inverse. When preparing to reshape data from wide to long format, you’ll need to answer some questions relating to how the column/variable names distinguish the variable name from the time indicator:
W1_variable
has both prefix (W
)
and suffix (_
)).One key assumption is that variables labeled with a pattern such as
Q1_W1
, Q1_W2
, and so on refer to the same
measure at different times. I’ve encountered datasets in which
Q1
might refer to a different measure at each time point
and this is not a problem that can be handled in an automated way.
With that warning out of the way, let’s look at a couple examples.
Let’s return to the wide data we looked at earlier.
# A tibble: 3 × 7
id Q1_W1 Q1_W2 Q1_W3 Q2_W1 Q2_W2 Q2_W3
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1.5 2 5 7.5 10
2 2 5 4 3 14 10.5 7
3 3 15 12 9 8 12 16
Here we can see that the time indicators are at the end of
the variable names (_W1
), time indicators of 1, 2, and 3,
and a prefix of _W
. With that in mind, we can use
long_panel()
to convert the data to long format.
# Panel data: 9 × 4
# Entities: id [3]
# Wave variable: wave [1, 2, 3 (3 waves)]
id wave Q1 Q2
<fct> <dbl> <dbl> <dbl>
1 1 1 1 5
2 1 2 1.5 7.5
3 1 3 2 10
4 2 1 5 14
5 2 2 4 10.5
6 2 3 3 7
7 3 1 15 8
8 3 2 12 12
9 3 3 9 16
Perfect! The first argument, w
, was the wide data.
prefix
is self-explanatory. begin
and
end
refer to the range of the time indicators, since they
are consecutive. You can instead use periods = c(1, 2, 3)
if you prefer. That’s especially true if you have non-consecutive time
indicators like a biannual survey that uses the year as an indicator,
like periods = c(1990, 1992, 1994)
.
I should note that base R has a function, reshape()
that
can work in this situation without making you pull your hair out too
much:
reshape(as.data.frame(wide), sep = "_W", times = c(1, 2, 3), direction = "long",
varying = c("Q1_W1", "Q1_W2", "Q1_W3", "Q2_W1", "Q2_W2", "Q2_W3"))
id time Q1 Q2
1.1 1 1 1.0 5.0
2.1 2 1 5.0 14.0
3.1 3 1 15.0 8.0
1.2 1 2 1.5 7.5
2.2 2 2 4.0 10.5
3.2 3 2 12.0 12.0
1.3 1 3 2.0 10.0
2.3 2 3 3.0 7.0
3.3 3 3 9.0 16.0
You can see how frustrating that could be if you had many more
variables — it wouldn’t be unusual to have hundreds of columns in the
wide format, not all of which would be variables that vary over time
(e.g., race). Truth be told, long_panel()
uses
reshape()
internally, but only after a lot of processing.
Other options include the reshape2
and tidyr
packages, but they are not purpose-built for the panel setting and
therefore can have a learning curve to avoid having data that end up a
bit too long.
Here’s a wide dataset with what is usually a trickier format to
handle due to limitations of reshape()
.
# A tibble: 3 × 5
WA_Q1 WB_Q1 WC_Q1 WA_Q2 WC_Q2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.5 2 5 10
2 5 4 3 14 7
3 15 12 9 8 16
Key characteristics:
W
) and suffix (_
).While you don’t have to recognize this to use the function properly,
notice that in this case Q2
was only measured at times A
and C. This can add considerable difficulty to when trying to reshape
data “by hand.”
# Panel data: 9 × 4
# Entities: id [3]
# Wave variable: wave [A, B, C (3 waves)]
id wave Q1 Q2
<fct> <ord> <dbl> <dbl>
1 1 A 1 5
2 1 B 1.5 NA
3 1 C 2 10
4 2 A 5 14
5 2 B 4 NA
6 2 C 3 7
7 3 A 15 8
8 3 B 12 NA
9 3 C 9 16
Just what we were looking for. Note that panel_data
objects must have an ordered wave variable, but long_data()
understands how to order letters and handles that for you. The
missingness in Q2
is by design, since it wasn’t measured in
wave B.
Another issue that can come up is the treatment of constants — that
is, variables that do not change over time. The best wide data should
come labeled in a way that makes it clear the constants are constants.
For instance, a variable signifying race wouldn’t be called
race_W1
, but instead just race
.
long_panel()
automatically checks your data for variables
that are labeled as if they vary over time but actually do not.
For instance, data that start by looking like this:
# A tibble: 3 × 5
id Q1_W1 Q1_W2 Q1_W3 race_W1
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 1.5 2 white
2 2 4 3 2 black
3 3 15 12 9 white
Can easily end up shaped like this:
# A tibble: 9 × 4
id wave race Q1
<dbl> <dbl> <chr> <dbl>
1 1 1 white 1
2 1 2 <NA> 1.5
3 1 3 <NA> 2
4 2 1 black 4
5 2 2 <NA> 3
6 2 3 <NA> 2
7 3 1 white 15
8 3 2 <NA> 12
9 3 3 <NA> 9
But obviously just because the wide data marked race
with a wave label, that doesn’t mean it was unknown in the other waves.
You’ll get the right result with long_panel()
:
# Panel data: 9 × 4
# Entities: id [3]
# Wave variable: wave [1, 2, 3 (3 waves)]
id wave Q1 race
<fct> <dbl> <dbl> <chr>
1 1 1 1 white
2 1 2 1.5 white
3 1 3 2 white
4 2 1 4 black
5 2 2 3 black
6 2 3 2 black
7 3 1 15 white
8 3 2 12 white
9 3 3 9 white
If you have an ID variable in the wide data, you can pass the name of
that variable to long_panel()
with the id
argument, which is "id"
by default. If there is no variable
with the name you give to id
, one will be created. You can
also choose the name of the wave variable via wave
, which
is "wave"
by default.
You can also choose not to have the output of
long_panel()
be a panel_data
object by setting
as_panel_data
to FALSE
.
There are some other options available to you for tougher cases. For
instance, when use.regex
is TRUE
, the
arguments for prefix
and suffix
are treated as
regular expressions for more complicated patterns.
Internally, time-varying variables are detected by the presence of
prefix
, one of the time periods, and suffix
in
the variable name. The “root” variable without the indicator is whatever
is left. Sometimes, though, this can cause false matches. Here’s an
example I have encountered. My wide data looked like this:
# A tibble: 3 × 5
CaseID Consent A1 B1 C1
<dbl> <lgl> <dbl> <dbl> <dbl>
1 1 TRUE 5 4 3
2 2 TRUE 6 7 8
3 3 TRUE 10 8 6
My ID variable was called CaseID
and the periods — which
were A, B, and C — were labeled at the beginning of the column
names. Following the earlier examples, this will confuse
long_panel()
:
# Panel data: 9 × 4
# Entities: CaseID [3]
# Wave variable: wave [A, B, C (3 waves)]
CaseID wave `1` onsent
<fct> <ord> <dbl> <lgl>
1 1 A 5 TRUE
2 1 B 4 TRUE
3 1 C 3 TRUE
4 2 A 6 TRUE
5 2 B 7 TRUE
6 2 C 8 TRUE
7 3 A 10 TRUE
8 3 B 8 TRUE
9 3 C 6 TRUE
See what happened? The Consent
variable in the wide data
looked just like a constant variable that was measured at time point C.
This isn’t the end of the world, but errors like this can be more
confusing and damaging in other scenarios. Fortunately, I knew more
about the labeling of the time-varying variables than what I told
long_panel()
. Yes, there is A/B/C at the beginning with no
prefix/suffix, but also each time-varying item has a number
that comes after A/B/C.
long_panel()
offers the argument match
for
situations like these. This is the regular expression used to match and
then capture the variable name sans time indicator. By default,
match
is ".*"
, meaning any character any
number of times. To reflect what I know about these data, I change it to
"\\d+.*"
, meaning at least one digit following by any
number of other characters.
long_panel(wide, begin = "A", end = "C", label_location = "beginning",
id = "CaseID", match = "\\d+.*")
# Panel data: 9 × 4
# Entities: CaseID [3]
# Wave variable: wave [A, B, C (3 waves)]
CaseID wave Consent `1`
<fct> <ord> <lgl> <dbl>
1 1 A TRUE 5
2 1 B TRUE 4
3 1 C TRUE 3
4 2 A TRUE 6
5 2 B TRUE 7
6 2 C TRUE 8
7 3 A TRUE 10
8 3 B TRUE 8
9 3 C TRUE 6
Now it rightly ignores Consent
as a variable that lacks
a time indicator. In general, long_panel()
tries to protect
you from having to use or even know how to use regular expressions, but
sometimes there’s no way around it.
widen_panel()
, as you might expect, does the opposite of
long_panel()
. This is generally an easier operation,
thankfully.
widen_panel()
expects a panel_data
object.
If your long data aren’t in that format, it’s easy enough to just pass
it to panel_data()
.
To go through an example, let’s take a look at some long data.
# Panel data: 9 × 5
# Entities: person [3]
# Wave variable: time [1, 2, 3 (3 waves)]
person time Q1 Q2 race
<fct> <dbl> <dbl> <dbl> <chr>
1 1 1 1 5 white
2 1 2 1.5 7.5 white
3 1 3 2 10 white
4 2 1 5 14 black
5 2 2 4 10.5 black
6 2 3 3 7 black
7 3 1 15 8 white
8 3 2 12 12 white
9 3 3 9 16 white
Okay, so we have an ID variable (person
), wave variable
(time
), two time-varying variables (Q1
and
Q2
), and a time-invariant variable (race
). The
only difficulty here conceptually is how to automatically know, without
the domain knowledge about the substantive meaning of these variables,
which ones vary over time and which don’t. This is simply a matter of
widen_panel()
checking the variance of each (using the
panelr
function are_varying()
). Note that in
very wide datasets, or those with many individuals, this can take a
little while to happen.
# A tibble: 3 × 8
person race Q1_1 Q2_1 Q1_2 Q2_2 Q1_3 Q2_3
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 white 1 5 1.5 7.5 2 10
2 2 black 5 14 4 10.5 3 7
3 3 white 15 8 12 12 9 16
Pretty much all you need to worry about is how you want to label the
wide data. By default the separator
argument is
"_"
.
There are only two other arguments. varying
lets you
specify which variables in the long data vary over time. This can save
you time compared to having widen_panel()
check them all,
but of course requires you to pass those variable names along which can
be more work than it’s worth at times.
ignore.attributes
deals with the scenario in which you
started with wide data, used long_panel()
to convert to
long format, and now want to convert back to wide format.
long_panel()
stores information in the data frame about
which variables vary over time so that they don’t have to be checked all
over again. If you’ve made changes or think something went wrong, you
can set ignore.attributes
to TRUE
to force
those checks all over again.