Title: | Clustering Time Series While Resisting Outliers |
---|---|
Description: | Robust Clustering of Time Series (RCTS) has the functionality to cluster time series using both the classical and the robust interactive fixed effects framework. The classical framework is developed in Ando & Bai (2017) <doi:10.1080/01621459.2016.1195743>. The implementation within this package excludes the SCAD-penalty on the estimations of beta. This robust framework is developed in Boudt & Heyndels (2022) <doi:10.1016/j.ecosta.2022.01.002> and is made robust against different kinds of outliers. The algorithm iteratively updates beta (the coefficients of the observable variables), group membership, and the latent factors (which can be common and/or group-specific) along with their loadings. The number of groups and factors can be estimated if they are unknown. |
Authors: | Ewoud Heyndels [aut, cre] |
Maintainer: | Ewoud Heyndels <[email protected]> |
License: | GPL (>= 2) |
Version: | 0.2.4 |
Built: | 2024-11-08 03:58:07 UTC |
Source: | https://github.com/eh-in-r/rcts |
The PIC is calculated with a sigma2 specific to the configuration (= number of groups and factors). Because the method to estimate the number of groups and factors requires sigma2 to be equal over all configurations (see proofs of different papers of Ando/Bai) we replace sigma2 by the sigma2 of the configuration with maximum number of groups and factors (this is the last one that was executed).
adapt_pic_with_sigma2maxmodel(df, df_results, sigma2_max_model)
adapt_pic_with_sigma2maxmodel(df, df_results, sigma2_max_model)
df |
contains PIC for all candidate C's and all subsamples |
df_results |
dataframe with results for each estimated configuration |
sigma2_max_model |
sigma2 of model with maximum number of groups and factors |
data.frame of same size as df
set.seed(1) df_pic <- data.frame(matrix(rnorm(4 * 50), nrow = 4)) #4 configuration / 50 candidate values for C df_results <- data.frame(sigma2 = rnorm(4)) pic_sigma2 <- 3.945505 adapt_pic_with_sigma2maxmodel(df_pic, df_results, pic_sigma2)
set.seed(1) df_pic <- data.frame(matrix(rnorm(4 * 50), nrow = 4)) #4 configuration / 50 candidate values for C df_results <- data.frame(sigma2 = rnorm(4)) pic_sigma2 <- 3.945505 adapt_pic_with_sigma2maxmodel(df_pic, df_results, pic_sigma2)
When running the algorithm with a different number of observable variables then the number that is available, reformat X. (Mainly used for testing)
adapt_X_estimating_less_variables(X, vars_est)
adapt_X_estimating_less_variables(X, vars_est)
X |
dataframe with the observed variables |
vars_est |
number of available observed variables for which a coefficient will be estimated |
Returns a 3D-array. If vars_est is set to 0, it returns NA.
Adds the current configuration (number of groups and factors) to df_results.
add_configuration(df_results, S, k, kg)
add_configuration(df_results, S, k, kg)
df_results |
dataframe with results for each estimated configuration |
S |
estimated number of groups in current configuration |
k |
estimated number of common factors in current configuration |
kg |
vector with the estimated number of group specific factors in current configuration (augmented with NA's to reach a length of 20) |
data.frame
add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17)))
add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17)))
Adds several metrics to df_results.
add_metrics( df_results, index_configuration, pic_sigma2, beta_est, g, comfactor, lambda, factor_group, lambda_group, iteration, g_true = NA, add_rand = FALSE )
add_metrics( df_results, index_configuration, pic_sigma2, beta_est, g, comfactor, lambda, factor_group, lambda_group, iteration, g_true = NA, add_rand = FALSE )
df_results |
dataframe with results for each estimated configuration |
index_configuration |
index of the configuration of groups and factors |
pic_sigma2 |
sum of squared errors divided by NT |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
comfactor |
estimated common factors |
lambda |
loadings of the estimated common factors |
factor_group |
estimated group specific factors |
lambda_group |
loadings of the estimated group specific factors |
iteration |
number of iteration |
g_true |
vector of length NN with true group memberships |
add_rand |
adds the adjusted randindex to the df (requires the mclust package); used for simulations |
data.frame with final estimations of each configuration
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration add_metrics(df_results, 1, 3.94, NA, round(runif(30, 1, 3)), NA, NA, NA, NA, 9)
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration add_metrics(df_results, 1, 3.94, NA, round(runif(30, 1, 3)), NA, NA, NA, NA, 9)
Fills in df_pic: adds a row with the calculated PIC for the current configuration.
add_pic( df, index_configuration, robust, Y, beta_est, g, S, k, kg, est_errors, C_candidates, method_estimate_beta = "individual", choice_pic = "pic2017" )
add_pic( df, index_configuration, robust, Y, beta_est, g, S, k, kg, est_errors, C_candidates, method_estimate_beta = "individual", choice_pic = "pic2017" )
df |
input data frame |
index_configuration |
index of the configuration of groups and factors |
robust |
robust or classical estimation |
Y |
Y: NxT dataframe with the panel data of interest |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
S |
number of estimated groups |
k |
estimated number of common factors |
kg |
vector with the estimated number of group specific factors for each group |
est_errors |
NxT matrix with the error terms |
C_candidates |
candidates for C (parameter in PIC) |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
choice_pic |
parameter that defines which PIC is used to select the best configuration of groups and factors. Options are "pic2017" (uses the PIC of Ando and Bai (2017)), "pic2016" (Ando and Bai (2016)) weighs the fourth term with an extra factor relative to the size of the groups, and "pic2022". They differ in the penalty they perform on the number of group specific factors (and implicitly on the number of groups). They also differ in the sense that they have different NT-regions (where N is the number of time series and T is the length of the time series) where the estimated number of groups, and thus group specific factors will be wrong. Pic2022 is designed to shrink the problematic NT-region to very large N / very small T). |
data.frame
set.seed(1) original_data <- create_data_dgp2(30, 10) Y <- original_data[[1]] g <- original_data[[3]] beta_est <- matrix(rnorm(4 * nrow(Y)), nrow = 4) df_pic <- initialise_df_pic(1:5) e <- matrix(rnorm(nrow(Y) * ncol(Y)), nrow(Y)) add_pic(df_pic, 1, TRUE, Y, beta_est, g, 3, 0, c(3, 3, 3), e, 1:5)
set.seed(1) original_data <- create_data_dgp2(30, 10) Y <- original_data[[1]] g <- original_data[[3]] beta_est <- matrix(rnorm(4 * nrow(Y)), nrow = 4) df_pic <- initialise_df_pic(1:5) e <- matrix(rnorm(nrow(Y) * ncol(Y)), nrow(Y)) add_pic(df_pic, 1, TRUE, Y, beta_est, g, 3, 0, c(3, 3, 3), e, 1:5)
Calculates the PIC for the current configuration.
add_pic_parallel( Y, beta_est, g, S, k, kg, robust, est_errors, C_candidates, choice_pic, method_estimate_beta = "individual" )
add_pic_parallel( Y, beta_est, g, S, k, kg, robust, est_errors, C_candidates, choice_pic, method_estimate_beta = "individual" )
Y |
Y: NxT dataframe with the panel data of interest |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
S |
number of estimated groups |
k |
estimated number of common factors |
kg |
vector with the estimated number of group specific factors for each group |
robust |
robust or classical estimation |
est_errors |
NxT matrix with the error terms |
C_candidates |
candidates for C (parameter in PIC) |
choice_pic |
parameter that defines which PIC is used to select the best configuration of groups and factors. Options are "pic2017" (uses the PIC of Ando and Bai (2017)), "pic2016" (Ando and Bai (2016)) weighs the fourth term with an extra factor relative to the size of the groups, and "pic2022". They differ in the penalty they perform on the number of group specific factors (and implicitly on the number of groups). They also differ in the sense that they have different NT-regions (where N is the number of time series and T is the length of the time series) where the estimated number of groups, and thus group specific factors will be wrong. Pic2022 is designed to shrink the problematic NT-region to very large N / very small T). |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
numeric vector with a value for each candidate C
Helpfunction in create_true_beta() for the option beta_true_heterogeneous_groups. (This is the default option.)
beta_true_heterogroups( vars, S_true, extra_beta_factor = 1, limit_true_groups = 12 )
beta_true_heterogroups( vars, S_true, extra_beta_factor = 1, limit_true_groups = 12 )
vars |
number of observable variables |
S_true |
true number of groups: should be at least 1 and maximum limit_true_groups |
extra_beta_factor |
option to multiply the coefficients in true beta; default = 1 |
limit_true_groups |
Maximum number of true groups in a simulation-DGP for which the code in this package is implemented. Currently equals 12. For application on realworld data this parameter is not relevant. |
matrix where the number of rows equals S_true, and the number of columns equals max(1, vars)
Function that returns for each candidate C the best number of groups and factors, based on the PIC.
calculate_best_config(df_results, df_pic, C_candidates, limit_est_groups = 20)
calculate_best_config(df_results, df_pic, C_candidates, limit_est_groups = 20)
df_results |
dataframe with results for each estimated configuration |
df_pic |
dataframe with the PIC for each configuration and for each candidate C |
C_candidates |
candidates for C (parameter in PIC) |
limit_est_groups |
maximum allowed number of groups that can be estimated |
Returns a matrix with a row for each candidate value for C. The first column contains the optimized number of groups (for each candidate C). The second columns does the same for the number of common factors. Column 3 until 22 do the same for the number of group specific factors. This is set to NA if the configuration has less than 20 groups estimated.
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration calculate_best_config(df_results, data.frame(t(1:5)), 1:5)
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration calculate_best_config(df_results, data.frame(t(1:5)), 1:5)
Calculates the error term Y - X*beta_est - LF - LgFg.
calculate_error_term( Y, X, beta_est, g, factor_group, lambda_group, comfactor, lambda, S, k, kg, method_estimate_beta = "individual", no_common_factorstructure = FALSE, no_group_factorstructure = FALSE )
calculate_error_term( Y, X, beta_est, g, factor_group, lambda_group, comfactor, lambda, S, k, kg, method_estimate_beta = "individual", no_common_factorstructure = FALSE, no_group_factorstructure = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
factor_group |
estimated group specific factors |
lambda_group |
loadings of the estimated group specific factors |
comfactor |
estimated common factors |
lambda |
loadings of the estimated common factors |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
no_common_factorstructure |
if there is a common factorstructure being estimated |
no_group_factorstructure |
if there is a group factorstructure being estimated |
NxT matrix
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership # to the true value for this example. lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 g <- g_true_dgp3 set.seed(1) beta_est <- matrix(rnorm(nrow(Y) * 4), ncol = nrow(Y)) #random values for beta comfactor <- matrix(0, ncol = ncol(Y)) lambda <- matrix(0, ncol = nrow(Y)) calculate_error_term(Y, X, beta_est, g, factor_group, lambda_group, comfactor, lambda, 3, 0, c(3, 3, 3))
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership # to the true value for this example. lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 g <- g_true_dgp3 set.seed(1) beta_est <- matrix(rnorm(nrow(Y) * 4), ncol = nrow(Y)) #random values for beta comfactor <- matrix(0, ncol = ncol(Y)) lambda <- matrix(0, ncol = nrow(Y)) calculate_error_term(Y, X, beta_est, g, factor_group, lambda_group, comfactor, lambda, 3, 0, c(3, 3, 3))
During the updating of group membership, the errorterm is used as the objective function to estimate the group.
calculate_errors_virtual_groups( group, LF, virtual_grouped_factor_structure, NN, TT, k, kg, vars_est, method_estimate_beta, Y, X, beta_est, g )
calculate_errors_virtual_groups( group, LF, virtual_grouped_factor_structure, NN, TT, k, kg, vars_est, method_estimate_beta, Y, X, beta_est, g )
group |
group |
LF |
NxT-matrix of the common factorstructure |
virtual_grouped_factor_structure |
list with length the number of groups; every element of the list contains NxT-matrix |
NN |
number of time series |
TT |
length of time series |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
Y |
Y: NxT dataframe with the panel data of interest |
X |
dataframe with the observed variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
NxT matrix with the errorterms (=Y minus the estimated factorstructure(s) and minus X*beta)
Returns the estimated groupfactorstructure.
calculate_FL_group_estimated( lg, fg, g, NN, TT, S, k, kg, num_factors_may_vary = TRUE )
calculate_FL_group_estimated( lg, fg, g, NN, TT, S, k, kg, num_factors_may_vary = TRUE )
lg |
loadings of estimated group factors |
fg |
estimated group factors |
g |
Vector with estimated group membership for all individuals |
NN |
number of time series |
TT |
length of time series |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
num_factors_may_vary |
whether or not the number of groupfactors is constant over all groups or not |
list with NjxT matrices
Calculate the true groupfactorstructure.
calculate_FL_group_true( lgt, fgt, g_true, NN, TT, S_true, k_true, kg_true, num_factors_may_vary = TRUE, dgp1_AB_local = FALSE )
calculate_FL_group_true( lgt, fgt, g_true, NN, TT, S_true, k_true, kg_true, num_factors_may_vary = TRUE, dgp1_AB_local = FALSE )
lgt |
true group factor loadings |
fgt |
true group factors |
g_true |
vector of length NN with true group memberships |
NN |
number of time series |
TT |
length of time series |
S_true |
true number of groups |
k_true |
true number of common factors |
kg_true |
true number of group factors for each group |
num_factors_may_vary |
whether or not the number of groupfactors is constant over all groups or not |
dgp1_AB_local |
gives information about which DGP we use; TRUE of FALSE |
list with NjxT matrices
calculates factor loadings of common factors
calculate_lambda( Y, X, beta_est, comfactor, factor_group, g, lgfg_list, k, kg, robust, method_estimate_beta, method_estimate_factors, verbose = FALSE, initialise = FALSE )
calculate_lambda( Y, X, beta_est, comfactor, factor_group, g, lgfg_list, k, kg, robust, method_estimate_beta, method_estimate_factors, verbose = FALSE, initialise = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
comfactor |
common factors |
factor_group |
estimated group specific factors |
g |
Vector with group membership for all individuals |
lgfg_list |
This is a list (length number of groups) containing FgLg for every group. |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
defines method of robust estimaton of the factors: "macro", "pertmm" or "cz" |
verbose |
when TRUE, it prints messages |
initialise |
indicator of being in the initialisation phase |
Returns a matrix where each row contains a common factor. If the number of estimated common factors equals zero, it returns a matrix with 1 row, containing zero's.
returns object which includes group and id of the individuals
calculate_lambda_group( Y, X, beta_est, factor_group, g, lambda, comfactor, S, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE, initialise = FALSE )
calculate_lambda_group( Y, X, beta_est, factor_group, g, lambda, comfactor, S, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE, initialise = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
factor_group |
estimated group specific factors |
g |
Vector with estimated group membership for all individuals |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
defines method of robust estimaton of the factors: "macro", "pertmm" or "cz" |
verbose |
when TRUE, it prints messages |
initialise |
indicator of being in the initialisation phase |
Returns a data.frame with a row for each time series. The first number of columns contain the individual loadings to the group specific factors. Furthermore "group" (group membership) and id (the order in which the time series appear in Y) are added.
#' #example with data generated with DGP 2 data <- create_data_dgp2(30, 10) Y <- data[[1]] X <- data[[2]] g <- data[[3]] #true group membership set.seed(1) beta_est <- matrix(rnorm(4 * nrow(Y)), nrow = 4) factor_group <- data[[5]] #true values of group specific factors comfactor <- matrix(0, nrow = 1, ncol = ncol(Y)) lambda <- matrix(0, nrow = 1, ncol = nrow(Y)) calculate_lambda_group(Y, X, beta_est, factor_group, g, lambda, comfactor, 3, 0, c(3, 3, 3), TRUE)
#' #example with data generated with DGP 2 data <- create_data_dgp2(30, 10) Y <- data[[1]] X <- data[[2]] g <- data[[3]] #true group membership set.seed(1) beta_est <- matrix(rnorm(4 * nrow(Y)), nrow = 4) factor_group <- data[[5]] #true values of group specific factors comfactor <- matrix(0, nrow = 1, ncol = ncol(Y)) lambda <- matrix(0, nrow = 1, ncol = nrow(Y)) calculate_lambda_group(Y, X, beta_est, factor_group, g, lambda, comfactor, 3, 0, c(3, 3, 3), TRUE)
Returns list (with as length the number of groups) with lgfg (product of grouploadings and groupfactors). Each element of the list with the assumption that all individuals are in the same group k. This function is used to speed up code.
calculate_lgfg( lambda_group, factor_group, S, k, kg, num_factors_may_vary, NN, TT )
calculate_lgfg( lambda_group, factor_group, S, k, kg, num_factors_may_vary, NN, TT )
lambda_group |
loadings of the estimated group specific factors |
factor_group |
estimated group specific factors |
S |
number of groups |
k |
number of common factors |
kg |
vector with the number of group specific factors for each group |
num_factors_may_vary |
whether or not the number of groupfactors is constant over all groups or not |
NN |
number of time series |
TT |
length of time series |
list with S elements: each element contains a matrix with NN rows and TT columns with the estimated group factor structure of this particular group
Helpfunction in update_g(). Depends on an not yet established group k ( cannot use lgfg_list)
calculate_obj_for_g(i, k, errors_virtual, rho_parameters, robust, TT)
calculate_obj_for_g(i, k, errors_virtual, rho_parameters, robust, TT)
i |
individual |
k |
group |
errors_virtual |
list with errors for each possible group |
rho_parameters |
median and madn of the calculated error term |
robust |
robust or classical estimation |
TT |
length of time series |
numeric value
This depends on kappa1 -> kappaN, through p (=number of nonzero elements of beta_est). The parameter 'sigma2' is the non-robust sigma2. As it is only used in term 2 to 4, it does not actually matter what its value is (needs to be > 0). It could be set to 1 as well.
calculate_PIC( C, robust, S, k, kg, e2, sigma2, NN, TT, method_estimate_beta, beta_est, g, vars_est, choice_pic = "pic2017" )
calculate_PIC( C, robust, S, k, kg, e2, sigma2, NN, TT, method_estimate_beta, beta_est, g, vars_est, choice_pic = "pic2017" )
C |
determines relative contribution of the penalty terms compared to the estimation error term |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
e2 |
NxT matrix with error terms |
sigma2 |
scalar: sum of squared error terms, scaled by NT |
NN |
number of time series |
TT |
length of time series |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
choice_pic |
indicates which PIC to use to estimate the number of groups and factors: options are "pic2017" (uses the PIC of Ando and Bai (2017); works better for large N), "pic2016" (Ando and Bai (2016); works better for large T) weighs the fourth term with an extra factor relative to the size of the groups, and "pic2022" which shrinks the NT-space where the number of groups and factors would be over- or underestimated compared to pic2016 and pic2017. |
numeric
set.seed(1) NN <- 30 TT <- 10 e <- matrix(rnorm(NN * TT), nrow = NN) beta_est <- matrix(rnorm(NN * 4), ncol = NN) #random values for beta g <- round(runif(NN, 1, 3)) calculate_PIC(0.51, TRUE, 3, 0, c(3, 3, 3), e, e^2/(NN*TT), NN, TT, "individual", beta_est, g, 3)
set.seed(1) NN <- 30 TT <- 10 e <- matrix(rnorm(NN * TT), nrow = NN) beta_est <- matrix(rnorm(NN * 4), ncol = NN) #random values for beta g <- round(runif(NN, 1, 3)) calculate_PIC(0.51, TRUE, 3, 0, c(3, 3, 3), e, e^2/(NN*TT), NN, TT, "individual", beta_est, g, 3)
This is used in calculate_PIC()
calculate_PIC_term1(e, robust)
calculate_PIC_term1(e, robust)
e |
NxT matrix with the error terms |
robust |
robust or classical estimation |
numeric
Calculates sum of squared errors, divided by NT
calculate_sigma2(e, NN = nrow(e), TT = ncol(e))
calculate_sigma2(e, NN = nrow(e), TT = ncol(e))
e |
matrix with error terms |
NN |
N |
TT |
T |
numeric
Y <- Y_dgp3 set.seed(1) e <- matrix(rnorm(nrow(Y) * ncol(Y)), nrow = nrow(Y)) calculate_sigma2(e)
Y <- Y_dgp3 set.seed(1) e <- matrix(rnorm(nrow(Y) * ncol(Y)), nrow = nrow(Y)) calculate_sigma2(e)
Sigma2 is the sum of the squared errors, divided by NT. We need the sigma2 of the maxmodel to use (in term 2,3,4 of the PIC) instead of the configuration-dependent sigma2. (See paper AndoBai 2016). sigma2_max_model could actually be set to 1 as well, as it can be absorbed in parameter C of the PIC.
calculate_sigma2maxmodel(e, kg_max, S, S_cand, kg, k, k_cand)
calculate_sigma2maxmodel(e, kg_max, S, S_cand, kg, k, k_cand)
e |
NxT-matrix containing the estimated error term |
kg_max |
scalar: maximum allowed number of estimated factors for any group |
S |
estimated number of groups |
S_cand |
vector with candidate values for the number of groups |
kg |
vector with the estimated number of group specific factors for each group |
k |
estimated number of common factors |
k_cand |
vector with candidate value for the number of common factors |
numeric
Helpfunction. Calculates part of the 4th term of the PIC.
calculate_TN_factor(TT, Nj)
calculate_TN_factor(TT, Nj)
TT |
length of time series |
Nj |
number of time series in group j |
numeric
VC² depends on C (this is the scale parameter in PIC). When VC² is equal to zero, the found number of groups and factors are the same over the subsamples.
calculate_VCsquared( rcj, rc, C_candidates, indices_subset, Smax, limit_est_groups = 20 )
calculate_VCsquared( rcj, rc, C_candidates, indices_subset, Smax, limit_est_groups = 20 )
rcj |
dataframe containg the numer of groupfactors for all candidate C's and all subsamples |
rc |
dataframe containg the numer of common factors for all candidate C's and all subsamples |
C_candidates |
candidates for C (parameter in PIC) |
indices_subset |
all indices of the subsets |
Smax |
maximum allowed number of estimated groups |
limit_est_groups |
maximum allowed number of groups that can be estimated |
numeric vector with the VC2-value for each candidate C
rcj <- data.frame(X1 = rep("3_3_3", 5), X2 = rep("3_2_1", 5)) rc <- data.frame(X1 = rep(1, 5), X2 = rep(0, 5)) calculate_VCsquared(rcj, rc, 1:5, 0:1, 3)
rcj <- data.frame(X1 = rep("3_3_3", 5), X2 = rep("3_2_1", 5)) rc <- data.frame(X1 = rep(1, 5), X2 = rep(0, 5)) calculate_VCsquared(rcj, rc, 1:5, 0:1, 3)
This function calculates FgLg (the groupfactorstructure) for all possible groups where individual i can be placed. For each group were the groupfactors (Fg) estimated earlier. Now the grouploadings are needed for each group as well. In the classical case these are calculated by Fg*Y/T. In the robust case these are robust.
calculate_virtual_factor_and_lambda_group( group, solve_FG_FG_times_FG, robust, NN_local, method_estimate_factors_local, g, vars_est, number_of_group_factors_local, number_of_common_factors_local, method_estimate_beta, factor_group, lambda, comfactor, Y, X, beta_est, verbose = FALSE )
calculate_virtual_factor_and_lambda_group( group, solve_FG_FG_times_FG, robust, NN_local, method_estimate_factors_local, g, vars_est, number_of_group_factors_local, number_of_common_factors_local, method_estimate_beta, factor_group, lambda, comfactor, Y, X, beta_est, verbose = FALSE )
group |
number of groups |
solve_FG_FG_times_FG |
This is the same as groupfactor / T. It is only used in the Classical approach |
robust |
robust or classical estimation of group membership |
NN_local |
number of time series |
method_estimate_factors_local |
specifies the robust algorithm to estimate factors: default is "macro" |
g |
vector with estimated group membership for all individuals |
vars_est |
number of variables that are included in the algorithm and have their coefficient estimated. This is usually equal to vars. |
number_of_group_factors_local |
number of group factors to be estimated |
number_of_common_factors_local |
number of common factors to be estimated |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
factor_group |
estimated group specific factors |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
verbose |
when TRUE, it prints messages |
NxT matrix containing the product of virtual groupfactors and virtual loadings
Calculates W = Y - X*beta_est. It is used in the initialization step of the algorithm, to initialise the factorstructures.
calculate_W(Y, X, beta_est, g, vars_est, method_estimate_beta)
calculate_W(Y, X, beta_est, g, vars_est, method_estimate_beta)
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with group membership for all individuals |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
NxT matrix
Calculates (the estimated value of) the matrix X*beta_est.
calculate_XB_estimated(X, beta_est, g, vars_est, method_estimate_beta, TT)
calculate_XB_estimated(X, beta_est, g, vars_est, method_estimate_beta, TT)
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
TT |
length of time series |
Returns a NxT matrix. If vars_est is set to 0, it returns NA.
Calculates the product of X*beta_true .
calculate_XB_true(X, beta_true, g, g_true, method_estimate_beta)
calculate_XB_true(X, beta_true, g, g_true, method_estimate_beta)
X |
X: NxTxp array containing the observable variables |
beta_true |
true coefficients of the observable variables |
g |
Vector with estimated group membership for all individuals |
g_true |
true group membership |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
Returns a NxT matrix (if method_estimate_beta == "individual"), and otherwise NA.
Calculates Z = Y - X*beta_est - LgFg. It is used in the estimate of the common factorstructure.
calculate_Z_common( Y, X, beta_est, g, lgfg_list, vars_est, kg, method_estimate_beta, method_estimate_factors, initialise = FALSE )
calculate_Z_common( Y, X, beta_est, g, lgfg_list, vars_est, kg, method_estimate_beta, method_estimate_factors, initialise = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with group membership for all individuals |
lgfg_list |
This is a list (length number of groups) containing FgLg for every group. |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
kg |
number of group specific factors to be estimated |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
defines method of robust estimaton of the factors: "macro", "pertmm" or "cz" |
initialise |
indicator of being in the initialisation phase |
NxT matrix
Calculates Z = Y - X*beta_est - LF. It is used to estimate the groupfactorstructure.
calculate_Z_group( Y, X, beta_est, g, lambda, comfactor, group, k, method_estimate_beta, initialise, vars_est )
calculate_Z_group( Y, X, beta_est, g, lambda, comfactor, group, k, method_estimate_beta, initialise, vars_est )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with group membership for all individuals |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
group |
indexnumber of the group |
k |
number of common factors to be estimated |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
initialise |
indicator of being in the initialisation phase |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
NxT matrix
Checks the rules for stopping the algorithm, based on its convergence speed.
check_stopping_rules( iteration, speed, all_OF_values, speedlimit = 0.01, verbose = FALSE )
check_stopping_rules( iteration, speed, all_OF_values, speedlimit = 0.01, verbose = FALSE )
iteration |
number of iteration |
speed |
convergence speed |
all_OF_values |
vector containing the values of the objective function from previous iterations |
speedlimit |
if the convergence speed falls under this limit the algorithm stops |
verbose |
if TRUE, more information is printed |
logical
check_stopping_rules(4, 1.7, 5:1)
check_stopping_rules(4, 1.7, 5:1)
It starts with defining a robust location and scatter (based on Ma & Genton (2000): Highly robust estimation of the autocovariance function).
clustering_with_robust_distances(g, number_of_groups, Y)
clustering_with_robust_distances(g, number_of_groups, Y)
g |
Vector with group membership for all individuals |
number_of_groups |
number of groups |
Y |
Y: the panel data of interest |
numeric vector with the new clustering, now including class zero adjustments
Used to include cross-sectional dependence or serial dependence into the simulated panel data.
create_covMat_crosssectional_dependence(parameter, NN)
create_covMat_crosssectional_dependence(parameter, NN)
parameter |
amount of cross-sectional dependence |
NN |
number of time series |
NxN covariance matrix
The default has 3 groups with each 3 group specific factors. Further it contains 0 common factors and 3 observed variables. The output is a list where the first element is the simulated panel dataset (a dataframe with N (amount of time series) rows and T (length of time series) columns). The second element contains the NxTxp array with the p observed variables. The third element contains the true group membership. The fourth element contains the true beta's (this has p+1 rows and one column for each group). The fifth element contains a list with the true group specific factors. The sixth element contains a dataframe with N rows where each row contains the group specific factor loadings that corresponds to the group specific factors. Further it contains the true group membership and an index (this corresponds to the rownumber in Y and X). The seventh and eighth elements contain the true common factor(s) and its loadings respectively.
create_data_dgp2(N, TT, S_true = 3, vars = 3, k_true = 0, kg_true = c(3, 3, 3))
create_data_dgp2(N, TT, S_true = 3, vars = 3, k_true = 0, kg_true = c(3, 3, 3))
N |
number of time series |
TT |
length of time series |
S_true |
true number of groups |
vars |
number of available observed variables |
k_true |
true number of common_factors |
kg_true |
vector with the true number of group factors for each group |
list
create_data_dgp2(30, 10)
create_data_dgp2(30, 10)
Creates beta_true, which contains the true values of beta (= the coefficients of X)
create_true_beta( vars, NN, S_true, method_true_beta = "heterogeneous_groups", limit_true_groups = 12, extra_beta_factor = 1 )
create_true_beta( vars, NN, S_true, method_true_beta = "heterogeneous_groups", limit_true_groups = 12, extra_beta_factor = 1 )
vars |
number of observable variables |
NN |
number of time series |
S_true |
number of groups |
method_true_beta |
how the true values of beta are defined: "homogeneous" (equal for all individuals), "heterogeneous_groups" (equal within groups, and different between groups) or heterogeneous_individuals (different for all individuals) |
limit_true_groups |
Maximum number of true groups in a simulation-DGP for which the code in this package is implemented. Currently equals 12. For application on realworld data this parameter is not relevant. |
extra_beta_factor |
multiplies coefficients in beta_est; default = 1 |
matrix with number of rows equal to number of observable variables + 1 (the first row contains the intercept) and number of culumns equal to the true number of groups.
Defines the candidate values for C.
define_C_candidates()
define_C_candidates()
numeric vector
define_C_candidates()
define_C_candidates()
Constructs dataframe where the rows contains all configurations that are included and for which the estimators will be estimated.
define_configurations(S_cand, k_cand, kg_cand)
define_configurations(S_cand, k_cand, kg_cand)
S_cand |
candidates for S (number of groups) |
k_cand |
candidates for k (number of common factors) |
kg_cand |
candidates for kg (number of group specific factors) |
data.frame
define_configurations(2:4, 0, 2:3)
define_configurations(2:4, 0, 2:3)
Defines the set of combinations of group specific factors.
define_kg_candidates(S, kg_min, kg_max, nfv = TRUE, limit_est_groups = 20)
define_kg_candidates(S, kg_min, kg_max, nfv = TRUE, limit_est_groups = 20)
S |
number of estimated groups |
kg_min |
minimum value for number of group specific factors |
kg_max |
minimum value for number of group specific factors |
nfv |
logical; whether the number of group specific factors is allowed to change among the groups |
limit_est_groups |
maximum allowed number of groups that can be estimated |
Returns a data frame where each row contains the number of group specific factors for all the estimated groups. The number of columns is set to 20 (the current maximum amount of group that can be estimated)
define_kg_candidates(3, 2, 4)
define_kg_candidates(3, 2, 4)
Returns a vector with the indices of the subsets. Must start with zero.
define_number_subsets(n)
define_number_subsets(n)
n |
number of subsets |
numeric
define_number_subsets(3)
define_number_subsets(3)
This is a short version of define_object_for_initial_clustering() which only contains implementations for robust macropca case and classical case.
define_object_for_initial_clustering_macropca( Y, k, kg, comfactor, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE )
define_object_for_initial_clustering_macropca( Y, k, kg, comfactor, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
comfactor |
estimated common factors |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
specifies the robust algorithm to estimate factors: default is "macro". The value is not used when robust is set to FALSE. |
verbose |
when TRUE, it prints messages |
matrix with N rows and 10 columns
Robust updating of group membership is based on a rho function (instead of the non-robust quadratic function) on the norm of the errors. This requires parameters of location and scale. They are defined here (currently as median and madn). This function is applied on the estimated errors: Y - XB - FL - FgLg. This function is used in update_g().
define_rho_parameters(object = NULL)
define_rho_parameters(object = NULL)
object |
input |
list
Helpfunction in estimate_beta() for estimating beta_est.
determine_beta( string, X_special, Y_special, robust, NN, TT, S, method_estimate_beta, initialisation = FALSE, indices = NA, vars_est, sigma2, nosetting_local = FALSE, kappa_candidates = c(2^(-0:-20), 0) )
determine_beta( string, X_special, Y_special, robust, NN, TT, S, method_estimate_beta, initialisation = FALSE, indices = NA, vars_est, sigma2, nosetting_local = FALSE, kappa_candidates = c(2^(-0:-20), 0) )
string |
can have values: "homogeneous" (when one beta_est is estimated for all individuals together) or "heterogeneous" (when beta_est is estimated either groupwise or elementwise) |
X_special |
preprocessed X (2-dimensional matrix with 'var_est' observable variables) |
Y_special |
preprocessed Y |
robust |
robust or classical estimation |
NN |
number of time series |
TT |
length of time series |
S |
estimated number of groups |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
initialisation |
indicator of being in the initialisation phase |
indices |
individuals for which beta_est is being estimated |
vars_est |
number of available observed variables for which a coefficient will be estimated. As default it is equal to the number of available observed variables. |
sigma2 |
sum of squared error terms, scaled by NT |
nosetting_local |
option to remove the recommended setting in lmrob(). It is much faster. Defaults to FALSE. |
kappa_candidates |
Defines the size of the SCAD-penalty used in the classical algorithm. This vector should contain more than 1 element. |
The function returns a numeric vector (for the default setting: string == "heterogeneous") or a matrix with the estimated beta (if string == "homogeneous").
Uses the "almost classical lambda" (=matrix where the mean of each row is equal to the classical lambda) to create a robust lambda by using M estimation.
determine_robust_lambda( almost_classical_lambda, fastoption = TRUE, fastoption2 = FALSE )
determine_robust_lambda( almost_classical_lambda, fastoption = TRUE, fastoption2 = FALSE )
almost_classical_lambda |
matrix where the mean of each row is equal to the classical lambda |
fastoption |
Uses nlm() instead of optim(). This is faster. |
fastoption2 |
experimental parameter: can speed nlm() up (10%), but loses accuracy. May benefit from finetuning. |
M-estimator of location of the parameter, by minimizing sum of rho()
An example for df_results. This dataframe contains the estimators for each configuration.
df_results_example
df_results_example
Dataframe with 4 rows (one for each configuration) and 11 columns:
number of groups
number of common factors
number of group specific factors in group 1
number of group specific factors in group 2
number of group specific factors in group 3
estimated group membership
estimated beta
estimated group specific factors
estimated loadings to the group specific factors
estimated common factors
estimated loadings to the common factors
Helpfunction to shorten code: are common factors being estimated.
do_we_estimate_common_factors(k)
do_we_estimate_common_factors(k)
k |
number of common factors |
numeric: 0 or 1
Helpfunction to shorten code: are group factors being estimated.
do_we_estimate_group_factors(kg)
do_we_estimate_group_factors(kg)
kg |
number of group factors to be estimated |
numeric: 0 or 1
This function is a wrapper around the initialization and the estimation part of the algorithm, for one configuration. It is only used for the serialized algorithm.
estimate_algorithm(Y, X, S, k, kg, maxit = 30, robust = TRUE)
estimate_algorithm(Y, X, S, k, kg, maxit = 30, robust = TRUE)
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
maxit |
maximum limit for the number of iterations |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
list with
estimated beta
vector with group membership
matrix with the common factor(s) (contains zero's if there are none estimated)
loadings to the common factor(s)
list with the group specific factors for each of the groups
data.frame with loadings to the group specific factors augmented with group membership and id (to have the order of the time series)
set.seed(1) original_data <- create_data_dgp2(60, 30) Y <- original_data[[1]] X <- original_data[[2]] estimate_algorithm(Y, X, 3, 0, c(3,3,3), maxit = 2, robust = TRUE)
set.seed(1) original_data <- create_data_dgp2(60, 30) Y <- original_data[[1]] X <- original_data[[2]] estimate_algorithm(Y, X, 3, 0, c(3,3,3), maxit = 2, robust = TRUE)
Update step of algorithm to obtain new estimation for beta. Note that we call it beta_est because beta() exists in base R.
estimate_beta( Y, X, beta_est, g, lambda_group, factor_group, lambda, comfactor, method_estimate_beta = "individual", S, k, kg, vars_est, robust, num_factors_may_vary = TRUE, optimize_kappa = FALSE, nosetting = FALSE )
estimate_beta( Y, X, beta_est, g, lambda_group, factor_group, lambda, comfactor, method_estimate_beta = "individual", S, k, kg, vars_est, robust, num_factors_may_vary = TRUE, optimize_kappa = FALSE, nosetting = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
lambda_group |
loadings of the estimated group specific factors |
factor_group |
estimated group specific factors |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
num_factors_may_vary |
whether or not the number of groupfactors is constant over all groups or not |
optimize_kappa |
indicates if kappa has to be optimized or not (only relevant for the classical algorithm) |
nosetting |
option to remove the recommended setting in lmrob(). It is much faster. Defaults to FALSE. |
list: 1st element contains matrix (N columns: 1 for each time series of the panel data) with estimated beta_est's. If vars_est is set to 0, the list contains NA.
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership to the true value lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 g <- g_true_dgp3 # There are no common factors to be estimated -> but needs placeholder lambda <- matrix(0, nrow = 1, ncol = 300) comfactor <- matrix(0, nrow = 1, ncol = 30) # # Choose how coefficients of the observable variables are estimated method_estimate_beta <- "individual" method_estimate_factors <- "macro" beta_est <- estimate_beta( Y, X, NA, g, lambda_group, factor_group, lambda, comfactor, S = 3, k = 0, kg = c(3, 3, 3), vars_est = 3, robust = TRUE )[[1]]
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership to the true value lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 g <- g_true_dgp3 # There are no common factors to be estimated -> but needs placeholder lambda <- matrix(0, nrow = 1, ncol = 300) comfactor <- matrix(0, nrow = 1, ncol = 30) # # Choose how coefficients of the observable variables are estimated method_estimate_beta <- "individual" method_estimate_factors <- "macro" beta_est <- estimate_beta( Y, X, NA, g, lambda_group, factor_group, lambda, comfactor, S = 3, k = 0, kg = c(3, 3, 3), vars_est = 3, robust = TRUE )[[1]]
The estimator for F, see Anderson (1984), is equal to the first k eigenvectors (multiplied by sqrt(T) due to the restriction F'F/T = I) associated with first r largest eigenvalues of the matrix WW' (which is of size TxT).
estimate_factor( Y, X, beta_est, g, lgfg_list, k, kg, robust, method_estimate_beta, method_estimate_factors, initialise = FALSE, verbose = FALSE )
estimate_factor( Y, X, beta_est, g, lgfg_list, k, kg, robust, method_estimate_beta, method_estimate_factors, initialise = FALSE, verbose = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with group membership for all individuals |
lgfg_list |
This is a list (length number of groups) containing FgLg for every group. |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
defines method of robust estimaton of the factors: "macro", "pertmm" or "cz" |
initialise |
indicator of being in the initialisation phase |
verbose |
when TRUE, it prints messages |
Return a list. The first element contains the k x T matrix with the k estimated common factors. The second element contains either the robust MacroPCA-based loadings or NA.
Estimates group factors Fg.
estimate_factor_group( Y, X, beta_est, g, lambda, comfactor, factor_group, S, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", initialise = FALSE, verbose = FALSE )
estimate_factor_group( Y, X, beta_est, g, lambda, comfactor, factor_group, S, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", initialise = FALSE, verbose = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with group membership for all individuals |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
factor_group |
estimated group specific factors |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
defines method of robust estimaton of the factors: "macro", "pertmm" or "cz" |
initialise |
indicator of being in the initialisation phase |
verbose |
when TRUE, it prints messages |
Returns a list with an element for each estimated group. Each element of the list is a matrix with the group specific factors as rows.
#example with data generated with DGP 2 data <- create_data_dgp2(30, 10) Y <- data[[1]] X <- data[[2]] g <- data[[3]] #true group membership set.seed(1) beta_est <- matrix(rnorm(4 * nrow(Y)), nrow = 4) factor_group <- data[[5]] #true values of group specific factors comfactor <- matrix(0, nrow = 1, ncol = ncol(Y)) lambda <- matrix(0, nrow = 1, ncol = nrow(Y)) estimate_factor_group(Y, X, beta_est, g, lambda, comfactor, factor_group, 3, 0, c(3, 3, 3), TRUE)
#example with data generated with DGP 2 data <- create_data_dgp2(30, 10) Y <- data[[1]] X <- data[[2]] g <- data[[3]] #true group membership set.seed(1) beta_est <- matrix(rnorm(4 * nrow(Y)), nrow = 4) factor_group <- data[[5]] #true values of group specific factors comfactor <- matrix(0, nrow = 1, ncol = ncol(Y)) lambda <- matrix(0, nrow = 1, ncol = nrow(Y)) estimate_factor_group(Y, X, beta_est, g, lambda, comfactor, factor_group, 3, 0, c(3, 3, 3), TRUE)
MacroPCA crashes Rstudio with certain dimensions of the input. Solve this by doubling every row. No information is added by this, so there is no influence on the end result, but crashes of Rstudio are evaded.
evade_crashes_macropca(object, verbose = FALSE)
evade_crashes_macropca(object, verbose = FALSE)
object |
input |
verbose |
prints messages |
matrix
Sets values that should be zero but are >0 (e.g. 1e-13) on zero.
evade_floating_point_errors(x, LIMIT = 1e-13)
evade_floating_point_errors(x, LIMIT = 1e-13)
x |
numeric input |
LIMIT |
limit under which value is set to 0 |
numeric
factor_group_true_dgp3 contains the values of the true group factors on which Y_dgp3 is based
factor_group_true_dgp3
factor_group_true_dgp3
list with length 3: each element has dimension 3 x 30
Fills in the optimized number of common factors for each C.
fill_rc(df, all_best_values, subset)
fill_rc(df, all_best_values, subset)
df |
input |
all_best_values |
data frame with the optimal number of groups, common factors and group specific factors |
subset |
index of the subsample |
data.frame
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration all_best_values <- calculate_best_config(df_results, data.frame(t(1:5)), 1:5) rc <- fill_rc(initialise_rc(0:2, 1:5), all_best_values, 1)
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration all_best_values <- calculate_best_config(df_results, data.frame(t(1:5)), 1:5) rc <- fill_rc(initialise_rc(0:2, 1:5), all_best_values, 1)
Fills in the optimized number of groups and group specific factors for each C.
fill_rcj(df, all_best_values, subset, S_cand, kg_cand)
fill_rcj(df, all_best_values, subset, S_cand, kg_cand)
df |
input |
all_best_values |
data frame with the optimal number of groups, common factors and group specific factors |
subset |
index of the subsample |
S_cand |
vector with candidate values for the number of estimated groups |
kg_cand |
vector with candidate values for the number of estimated group specific factors |
data.frame
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration all_best_values <- calculate_best_config(df_results, data.frame(t(1:5)), 1:5) rcj <- fill_rcj(initialise_rcj(0:2, 1:5) , all_best_values, 1, 2:4, 2:4)
df_results <- add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))) #data.frame with one configuration all_best_values <- calculate_best_config(df_results, data.frame(t(1:5)), 1:5) rcj <- fill_rcj(initialise_rcj(0:2, 1:5) , all_best_values, 1, 2:4, 2:4)
Filters dataframe on the requested group specific factors configuration.
final_estimations_filter_kg(df, kg)
final_estimations_filter_kg(df, kg)
df |
input dataframe |
kg |
vector with number of group specific factors for each group, on which should be filtered |
data.frame
g_true_dgp3 contains the true group memberships of the elements of Y_dgp3
g_true_dgp3
g_true_dgp3
vector with 300 elements
table(g_true_dgp3)
table(g_true_dgp3)
Loadings and factors are generated by: factors ~ N(j * fgr_factor_mean, fgr_factor_sd) -> default case will be N(j, 1) loadings ~ N(lgr_factor_mean, j * lgr_factor_sd) -> default case will be N(0, j)
generate_grouped_factorstructure( S, kg_true, TT, g_true, lgr_factor_mean = 0, lgr_factor_sd = 1, fgr_factor_mean = 1, fgr_factor_sd = 1 )
generate_grouped_factorstructure( S, kg_true, TT, g_true, lgr_factor_mean = 0, lgr_factor_sd = 1, fgr_factor_mean = 1, fgr_factor_sd = 1 )
S |
true number of groups |
kg_true |
vector with as length the number of groups, where each element is the true number of groupfactors of that group. |
TT |
length of time series |
g_true |
vector of length NN with true group memberships |
lgr_factor_mean |
mean of the normal distribution from which the loadings are generated |
lgr_factor_sd |
sd of the normal distribution from which the loadings are generated (multiplied by a coefficient for each different group) |
fgr_factor_mean |
mean of the normal distribution from which the group specific factors are generated (multiplied by a coefficient for each different group) |
fgr_factor_sd |
sd of the normal distribution from which the group specific factors are generated |
list: first element contains the true group specific factors and the second element contains the corresponding loadings
Generate panel data Y for simulations.
generate_Y( NN, TT, k_true, kg_true, g_true, beta_true, lambda_group_true, factor_group_true, lambda_true, comfactor_true, eps, X )
generate_Y( NN, TT, k_true, kg_true, g_true, beta_true, lambda_group_true, factor_group_true, lambda_true, comfactor_true, eps, X )
NN |
number of time series |
TT |
length of time series |
k_true |
true number of common factors |
kg_true |
Vector of length the number of groups. Each element contains the true number of group factors for that group. |
g_true |
vector of length NN with true group memberships |
beta_true |
true coefficients of the observable variables |
lambda_group_true |
loadings of the true group specific factors |
factor_group_true |
true group specific factors |
lambda_true |
loadings of the true common factors |
comfactor_true |
true common factors |
eps |
NN x TT-matrix containing the error term |
X |
dataframe with the observed variables |
NN x TT matrix
Finds the first stable interval after the first unstable point. It then defines the value for C for the begin, middle and end of this interval.
get_best_configuration( list_vc, list_rc, list_rcj, C_candidates, S_cand, return_short = FALSE, verbose = FALSE )
get_best_configuration( list_vc, list_rc, list_rcj, C_candidates, S_cand, return_short = FALSE, verbose = FALSE )
list_vc |
list with resulting expression(VC^2) for each run |
list_rc |
list with resulting rc for each run |
list_rcj |
list with resulting rcj for each run |
C_candidates |
candidates for C |
S_cand |
candidates for S (number of groups) |
return_short |
if TRUE, the function returns the dataframe filtered for several specified potential candidates for C |
verbose |
when TRUE, it prints messages |
data.frame with the optimized configuration for each candidate C (if return_short is FALSE) and for each of the selected C's in the chosen stable interval (if return_short is TRUE).
set.seed(1) all_best_values <- calculate_best_config(add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))), data.frame(t(1:5)), 1:5) rc <- fill_rc(initialise_rc(0:1, 1:5), all_best_values, 0) rc <- fill_rc(rc, all_best_values, 1) rcj <- fill_rcj(initialise_rcj(0:1, 1:5) , all_best_values, 0, 2:4, 2:4) rcj <- fill_rcj(rcj, all_best_values, 1, 2:4, 2:4) get_best_configuration(sort(runif(5)), rc, rcj, 1:5, 2:4, return_short = FALSE)
set.seed(1) all_best_values <- calculate_best_config(add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))), data.frame(t(1:5)), 1:5) rc <- fill_rc(initialise_rc(0:1, 1:5), all_best_values, 0) rc <- fill_rc(rc, all_best_values, 1) rcj <- fill_rcj(initialise_rcj(0:1, 1:5) , all_best_values, 0, 2:4, 2:4) rcj <- fill_rcj(rcj, all_best_values, 1, 2:4, 2:4) get_best_configuration(sort(runif(5)), rc, rcj, 1:5, 2:4, return_short = FALSE)
Defines the convergence speed.
get_convergence_speed(iteration, of)
get_convergence_speed(iteration, of)
iteration |
number of iteration |
of |
objective function |
numeric if iteration > 3, otherwise NA
get_convergence_speed(5, 10:1)
get_convergence_speed(5, 10:1)
Function that returns the final clustering, based on the estimated number of groups and common and group specific factors.
get_final_estimation(df, opt_groups, k, kg, type, limit_est_groups = 20)
get_final_estimation(df, opt_groups, k, kg, type, limit_est_groups = 20)
df |
input dataframe (this will be df_results_full) |
opt_groups |
the optimal number of groups |
k |
the optimal number of common factors |
kg |
vector with the optimal number of group specific factors |
type |
defines which estimation to return: options are "clustering", "beta", "fg" (group specific factors), "lg" (loadings corresponding to fg), "f" (common factors), "l" (loadings corresponding to f), |
limit_est_groups |
maximum allowed number of groups that can be estimated |
This function returns the estimations of the chosen configuration. If type is "clustering" it returns a numeric vector with the estimated group membership for all time series. If type is "beta", "lg" the function returns a data.frame. If type is "f" or "l" the function also returns a data.frame. If no common factors were estimated in the optimized configuration, then NA is returned. If type is "fg" the function returns a list.
get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "clustering") get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "beta") get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "fg") get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "lg")
get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "clustering") get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "beta") get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "fg") get_final_estimation(df_results_example, 3, 0, c(3, 3, 3), "lg")
It is used in iterate().
grid_add_variables( grid, Y, X, beta_est, g, lambda, comfactor, method_estimate_beta, vars_est, S, limit_est_groups_heterogroups = 15 )
grid_add_variables( grid, Y, X, beta_est, g, lambda, comfactor, method_estimate_beta, vars_est, S, limit_est_groups_heterogroups = 15 )
grid |
dataframe containing values for X*beta_est and LF (product of common factor and its loadings) |
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
S |
number of estimated groups |
limit_est_groups_heterogroups |
maximum amount of groups that can be estimated when method_estimate_beta is set to "group" |
data.frame
It handles possible thrown errors in MacroPCA.
handle_macropca_errors( object, temp, KMAX, number_eigenvectors, verbose = FALSE )
handle_macropca_errors( object, temp, KMAX, number_eigenvectors, verbose = FALSE )
object |
input |
temp |
this is the result of the trycatch block of using macropca on object |
KMAX |
parameter kmax in MacroPCA |
number_eigenvectors |
number of principal components that are needed |
verbose |
when TRUE, it prints messages |
matrix of which the columns contain the chosen amount of eigenvectors of object
Function with as input a dataframe. (this will be "Y" or "to_divide") It filters out rows with NA.
handleNA(df)
handleNA(df)
df |
input |
list with a dataframe where the rows with NA are filtered out, and a dataframe with only those rows
Removes NA's in LG (in function calculate_virtual_factor_and_lambda_group() )
handleNA_LG(df)
handleNA_LG(df)
df |
input |
matrix
Note: this needs to be called before the definition of grid.
initialise_beta( Y, X, S, robust, method_estimate_beta = "individual", nosetting_lmrob = FALSE )
initialise_beta( Y, X, S, robust, method_estimate_beta = "individual", nosetting_lmrob = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
dataframe with the observed variables |
S |
estimated number of groups |
robust |
robust or classical estimation |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
nosetting_lmrob |
option to remove the recommended setting in lmrob(). It is much faster. Defaults to FALSE. |
Matrix with number of rows equal to the number of estimated variables plus one. If method_estimate_beta is set to the default ("individual"), the number of columns is equal to the number of time series in Y. If method_estimate_beta is set to "group" or to "homogeneous" the number of columns is equal to the number of groups.
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership # to the true value for this example. lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 beta_init <- initialise_beta(Y, X, S = 3, TRUE )
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership # to the true value for this example. lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 beta_init <- initialise_beta(Y, X, S = 3, TRUE )
If a time series contains NA's a random cluster will be assigned to that time series.
initialise_clustering( Y, S, k, kg, comfactor, robust, max_percent_outliers_tkmeans = 0, verbose = FALSE )
initialise_clustering( Y, S, k, kg, comfactor, robust, max_percent_outliers_tkmeans = 0, verbose = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
S |
the desired number of groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
comfactor |
estimated common factors |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
max_percent_outliers_tkmeans |
the proportion of observations to be trimmed |
verbose |
when TRUE, it prints messages |
numeric vector
Y <- Y_dgp3 comfactor <- matrix(0, nrow = ncol(Y)) initialise_clustering(Y, 3, 0, c(3, 3, 3), comfactor, TRUE)
Y <- Y_dgp3 comfactor <- matrix(0, nrow = ncol(Y)) initialise_clustering(Y, 3, 0, c(3, 3, 3), comfactor, TRUE)
This is a short version of initialise_commonfactorstructure() which only contains implementations for the robust macropca case and the classical case.
initialise_commonfactorstructure_macropca( Y, X, beta_est, g, factor_group, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE )
initialise_commonfactorstructure_macropca( Y, X, beta_est, g, factor_group, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
dataframe with the observed variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
factor_group |
estimated group specific factors |
k |
number of estimated common factors |
kg |
vector with the number of estimated group specific factors |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
specifies the robust algorithm to estimate factors: default is "macro". The value is not used when robust is set to FALSE. |
verbose |
when TRUE, it prints messages |
list: 1st element contains the common factor(s) and the second element contains the factor loadings
set.seed(1) original_data <- create_data_dgp2(30, 20) Y <- original_data[[1]] X <- original_data[[2]] g <- original_data[[3]] beta_est <- matrix(rnorm(4 * ncol(Y)), nrow = 4) initialise_commonfactorstructure_macropca(Y, X, beta_est, g, NA, 0, c(3, 3, 3), TRUE)
set.seed(1) original_data <- create_data_dgp2(30, 20) Y <- original_data[[1]] X <- original_data[[2]] g <- original_data[[3]] beta_est <- matrix(rnorm(4 * ncol(Y)), nrow = 4) initialise_commonfactorstructure_macropca(Y, X, beta_est, g, NA, 0, c(3, 3, 3), TRUE)
Initialises a dataframe which will contain the PIC for each configuration and for each value of C.
initialise_df_pic(C_candidates)
initialise_df_pic(C_candidates)
C_candidates |
candidates for C (parameter in PIC) |
Returns an empty data.frame.
initialise_df_pic(1:10)
initialise_df_pic(1:10)
Initialises a dataframe that will contain an overview of metrics for each estimated configuration (for example adjusted randindex).
initialise_df_results(robust, limit_est_groups = 20)
initialise_df_results(robust, limit_est_groups = 20)
robust |
robust or classical estimation |
limit_est_groups |
maximum allowed number of groups that can be estimated |
Returns an empty data.frame.
initialise_df_results(TRUE)
initialise_df_results(TRUE)
This function initialises a data frame which will eventually be filled with the optimized number of common factors for each C and for each subset of the original dataset.
initialise_rc(indices_subset, C_candidates)
initialise_rc(indices_subset, C_candidates)
indices_subset |
all indices of the subsets |
C_candidates |
candidates for C (parameter in PIC) |
data.frame
initialise_rc(0:2, 1:5)
initialise_rc(0:2, 1:5)
This function initialises a data frame which will eventually be filled with the optimized number of groups and group specific factors for each C and for each subset of the original dataset.
initialise_rcj(indices_subset, C_candidates)
initialise_rcj(indices_subset, C_candidates)
indices_subset |
all indices of the subsets |
C_candidates |
candidates for C (parameter in PIC) |
data.frame
initialise_rcj(0:2, 1:5)
initialise_rcj(0:2, 1:5)
X is an array with dimensions N, T and number of observable variables. The variables are randomly generated with mean 0 and sd 1.
initialise_X(NN, TT, vars, scale_robust = TRUE)
initialise_X(NN, TT, vars, scale_robust = TRUE)
NN |
number of time series |
TT |
length of time series |
vars |
number of available observable variables |
scale_robust |
logical, defines if X will be scaled with robust metrics instead of with non-robust metrics |
array with dimensions N x T x number of observable variables
Wrapper around estimate_beta(), update_g(), and estimating the factorstructures.
iterate( Y, X, beta_est, g, lambda_group, factor_group, lambda, comfactor, S, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE )
iterate( Y, X, beta_est, g, lambda_group, factor_group, lambda, comfactor, S, k, kg, robust, method_estimate_beta = "individual", method_estimate_factors = "macro", verbose = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
lambda_group |
loadings of the estimated group specific factors |
factor_group |
estimated group specific factors |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
S |
number of groups to estimate |
k |
number of common factors to estimate |
kg |
vector with length S. Each element contains the number of group specific factors to estimate. |
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
method_estimate_factors |
specifies the robust algorithm to estimate factors: default is "macro". The value is not used when robust is set to FALSE. |
verbose |
when TRUE, it prints messages |
list with
estimated beta
vector with group membership
matrix with the common factor(s) (contains zero's if there are none estimated)
loadings to the common factor(s)
list with the group specific factors for each of the groups
data.frame with loadings to the group specific factors augmented with group membership and id (to have the order of the time series)
the value of the objective function
set.seed(1) original_data <- create_data_dgp2(30, 10) Y <- original_data[[1]] X <- original_data[[2]] g <- original_data[[3]] beta_est <- matrix(rnorm(4 * ncol(Y)), nrow = 4) factor_group <- original_data[[5]] lambda_group <- original_data[[6]] comfactor <- matrix(0, nrow = 1, ncol = ncol(Y)) lambda <- matrix(0, nrow = 1, ncol = nrow(Y)) iterate(Y, X, beta_est, g, lambda_group, factor_group, lambda, comfactor, 3, 0, c(3, 3, 3), TRUE, verbose = FALSE)
set.seed(1) original_data <- create_data_dgp2(30, 10) Y <- original_data[[1]] X <- original_data[[2]] g <- original_data[[3]] beta_est <- matrix(rnorm(4 * ncol(Y)), nrow = 4) factor_group <- original_data[[5]] lambda_group <- original_data[[6]] comfactor <- matrix(0, nrow = 1, ncol = ncol(Y)) lambda <- matrix(0, nrow = 1, ncol = nrow(Y)) iterate(Y, X, beta_est, g, lambda_group, factor_group, lambda, comfactor, 3, 0, c(3, 3, 3), TRUE, verbose = FALSE)
Function that returns the set of combinations of groupfactors for which the algorithm needs to run.
kg_candidates_expand(S, kg_min, kg_max, limit_est_groups = 20)
kg_candidates_expand(S, kg_min, kg_max, limit_est_groups = 20)
S |
number of groups |
kg_min |
minimum value for number of group specific factors |
kg_max |
minimum value for number of group specific factors |
limit_est_groups |
maximum allowed number of groups that can be estimated |
data.frame where each row contains a possible combination of group specific factors for each of the groups
lambda_group_true_dgp3 contains the values of the loadings to the group factors on which Y_dgp3 is based
lambda_group_true_dgp3
lambda_group_true_dgp3
dataframe with 300 rows. The first 3 columns are the loadings to the factors. The 4th column contains group membership. The fifth column contains an id of the individuals.
Desgined to make sure the following error does not happen anymore: Error in if (init$scale == 0) : missing value where TRUE/FALSE needed. KS2014 is the recommended setting (use "nosetting = FALSE").
LMROB(parameter_y, parameter_x, nointercept = FALSE, nosetting = FALSE)
LMROB(parameter_y, parameter_x, nointercept = FALSE, nosetting = FALSE)
parameter_y |
dependent variable in regression |
parameter_x |
independent variables in regression |
nointercept |
if TRUE it performs regression without an intercept |
nosetting |
option to remove the recommended setting in lmrob(). It is much faster. Defaults to FALSE. |
An object of class lmrob. If something went wrong it returns an object of class error.
Makes a dataframe with the PIC for each configuration and each candidate C.
make_df_pic_parallel(x, C_candidates)
make_df_pic_parallel(x, C_candidates)
x |
output of the parallel version of the algorithm |
C_candidates |
candidates for C |
data.frame
Makes a dataframe with information on each configuration.
make_df_results_parallel(x, limit_est_groups = 20)
make_df_results_parallel(x, limit_est_groups = 20)
x |
output of the parallel version of the algorithm |
limit_est_groups |
maximum allowed number of groups that can be estimated |
data.frame
Selects a subsample of the time series, and of the length of the time series. Based on this it returns a list with a subsample of Y, the corresponding subsample of X and of the true group membership and factorstructures if applicable.
make_subsamples(original_data, subset, verbose = TRUE)
make_subsamples(original_data, subset, verbose = TRUE)
original_data |
list containing the true data: Y, X, g_true, beta_true, factor_group_true, lambda_group_true, comfactor_true, lambda_true |
subset |
index of the subsample: this defines how many times stepsize_N is subtracted from the original N time series. Similar for stepsize_T. |
verbose |
when TRUE, it prints messages |
Y, X, g_true, comfactor_true, lambda_true, factor_group_true, lambda_group_true, sampleN, sampleT The output is a list where the first element is a subset of the panel dataset. The second element contains a subsetted 3D-array with the p observed variables. The third element contains the subsetted true group membership. The fourth and fifth elements contain the subsetted true common factor(s) and its loadings respectively. The sixth element contains a list with the subsetted true group specific factors. The seventh element contains a dataframe where each row contains the group specific factor loadings that corresponds to the group specific factors. The eighth and ninth element contain the indices of N and T respectively, which were used to create the subsets.
set.seed(1) original_data <- create_data_dgp2(30, 10) make_subsamples(original_data, 1)
set.seed(1) original_data <- create_data_dgp2(30, 10) make_subsamples(original_data, 1)
Function to calculate the norm of a matrix.
matrixnorm(mat)
matrixnorm(mat)
mat |
input matrix |
numeric
Helpfunction in OF_vectorized3()
OF_vectorized_helpfunction3( i, t, XBETA, LF, group_memberships, lgfg_list, Y, kg )
OF_vectorized_helpfunction3( i, t, XBETA, LF, group_memberships, lgfg_list, Y, kg )
i |
index of individual |
t |
index of time |
XBETA |
matrixproduct of X and beta_est |
LF |
matrixproduct of common factors and its loadings |
group_memberships |
Vector with group membership for all individuals |
lgfg_list |
product of groupfactors and their loadings; list with length the number of groups |
Y |
Y: NxT dataframe with the panel data of interest |
kg |
vector containing the number of group factors to be estimated for all groups |
numeric: contains the contribution to the objective function of one timepoint for one time series
Calculates objective function for the classical algorithm: used in iterate() and in local_search.
OF_vectorized3( NN, TT, g, grid, Y, beta_est, lc, fc, lg, fg, S, k, kg, method_estimate_beta, num_factors_may_vary = TRUE )
OF_vectorized3( NN, TT, g, grid, Y, beta_est, lc, fc, lg, fg, S, k, kg, method_estimate_beta, num_factors_may_vary = TRUE )
NN |
number of time series |
TT |
length of time series |
g |
Vector with group membership for all individuals |
grid |
dataframe containing the matrix multiplications XB, FgLg and FL |
Y |
Y: NxT dataframe with the panel data of interest |
beta_est |
estimated values of beta |
lc |
loadings of estimated common factors |
fc |
estimated common factors |
lg |
estimated grouploadings |
fg |
estimated groupfactors |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
num_factors_may_vary |
whether or not the number of groupfactors is constant over all groups or not |
numeric value of the objective function
Wrapper of the loop over the subsets which in turn use the parallelised algorithm.
parallel_algorithm( original_data, indices_subset, S_cand, k_cand, kg_cand, C_candidates, robust = TRUE, USE_DO = FALSE, choice_pic = "pic2022", maxit = 30 )
parallel_algorithm( original_data, indices_subset, S_cand, k_cand, kg_cand, C_candidates, robust = TRUE, USE_DO = FALSE, choice_pic = "pic2022", maxit = 30 )
original_data |
list containing the original data (1: Y, 2: X) |
indices_subset |
vector with indices of the subsets; starts with zero |
S_cand |
candidates for S (number of groups) |
k_cand |
candidates for k (number of common factors) |
kg_cand |
candidates for kg (number of group specific factors) |
C_candidates |
candidates for C |
robust |
robust or classical estimation |
USE_DO |
(for testing purposes) if TRUE, then a serialized version is performed ("do" instead of "dopar") |
choice_pic |
indicates which PIC to use to estimate the number of groups and factors. Options are "pic2017" (PIC of Ando and Bai (2017); works better for large N), "pic2016" (Ando and Bai (2016); works better for large T) weighs the fourth term with an extra factor relative to the size of the groups, and "pic2022" which shrinks the NT-space where the number of groups and factors would be over- or underestimated compared to pic2016 and pic2017. This is the default. This parameter can also be a vector with multiple pic's. |
maxit |
maximum limit for the number of iterations for each configuration; defaults to 30 |
Returns a list with three elements.
Data.frame with the optimal number of common factors for each candidate C in the rows. Each column contains the results of one subset of the input data (the first row corresponds to the full dataset).
Data.frame with the optimal number of groups and group specific factors for each candidate C in the rows. The structure is the same as in the above. Each entry is of the form "1_2_3_NA". This is to be interpreted as 3 groups (three non NA values) where group 1 contains 1 group specific factor, group 2 contains 2 and group 3 contains 3.
Data.frame with information about each configuration in the rows.
#Using a small dataset as an example; this will generate several warnings due to its small size. #Note that this example is run sequentially instead of parallel, # and consequently will print some intermediate information in the console. #This example uses the classical algorithm instead of the robust algorithm # to limit its running time. set.seed(1) original_data <- create_data_dgp2(30, 10) #define the number of subsets used to estimate the optimal number of groups and factors indices_subset <- define_number_subsets(2) #define the candidate values for C (this is a parameter in the information criterium # used to estimate the optimal number of groups and factors) C_candidates <- define_C_candidates() S_cand <- 3:3 # vector with candidate number of groups k_cand <- 0:0 # vector with candidate number of common factors kg_cand <- 1:2 # vector with candidate number of group specific factors #excluding parallel part from this example #cl <- makeCluster(detectCores() - 1) #registerDoSNOW(cl) output <- parallel_algorithm(original_data, indices_subset, S_cand, k_cand, kg_cand, C_candidates, robust = FALSE, USE_DO = TRUE, maxit = 3) #stopCluster(cl)
#Using a small dataset as an example; this will generate several warnings due to its small size. #Note that this example is run sequentially instead of parallel, # and consequently will print some intermediate information in the console. #This example uses the classical algorithm instead of the robust algorithm # to limit its running time. set.seed(1) original_data <- create_data_dgp2(30, 10) #define the number of subsets used to estimate the optimal number of groups and factors indices_subset <- define_number_subsets(2) #define the candidate values for C (this is a parameter in the information criterium # used to estimate the optimal number of groups and factors) C_candidates <- define_C_candidates() S_cand <- 3:3 # vector with candidate number of groups k_cand <- 0:0 # vector with candidate number of common factors kg_cand <- 1:2 # vector with candidate number of group specific factors #excluding parallel part from this example #cl <- makeCluster(detectCores() - 1) #registerDoSNOW(cl) output <- parallel_algorithm(original_data, indices_subset, S_cand, k_cand, kg_cand, C_candidates, robust = FALSE, USE_DO = TRUE, maxit = 3) #stopCluster(cl)
Plots expression(VC^2) along with the corresponding number of groups (orange), common factors (darkblue) and group factors of the first group (lightblue).
plot_VCsquared( VC_squared, rc, rcj, C_candidates, S_cand, xlim_min = 0.001, xlim_max = 100, add_true_lines = FALSE, verbose = FALSE )
plot_VCsquared( VC_squared, rc, rcj, C_candidates, S_cand, xlim_min = 0.001, xlim_max = 100, add_true_lines = FALSE, verbose = FALSE )
VC_squared |
measure of variability in the optimal configuration between the subsets |
rc |
dataframe containg the numer of common factors for all candidate C's and all subsamples |
rcj |
dataframe containg the numer of groupfactors for all candidate C's and all subsamples |
C_candidates |
candidates for C (parameter in PIC) |
S_cand |
candidate numbers for the number of groups |
xlim_min |
starting point of the plot |
xlim_max |
end point of the plot |
add_true_lines |
if set to TRUE, for each C the true number of groups, common factors, and group specific factors of group 1 will be added to the plot |
verbose |
if TRUE, more details are printed |
A ggplot object.
set.seed(1) #requires filled in dataframes rc and rcj all_best_values <- calculate_best_config(add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))), data.frame(t(1:20)), 1:20) rc <- fill_rc(initialise_rc(0:1, 1:20), all_best_values, 0) rc <- fill_rc(rc, all_best_values, 1) rcj <- fill_rcj(initialise_rcj(0:1, 1:20) , all_best_values, 0, 2:4, 2:4) rcj <- fill_rcj(rcj, all_best_values, 1, 2:4, 2:4) plot_VCsquared(c(runif(9), 0, 0, runif(9)), rc, rcj, 1:20, 2:4)
set.seed(1) #requires filled in dataframes rc and rcj all_best_values <- calculate_best_config(add_configuration(initialise_df_results(TRUE), 3, 0, c(3, 3, 3, rep(NA, 17))), data.frame(t(1:20)), 1:20) rc <- fill_rc(initialise_rc(0:1, 1:20), all_best_values, 0) rc <- fill_rc(rc, all_best_values, 1) rcj <- fill_rcj(initialise_rcj(0:1, 1:20) , all_best_values, 0, 2:4, 2:4) rcj <- fill_rcj(rcj, all_best_values, 1, 2:4, 2:4) plot_VCsquared(c(runif(9), 0, 0, runif(9)), rc, rcj, 1:20, 2:4)
It contains options to use the classical or robust covmatrix or no covariance matrix at all.
prepare_for_robpca(object, NN, TT, option = 3)
prepare_for_robpca(object, NN, TT, option = 3)
object |
this is the object of which we may take the covariance matrix and then to perform robust PCA on |
NN |
N |
TT |
T |
option |
1 (robust covmatrix), 2 (classical covmatrix), 3 (no covmatrix) |
matrix
This package is about clustering time series in a robust manner. The method of Ando & Bai (Clustering Huge Number of Financial Time Series: A Panel Data Approach With High-Dimensional Predictors and Factor Structures) is extended to make it robust against contamination, a common issue with real world data. In this package the core functions for the robust approach are included. It also contains a simulated dataset (dataset_Y_dgp3).
Randomly reassign individual(s) if there are empty groups. This can happen if the total number of time series is low compared to the number of desired groups.
reassign_if_empty_groups(g, S_true, NN)
reassign_if_empty_groups(g, S_true, NN)
g |
Vector with group membership for all individuals |
S_true |
true number of groups |
NN |
number of time series |
numeric vector with the estimated group membership for all time series
Restructures X (which is an 3D-array of dimensions (N,T,p) to a 2D-matrix of dimension (NxT,p).
restructure_X_to_order_slowN_fastT(X, vars_est)
restructure_X_to_order_slowN_fastT(X, vars_est)
X |
input |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
The function returns a 2D-array, unless the input X is NA, in which case the output will be NA as well.
Uses the almost classical lambda (this is an object of which the mean equals to the classical lambda) to create a robust lambda by using M estimation
return_robust_lambdaobject( Y_like_object, group, type, g, NN, k, kg, comfactor_rrn, factor_group_rrn, verbose = FALSE )
return_robust_lambdaobject( Y_like_object, group, type, g, NN, k, kg, comfactor_rrn, factor_group_rrn, verbose = FALSE )
Y_like_object |
this is Y_ster or W or W_j |
group |
index of group |
type |
scalar which shows in which setting this function is used |
g |
vector with group memberships |
NN |
number of time series |
k |
number of common factors |
kg |
number of group factors |
comfactor_rrn |
estimated common factors |
factor_group_rrn |
estimatied group specific factors |
verbose |
when TRUE, it prints messages |
Nxk dataframe
Contains call to MacroPCA()
robustpca(object, number_eigenvectors, KMAX = 20, verbose_robustpca = FALSE)
robustpca(object, number_eigenvectors, KMAX = 20, verbose_robustpca = FALSE)
object |
input |
number_eigenvectors |
number of eigenvectors to extract |
KMAX |
The maximal number of principal components to compute. This is a parameter in cellWise::MacroPCA() |
verbose_robustpca |
when TRUE, it prints messages: used for testing (requires Matrix-package when set to TRUE) |
Notes:
Different values for kmax give different factors, but the product lambdafactor stays constant. Note that this number needs to be big enough, otherwise eigen() will be used. Variation in k does give different results for lambdafactor
MacroPCA() crashes with specific values of dim(object). For example when dim(object) = c(193,27). This is solved with evade_crashes_macropca(), for those problematic dimensions that are already encountered during tests.
list with as the first element the robust factors and as the second element the robust factor loadings
The function estimates beta, group membership and the common and group specific factorstructures for one configuration.
run_config(robust, config, C_candidates, Y, X, choice_pic, maxit = 30)
run_config(robust, config, C_candidates, Y, X, choice_pic, maxit = 30)
robust |
TRUE or FALSE: defines using the classical or robust algorithm to estimate beta |
config |
contains one configuration of groups and factors |
C_candidates |
candidates for C (parameter in PIC) |
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
choice_pic |
indicates which PIC to use to estimate the number of groups and factors: options are "pic2017" (uses the PIC of Ando and Bai (2017); works better for large N), "pic2016" (Ando and Bai (2016); works better for large T) weighs the fourth term with an extra factor relative to the size of the groups, and "pic2022" which shrinks the NT-space where the number of groups and factors would be over- or underestimated compared to pic2016 and pic2017. |
maxit |
maximum limit for the number of iterations |
list with the estimators and metrics for this configuration
Scaling of X.
scaling_X(X, firsttime, robust, vars)
scaling_X(X, firsttime, robust, vars)
X |
input |
firsttime |
Scaling before generating Y and before adding outliers: this is always with mean and sd. If this is FALSE, it indicates that we are using the function for a second time, after adding the outliers. In the robust case it uses median and MAD, otherwise again mean and sd. |
robust |
logical, scaling with robust metrics instead of with non-robust measures |
vars |
number of observable variables |
3D-array with the same dimensions as X
Helpfunction in update_g(), to calculate solve(FG x t(FG)) x FG
solveFG(TT, S, kg, factor_group, testing = FALSE)
solveFG(TT, S, kg, factor_group, testing = FALSE)
TT |
length of time series |
S |
number of groups |
kg |
vector with the estimated number of group specific factors for each group |
factor_group |
estimated group specific factors |
testing |
variable that determines if we are in 'testing phase'; defaults to FALSE (requires Matrix-package if set to TRUE) |
list: the number of elements in this list is equal to S (the number of groups). Each of the elements in this list has a number rows equal to the number of group specific factors, and TT columns.
Shows the configurations for potential C's of the first stable interval (beginpoint, middlepoint and endpoint)
tabulate_potential_C( df, runs, beginpoint, middlepoint_log, middlepoint, endpoint, S_cand )
tabulate_potential_C( df, runs, beginpoint, middlepoint_log, middlepoint, endpoint, S_cand )
df |
input dataframe |
runs |
number of panel data sets for which the algorithm has run. If larger than one, the median VC2 is used to determine C. |
beginpoint |
first C of the chosen stable interval |
middlepoint_log |
middle C (on a logscale) of the chosen stable interval |
middlepoint |
middle C of the chosen stable interval |
endpoint |
last C of the chosen stable interval |
S_cand |
candidate number for the number of groups |
data.frame
Function that estimates group membership.
update_g( Y, X, beta_est, g, factor_group, lambda, comfactor, S, k, kg, vars_est, robust, method_estimate_factors, method_estimate_beta, verbose = FALSE )
update_g( Y, X, beta_est, g, factor_group, lambda, comfactor, S, k, kg, vars_est, robust, method_estimate_factors, method_estimate_beta, verbose = FALSE )
Y |
Y: NxT dataframe with the panel data of interest |
X |
X: NxTxp array containing the observable variables |
beta_est |
estimated values of beta |
g |
Vector with estimated group membership for all individuals |
factor_group |
estimated group specific factors |
lambda |
loadings of the estimated common factors |
comfactor |
estimated common factors |
S |
number of estimated groups |
k |
number of common factors to be estimated |
kg |
number of group specific factors to be estimated |
vars_est |
number of variables that will be included in the algorithm and have their coefficient estimated. This is usually equal to the number of observable variables. |
robust |
robust or classical estimation of group membership |
method_estimate_factors |
defines method of robust estimaton of the factors: "macro", "pertmm" or "cz" |
method_estimate_beta |
defines how beta is estimated. Default case is an estimated beta for each individual. Default value is "individual." Possible values are "homogeneous", "group" or "individual". |
verbose |
when TRUE, it prints messages |
Returns a list. The first element contains a vector with the estimated group membership for all time series. The second element contains the values which were used to determine the group membership. The third element is only relevant if method_estimate_factors is set to "cz" (non-default) and contains the group membership before moving some of the time series to class zero.
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership to the true value lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 g_true <- g_true_dgp3 # true values of group membership g <- g_true # estimated values of group membership; set in this example to be equal to true values # There are no common factors to be estimated -> use placeholder with values set to zero lambda <- matrix(0, nrow = 1, ncol = 300) comfactor <- matrix(0, nrow = 1, ncol = 30) # Choose how coefficients of the observable are estimated beta_est <- estimate_beta( Y, X, NA, g, lambda_group, factor_group, lambda, comfactor, S = 3, k = 0, kg = c(3, 3, 3), vars_est = 3, robust = TRUE )[[1]] g_new <- update_g( Y, X, beta_est, g, factor_group, lambda, comfactor, S = 3, k = 0, kg = c(3, 3, 3), vars_est = 3, robust = TRUE, "macro", "individual" )[[1]]
X <- X_dgp3 Y <- Y_dgp3 # Set estimations for group factors and its loadings, and group membership to the true value lambda_group <- lambda_group_true_dgp3 factor_group <- factor_group_true_dgp3 g_true <- g_true_dgp3 # true values of group membership g <- g_true # estimated values of group membership; set in this example to be equal to true values # There are no common factors to be estimated -> use placeholder with values set to zero lambda <- matrix(0, nrow = 1, ncol = 300) comfactor <- matrix(0, nrow = 1, ncol = 30) # Choose how coefficients of the observable are estimated beta_est <- estimate_beta( Y, X, NA, g, lambda_group, factor_group, lambda, comfactor, S = 3, k = 0, kg = c(3, 3, 3), vars_est = 3, robust = TRUE )[[1]] g_new <- update_g( Y, X, beta_est, g, factor_group, lambda, comfactor, S = 3, k = 0, kg = c(3, 3, 3), vars_est = 3, robust = TRUE, "macro", "individual" )[[1]]
The dataset X_dgp3 contains the values of the 3 observable variables on which Y_dgp3 is based.
X_dgp3
X_dgp3
array with 300 x 30 x 3 elements
head(X_dgp3[,,1]) hist(X_dgp3[,,1])
head(X_dgp3[,,1]) hist(X_dgp3[,,1])
Y = XB + LgFg. It has 3 groups and each group has 3 groupfactors. At last there were 3 observable variables generated into it.
Y_dgp3
Y_dgp3
300 x 30 matrix. Each row is one time series.
plot(Y_dgp3[,1:2], col = g_true_dgp3, xlab = "First column of Y", ylab = "Second column of Y", main = "Plot of the first two columns of the dataset Y. \nColors are the true groups.")
plot(Y_dgp3[,1:2], col = g_true_dgp3, xlab = "First column of Y", ylab = "Second column of Y", main = "Plot of the first two columns of the dataset Y. \nColors are the true groups.")