CRAN Package Check Results for Package dataPreparation

Last updated on 2025-12-20 04:49:51 CET.

Flavor Version Tinstall Tcheck Ttotal Status Flags
r-devel-linux-x86_64-debian-clang 1.1.2 11.44 109.78 121.22 OK
r-devel-linux-x86_64-debian-gcc 1.1.2 8.28 74.49 82.77 ERROR
r-devel-linux-x86_64-fedora-clang 1.1.2 20.00 161.58 181.58 ERROR
r-devel-linux-x86_64-fedora-gcc 1.1.2 19.00 156.19 175.19 ERROR
r-devel-windows-x86_64 1.1.2 13.00 127.00 140.00 OK
r-patched-linux-x86_64 1.1.2 10.83 99.08 109.91 OK
r-release-linux-x86_64 1.1.2 10.91 98.52 109.43 OK
r-release-macos-arm64 1.1.2 OK
r-release-macos-x86_64 1.1.2 8.00 105.00 113.00 OK
r-release-windows-x86_64 1.1.2 13.00 124.00 137.00 OK
r-oldrel-macos-arm64 1.1.2 OK
r-oldrel-macos-x86_64 1.1.2 8.00 98.00 106.00 OK
r-oldrel-windows-x86_64 1.1.2 18.00 156.00 174.00 OK

Check Details

Version: 1.1.2
Check: examples
Result: ERROR Running examples in ‘dataPreparation-Ex.R’ failed The error most likely occurred in: > base::assign(".ptime", proc.time(), pos = "CheckExEnv") > ### Name: build_encoding > ### Title: Compute encoding > ### Aliases: build_encoding > > ### ** Examples > > # Get a data set > data(adult) > encoding <- build_encoding(adult, cols = "auto", verbose = TRUE) [1] "age" "fnlwgt" "education_num" "capital_gain" [5] "capital_loss" "hr_per_week" [1] "build_encoding: c(\"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor or character i do nothing for those variables." [1] "build_encoding: I will compute encoding on 9 character and factor columns." [1] "build_encoding: it took me: 0s to compute encoding for 9 character and factor columns." > > print(encoding) $type_employer $type_employer$new_cols type_employer.? type_employer.Federal-gov "type.employer.." "type.employer.Federal.gov" type_employer.Local-gov type_employer.Never-worked "type.employer.Local.gov" "type.employer.Never.worked" type_employer.Private type_employer.Self-emp-inc "type.employer.Private" "type.employer.Self.emp.inc" type_employer.Self-emp-not-inc type_employer.State-gov "type.employer.Self.emp.not.inc" "type.employer.State.gov" type_employer.Without-pay "type.employer.Without.pay" $type_employer$values [1] "?" "Federal-gov" "Local-gov" "Never-worked" [5] "Private" "Self-emp-inc" "Self-emp-not-inc" "State-gov" [9] "Without-pay" $education $education$new_cols education.10th education.11th education.12th "education.10th" "education.11th" "education.12th" education.1st-4th education.5th-6th education.7th-8th "education.1st.4th" "education.5th.6th" "education.7th.8th" education.9th education.Assoc-acdm education.Assoc-voc "education.9th" "education.Assoc.acdm" "education.Assoc.voc" education.Bachelors education.Doctorate education.HS-grad "education.Bachelors" "education.Doctorate" "education.HS.grad" education.Masters education.Preschool education.Prof-school "education.Masters" "education.Preschool" "education.Prof.school" education.Some-college "education.Some.college" $education$values [1] "10th" "11th" "12th" "1st-4th" "5th-6th" [6] "7th-8th" "9th" "Assoc-acdm" "Assoc-voc" "Bachelors" [11] "Doctorate" "HS-grad" "Masters" "Preschool" "Prof-school" [16] "Some-college" $marital $marital$new_cols marital.Divorced marital.Married-AF-spouse "marital.Divorced" "marital.Married.AF.spouse" marital.Married-civ-spouse marital.Married-spouse-absent "marital.Married.civ.spouse" "marital.Married.spouse.absent" marital.Never-married marital.Separated "marital.Never.married" "marital.Separated" marital.Widowed "marital.Widowed" $marital$values [1] "Divorced" "Married-AF-spouse" "Married-civ-spouse" [4] "Married-spouse-absent" "Never-married" "Separated" [7] "Widowed" $occupation $occupation$new_cols occupation.? occupation.Adm-clerical "occupation.." "occupation.Adm.clerical" occupation.Armed-Forces occupation.Craft-repair "occupation.Armed.Forces" "occupation.Craft.repair" occupation.Exec-managerial occupation.Farming-fishing "occupation.Exec.managerial" "occupation.Farming.fishing" occupation.Handlers-cleaners occupation.Machine-op-inspct "occupation.Handlers.cleaners" "occupation.Machine.op.inspct" occupation.Other-service occupation.Priv-house-serv "occupation.Other.service" "occupation.Priv.house.serv" occupation.Prof-specialty occupation.Protective-serv "occupation.Prof.specialty" "occupation.Protective.serv" occupation.Sales occupation.Tech-support "occupation.Sales" "occupation.Tech.support" occupation.Transport-moving "occupation.Transport.moving" $occupation$values [1] "?" "Adm-clerical" "Armed-Forces" [4] "Craft-repair" "Exec-managerial" "Farming-fishing" [7] "Handlers-cleaners" "Machine-op-inspct" "Other-service" [10] "Priv-house-serv" "Prof-specialty" "Protective-serv" [13] "Sales" "Tech-support" "Transport-moving" $relationship $relationship$new_cols relationship.Husband relationship.Not-in-family "relationship.Husband" "relationship.Not.in.family" relationship.Other-relative relationship.Own-child "relationship.Other.relative" "relationship.Own.child" relationship.Unmarried relationship.Wife "relationship.Unmarried" "relationship.Wife" $relationship$values [1] "Husband" "Not-in-family" "Other-relative" "Own-child" [5] "Unmarried" "Wife" $race $race$new_cols race.Amer-Indian-Eskimo race.Asian-Pac-Islander race.Black "race.Amer.Indian.Eskimo" "race.Asian.Pac.Islander" "race.Black" race.Other race.White "race.Other" "race.White" $race$values [1] "Amer-Indian-Eskimo" "Asian-Pac-Islander" "Black" [4] "Other" "White" $sex $sex$new_cols sex.Female sex.Male "sex.Female" "sex.Male" $sex$values [1] "Female" "Male" $country $country$new_cols country.? country.Cambodia "country.." "country.Cambodia" country.Canada country.China "country.Canada" "country.China" country.Columbia country.Cuba "country.Columbia" "country.Cuba" country.Dominican-Republic country.Ecuador "country.Dominican.Republic" "country.Ecuador" country.El-Salvador country.England "country.El.Salvador" "country.England" country.France country.Germany "country.France" "country.Germany" country.Greece country.Guatemala "country.Greece" "country.Guatemala" country.Haiti country.Holand-Netherlands "country.Haiti" "country.Holand.Netherlands" country.Honduras country.Hong "country.Honduras" "country.Hong" country.Hungary country.India "country.Hungary" "country.India" country.Iran country.Ireland "country.Iran" "country.Ireland" country.Italy country.Jamaica "country.Italy" "country.Jamaica" country.Japan country.Laos "country.Japan" "country.Laos" country.Mexico country.Nicaragua "country.Mexico" "country.Nicaragua" country.Outlying-US(Guam-USVI-etc) country.Peru "country.Outlying.US.Guam.USVI.etc." "country.Peru" country.Philippines country.Poland "country.Philippines" "country.Poland" country.Portugal country.Puerto-Rico "country.Portugal" "country.Puerto.Rico" country.Scotland country.South "country.Scotland" "country.South" country.Taiwan country.Thailand "country.Taiwan" "country.Thailand" country.Trinadad&Tobago country.United-States "country.Trinadad.Tobago" "country.United.States" country.Vietnam country.Yugoslavia "country.Vietnam" "country.Yugoslavia" $country$values [1] "?" "Cambodia" [3] "Canada" "China" [5] "Columbia" "Cuba" [7] "Dominican-Republic" "Ecuador" [9] "El-Salvador" "England" [11] "France" "Germany" [13] "Greece" "Guatemala" [15] "Haiti" "Holand-Netherlands" [17] "Honduras" "Hong" [19] "Hungary" "India" [21] "Iran" "Ireland" [23] "Italy" "Jamaica" [25] "Japan" "Laos" [27] "Mexico" "Nicaragua" [29] "Outlying-US(Guam-USVI-etc)" "Peru" [31] "Philippines" "Poland" [33] "Portugal" "Puerto-Rico" [35] "Scotland" "South" [37] "Taiwan" "Thailand" [39] "Trinadad&Tobago" "United-States" [41] "Vietnam" "Yugoslavia" $income $income$new_cols income.<=50K income.>50K "income...50K" "income..50K" $income$values [1] "<=50K" ">50K" > > # To limit the number of generated columns, one can use min_frequency parameter: > build_encoding(adult, cols = "auto", verbose = TRUE, min_frequency = 0.1) [1] "age" "fnlwgt" "education_num" "capital_gain" [5] "capital_loss" "hr_per_week" [1] "build_encoding: c(\"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor or character i do nothing for those variables." [1] "build_encoding: I will compute encoding on 9 character and factor columns." Error in `[.data.table`(data_set, , `:=`(c("freq"), (.N/nrow(data_set))), : attempt access index 15/15 in VECTOR_ELT Calls: build_encoding -> [ -> [.data.table Execution halted Flavor: r-devel-linux-x86_64-debian-gcc

Version: 1.1.2
Check: tests
Result: ERROR Running ‘testthat.R’ [12s/13s] Running the tests in ‘tests/testthat.R’ failed. Complete output: > if (requireNamespace("testthat", quietly = TRUE)) { + library(testthat) + library(dataPreparation) + test_check("dataPreparation") + } dataPreparation 1.1.2 Type data_preparation_news() to see new features/changes/bug fixes. [1] "aggregate_by_key: I start to aggregate" [1] "aggregate_by_key: 6 columns have been constructed. It took 0.05 seconds. " [1] "find_and_transform_dates: It took me 0.74s to identify formats" [1] "find_and_transform_dates: It took me 0.12s to transform 4 columns to a Date format." [1] "find_and_transform_dates: It took me 0.01s to identify formats" [1] "find_and_transform_dates: There are no dates to transform.\n (If i missed something please provide the date format in inputs or\n consider using set_col_as_date to transform it)." [1] "identify_dates: column date_col seems to have an ambiguity, I try to solve it." [1] "V2" [1] "fast_discretization: V2 aren't columns of types numeric i do nothing for those variables." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_freq method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 0 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 0 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "equal_width_splits: constant_col can't provide 10 equal width bins; instead you will have 0 bins." [1] "fast_discretization: column constant_col seems to be constant, I do nothing." [1] "fast_discretization: it took me: 0s to build splits for 0 numeric columns." [1] "equal_width_splits: data_set can't provide 10 equal width bins; instead you will have 0 bins." [1] "equal_freq_splits: data_set can't provide 10 equal freq bins; instead you will have 2 bins." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will discretize 1 numeric columns using, bins." [1] "fast_discretization: it took me: 0s to transform 1 numeric columns into, binary columns." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor false_factor." [1] "un_factor: It took me 0s to un-factor 1 column(s)." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor true_factor." [1] "un_factor: I un-factor false_factor." [1] "un_factor: It took me 0s to un-factor 2 column(s)." [1] "fast_filter_variables: I check for constant columns." [1] "fast_filter_variables: I delete 1 constant column(s) in data_set." [1] "fast_filter_variables: I check for columns in double." [1] "fast_filter_variables: I delete 1 column(s) that are in double in data_set." [1] "fast_filter_variables: I check for columns that are bijections of another column." [1] "fast_filter_variables: I delete 3 column(s) that are bijections of another column in data_set." [1] "fast_filter_variables: I check for columns that are included in another column." [1] "fast_filter_variables: I delete 1 column(s) that are bijections of another column in data_set." [1] "string_column" [1] "fast_round: string_column aren't columns of types numeric or integer i do nothing for those variables." [1] "string_column" [1] "fast_round: string_column aren't columns of types numeric or integer i do nothing for those variables." Saving _problems/test_generate_from_character-13.R Saving _problems/test_generate_from_character-26.R Saving _problems/test_generate_from_character-40.R [1] "generate_factor_from_date: I will create a factor column from each date column." [1] "generate_factor_from_date: It took me 0s to transform 1 column(s)." [1] "ID" [1] "generate_date_diffs: ID aren't columns of types date i do nothing for those variables." [1] "generate_date_diffs: I will generate difference between dates." [1] "generate_date_diffs: It took me 0s to create 3 column(s)." [1] "date1" "date2" "date3" "date4" [5] "num1" "num2" "constant" "num3" [9] "age" "fnlwgt" "education_num" "capital_gain" [13] "capital_loss" "hr_per_week" [1] "generate_from_factor: c(\"date1\", \"date2\", \"date3\", \"date4\", \"num1\", \"num2\", \"constant\", \"num3\", \"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables." Saving _problems/test_generate_from_factor-14.R Saving _problems/test_generate_from_factor-27.R [1] "one_hot_encoder: Since you didn't provide encoding, I compute them with build_encoding." [1] "build_encoding: I will compute encoding on 1 character and factor columns." [1] "build_encoding: it took me: 0s to compute encoding for 1 character and factor columns." [1] "one_hot_encoder: I will one hot encode some columns." [1] "one_hot_encoder: I am doing column: character_col" [1] "one_hot_encoder: It took me 0s to transform 1 column(s)." [1] "build_encoding: I will compute encoding on 1 character and factor columns." [1] "build_encoding: it took me: 0s to compute encoding for 1 character and factor columns." [1] "build_encoding: I will compute encoding on 1 character and factor columns." Saving _problems/test_generate_from_factor-80.R [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades." [1] "target_encode: Start to encode columns according to target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades." [1] "target_encode: Start to encode columns according to target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "real_cols: col_2 aren't columns of the table, i do nothing for those variables" [1] "col_2" [1] "real_cols: col_2 aren't columns of types numeric i do nothing for those variables." [1] "find_and_transform_numerics: It took me 0s to identify 2 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: It took me 0s to transform 2 column(s) to a numeric format." [1] "find_and_transform_numerics: It took me 0s to identify 0 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: There are no numerics to transform.(If i missed something consider using set_col_as_numeric to transform it)" V 1.1.2 (September 2025) ================== - DOC : - Update documentation according to new \link standards. - TECH : - Update CI tested R versions : removing 4.0 & 4.1, adding 4.4 and 4.5 V 1.1.1 (June 2023) ================== - FEAT: - Speed up examples by providing and using a `tiny_messy_adult` data set. - FIX: - Fix typos - TECH: - Speed up CI for MACOS V 1.1.0 ======= - FEAT: - Stop supporting R strictly before 3.6, and support R 4.2 and 4.3 - BUGFIX: - FIX documentation - TECH: - Upgrade package install in CI V 1.0.5 (July 2022) ================== FEAT: - New functions *compute_probability_ratio* and *compute_weight_of_evidence* to be used for target encoding - New function *get_most_frequent_element* to identify most frequent element in a list V 1.0.4 ======= BUGFIX: Fix *generate_from_character*, when there were some NAs in the column it would drop the line. It is not the case anymore. V 1.0.3 ======= BUGFIX: Fix bud on *fast_is_bijection* when column has multiple class FEAT: Harmonize logging levels between functions V 1.0.2 ======= Remove useless dependencies. Make sure library works on windows, macos, ubuntu, and R versions from 3.3 to 4.1. V 1.0.1 ======= Based on CRAN feedbacks removed problematic vignettes. V 1.0.0 ======= For this version 1.0.0 there are a lot of changes, and version is not compatible with previous version of the package. Also there might be some rework to do on code using previous version of this package (and we are sorry about it), we strongly believe that this version will be easier to use, faster, and more maintanable in time. In this version: - All function names and variables are snake_case (there used to be a mix of camel case and snake case) - We remove a lost of useless code that was slowing done the package (particularly garbage collection) - We made the code more readable so that it is easier to contribute to this package - Logging is more explicit and cleaner. - We took into account linting. - A few more functions are availables. We hope that you will like even more this new version of the package. Please don't hesitate to provide feedback, warn us about bug, suggest improvements or even better developp some improvements on this package. To do so please go to github (https://github.com/ELToulemonde/dataPreparation/). V 0.4.3 ======= - Fix : - In *same_shape*: there was a future bug due to change in class "matrix". Fixed it by implementing 2 functions to check class V 0.4.2 ======= - Fix test: - Case in *build_encoding*: min_frequency allows to drop rare values" was not built correctly. V 0.4.1 ======= - New features: - New functions: - Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which is the process of replacing a categorical value with the aggregation of the target variable. - Function *remove_sd_outlier* helps to remove rows that have numerical values to extreme. - Function *remove_percentile_outlier* helps to remove rows that have numerical values to extreme (based on percentile analysis). - Function *remove_rare_categorical* helps to remove rows that have categorical values to rare. - New features in existing functions: - Function *prepare_set* integrate *target_encode* function. It is called by providing *target_col* and *target_encoding_functions*. V 0.4.0 ======= - New features: - New features in existing functions: - To avoid issues based on column names, we will check and rename columns that have same names. - In *aggregate_by_key* generated column names are changed to be more explicit. - In *aggregate_by_key* generated from character column with more than \code{thresh} values is now count of unique instead of count. - Added missing *auto* default values on cols - Bug fixes: - *which_are_bijection* and *which_are_in_double* are using *bi_col_test* which was not working with 2 column data set. It is fixed. - *prepare_set* optional argument *factor_date_type* was not working. It is fixed. - Other changes: - Changed *which_are_included* example since it was to slow for CRAN. Also it might be a little bit more explicit now. - Changed *aggregate_by_key* example since it was to slow for CRAN. - Integration: - Rewrite all tests to make them more readable - Code coverage is improved, dependencies on *messy_adult* set is lowered WARNING: - In *aggregate_by_key* generated column names are changed. - In *aggregate_by_key* generated column for character is different. V 0.3.9 ======= - Integration: - Matching new devtools requirements - Starting to rewrite unittest to make it more readable V 0.3.8 ======= - New features: - New features in existing functions: - Identification of bijection through internal function *fast_is_bijection* is way faster (up to 40 times faster in case of bijection). So *whichArebijection* and *fastFiltervariables* are also improved. - Remove remaining *gc* to save time. - In *one_hot_encoder* added parameter *type* to choose between logical or numerical results. V 0.3.7 ======= - New features: - New functions: - Function *as.POSIXct_fast* is now available. It helps to transform to POSIXct way faster (if the same date value is present multiple times in the column). - New features in existing functions: - In dates identifications, we make it faster by computing search of format only on unique values. - In date transformation, we made it faster by using *as.POSIXct_fast* when it is necessary. - Functions *findAndTransFormDates*, *find_and_transform_numerics* and *un_factor* now accept argument *cols* to limit search. - Bug fixes: - Control that over-allocate option is activated on every data.table to avoid issues with set. Package should be more robust. - In bijection search (internal function *fast_is_bijection*) there was a bug on some rare cases. Fixed but slower. -Code quality: - Improving code quality using lintr - Suppressing some useless code - Meeting new covr standard - Improve log of setColAsXXX V 0.3.6 ======= - Bug fixes: - *identify_dates* had a weird bug. Solved - Integration: - Making dataPreparation compatible with testthat 2.0.0 V 0.3.5 ======= - New features: - New features in existing functions: - *findAndTransFormDates* now as an *ambiguities* parameter, IGNORE to work as before, WARN to check for ambiguities and print them, SOLVE to try to solve ambiguities on more lines. - *one_hot_encoder* now uses a *build_encoding* functions to be able to build same encoding on train and on test. - *aggregate_by_key* is now way faster on numerics. But it changed the way it gets input functions. - *fast_scale* now as a *way* parameter which allow you to either scale or unscale. Unscaling numeric values can be very useful for most post-model analysis. - *set_col_as_date* now accept multiple formats in a single call. - New functions: - *build_encoding* build a list of encoding to be used by *one_hot_encoder*, it also has a parameter *min_frequency* to control that rare values doesn't result in new columns. - Previously private function *identify_dates* is now exported. To be able to perform same transformation on train and on test. - Adding *dataPreparationNews* function to open NEWS file (inspired from rfNews() of randomForest package) - Bug fixes: - *findAndTransFormDates*: bug fixed: user formats weren't used. - *identify_dates*: some formats where tested but would never work. They have been removed. - Refactoring: - Unit test partly reviewed to be more readable and more efficient. Unit test time as been divided by 3. - Improving input control for more robust functions WARNING: - *one_hot_encoder* now requires you to run *build_encoding* first. - *aggregate_by_key* now require functions to be passed by character name This version is making (as much as possible) transformation reproducible on train and test set. This is to prepare future pipeline feature. V 0.3.4 ======== - Improvement of function - *which_are_bijection*: It is 2 to 15 time faster than previous version. - *which_are_included*: It is a bit faster. - Bug fixes: - *generate_factor_from_date*: default value was missing. Fixed. - New features: - New features in existing functions: - *fast_filter_variables* has a new parameter (level) to choose which types of filtering to perform WARNING: - *which_are_included*: in case of bijection (col1 is a bijection of col2), they are both included in the other, but the choice of the one to drop might have changed in this version. V 0.3.3 ======== - New features: - New features in existing functions: - *findAndTransFormDates* now recognize date character even if there are multiple separator in date (ex: "2016, Jan-26"). - *findAndTransFormDates* now recognize date character even if there are leading and tailing white spaces. WARNING: - *date3* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if there are leading and/or trailing white spaces. - *date4* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if there are multiple separator. V 0.3.2 ======== - Change URLs to meet CRAN requirement v 0.3.1 ======= - Fix bug in Latex documentation v 0.3 ===== - New features: - New features in existing functions: - *findAndTransFormDates* now recognize date character even if "0" are not present in month or day part and month as lower strings. - *findAndTransFormDates* and *set_col_as_date* now work with *factors*. - New functions: - *fast_discretization*: to perform equal freq or equal width discretization on a data set using *data.table* power. - *fast_scale*: to perform scaling on a data set using *data.table* power. - *one_hot_encoder*: to perform one_hot encoding on a data set using *data.table* power. - New documentation: - A new vignette to illustrate how to build a correct *train* and *test* set using data preparation - Minor changes in log (in particular regarding progress bars and typos) - Due to dependencies issues with *tcltk*, we stop using it and start using *progress* - Refactoring: - Private function *real_cols* take more importance to control that columns have the correct types and handling " auto" value. - Making code faster: some functions are up to **30% faster** - Review unit testing to be faster - Unit test evolution to be more readable WARNING: - *date1* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if "0" are not present in month or day part. v 0.2 ===== - Improving unit testing and code coverage - Improving documentation - Solving minor bug in date conversion and in which functions - New features: - New functions: - *un_factor* to un-factor columns, when reading wasn't performed in expected way. - *same_shape* to make ure that train and test set have exactly the same shape. - generate new columns from existing columns (generate functions) - generate factor from dates: *generate_factor_from_date* - diffDates becomes *generate_date_diffs* (for better name understanding). - generate numerics and booleans from character of factors (using *generate_from_factor* and *generate_from_character*) - *set_col_as_factor* a function to make multiple columns as factor and controlling number of unique elements - New features in existing functions: - which functions: add *keep_cols* argument to make sure that they are not dropped - fast_filter_variables: *verbose* can be T/F or 0, 1, 2 in order to control level of verbosity - *findAndTransFormDates* and *set_col_as_dates* now recognize and accept timestamp. WARNING: - If you were using *diffDates*, it is now called *generate_date_diffs* - *date2* column in *messy_adult* data set have changed in order to illustrate new timestamp features - *set_col_as_factorOrLogical* doesn't exist anymore: it as been split between *set_col_as_factor* and *generateFromCat* - Considering all those changes: *shape_set* and *prepare_set* don't give the same result anymore. v 0.1: release on CRAN July 2017 ================================ [1] "prepare_set: step one: correcting mistakes." [1] "fast_filter_variables: I check for constant columns." [1] "fast_filter_variables: I check for columns in double." [1] "fast_filter_variables: I check for columns that are bijections of another column." [1] "fast_filter_variables: I delete 1 column(s) that are bijections of another column in data_set." [1] "age" "fnlwgt" "capital_gain" "capital_loss" "hr_per_week" [1] "un_factor: c(\"age\", \"fnlwgt\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor education." [1] "un_factor: I un-factor occupation." [1] "un_factor: I un-factor country." [1] "un_factor: It took me 0s to un-factor 3 column(s)." [1] "find_and_transform_numerics: It took me 0s to identify 0 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: There are no numerics to transform.(If i missed something consider using set_col_as_numeric to transform it)" [1] "find_and_transform_dates: It took me 0.43s to identify formats" [1] "find_and_transform_dates: There are no dates to transform.\n (If i missed something please provide the date format in inputs or\n consider using set_col_as_date to transform it)." [1] "prepare_set: step two: transforming data_set." [1] "age" "type_employer" "fnlwgt" "education" [5] "marital" "occupation" "relationship" "race" [9] "sex" "capital_gain" "capital_loss" "hr_per_week" [13] "country" "income" [1] "prepare_set: c(\"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\") aren't columns of types date i do nothing for those variables." [1] "generate_date_diffs: I will generate difference between dates." [1] "generate_date_diffs: It took me 0s to create 0 column(s)." [1] "generate_factor_from_date: I will create a factor column from each date column." [1] "generate_factor_from_date: It took me 0s to transform 0 column(s)." [1] "age" "type_employer" "fnlwgt" "marital" [5] "relationship" "race" "sex" "capital_gain" [9] "capital_loss" "hr_per_week" "income" [1] "prepare_set: c(\"age\", \"type_employer\", \"fnlwgt\", \"marital\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"income\") aren't columns of types character i do nothing for those variables." Saving _problems/test_prepare_set-15.R [1] "remove_sd_outlier: I start to filter categorical rare events" [1] "remove_sd_outlier: dropped 1 row(s) that are rare event on num_col." [1] "remove_sd_outlier: 1 have been dropped. It took 0 seconds. " [1] "remove_sd_outlier: I start to filter categorical rare events" [1] "remove_sd_outlier: dropped 0 row(s) that are rare event on num_col." [1] "remove_sd_outlier: 0 have been dropped. It took 0 seconds. " [1] "remove_rare_categorical: I start to filter categorical rare events" [1] "remove_rare_categorical: dropped 1 row(s) that are rare event on cat_col." [1] "remove_rare_categorical: 1 have been dropped. It took 0.03 seconds. " [1] "remove_percentile_outlier: I start to filter categorical rare events" [1] "remove_percentile_outlier: dropped 2 row(s) that are rare event on num_col." [1] "remove_percentile_outlier: 2 have been dropped. It took 0 seconds. " [1] "remove_percentile_outlier: I start to filter categorical rare events" [1] "remove_percentile_outlier: dropped 2 row(s) that are rare event on num_col." [1] "remove_percentile_outlier: 2 have been dropped. It took 0 seconds. " [1] "same_shape: verify that every column is present." [1] "same_shape: columns col_2 are missing, I create them." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_2 class was logical i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: the following columns are in data_set but not in reference_set: I drop them: " [1] "col_2" [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was character i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was character i set it to c(\"POSIXct\", \"POSIXt\")." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: col_1 class had different levels than in reference_set I change it." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: col_1 class had different levels than in reference_set I change it." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was numeric i set it to weird_class." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was numeric i set it to weird_class." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: columns type_employer?, type_employerFederal-gov, type_employerLocal-gov, type_employerNever-worked, type_employerPrivate, type_employerSelf-emp-inc, type_employerSelf-emp-not-inc, type_employerState-gov, type_employerWithout-pay, education11th, education12th, education1st-4th, education5th-6th, education7th-8th, education9th, educationAssoc-acdm, educationAssoc-voc, educationBachelors, educationDoctorate, educationHS-grad, educationMasters, educationPreschool, educationProf-school, educationSome-college, maritalMarried-AF-spouse, maritalMarried-civ-spouse, maritalMarried-spouse-absent, maritalNever-married, maritalSeparated, maritalWidowed, occupationAdm-clerical, occupationArmed-Forces, occupationCraft-repair, occupationExec-managerial, occupationFarming-fishing, occupationHandlers-cleaners, occupationMachine-op-inspct, occupationOther-service, occupationPriv-house-serv, occupationProf-specialty, occupationProtective-serv, occupationSales, occupationTech-support, occupationTransport-moving, relationshipNot-in-family, relationshipOther-relative, relationshipOwn-child, relationshipUnmarried, relationshipWife, raceAsian-Pac-Islander, raceBlack, raceOther, raceWhite, sexMale, capital_loss1408, capital_loss1564, capital_loss1573, capital_loss1719, capital_loss1762, capital_loss1887, capital_loss1902, capital_loss2042, capital_loss2179, countryCambodia, countryCanada, countryChina, countryColumbia, countryCuba, countryDominican-Republic, countryEcuador, countryEl-Salvador, countryEngland, countryFrance, countryGermany, countryGreece, countryGuatemala, countryHaiti, countryHoland-Netherlands, countryHonduras, countryHong, countryHungary, countryIndia, countryIran, countryIreland, countryItaly, countryJamaica, countryJapan, countryLaos, countryMexico, countryNicaragua, countryOutlying-US(Guam-USVI-etc), countryPeru, countryPhilippines, countryPoland, countryPortugal, countryPuerto-Rico, countryScotland, countrySouth, countryTaiwan, countryThailand, countryTrinadad&Tobago, countryUnited-States, countryVietnam, countryYugoslavia, income>50K are missing, I create them." [1] "same_shape: drop unwanted columns." [1] "same_shape: the following columns are in data_set but not in reference_set: I drop them: " [1] "type_employer" "education" "marital" "occupation" [5] "relationship" "race" "sex" "capital_loss" [9] "country" "income" [1] "same_shape: verify that every column is in the right type." [1] "same_shape: age class was integer i set it to numeric." [1] "same_shape: fnlwgt class was integer i set it to numeric." [1] "same_shape: education_num class was integer i set it to numeric." [1] "same_shape: capital_gain class was integer i set it to numeric." [1] "same_shape: hr_per_week class was integer i set it to numeric." [1] "same_shape: type_employer? class was logical i set it to numeric." [1] "same_shape: type_employerFederal-gov class was logical i set it to numeric." [1] "same_shape: type_employerLocal-gov class was logical i set it to numeric." [1] "same_shape: type_employerNever-worked class was logical i set it to numeric." [1] "same_shape: type_employerPrivate class was logical i set it to numeric." [1] "same_shape: type_employerSelf-emp-inc class was logical i set it to numeric." [1] "same_shape: type_employerSelf-emp-not-inc class was logical i set it to numeric." [1] "same_shape: type_employerState-gov class was logical i set it to numeric." [1] "same_shape: type_employerWithout-pay class was logical i set it to numeric." [1] "same_shape: education11th class was logical i set it to numeric." [1] "same_shape: education12th class was logical i set it to numeric." [1] "same_shape: education1st-4th class was logical i set it to numeric." [1] "same_shape: education5th-6th class was logical i set it to numeric." [1] "same_shape: education7th-8th class was logical i set it to numeric." [1] "same_shape: education9th class was logical i set it to numeric." [1] "same_shape: educationAssoc-acdm class was logical i set it to numeric." [1] "same_shape: educationAssoc-voc class was logical i set it to numeric." [1] "same_shape: educationBachelors class was logical i set it to numeric." [1] "same_shape: educationDoctorate class was logical i set it to numeric." [1] "same_shape: educationHS-grad class was logical i set it to numeric." [1] "same_shape: educationMasters class was logical i set it to numeric." [1] "same_shape: educationPreschool class was logical i set it to numeric." [1] "same_shape: educationProf-school class was logical i set it to numeric." [1] "same_shape: educationSome-college class was logical i set it to numeric." [1] "same_shape: maritalMarried-AF-spouse class was logical i set it to numeric." [1] "same_shape: maritalMarried-civ-spouse class was logical i set it to numeric." [1] "same_shape: maritalMarried-spouse-absent class was logical i set it to numeric." [1] "same_shape: maritalNever-married class was logical i set it to numeric." [1] "same_shape: maritalSeparated class was logical i set it to numeric." [1] "same_shape: maritalWidowed class was logical i set it to numeric." [1] "same_shape: occupationAdm-clerical class was logical i set it to numeric." [1] "same_shape: occupationArmed-Forces class was logical i set it to numeric." [1] "same_shape: occupationCraft-repair class was logical i set it to numeric." [1] "same_shape: occupationExec-managerial class was logical i set it to numeric." [1] "same_shape: occupationFarming-fishing class was logical i set it to numeric." [1] "same_shape: occupationHandlers-cleaners class was logical i set it to numeric." [1] "same_shape: occupationMachine-op-inspct class was logical i set it to numeric." [1] "same_shape: occupationOther-service class was logical i set it to numeric." [1] "same_shape: occupationPriv-house-serv class was logical i set it to numeric." [1] "same_shape: occupationProf-specialty class was logical i set it to numeric." [1] "same_shape: occupationProtective-serv class was logical i set it to numeric." [1] "same_shape: occupationSales class was logical i set it to numeric." [1] "same_shape: occupationTech-support class was logical i set it to numeric." [1] "same_shape: occupationTransport-moving class was logical i set it to numeric." [1] "same_shape: relationshipNot-in-family class was logical i set it to numeric." [1] "same_shape: relationshipOther-relative class was logical i set it to numeric." [1] "same_shape: relationshipOwn-child class was logical i set it to numeric." [1] "same_shape: relationshipUnmarried class was logical i set it to numeric." [1] "same_shape: relationshipWife class was logical i set it to numeric." [1] "same_shape: raceAsian-Pac-Islander class was logical i set it to numeric." [1] "same_shape: raceBlack class was logical i set it to numeric." [1] "same_shape: raceOther class was logical i set it to numeric." [1] "same_shape: raceWhite class was logical i set it to numeric." [1] "same_shape: sexMale class was logical i set it to numeric." [1] "same_shape: capital_loss1408 class was logical i set it to numeric." [1] "same_shape: capital_loss1564 class was logical i set it to numeric." [1] "same_shape: capital_loss1573 class was logical i set it to numeric." [1] "same_shape: capital_loss1719 class was logical i set it to numeric." [1] "same_shape: capital_loss1762 class was logical i set it to numeric." [1] "same_shape: capital_loss1887 class was logical i set it to numeric." [1] "same_shape: capital_loss1902 class was logical i set it to numeric." [1] "same_shape: capital_loss2042 class was logical i set it to numeric." [1] "same_shape: capital_loss2179 class was logical i set it to numeric." [1] "same_shape: countryCambodia class was logical i set it to numeric." [1] "same_shape: countryCanada class was logical i set it to numeric." [1] "same_shape: countryChina class was logical i set it to numeric." [1] "same_shape: countryColumbia class was logical i set it to numeric." [1] "same_shape: countryCuba class was logical i set it to numeric." [1] "same_shape: countryDominican-Republic class was logical i set it to numeric." [1] "same_shape: countryEcuador class was logical i set it to numeric." [1] "same_shape: countryEl-Salvador class was logical i set it to numeric." [1] "same_shape: countryEngland class was logical i set it to numeric." [1] "same_shape: countryFrance class was logical i set it to numeric." [1] "same_shape: countryGermany class was logical i set it to numeric." [1] "same_shape: countryGreece class was logical i set it to numeric." [1] "same_shape: countryGuatemala class was logical i set it to numeric." [1] "same_shape: countryHaiti class was logical i set it to numeric." [1] "same_shape: countryHoland-Netherlands class was logical i set it to numeric." [1] "same_shape: countryHonduras class was logical i set it to numeric." [1] "same_shape: countryHong class was logical i set it to numeric." [1] "same_shape: countryHungary class was logical i set it to numeric." [1] "same_shape: countryIndia class was logical i set it to numeric." [1] "same_shape: countryIran class was logical i set it to numeric." [1] "same_shape: countryIreland class was logical i set it to numeric." [1] "same_shape: countryItaly class was logical i set it to numeric." [1] "same_shape: countryJamaica class was logical i set it to numeric." [1] "same_shape: countryJapan class was logical i set it to numeric." [1] "same_shape: countryLaos class was logical i set it to numeric." [1] "same_shape: countryMexico class was logical i set it to numeric." [1] "same_shape: countryNicaragua class was logical i set it to numeric." [1] "same_shape: countryOutlying-US(Guam-USVI-etc) class was logical i set it to numeric." [1] "same_shape: countryPeru class was logical i set it to numeric." [1] "same_shape: countryPhilippines class was logical i set it to numeric." [1] "same_shape: countryPoland class was logical i set it to numeric." [1] "same_shape: countryPortugal class was logical i set it to numeric." [1] "same_shape: countryPuerto-Rico class was logical i set it to numeric." [1] "same_shape: countryScotland class was logical i set it to numeric." [1] "same_shape: countrySouth class was logical i set it to numeric." [1] "same_shape: countryTaiwan class was logical i set it to numeric." [1] "same_shape: countryThailand class was logical i set it to numeric." [1] "same_shape: countryTrinadad&Tobago class was logical i set it to numeric." [1] "same_shape: countryUnited-States class was logical i set it to numeric." [1] "same_shape: countryVietnam class was logical i set it to numeric." [1] "same_shape: countryYugoslavia class was logical i set it to numeric." [1] "same_shape: income>50K class was logical i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to scale 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to scale 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to unscale 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "set_col_as_numeric: I will set some columns as numeric" [1] "set_col_as_numeric: I am doing the column char_col_1." [1] "set_col_as_numeric: 0 NA have been created due to transformation to numeric." [1] "set_col_as_numeric: I am doing the column char_col_2." [1] "set_col_as_numeric: 0 NA have been created due to transformation to numeric." [1] "set_col_as_character: I will set some columns as character" [1] "set_col_as_character: I am doing the column numCol." [1] "set_col_as_character: I am doing the column factorCol." [1] "set_col_as_character: I am doing the column charcol." [1] "set_col_as_character: charcol is a character, i do nothing." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date1." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: I am doing the column date2." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.02s to transform 2 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date2." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date1." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: it took me: 0s to transform 0 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: Since i generated only NAs i set ID as it was before." [1] "set_col_as_date: it took me: 0s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: ID doesn't seem to be a date, if it really is please provide format." [1] "set_col_as_date: it took me: 0s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time." [1] "set_col_as_date: it took me: 0s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time_stamp_s." [1] "set_col_as_date: it took me: 0s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time_stamp_ms." [1] "set_col_as_date: it took me: 0s to transform 1 column(s) to Dates." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: it took me: 0s to transform 1 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: it took me: 0s to transform 1 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: col has more than 2 values, i don't transform it." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming logical into binaries.\n" [1] "shape_set: Previous distribution of column types:" col_class_init logical 1 [1] "shape_set: Current distribution of column types:" col_class_end integer 1 [1] "which_are_constant: constantCol is constant." [1] "which_are_constant: it took me 0s to identify 1 constant column(s)" [1] "which_are_in_double: it took me 0s to identify 2 column(s) to drop." [1] "which_are_in_double: it took me 0s to identify 1 column(s) to drop." [1] "which_are_in_double: it took me 0s to identify 1 column(s) to drop." [1] "which_are_in_double: it took me 0s to identify 0 column(s) to drop." [1] "which_are_bijection: it took me 0.12s to identify 1 column(s) to drop." [1] "which_are_bijection: education is a bijection of education_num. I put it in drop list." [1] "which_are_bijection: it took me 0.02s to identify 1 column(s) to drop." [1] "which_are_bijection: it took me 0s to identify 0 column(s) to drop." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column education." [1] "which_are_included: are_50_or_more is included in column age." [1] "which_are_included: constant is included in column sex." [1] "which_are_included: sex is included in column fnlwgt." [1] "which_are_included: income is included in column id." [1] "which_are_included: race is included in column fnlwgt." [1] "which_are_included: relationship is included in column id." [1] "which_are_included: type_employer is included in column fnlwgt." [1] "which_are_included: marital is included in column id." [1] "which_are_included: occupation is included in column id." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column id." [1] "which_are_included: capital_gain is included in column fnlwgt." [1] "which_are_included: capital_loss is included in column fnlwgt." [1] "which_are_included: country is included in column fnlwgt." [1] "which_are_included: hr_per_week is included in column id." [1] "which_are_included: age is included in column id." [1] "which_are_included: mail is included in column id." [1] "which_are_included: date2 is included in column id." [1] "which_are_included: date1 is included in column id." [1] "which_are_included: date3 is included in column date4." [1] "which_are_included: date4 is included in column id." [1] "which_are_included: num1 is included in column num3." [1] "which_are_included: num3 is included in column id." [1] "which_are_included: num2 is included in column id." [1] "which_are_included: fnlwgt is included in column id." [1] "which_are_included: constant is included in column sex." [1] "which_are_included: sex is included in column fnlwgt." [1] "which_are_included: income is included in column id." [1] "which_are_included: race is included in column fnlwgt." [1] "which_are_included: relationship is included in column id." [1] "which_are_included: type_employer is included in column fnlwgt." [1] "which_are_included: marital is included in column id." [1] "which_are_included: occupation is included in column id." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column id." [1] "which_are_included: capital_gain is included in column fnlwgt." [1] "which_are_included: capital_loss is included in column fnlwgt." [1] "which_are_included: country is included in column fnlwgt." [1] "which_are_included: hr_per_week is included in column id." [1] "which_are_included: age is included in column id." [1] "which_are_included: mail is included in column id." [1] "which_are_included: date2 is included in column id." [1] "which_are_included: date1 is included in column id." [1] "which_are_included: date3 is included in column date4." [1] "which_are_included: date4 is included in column id." [1] "which_are_included: num1 is included in column num3." [1] "which_are_included: num3 is included in column id." [1] "which_are_included: num2 is included in column id." [1] "which_are_included: fnlwgt is included in column id." [ FAIL 7 | WARN 0 | SKIP 1 | PASS 322 ] ══ Skipped tests (1) ═══════════════════════════════════════════════════════════ • empty test (1): ══ Failed tests ════════════════════════════════════════════════════════════════ ── Error ('test_generate_from_character.R:13:5'): generate_from_character: don't drop so generate 3 new cols ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 3/3 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, cols = "character_col") at test_generate_from_character.R:13:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_character.R:26:5'): generate_from_character: drop generate 3 col and suppress one ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, drop = TRUE) at test_generate_from_character.R:26:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_character.R:40:5'): generate_from_character: don't reduce number of rows even with NA ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, cols = "character_col") at test_generate_from_character.R:40:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:14:5'): generate_from_factor: drop: functionnal test on reference set ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 25/25 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_factor(...) at test_generate_from_factor.R:14:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:27:5'): generate_from_factor: test don't drop => keep original col ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_factor(...) at test_generate_from_factor.R:27:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:80:5'): build_encoding: min_frequency allows to drop rare values ── Error in ``[.data.table`(data_set, , `:=`(c("freq"), (.N/nrow(data_set))), by = col)`: attempt access index 1/1 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::build_encoding(...) at test_generate_from_factor.R:80:5 2. ├─data_set[, `:=`(c("freq"), (.N/nrow(data_set))), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_prepare_set.R:14:5'): prepare_set: functionnal test: test full pipeline. Should give result with as many rows as unique key. ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 15/15 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::prepare_set(...) at test_prepare_set.R:14:5 2. └─dataPreparation::generate_from_character(...) 3. ├─data_set[, `:=`(c(new_col), .N), by = col] 4. └─data.table:::`[.data.table`(...) [ FAIL 7 | WARN 0 | SKIP 1 | PASS 322 ] Error: ! Test failures. Execution halted Flavor: r-devel-linux-x86_64-debian-gcc

Version: 1.1.2
Check: examples
Result: ERROR Running examples in ‘dataPreparation-Ex.R’ failed The error most likely occurred in: > ### Name: build_encoding > ### Title: Compute encoding > ### Aliases: build_encoding > > ### ** Examples > > # Get a data set > data(adult) > encoding <- build_encoding(adult, cols = "auto", verbose = TRUE) [1] "age" "fnlwgt" "education_num" "capital_gain" [5] "capital_loss" "hr_per_week" [1] "build_encoding: c(\"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor or character i do nothing for those variables." [1] "build_encoding: I will compute encoding on 9 character and factor columns." [1] "build_encoding: it took me: 0.03s to compute encoding for 9 character and factor columns." > > print(encoding) $type_employer $type_employer$new_cols type_employer.? type_employer.Federal-gov "type.employer.." "type.employer.Federal.gov" type_employer.Local-gov type_employer.Never-worked "type.employer.Local.gov" "type.employer.Never.worked" type_employer.Private type_employer.Self-emp-inc "type.employer.Private" "type.employer.Self.emp.inc" type_employer.Self-emp-not-inc type_employer.State-gov "type.employer.Self.emp.not.inc" "type.employer.State.gov" type_employer.Without-pay "type.employer.Without.pay" $type_employer$values [1] "?" "Federal-gov" "Local-gov" "Never-worked" [5] "Private" "Self-emp-inc" "Self-emp-not-inc" "State-gov" [9] "Without-pay" $education $education$new_cols education.10th education.11th education.12th "education.10th" "education.11th" "education.12th" education.1st-4th education.5th-6th education.7th-8th "education.1st.4th" "education.5th.6th" "education.7th.8th" education.9th education.Assoc-acdm education.Assoc-voc "education.9th" "education.Assoc.acdm" "education.Assoc.voc" education.Bachelors education.Doctorate education.HS-grad "education.Bachelors" "education.Doctorate" "education.HS.grad" education.Masters education.Preschool education.Prof-school "education.Masters" "education.Preschool" "education.Prof.school" education.Some-college "education.Some.college" $education$values [1] "10th" "11th" "12th" "1st-4th" "5th-6th" [6] "7th-8th" "9th" "Assoc-acdm" "Assoc-voc" "Bachelors" [11] "Doctorate" "HS-grad" "Masters" "Preschool" "Prof-school" [16] "Some-college" $marital $marital$new_cols marital.Divorced marital.Married-AF-spouse "marital.Divorced" "marital.Married.AF.spouse" marital.Married-civ-spouse marital.Married-spouse-absent "marital.Married.civ.spouse" "marital.Married.spouse.absent" marital.Never-married marital.Separated "marital.Never.married" "marital.Separated" marital.Widowed "marital.Widowed" $marital$values [1] "Divorced" "Married-AF-spouse" "Married-civ-spouse" [4] "Married-spouse-absent" "Never-married" "Separated" [7] "Widowed" $occupation $occupation$new_cols occupation.? occupation.Adm-clerical "occupation.." "occupation.Adm.clerical" occupation.Armed-Forces occupation.Craft-repair "occupation.Armed.Forces" "occupation.Craft.repair" occupation.Exec-managerial occupation.Farming-fishing "occupation.Exec.managerial" "occupation.Farming.fishing" occupation.Handlers-cleaners occupation.Machine-op-inspct "occupation.Handlers.cleaners" "occupation.Machine.op.inspct" occupation.Other-service occupation.Priv-house-serv "occupation.Other.service" "occupation.Priv.house.serv" occupation.Prof-specialty occupation.Protective-serv "occupation.Prof.specialty" "occupation.Protective.serv" occupation.Sales occupation.Tech-support "occupation.Sales" "occupation.Tech.support" occupation.Transport-moving "occupation.Transport.moving" $occupation$values [1] "?" "Adm-clerical" "Armed-Forces" [4] "Craft-repair" "Exec-managerial" "Farming-fishing" [7] "Handlers-cleaners" "Machine-op-inspct" "Other-service" [10] "Priv-house-serv" "Prof-specialty" "Protective-serv" [13] "Sales" "Tech-support" "Transport-moving" $relationship $relationship$new_cols relationship.Husband relationship.Not-in-family "relationship.Husband" "relationship.Not.in.family" relationship.Other-relative relationship.Own-child "relationship.Other.relative" "relationship.Own.child" relationship.Unmarried relationship.Wife "relationship.Unmarried" "relationship.Wife" $relationship$values [1] "Husband" "Not-in-family" "Other-relative" "Own-child" [5] "Unmarried" "Wife" $race $race$new_cols race.Amer-Indian-Eskimo race.Asian-Pac-Islander race.Black "race.Amer.Indian.Eskimo" "race.Asian.Pac.Islander" "race.Black" race.Other race.White "race.Other" "race.White" $race$values [1] "Amer-Indian-Eskimo" "Asian-Pac-Islander" "Black" [4] "Other" "White" $sex $sex$new_cols sex.Female sex.Male "sex.Female" "sex.Male" $sex$values [1] "Female" "Male" $country $country$new_cols country.? country.Cambodia "country.." "country.Cambodia" country.Canada country.China "country.Canada" "country.China" country.Columbia country.Cuba "country.Columbia" "country.Cuba" country.Dominican-Republic country.Ecuador "country.Dominican.Republic" "country.Ecuador" country.El-Salvador country.England "country.El.Salvador" "country.England" country.France country.Germany "country.France" "country.Germany" country.Greece country.Guatemala "country.Greece" "country.Guatemala" country.Haiti country.Holand-Netherlands "country.Haiti" "country.Holand.Netherlands" country.Honduras country.Hong "country.Honduras" "country.Hong" country.Hungary country.India "country.Hungary" "country.India" country.Iran country.Ireland "country.Iran" "country.Ireland" country.Italy country.Jamaica "country.Italy" "country.Jamaica" country.Japan country.Laos "country.Japan" "country.Laos" country.Mexico country.Nicaragua "country.Mexico" "country.Nicaragua" country.Outlying-US(Guam-USVI-etc) country.Peru "country.Outlying.US.Guam.USVI.etc." "country.Peru" country.Philippines country.Poland "country.Philippines" "country.Poland" country.Portugal country.Puerto-Rico "country.Portugal" "country.Puerto.Rico" country.Scotland country.South "country.Scotland" "country.South" country.Taiwan country.Thailand "country.Taiwan" "country.Thailand" country.Trinadad&Tobago country.United-States "country.Trinadad.Tobago" "country.United.States" country.Vietnam country.Yugoslavia "country.Vietnam" "country.Yugoslavia" $country$values [1] "?" "Cambodia" [3] "Canada" "China" [5] "Columbia" "Cuba" [7] "Dominican-Republic" "Ecuador" [9] "El-Salvador" "England" [11] "France" "Germany" [13] "Greece" "Guatemala" [15] "Haiti" "Holand-Netherlands" [17] "Honduras" "Hong" [19] "Hungary" "India" [21] "Iran" "Ireland" [23] "Italy" "Jamaica" [25] "Japan" "Laos" [27] "Mexico" "Nicaragua" [29] "Outlying-US(Guam-USVI-etc)" "Peru" [31] "Philippines" "Poland" [33] "Portugal" "Puerto-Rico" [35] "Scotland" "South" [37] "Taiwan" "Thailand" [39] "Trinadad&Tobago" "United-States" [41] "Vietnam" "Yugoslavia" $income $income$new_cols income.<=50K income.>50K "income...50K" "income..50K" $income$values [1] "<=50K" ">50K" > > # To limit the number of generated columns, one can use min_frequency parameter: > build_encoding(adult, cols = "auto", verbose = TRUE, min_frequency = 0.1) [1] "age" "fnlwgt" "education_num" "capital_gain" [5] "capital_loss" "hr_per_week" [1] "build_encoding: c(\"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor or character i do nothing for those variables." [1] "build_encoding: I will compute encoding on 9 character and factor columns." Error in `[.data.table`(data_set, , `:=`(c("freq"), (.N/nrow(data_set))), : attempt access index 15/15 in VECTOR_ELT Calls: build_encoding -> [ -> [.data.table Execution halted Flavors: r-devel-linux-x86_64-fedora-clang, r-devel-linux-x86_64-fedora-gcc

Version: 1.1.2
Check: tests
Result: ERROR Running ‘testthat.R’ [21s/37s] Running the tests in ‘tests/testthat.R’ failed. Complete output: > if (requireNamespace("testthat", quietly = TRUE)) { + library(testthat) + library(dataPreparation) + test_check("dataPreparation") + } dataPreparation 1.1.2 Type data_preparation_news() to see new features/changes/bug fixes. [1] "aggregate_by_key: I start to aggregate" [1] "aggregate_by_key: 6 columns have been constructed. It took 0.03 seconds. " [1] "find_and_transform_dates: It took me 2.05s to identify formats" [1] "find_and_transform_dates: It took me 0.26s to transform 4 columns to a Date format." [1] "find_and_transform_dates: It took me 0.02s to identify formats" [1] "find_and_transform_dates: There are no dates to transform.\n (If i missed something please provide the date format in inputs or\n consider using set_col_as_date to transform it)." [1] "identify_dates: column date_col seems to have an ambiguity, I try to solve it." [1] "V2" [1] "fast_discretization: V2 aren't columns of types numeric i do nothing for those variables." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_freq method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 0 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 0 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "equal_width_splits: constant_col can't provide 10 equal width bins; instead you will have 0 bins." [1] "fast_discretization: column constant_col seems to be constant, I do nothing." [1] "fast_discretization: it took me: 0s to build splits for 0 numeric columns." [1] "equal_width_splits: data_set can't provide 10 equal width bins; instead you will have 0 bins." [1] "equal_freq_splits: data_set can't provide 10 equal freq bins; instead you will have 2 bins." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will discretize 1 numeric columns using, bins." [1] "fast_discretization: it took me: 0s to transform 1 numeric columns into, binary columns." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor false_factor." [1] "un_factor: It took me 0s to un-factor 1 column(s)." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor true_factor." [1] "un_factor: I un-factor false_factor." [1] "un_factor: It took me 0s to un-factor 2 column(s)." [1] "fast_filter_variables: I check for constant columns." [1] "fast_filter_variables: I delete 1 constant column(s) in data_set." [1] "fast_filter_variables: I check for columns in double." [1] "fast_filter_variables: I delete 1 column(s) that are in double in data_set." [1] "fast_filter_variables: I check for columns that are bijections of another column." [1] "fast_filter_variables: I delete 3 column(s) that are bijections of another column in data_set." [1] "fast_filter_variables: I check for columns that are included in another column." [1] "fast_filter_variables: I delete 1 column(s) that are bijections of another column in data_set." [1] "string_column" [1] "fast_round: string_column aren't columns of types numeric or integer i do nothing for those variables." [1] "string_column" [1] "fast_round: string_column aren't columns of types numeric or integer i do nothing for those variables." Saving _problems/test_generate_from_character-13.R Saving _problems/test_generate_from_character-26.R Saving _problems/test_generate_from_character-40.R [1] "generate_factor_from_date: I will create a factor column from each date column." [1] "generate_factor_from_date: It took me 0s to transform 1 column(s)." [1] "ID" [1] "generate_date_diffs: ID aren't columns of types date i do nothing for those variables." [1] "generate_date_diffs: I will generate difference between dates." [1] "generate_date_diffs: It took me 0s to create 3 column(s)." [1] "date1" "date2" "date3" "date4" [5] "num1" "num2" "constant" "num3" [9] "age" "fnlwgt" "education_num" "capital_gain" [13] "capital_loss" "hr_per_week" [1] "generate_from_factor: c(\"date1\", \"date2\", \"date3\", \"date4\", \"num1\", \"num2\", \"constant\", \"num3\", \"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables." Saving _problems/test_generate_from_factor-14.R Saving _problems/test_generate_from_factor-27.R [1] "one_hot_encoder: Since you didn't provide encoding, I compute them with build_encoding." [1] "build_encoding: I will compute encoding on 1 character and factor columns." [1] "build_encoding: it took me: 0.01s to compute encoding for 1 character and factor columns." [1] "one_hot_encoder: I will one hot encode some columns." [1] "one_hot_encoder: I am doing column: character_col" [1] "one_hot_encoder: It took me 0s to transform 1 column(s)." [1] "build_encoding: I will compute encoding on 1 character and factor columns." [1] "build_encoding: it took me: 0s to compute encoding for 1 character and factor columns." [1] "build_encoding: I will compute encoding on 1 character and factor columns." Saving _problems/test_generate_from_factor-80.R [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades." [1] "target_encode: Start to encode columns according to target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades." [1] "target_encode: Start to encode columns according to target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "real_cols: col_2 aren't columns of the table, i do nothing for those variables" [1] "col_2" [1] "real_cols: col_2 aren't columns of types numeric i do nothing for those variables." [1] "find_and_transform_numerics: It took me 0s to identify 2 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: It took me 0s to transform 2 column(s) to a numeric format." [1] "find_and_transform_numerics: It took me 0s to identify 0 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: There are no numerics to transform.(If i missed something consider using set_col_as_numeric to transform it)" V 1.1.2 (September 2025) ================== - DOC : - Update documentation according to new \link standards. - TECH : - Update CI tested R versions : removing 4.0 & 4.1, adding 4.4 and 4.5 V 1.1.1 (June 2023) ================== - FEAT: - Speed up examples by providing and using a `tiny_messy_adult` data set. - FIX: - Fix typos - TECH: - Speed up CI for MACOS V 1.1.0 ======= - FEAT: - Stop supporting R strictly before 3.6, and support R 4.2 and 4.3 - BUGFIX: - FIX documentation - TECH: - Upgrade package install in CI V 1.0.5 (July 2022) ================== FEAT: - New functions *compute_probability_ratio* and *compute_weight_of_evidence* to be used for target encoding - New function *get_most_frequent_element* to identify most frequent element in a list V 1.0.4 ======= BUGFIX: Fix *generate_from_character*, when there were some NAs in the column it would drop the line. It is not the case anymore. V 1.0.3 ======= BUGFIX: Fix bud on *fast_is_bijection* when column has multiple class FEAT: Harmonize logging levels between functions V 1.0.2 ======= Remove useless dependencies. Make sure library works on windows, macos, ubuntu, and R versions from 3.3 to 4.1. V 1.0.1 ======= Based on CRAN feedbacks removed problematic vignettes. V 1.0.0 ======= For this version 1.0.0 there are a lot of changes, and version is not compatible with previous version of the package. Also there might be some rework to do on code using previous version of this package (and we are sorry about it), we strongly believe that this version will be easier to use, faster, and more maintanable in time. In this version: - All function names and variables are snake_case (there used to be a mix of camel case and snake case) - We remove a lost of useless code that was slowing done the package (particularly garbage collection) - We made the code more readable so that it is easier to contribute to this package - Logging is more explicit and cleaner. - We took into account linting. - A few more functions are availables. We hope that you will like even more this new version of the package. Please don't hesitate to provide feedback, warn us about bug, suggest improvements or even better developp some improvements on this package. To do so please go to github (https://github.com/ELToulemonde/dataPreparation/). V 0.4.3 ======= - Fix : - In *same_shape*: there was a future bug due to change in class "matrix". Fixed it by implementing 2 functions to check class V 0.4.2 ======= - Fix test: - Case in *build_encoding*: min_frequency allows to drop rare values" was not built correctly. V 0.4.1 ======= - New features: - New functions: - Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which is the process of replacing a categorical value with the aggregation of the target variable. - Function *remove_sd_outlier* helps to remove rows that have numerical values to extreme. - Function *remove_percentile_outlier* helps to remove rows that have numerical values to extreme (based on percentile analysis). - Function *remove_rare_categorical* helps to remove rows that have categorical values to rare. - New features in existing functions: - Function *prepare_set* integrate *target_encode* function. It is called by providing *target_col* and *target_encoding_functions*. V 0.4.0 ======= - New features: - New features in existing functions: - To avoid issues based on column names, we will check and rename columns that have same names. - In *aggregate_by_key* generated column names are changed to be more explicit. - In *aggregate_by_key* generated from character column with more than \code{thresh} values is now count of unique instead of count. - Added missing *auto* default values on cols - Bug fixes: - *which_are_bijection* and *which_are_in_double* are using *bi_col_test* which was not working with 2 column data set. It is fixed. - *prepare_set* optional argument *factor_date_type* was not working. It is fixed. - Other changes: - Changed *which_are_included* example since it was to slow for CRAN. Also it might be a little bit more explicit now. - Changed *aggregate_by_key* example since it was to slow for CRAN. - Integration: - Rewrite all tests to make them more readable - Code coverage is improved, dependencies on *messy_adult* set is lowered WARNING: - In *aggregate_by_key* generated column names are changed. - In *aggregate_by_key* generated column for character is different. V 0.3.9 ======= - Integration: - Matching new devtools requirements - Starting to rewrite unittest to make it more readable V 0.3.8 ======= - New features: - New features in existing functions: - Identification of bijection through internal function *fast_is_bijection* is way faster (up to 40 times faster in case of bijection). So *whichArebijection* and *fastFiltervariables* are also improved. - Remove remaining *gc* to save time. - In *one_hot_encoder* added parameter *type* to choose between logical or numerical results. V 0.3.7 ======= - New features: - New functions: - Function *as.POSIXct_fast* is now available. It helps to transform to POSIXct way faster (if the same date value is present multiple times in the column). - New features in existing functions: - In dates identifications, we make it faster by computing search of format only on unique values. - In date transformation, we made it faster by using *as.POSIXct_fast* when it is necessary. - Functions *findAndTransFormDates*, *find_and_transform_numerics* and *un_factor* now accept argument *cols* to limit search. - Bug fixes: - Control that over-allocate option is activated on every data.table to avoid issues with set. Package should be more robust. - In bijection search (internal function *fast_is_bijection*) there was a bug on some rare cases. Fixed but slower. -Code quality: - Improving code quality using lintr - Suppressing some useless code - Meeting new covr standard - Improve log of setColAsXXX V 0.3.6 ======= - Bug fixes: - *identify_dates* had a weird bug. Solved - Integration: - Making dataPreparation compatible with testthat 2.0.0 V 0.3.5 ======= - New features: - New features in existing functions: - *findAndTransFormDates* now as an *ambiguities* parameter, IGNORE to work as before, WARN to check for ambiguities and print them, SOLVE to try to solve ambiguities on more lines. - *one_hot_encoder* now uses a *build_encoding* functions to be able to build same encoding on train and on test. - *aggregate_by_key* is now way faster on numerics. But it changed the way it gets input functions. - *fast_scale* now as a *way* parameter which allow you to either scale or unscale. Unscaling numeric values can be very useful for most post-model analysis. - *set_col_as_date* now accept multiple formats in a single call. - New functions: - *build_encoding* build a list of encoding to be used by *one_hot_encoder*, it also has a parameter *min_frequency* to control that rare values doesn't result in new columns. - Previously private function *identify_dates* is now exported. To be able to perform same transformation on train and on test. - Adding *dataPreparationNews* function to open NEWS file (inspired from rfNews() of randomForest package) - Bug fixes: - *findAndTransFormDates*: bug fixed: user formats weren't used. - *identify_dates*: some formats where tested but would never work. They have been removed. - Refactoring: - Unit test partly reviewed to be more readable and more efficient. Unit test time as been divided by 3. - Improving input control for more robust functions WARNING: - *one_hot_encoder* now requires you to run *build_encoding* first. - *aggregate_by_key* now require functions to be passed by character name This version is making (as much as possible) transformation reproducible on train and test set. This is to prepare future pipeline feature. V 0.3.4 ======== - Improvement of function - *which_are_bijection*: It is 2 to 15 time faster than previous version. - *which_are_included*: It is a bit faster. - Bug fixes: - *generate_factor_from_date*: default value was missing. Fixed. - New features: - New features in existing functions: - *fast_filter_variables* has a new parameter (level) to choose which types of filtering to perform WARNING: - *which_are_included*: in case of bijection (col1 is a bijection of col2), they are both included in the other, but the choice of the one to drop might have changed in this version. V 0.3.3 ======== - New features: - New features in existing functions: - *findAndTransFormDates* now recognize date character even if there are multiple separator in date (ex: "2016, Jan-26"). - *findAndTransFormDates* now recognize date character even if there are leading and tailing white spaces. WARNING: - *date3* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if there are leading and/or trailing white spaces. - *date4* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if there are multiple separator. V 0.3.2 ======== - Change URLs to meet CRAN requirement v 0.3.1 ======= - Fix bug in Latex documentation v 0.3 ===== - New features: - New features in existing functions: - *findAndTransFormDates* now recognize date character even if "0" are not present in month or day part and month as lower strings. - *findAndTransFormDates* and *set_col_as_date* now work with *factors*. - New functions: - *fast_discretization*: to perform equal freq or equal width discretization on a data set using *data.table* power. - *fast_scale*: to perform scaling on a data set using *data.table* power. - *one_hot_encoder*: to perform one_hot encoding on a data set using *data.table* power. - New documentation: - A new vignette to illustrate how to build a correct *train* and *test* set using data preparation - Minor changes in log (in particular regarding progress bars and typos) - Due to dependencies issues with *tcltk*, we stop using it and start using *progress* - Refactoring: - Private function *real_cols* take more importance to control that columns have the correct types and handling " auto" value. - Making code faster: some functions are up to **30% faster** - Review unit testing to be faster - Unit test evolution to be more readable WARNING: - *date1* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if "0" are not present in month or day part. v 0.2 ===== - Improving unit testing and code coverage - Improving documentation - Solving minor bug in date conversion and in which functions - New features: - New functions: - *un_factor* to un-factor columns, when reading wasn't performed in expected way. - *same_shape* to make ure that train and test set have exactly the same shape. - generate new columns from existing columns (generate functions) - generate factor from dates: *generate_factor_from_date* - diffDates becomes *generate_date_diffs* (for better name understanding). - generate numerics and booleans from character of factors (using *generate_from_factor* and *generate_from_character*) - *set_col_as_factor* a function to make multiple columns as factor and controlling number of unique elements - New features in existing functions: - which functions: add *keep_cols* argument to make sure that they are not dropped - fast_filter_variables: *verbose* can be T/F or 0, 1, 2 in order to control level of verbosity - *findAndTransFormDates* and *set_col_as_dates* now recognize and accept timestamp. WARNING: - If you were using *diffDates*, it is now called *generate_date_diffs* - *date2* column in *messy_adult* data set have changed in order to illustrate new timestamp features - *set_col_as_factorOrLogical* doesn't exist anymore: it as been split between *set_col_as_factor* and *generateFromCat* - Considering all those changes: *shape_set* and *prepare_set* don't give the same result anymore. v 0.1: release on CRAN July 2017 ================================ [1] "prepare_set: step one: correcting mistakes." [1] "fast_filter_variables: I check for constant columns." [1] "fast_filter_variables: I check for columns in double." [1] "fast_filter_variables: I check for columns that are bijections of another column." [1] "fast_filter_variables: I delete 1 column(s) that are bijections of another column in data_set." [1] "age" "fnlwgt" "capital_gain" "capital_loss" "hr_per_week" [1] "un_factor: c(\"age\", \"fnlwgt\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor education." [1] "un_factor: I un-factor occupation." [1] "un_factor: I un-factor country." [1] "un_factor: It took me 0.01s to un-factor 3 column(s)." [1] "find_and_transform_numerics: It took me 0.01s to identify 0 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: There are no numerics to transform.(If i missed something consider using set_col_as_numeric to transform it)" [1] "find_and_transform_dates: It took me 1.18s to identify formats" [1] "find_and_transform_dates: There are no dates to transform.\n (If i missed something please provide the date format in inputs or\n consider using set_col_as_date to transform it)." [1] "prepare_set: step two: transforming data_set." [1] "age" "type_employer" "fnlwgt" "education" [5] "marital" "occupation" "relationship" "race" [9] "sex" "capital_gain" "capital_loss" "hr_per_week" [13] "country" "income" [1] "prepare_set: c(\"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\") aren't columns of types date i do nothing for those variables." [1] "generate_date_diffs: I will generate difference between dates." [1] "generate_date_diffs: It took me 0.01s to create 0 column(s)." [1] "generate_factor_from_date: I will create a factor column from each date column." [1] "generate_factor_from_date: It took me 0s to transform 0 column(s)." [1] "age" "type_employer" "fnlwgt" "marital" [5] "relationship" "race" "sex" "capital_gain" [9] "capital_loss" "hr_per_week" "income" [1] "prepare_set: c(\"age\", \"type_employer\", \"fnlwgt\", \"marital\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"income\") aren't columns of types character i do nothing for those variables." Saving _problems/test_prepare_set-15.R [1] "remove_sd_outlier: I start to filter categorical rare events" [1] "remove_sd_outlier: dropped 1 row(s) that are rare event on num_col." [1] "remove_sd_outlier: 1 have been dropped. It took 0.01 seconds. " [1] "remove_sd_outlier: I start to filter categorical rare events" [1] "remove_sd_outlier: dropped 0 row(s) that are rare event on num_col." [1] "remove_sd_outlier: 0 have been dropped. It took 0.01 seconds. " [1] "remove_rare_categorical: I start to filter categorical rare events" [1] "remove_rare_categorical: dropped 1 row(s) that are rare event on cat_col." [1] "remove_rare_categorical: 1 have been dropped. It took 0.01 seconds. " [1] "remove_percentile_outlier: I start to filter categorical rare events" [1] "remove_percentile_outlier: dropped 2 row(s) that are rare event on num_col." [1] "remove_percentile_outlier: 2 have been dropped. It took 0 seconds. " [1] "remove_percentile_outlier: I start to filter categorical rare events" [1] "remove_percentile_outlier: dropped 2 row(s) that are rare event on num_col." [1] "remove_percentile_outlier: 2 have been dropped. It took 0 seconds. " [1] "same_shape: verify that every column is present." [1] "same_shape: columns col_2 are missing, I create them." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_2 class was logical i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: the following columns are in data_set but not in reference_set: I drop them: " [1] "col_2" [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was character i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was character i set it to c(\"POSIXct\", \"POSIXt\")." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: col_1 class had different levels than in reference_set I change it." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: col_1 class had different levels than in reference_set I change it." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was numeric i set it to weird_class." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was numeric i set it to weird_class." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: columns type_employer?, type_employerFederal-gov, type_employerLocal-gov, type_employerNever-worked, type_employerPrivate, type_employerSelf-emp-inc, type_employerSelf-emp-not-inc, type_employerState-gov, type_employerWithout-pay, education11th, education12th, education1st-4th, education5th-6th, education7th-8th, education9th, educationAssoc-acdm, educationAssoc-voc, educationBachelors, educationDoctorate, educationHS-grad, educationMasters, educationPreschool, educationProf-school, educationSome-college, maritalMarried-AF-spouse, maritalMarried-civ-spouse, maritalMarried-spouse-absent, maritalNever-married, maritalSeparated, maritalWidowed, occupationAdm-clerical, occupationArmed-Forces, occupationCraft-repair, occupationExec-managerial, occupationFarming-fishing, occupationHandlers-cleaners, occupationMachine-op-inspct, occupationOther-service, occupationPriv-house-serv, occupationProf-specialty, occupationProtective-serv, occupationSales, occupationTech-support, occupationTransport-moving, relationshipNot-in-family, relationshipOther-relative, relationshipOwn-child, relationshipUnmarried, relationshipWife, raceAsian-Pac-Islander, raceBlack, raceOther, raceWhite, sexMale, capital_loss1408, capital_loss1564, capital_loss1573, capital_loss1719, capital_loss1762, capital_loss1887, capital_loss1902, capital_loss2042, capital_loss2179, countryCambodia, countryCanada, countryChina, countryColumbia, countryCuba, countryDominican-Republic, countryEcuador, countryEl-Salvador, countryEngland, countryFrance, countryGermany, countryGreece, countryGuatemala, countryHaiti, countryHoland-Netherlands, countryHonduras, countryHong, countryHungary, countryIndia, countryIran, countryIreland, countryItaly, countryJamaica, countryJapan, countryLaos, countryMexico, countryNicaragua, countryOutlying-US(Guam-USVI-etc), countryPeru, countryPhilippines, countryPoland, countryPortugal, countryPuerto-Rico, countryScotland, countrySouth, countryTaiwan, countryThailand, countryTrinadad&Tobago, countryUnited-States, countryVietnam, countryYugoslavia, income>50K are missing, I create them." [1] "same_shape: drop unwanted columns." [1] "same_shape: the following columns are in data_set but not in reference_set: I drop them: " [1] "type_employer" "education" "marital" "occupation" [5] "relationship" "race" "sex" "capital_loss" [9] "country" "income" [1] "same_shape: verify that every column is in the right type." [1] "same_shape: age class was integer i set it to numeric." [1] "same_shape: fnlwgt class was integer i set it to numeric." [1] "same_shape: education_num class was integer i set it to numeric." [1] "same_shape: capital_gain class was integer i set it to numeric." [1] "same_shape: hr_per_week class was integer i set it to numeric." [1] "same_shape: type_employer? class was logical i set it to numeric." [1] "same_shape: type_employerFederal-gov class was logical i set it to numeric." [1] "same_shape: type_employerLocal-gov class was logical i set it to numeric." [1] "same_shape: type_employerNever-worked class was logical i set it to numeric." [1] "same_shape: type_employerPrivate class was logical i set it to numeric." [1] "same_shape: type_employerSelf-emp-inc class was logical i set it to numeric." [1] "same_shape: type_employerSelf-emp-not-inc class was logical i set it to numeric." [1] "same_shape: type_employerState-gov class was logical i set it to numeric." [1] "same_shape: type_employerWithout-pay class was logical i set it to numeric." [1] "same_shape: education11th class was logical i set it to numeric." [1] "same_shape: education12th class was logical i set it to numeric." [1] "same_shape: education1st-4th class was logical i set it to numeric." [1] "same_shape: education5th-6th class was logical i set it to numeric." [1] "same_shape: education7th-8th class was logical i set it to numeric." [1] "same_shape: education9th class was logical i set it to numeric." [1] "same_shape: educationAssoc-acdm class was logical i set it to numeric." [1] "same_shape: educationAssoc-voc class was logical i set it to numeric." [1] "same_shape: educationBachelors class was logical i set it to numeric." [1] "same_shape: educationDoctorate class was logical i set it to numeric." [1] "same_shape: educationHS-grad class was logical i set it to numeric." [1] "same_shape: educationMasters class was logical i set it to numeric." [1] "same_shape: educationPreschool class was logical i set it to numeric." [1] "same_shape: educationProf-school class was logical i set it to numeric." [1] "same_shape: educationSome-college class was logical i set it to numeric." [1] "same_shape: maritalMarried-AF-spouse class was logical i set it to numeric." [1] "same_shape: maritalMarried-civ-spouse class was logical i set it to numeric." [1] "same_shape: maritalMarried-spouse-absent class was logical i set it to numeric." [1] "same_shape: maritalNever-married class was logical i set it to numeric." [1] "same_shape: maritalSeparated class was logical i set it to numeric." [1] "same_shape: maritalWidowed class was logical i set it to numeric." [1] "same_shape: occupationAdm-clerical class was logical i set it to numeric." [1] "same_shape: occupationArmed-Forces class was logical i set it to numeric." [1] "same_shape: occupationCraft-repair class was logical i set it to numeric." [1] "same_shape: occupationExec-managerial class was logical i set it to numeric." [1] "same_shape: occupationFarming-fishing class was logical i set it to numeric." [1] "same_shape: occupationHandlers-cleaners class was logical i set it to numeric." [1] "same_shape: occupationMachine-op-inspct class was logical i set it to numeric." [1] "same_shape: occupationOther-service class was logical i set it to numeric." [1] "same_shape: occupationPriv-house-serv class was logical i set it to numeric." [1] "same_shape: occupationProf-specialty class was logical i set it to numeric." [1] "same_shape: occupationProtective-serv class was logical i set it to numeric." [1] "same_shape: occupationSales class was logical i set it to numeric." [1] "same_shape: occupationTech-support class was logical i set it to numeric." [1] "same_shape: occupationTransport-moving class was logical i set it to numeric." [1] "same_shape: relationshipNot-in-family class was logical i set it to numeric." [1] "same_shape: relationshipOther-relative class was logical i set it to numeric." [1] "same_shape: relationshipOwn-child class was logical i set it to numeric." [1] "same_shape: relationshipUnmarried class was logical i set it to numeric." [1] "same_shape: relationshipWife class was logical i set it to numeric." [1] "same_shape: raceAsian-Pac-Islander class was logical i set it to numeric." [1] "same_shape: raceBlack class was logical i set it to numeric." [1] "same_shape: raceOther class was logical i set it to numeric." [1] "same_shape: raceWhite class was logical i set it to numeric." [1] "same_shape: sexMale class was logical i set it to numeric." [1] "same_shape: capital_loss1408 class was logical i set it to numeric." [1] "same_shape: capital_loss1564 class was logical i set it to numeric." [1] "same_shape: capital_loss1573 class was logical i set it to numeric." [1] "same_shape: capital_loss1719 class was logical i set it to numeric." [1] "same_shape: capital_loss1762 class was logical i set it to numeric." [1] "same_shape: capital_loss1887 class was logical i set it to numeric." [1] "same_shape: capital_loss1902 class was logical i set it to numeric." [1] "same_shape: capital_loss2042 class was logical i set it to numeric." [1] "same_shape: capital_loss2179 class was logical i set it to numeric." [1] "same_shape: countryCambodia class was logical i set it to numeric." [1] "same_shape: countryCanada class was logical i set it to numeric." [1] "same_shape: countryChina class was logical i set it to numeric." [1] "same_shape: countryColumbia class was logical i set it to numeric." [1] "same_shape: countryCuba class was logical i set it to numeric." [1] "same_shape: countryDominican-Republic class was logical i set it to numeric." [1] "same_shape: countryEcuador class was logical i set it to numeric." [1] "same_shape: countryEl-Salvador class was logical i set it to numeric." [1] "same_shape: countryEngland class was logical i set it to numeric." [1] "same_shape: countryFrance class was logical i set it to numeric." [1] "same_shape: countryGermany class was logical i set it to numeric." [1] "same_shape: countryGreece class was logical i set it to numeric." [1] "same_shape: countryGuatemala class was logical i set it to numeric." [1] "same_shape: countryHaiti class was logical i set it to numeric." [1] "same_shape: countryHoland-Netherlands class was logical i set it to numeric." [1] "same_shape: countryHonduras class was logical i set it to numeric." [1] "same_shape: countryHong class was logical i set it to numeric." [1] "same_shape: countryHungary class was logical i set it to numeric." [1] "same_shape: countryIndia class was logical i set it to numeric." [1] "same_shape: countryIran class was logical i set it to numeric." [1] "same_shape: countryIreland class was logical i set it to numeric." [1] "same_shape: countryItaly class was logical i set it to numeric." [1] "same_shape: countryJamaica class was logical i set it to numeric." [1] "same_shape: countryJapan class was logical i set it to numeric." [1] "same_shape: countryLaos class was logical i set it to numeric." [1] "same_shape: countryMexico class was logical i set it to numeric." [1] "same_shape: countryNicaragua class was logical i set it to numeric." [1] "same_shape: countryOutlying-US(Guam-USVI-etc) class was logical i set it to numeric." [1] "same_shape: countryPeru class was logical i set it to numeric." [1] "same_shape: countryPhilippines class was logical i set it to numeric." [1] "same_shape: countryPoland class was logical i set it to numeric." [1] "same_shape: countryPortugal class was logical i set it to numeric." [1] "same_shape: countryPuerto-Rico class was logical i set it to numeric." [1] "same_shape: countryScotland class was logical i set it to numeric." [1] "same_shape: countrySouth class was logical i set it to numeric." [1] "same_shape: countryTaiwan class was logical i set it to numeric." [1] "same_shape: countryThailand class was logical i set it to numeric." [1] "same_shape: countryTrinadad&Tobago class was logical i set it to numeric." [1] "same_shape: countryUnited-States class was logical i set it to numeric." [1] "same_shape: countryVietnam class was logical i set it to numeric." [1] "same_shape: countryYugoslavia class was logical i set it to numeric." [1] "same_shape: income>50K class was logical i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to scale 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to scale 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to unscale 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "set_col_as_numeric: I will set some columns as numeric" [1] "set_col_as_numeric: I am doing the column char_col_1." [1] "set_col_as_numeric: 0 NA have been created due to transformation to numeric." [1] "set_col_as_numeric: I am doing the column char_col_2." [1] "set_col_as_numeric: 0 NA have been created due to transformation to numeric." [1] "set_col_as_character: I will set some columns as character" [1] "set_col_as_character: I am doing the column numCol." [1] "set_col_as_character: I am doing the column factorCol." [1] "set_col_as_character: I am doing the column charcol." [1] "set_col_as_character: charcol is a character, i do nothing." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date1." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: I am doing the column date2." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.03s to transform 2 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date2." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.02s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date1." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.03s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: it took me: 0.01s to transform 0 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: Since i generated only NAs i set ID as it was before." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: ID doesn't seem to be a date, if it really is please provide format." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time_stamp_s." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time_stamp_ms." [1] "set_col_as_date: it took me: 0s to transform 1 column(s) to Dates." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: it took me: 0s to transform 1 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: it took me: 0s to transform 1 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: col has more than 2 values, i don't transform it." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming logical into binaries.\n" [1] "shape_set: Previous distribution of column types:" col_class_init logical 1 [1] "shape_set: Current distribution of column types:" col_class_end integer 1 [1] "which_are_constant: constantCol is constant." [1] "which_are_constant: it took me 0s to identify 1 constant column(s)" [1] "which_are_in_double: it took me 0.01s to identify 2 column(s) to drop." [1] "which_are_in_double: it took me 0.01s to identify 1 column(s) to drop." [1] "which_are_in_double: it took me 0.01s to identify 1 column(s) to drop." [1] "which_are_in_double: it took me 0s to identify 0 column(s) to drop." [1] "which_are_bijection: it took me 0.09s to identify 1 column(s) to drop." [1] "which_are_bijection: education is a bijection of education_num. I put it in drop list." [1] "which_are_bijection: it took me 0.05s to identify 1 column(s) to drop." [1] "which_are_bijection: it took me 0s to identify 0 column(s) to drop." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column education." [1] "which_are_included: are_50_or_more is included in column age." [1] "which_are_included: constant is included in column sex." [1] "which_are_included: sex is included in column fnlwgt." [1] "which_are_included: income is included in column id." [1] "which_are_included: race is included in column fnlwgt." [1] "which_are_included: relationship is included in column id." [1] "which_are_included: type_employer is included in column fnlwgt." [1] "which_are_included: marital is included in column id." [1] "which_are_included: occupation is included in column id." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column id." [1] "which_are_included: capital_gain is included in column fnlwgt." [1] "which_are_included: capital_loss is included in column fnlwgt." [1] "which_are_included: country is included in column fnlwgt." [1] "which_are_included: hr_per_week is included in column id." [1] "which_are_included: age is included in column id." [1] "which_are_included: mail is included in column id." [1] "which_are_included: date2 is included in column id." [1] "which_are_included: date1 is included in column id." [1] "which_are_included: date3 is included in column date4." [1] "which_are_included: date4 is included in column id." [1] "which_are_included: num1 is included in column num3." [1] "which_are_included: num3 is included in column id." [1] "which_are_included: num2 is included in column id." [1] "which_are_included: fnlwgt is included in column id." [1] "which_are_included: constant is included in column sex." [1] "which_are_included: sex is included in column fnlwgt." [1] "which_are_included: income is included in column id." [1] "which_are_included: race is included in column fnlwgt." [1] "which_are_included: relationship is included in column id." [1] "which_are_included: type_employer is included in column fnlwgt." [1] "which_are_included: marital is included in column id." [1] "which_are_included: occupation is included in column id." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column id." [1] "which_are_included: capital_gain is included in column fnlwgt." [1] "which_are_included: capital_loss is included in column fnlwgt." [1] "which_are_included: country is included in column fnlwgt." [1] "which_are_included: hr_per_week is included in column id." [1] "which_are_included: age is included in column id." [1] "which_are_included: mail is included in column id." [1] "which_are_included: date2 is included in column id." [1] "which_are_included: date1 is included in column id." [1] "which_are_included: date3 is included in column date4." [1] "which_are_included: date4 is included in column id." [1] "which_are_included: num1 is included in column num3." [1] "which_are_included: num3 is included in column id." [1] "which_are_included: num2 is included in column id." [1] "which_are_included: fnlwgt is included in column id." [ FAIL 7 | WARN 0 | SKIP 1 | PASS 322 ] ══ Skipped tests (1) ═══════════════════════════════════════════════════════════ • empty test (1): ══ Failed tests ════════════════════════════════════════════════════════════════ ── Error ('test_generate_from_character.R:13:5'): generate_from_character: don't drop so generate 3 new cols ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 3/3 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, cols = "character_col") at test_generate_from_character.R:13:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_character.R:26:5'): generate_from_character: drop generate 3 col and suppress one ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, drop = TRUE) at test_generate_from_character.R:26:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_character.R:40:5'): generate_from_character: don't reduce number of rows even with NA ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, cols = "character_col") at test_generate_from_character.R:40:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:14:5'): generate_from_factor: drop: functionnal test on reference set ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 25/25 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_factor(...) at test_generate_from_factor.R:14:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:27:5'): generate_from_factor: test don't drop => keep original col ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_factor(...) at test_generate_from_factor.R:27:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:80:5'): build_encoding: min_frequency allows to drop rare values ── Error in ``[.data.table`(data_set, , `:=`(c("freq"), (.N/nrow(data_set))), by = col)`: attempt access index 1/1 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::build_encoding(...) at test_generate_from_factor.R:80:5 2. ├─data_set[, `:=`(c("freq"), (.N/nrow(data_set))), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_prepare_set.R:14:5'): prepare_set: functionnal test: test full pipeline. Should give result with as many rows as unique key. ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 15/15 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::prepare_set(...) at test_prepare_set.R:14:5 2. └─dataPreparation::generate_from_character(...) 3. ├─data_set[, `:=`(c(new_col), .N), by = col] 4. └─data.table:::`[.data.table`(...) [ FAIL 7 | WARN 0 | SKIP 1 | PASS 322 ] Error: ! Test failures. Execution halted Flavor: r-devel-linux-x86_64-fedora-clang

Version: 1.1.2
Check: tests
Result: ERROR Running ‘testthat.R’ [21s/50s] Running the tests in ‘tests/testthat.R’ failed. Complete output: > if (requireNamespace("testthat", quietly = TRUE)) { + library(testthat) + library(dataPreparation) + test_check("dataPreparation") + } dataPreparation 1.1.2 Type data_preparation_news() to see new features/changes/bug fixes. [1] "aggregate_by_key: I start to aggregate" [1] "aggregate_by_key: 6 columns have been constructed. It took 0.11 seconds. " [1] "find_and_transform_dates: It took me 3.88s to identify formats" [1] "find_and_transform_dates: It took me 0.57s to transform 4 columns to a Date format." [1] "find_and_transform_dates: It took me 0.03s to identify formats" [1] "find_and_transform_dates: There are no dates to transform.\n (If i missed something please provide the date format in inputs or\n consider using set_col_as_date to transform it)." [1] "identify_dates: column date_col seems to have an ambiguity, I try to solve it." [1] "V2" [1] "fast_discretization: V2 aren't columns of types numeric i do nothing for those variables." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_freq method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will build splits for 0 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 0 numeric columns." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "equal_width_splits: constant_col can't provide 10 equal width bins; instead you will have 0 bins." [1] "fast_discretization: column constant_col seems to be constant, I do nothing." [1] "fast_discretization: it took me: 0s to build splits for 0 numeric columns." [1] "equal_width_splits: data_set can't provide 10 equal width bins; instead you will have 0 bins." [1] "equal_freq_splits: data_set can't provide 10 equal freq bins; instead you will have 2 bins." [1] "fast_discretization: I will build splits for 1 numeric columns using, equal_width method." [1] "fast_discretization: it took me: 0s to build splits for 1 numeric columns." [1] "fast_discretization: I will discretize 1 numeric columns using, bins." [1] "fast_discretization: it took me: 0s to transform 1 numeric columns into, binary columns." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor false_factor." [1] "un_factor: It took me 0s to un-factor 1 column(s)." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor true_factor." [1] "un_factor: I un-factor false_factor." [1] "un_factor: It took me 0s to un-factor 2 column(s)." [1] "fast_filter_variables: I check for constant columns." [1] "fast_filter_variables: I delete 1 constant column(s) in data_set." [1] "fast_filter_variables: I check for columns in double." [1] "fast_filter_variables: I delete 1 column(s) that are in double in data_set." [1] "fast_filter_variables: I check for columns that are bijections of another column." [1] "fast_filter_variables: I delete 3 column(s) that are bijections of another column in data_set." [1] "fast_filter_variables: I check for columns that are included in another column." [1] "fast_filter_variables: I delete 1 column(s) that are bijections of another column in data_set." [1] "string_column" [1] "fast_round: string_column aren't columns of types numeric or integer i do nothing for those variables." [1] "string_column" [1] "fast_round: string_column aren't columns of types numeric or integer i do nothing for those variables." Saving _problems/test_generate_from_character-13.R Saving _problems/test_generate_from_character-26.R Saving _problems/test_generate_from_character-40.R [1] "generate_factor_from_date: I will create a factor column from each date column." [1] "generate_factor_from_date: It took me 0s to transform 1 column(s)." [1] "ID" [1] "generate_date_diffs: ID aren't columns of types date i do nothing for those variables." [1] "generate_date_diffs: I will generate difference between dates." [1] "generate_date_diffs: It took me 0.03s to create 3 column(s)." [1] "date1" "date2" "date3" "date4" [5] "num1" "num2" "constant" "num3" [9] "age" "fnlwgt" "education_num" "capital_gain" [13] "capital_loss" "hr_per_week" [1] "generate_from_factor: c(\"date1\", \"date2\", \"date3\", \"date4\", \"num1\", \"num2\", \"constant\", \"num3\", \"age\", \"fnlwgt\", \"education_num\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables." Saving _problems/test_generate_from_factor-14.R Saving _problems/test_generate_from_factor-27.R [1] "one_hot_encoder: Since you didn't provide encoding, I compute them with build_encoding." [1] "build_encoding: I will compute encoding on 1 character and factor columns." [1] "build_encoding: it took me: 0s to compute encoding for 1 character and factor columns." [1] "one_hot_encoder: I will one hot encode some columns." [1] "one_hot_encoder: I am doing column: character_col" [1] "one_hot_encoder: It took me 0s to transform 1 column(s)." [1] "build_encoding: I will compute encoding on 1 character and factor columns." [1] "build_encoding: it took me: 0.01s to compute encoding for 1 character and factor columns." [1] "build_encoding: I will compute encoding on 1 character and factor columns." Saving _problems/test_generate_from_factor-80.R [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades." [1] "target_encode: Start to encode columns according to target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: grades." [1] "target_encode: Start to encode columns according to target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "build_target_encoding: Start to compute encoding for target_encoding according to col: target." [1] "real_cols: col_2 aren't columns of the table, i do nothing for those variables" [1] "col_2" [1] "real_cols: col_2 aren't columns of types numeric i do nothing for those variables." [1] "find_and_transform_numerics: It took me 0.01s to identify 2 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: It took me 0s to transform 2 column(s) to a numeric format." [1] "find_and_transform_numerics: It took me 0s to identify 0 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: There are no numerics to transform.(If i missed something consider using set_col_as_numeric to transform it)" V 1.1.2 (September 2025) ================== - DOC : - Update documentation according to new \link standards. - TECH : - Update CI tested R versions : removing 4.0 & 4.1, adding 4.4 and 4.5 V 1.1.1 (June 2023) ================== - FEAT: - Speed up examples by providing and using a `tiny_messy_adult` data set. - FIX: - Fix typos - TECH: - Speed up CI for MACOS V 1.1.0 ======= - FEAT: - Stop supporting R strictly before 3.6, and support R 4.2 and 4.3 - BUGFIX: - FIX documentation - TECH: - Upgrade package install in CI V 1.0.5 (July 2022) ================== FEAT: - New functions *compute_probability_ratio* and *compute_weight_of_evidence* to be used for target encoding - New function *get_most_frequent_element* to identify most frequent element in a list V 1.0.4 ======= BUGFIX: Fix *generate_from_character*, when there were some NAs in the column it would drop the line. It is not the case anymore. V 1.0.3 ======= BUGFIX: Fix bud on *fast_is_bijection* when column has multiple class FEAT: Harmonize logging levels between functions V 1.0.2 ======= Remove useless dependencies. Make sure library works on windows, macos, ubuntu, and R versions from 3.3 to 4.1. V 1.0.1 ======= Based on CRAN feedbacks removed problematic vignettes. V 1.0.0 ======= For this version 1.0.0 there are a lot of changes, and version is not compatible with previous version of the package. Also there might be some rework to do on code using previous version of this package (and we are sorry about it), we strongly believe that this version will be easier to use, faster, and more maintanable in time. In this version: - All function names and variables are snake_case (there used to be a mix of camel case and snake case) - We remove a lost of useless code that was slowing done the package (particularly garbage collection) - We made the code more readable so that it is easier to contribute to this package - Logging is more explicit and cleaner. - We took into account linting. - A few more functions are availables. We hope that you will like even more this new version of the package. Please don't hesitate to provide feedback, warn us about bug, suggest improvements or even better developp some improvements on this package. To do so please go to github (https://github.com/ELToulemonde/dataPreparation/). V 0.4.3 ======= - Fix : - In *same_shape*: there was a future bug due to change in class "matrix". Fixed it by implementing 2 functions to check class V 0.4.2 ======= - Fix test: - Case in *build_encoding*: min_frequency allows to drop rare values" was not built correctly. V 0.4.1 ======= - New features: - New functions: - Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which is the process of replacing a categorical value with the aggregation of the target variable. - Function *remove_sd_outlier* helps to remove rows that have numerical values to extreme. - Function *remove_percentile_outlier* helps to remove rows that have numerical values to extreme (based on percentile analysis). - Function *remove_rare_categorical* helps to remove rows that have categorical values to rare. - New features in existing functions: - Function *prepare_set* integrate *target_encode* function. It is called by providing *target_col* and *target_encoding_functions*. V 0.4.0 ======= - New features: - New features in existing functions: - To avoid issues based on column names, we will check and rename columns that have same names. - In *aggregate_by_key* generated column names are changed to be more explicit. - In *aggregate_by_key* generated from character column with more than \code{thresh} values is now count of unique instead of count. - Added missing *auto* default values on cols - Bug fixes: - *which_are_bijection* and *which_are_in_double* are using *bi_col_test* which was not working with 2 column data set. It is fixed. - *prepare_set* optional argument *factor_date_type* was not working. It is fixed. - Other changes: - Changed *which_are_included* example since it was to slow for CRAN. Also it might be a little bit more explicit now. - Changed *aggregate_by_key* example since it was to slow for CRAN. - Integration: - Rewrite all tests to make them more readable - Code coverage is improved, dependencies on *messy_adult* set is lowered WARNING: - In *aggregate_by_key* generated column names are changed. - In *aggregate_by_key* generated column for character is different. V 0.3.9 ======= - Integration: - Matching new devtools requirements - Starting to rewrite unittest to make it more readable V 0.3.8 ======= - New features: - New features in existing functions: - Identification of bijection through internal function *fast_is_bijection* is way faster (up to 40 times faster in case of bijection). So *whichArebijection* and *fastFiltervariables* are also improved. - Remove remaining *gc* to save time. - In *one_hot_encoder* added parameter *type* to choose between logical or numerical results. V 0.3.7 ======= - New features: - New functions: - Function *as.POSIXct_fast* is now available. It helps to transform to POSIXct way faster (if the same date value is present multiple times in the column). - New features in existing functions: - In dates identifications, we make it faster by computing search of format only on unique values. - In date transformation, we made it faster by using *as.POSIXct_fast* when it is necessary. - Functions *findAndTransFormDates*, *find_and_transform_numerics* and *un_factor* now accept argument *cols* to limit search. - Bug fixes: - Control that over-allocate option is activated on every data.table to avoid issues with set. Package should be more robust. - In bijection search (internal function *fast_is_bijection*) there was a bug on some rare cases. Fixed but slower. -Code quality: - Improving code quality using lintr - Suppressing some useless code - Meeting new covr standard - Improve log of setColAsXXX V 0.3.6 ======= - Bug fixes: - *identify_dates* had a weird bug. Solved - Integration: - Making dataPreparation compatible with testthat 2.0.0 V 0.3.5 ======= - New features: - New features in existing functions: - *findAndTransFormDates* now as an *ambiguities* parameter, IGNORE to work as before, WARN to check for ambiguities and print them, SOLVE to try to solve ambiguities on more lines. - *one_hot_encoder* now uses a *build_encoding* functions to be able to build same encoding on train and on test. - *aggregate_by_key* is now way faster on numerics. But it changed the way it gets input functions. - *fast_scale* now as a *way* parameter which allow you to either scale or unscale. Unscaling numeric values can be very useful for most post-model analysis. - *set_col_as_date* now accept multiple formats in a single call. - New functions: - *build_encoding* build a list of encoding to be used by *one_hot_encoder*, it also has a parameter *min_frequency* to control that rare values doesn't result in new columns. - Previously private function *identify_dates* is now exported. To be able to perform same transformation on train and on test. - Adding *dataPreparationNews* function to open NEWS file (inspired from rfNews() of randomForest package) - Bug fixes: - *findAndTransFormDates*: bug fixed: user formats weren't used. - *identify_dates*: some formats where tested but would never work. They have been removed. - Refactoring: - Unit test partly reviewed to be more readable and more efficient. Unit test time as been divided by 3. - Improving input control for more robust functions WARNING: - *one_hot_encoder* now requires you to run *build_encoding* first. - *aggregate_by_key* now require functions to be passed by character name This version is making (as much as possible) transformation reproducible on train and test set. This is to prepare future pipeline feature. V 0.3.4 ======== - Improvement of function - *which_are_bijection*: It is 2 to 15 time faster than previous version. - *which_are_included*: It is a bit faster. - Bug fixes: - *generate_factor_from_date*: default value was missing. Fixed. - New features: - New features in existing functions: - *fast_filter_variables* has a new parameter (level) to choose which types of filtering to perform WARNING: - *which_are_included*: in case of bijection (col1 is a bijection of col2), they are both included in the other, but the choice of the one to drop might have changed in this version. V 0.3.3 ======== - New features: - New features in existing functions: - *findAndTransFormDates* now recognize date character even if there are multiple separator in date (ex: "2016, Jan-26"). - *findAndTransFormDates* now recognize date character even if there are leading and tailing white spaces. WARNING: - *date3* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if there are leading and/or trailing white spaces. - *date4* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if there are multiple separator. V 0.3.2 ======== - Change URLs to meet CRAN requirement v 0.3.1 ======= - Fix bug in Latex documentation v 0.3 ===== - New features: - New features in existing functions: - *findAndTransFormDates* now recognize date character even if "0" are not present in month or day part and month as lower strings. - *findAndTransFormDates* and *set_col_as_date* now work with *factors*. - New functions: - *fast_discretization*: to perform equal freq or equal width discretization on a data set using *data.table* power. - *fast_scale*: to perform scaling on a data set using *data.table* power. - *one_hot_encoder*: to perform one_hot encoding on a data set using *data.table* power. - New documentation: - A new vignette to illustrate how to build a correct *train* and *test* set using data preparation - Minor changes in log (in particular regarding progress bars and typos) - Due to dependencies issues with *tcltk*, we stop using it and start using *progress* - Refactoring: - Private function *real_cols* take more importance to control that columns have the correct types and handling " auto" value. - Making code faster: some functions are up to **30% faster** - Review unit testing to be faster - Unit test evolution to be more readable WARNING: - *date1* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if "0" are not present in month or day part. v 0.2 ===== - Improving unit testing and code coverage - Improving documentation - Solving minor bug in date conversion and in which functions - New features: - New functions: - *un_factor* to un-factor columns, when reading wasn't performed in expected way. - *same_shape* to make ure that train and test set have exactly the same shape. - generate new columns from existing columns (generate functions) - generate factor from dates: *generate_factor_from_date* - diffDates becomes *generate_date_diffs* (for better name understanding). - generate numerics and booleans from character of factors (using *generate_from_factor* and *generate_from_character*) - *set_col_as_factor* a function to make multiple columns as factor and controlling number of unique elements - New features in existing functions: - which functions: add *keep_cols* argument to make sure that they are not dropped - fast_filter_variables: *verbose* can be T/F or 0, 1, 2 in order to control level of verbosity - *findAndTransFormDates* and *set_col_as_dates* now recognize and accept timestamp. WARNING: - If you were using *diffDates*, it is now called *generate_date_diffs* - *date2* column in *messy_adult* data set have changed in order to illustrate new timestamp features - *set_col_as_factorOrLogical* doesn't exist anymore: it as been split between *set_col_as_factor* and *generateFromCat* - Considering all those changes: *shape_set* and *prepare_set* don't give the same result anymore. v 0.1: release on CRAN July 2017 ================================ [1] "prepare_set: step one: correcting mistakes." [1] "fast_filter_variables: I check for constant columns." [1] "fast_filter_variables: I check for columns in double." [1] "fast_filter_variables: I check for columns that are bijections of another column." [1] "fast_filter_variables: I delete 1 column(s) that are bijections of another column in data_set." [1] "age" "fnlwgt" "capital_gain" "capital_loss" "hr_per_week" [1] "un_factor: c(\"age\", \"fnlwgt\", \"capital_gain\", \"capital_loss\", \"hr_per_week\") aren't columns of types factor i do nothing for those variables." [1] "un_factor: I will identify variable that are factor but shouldn't be." [1] "un_factor: I un-factor education." [1] "un_factor: I un-factor occupation." [1] "un_factor: I un-factor country." [1] "un_factor: It took me 0.01s to un-factor 3 column(s)." [1] "find_and_transform_numerics: It took me 0s to identify 0 numerics column(s), i will set them as numerics" [1] "find_and_transform_numerics: There are no numerics to transform.(If i missed something consider using set_col_as_numeric to transform it)" [1] "find_and_transform_dates: It took me 1.49s to identify formats" [1] "find_and_transform_dates: There are no dates to transform.\n (If i missed something please provide the date format in inputs or\n consider using set_col_as_date to transform it)." [1] "prepare_set: step two: transforming data_set." [1] "age" "type_employer" "fnlwgt" "education" [5] "marital" "occupation" "relationship" "race" [9] "sex" "capital_gain" "capital_loss" "hr_per_week" [13] "country" "income" [1] "prepare_set: c(\"age\", \"type_employer\", \"fnlwgt\", \"education\", \"marital\", \"occupation\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"country\", \"income\") aren't columns of types date i do nothing for those variables." [1] "generate_date_diffs: I will generate difference between dates." [1] "generate_date_diffs: It took me 0s to create 0 column(s)." [1] "generate_factor_from_date: I will create a factor column from each date column." [1] "generate_factor_from_date: It took me 0s to transform 0 column(s)." [1] "age" "type_employer" "fnlwgt" "marital" [5] "relationship" "race" "sex" "capital_gain" [9] "capital_loss" "hr_per_week" "income" [1] "prepare_set: c(\"age\", \"type_employer\", \"fnlwgt\", \"marital\", \"relationship\", \"race\", \"sex\", \"capital_gain\", \"capital_loss\", \"hr_per_week\", \"income\") aren't columns of types character i do nothing for those variables." Saving _problems/test_prepare_set-15.R [1] "remove_sd_outlier: I start to filter categorical rare events" [1] "remove_sd_outlier: dropped 1 row(s) that are rare event on num_col." [1] "remove_sd_outlier: 1 have been dropped. It took 0.01 seconds. " [1] "remove_sd_outlier: I start to filter categorical rare events" [1] "remove_sd_outlier: dropped 0 row(s) that are rare event on num_col." [1] "remove_sd_outlier: 0 have been dropped. It took 0.01 seconds. " [1] "remove_rare_categorical: I start to filter categorical rare events" [1] "remove_rare_categorical: dropped 1 row(s) that are rare event on cat_col." [1] "remove_rare_categorical: 1 have been dropped. It took 0.03 seconds. " [1] "remove_percentile_outlier: I start to filter categorical rare events" [1] "remove_percentile_outlier: dropped 2 row(s) that are rare event on num_col." [1] "remove_percentile_outlier: 2 have been dropped. It took 0.01 seconds. " [1] "remove_percentile_outlier: I start to filter categorical rare events" [1] "remove_percentile_outlier: dropped 2 row(s) that are rare event on num_col." [1] "remove_percentile_outlier: 2 have been dropped. It took 0 seconds. " [1] "same_shape: verify that every column is present." [1] "same_shape: columns col_2 are missing, I create them." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_2 class was logical i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: the following columns are in data_set but not in reference_set: I drop them: " [1] "col_2" [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was character i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was character i set it to c(\"POSIXct\", \"POSIXt\")." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: col_1 class had different levels than in reference_set I change it." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: col_1 class had different levels than in reference_set I change it." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was numeric i set it to weird_class." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: col_1 class was numeric i set it to weird_class." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: columns type_employer?, type_employerFederal-gov, type_employerLocal-gov, type_employerNever-worked, type_employerPrivate, type_employerSelf-emp-inc, type_employerSelf-emp-not-inc, type_employerState-gov, type_employerWithout-pay, education11th, education12th, education1st-4th, education5th-6th, education7th-8th, education9th, educationAssoc-acdm, educationAssoc-voc, educationBachelors, educationDoctorate, educationHS-grad, educationMasters, educationPreschool, educationProf-school, educationSome-college, maritalMarried-AF-spouse, maritalMarried-civ-spouse, maritalMarried-spouse-absent, maritalNever-married, maritalSeparated, maritalWidowed, occupationAdm-clerical, occupationArmed-Forces, occupationCraft-repair, occupationExec-managerial, occupationFarming-fishing, occupationHandlers-cleaners, occupationMachine-op-inspct, occupationOther-service, occupationPriv-house-serv, occupationProf-specialty, occupationProtective-serv, occupationSales, occupationTech-support, occupationTransport-moving, relationshipNot-in-family, relationshipOther-relative, relationshipOwn-child, relationshipUnmarried, relationshipWife, raceAsian-Pac-Islander, raceBlack, raceOther, raceWhite, sexMale, capital_loss1408, capital_loss1564, capital_loss1573, capital_loss1719, capital_loss1762, capital_loss1887, capital_loss1902, capital_loss2042, capital_loss2179, countryCambodia, countryCanada, countryChina, countryColumbia, countryCuba, countryDominican-Republic, countryEcuador, countryEl-Salvador, countryEngland, countryFrance, countryGermany, countryGreece, countryGuatemala, countryHaiti, countryHoland-Netherlands, countryHonduras, countryHong, countryHungary, countryIndia, countryIran, countryIreland, countryItaly, countryJamaica, countryJapan, countryLaos, countryMexico, countryNicaragua, countryOutlying-US(Guam-USVI-etc), countryPeru, countryPhilippines, countryPoland, countryPortugal, countryPuerto-Rico, countryScotland, countrySouth, countryTaiwan, countryThailand, countryTrinadad&Tobago, countryUnited-States, countryVietnam, countryYugoslavia, income>50K are missing, I create them." [1] "same_shape: drop unwanted columns." [1] "same_shape: the following columns are in data_set but not in reference_set: I drop them: " [1] "type_employer" "education" "marital" "occupation" [5] "relationship" "race" "sex" "capital_loss" [9] "country" "income" [1] "same_shape: verify that every column is in the right type." [1] "same_shape: age class was integer i set it to numeric." [1] "same_shape: fnlwgt class was integer i set it to numeric." [1] "same_shape: education_num class was integer i set it to numeric." [1] "same_shape: capital_gain class was integer i set it to numeric." [1] "same_shape: hr_per_week class was integer i set it to numeric." [1] "same_shape: type_employer? class was logical i set it to numeric." [1] "same_shape: type_employerFederal-gov class was logical i set it to numeric." [1] "same_shape: type_employerLocal-gov class was logical i set it to numeric." [1] "same_shape: type_employerNever-worked class was logical i set it to numeric." [1] "same_shape: type_employerPrivate class was logical i set it to numeric." [1] "same_shape: type_employerSelf-emp-inc class was logical i set it to numeric." [1] "same_shape: type_employerSelf-emp-not-inc class was logical i set it to numeric." [1] "same_shape: type_employerState-gov class was logical i set it to numeric." [1] "same_shape: type_employerWithout-pay class was logical i set it to numeric." [1] "same_shape: education11th class was logical i set it to numeric." [1] "same_shape: education12th class was logical i set it to numeric." [1] "same_shape: education1st-4th class was logical i set it to numeric." [1] "same_shape: education5th-6th class was logical i set it to numeric." [1] "same_shape: education7th-8th class was logical i set it to numeric." [1] "same_shape: education9th class was logical i set it to numeric." [1] "same_shape: educationAssoc-acdm class was logical i set it to numeric." [1] "same_shape: educationAssoc-voc class was logical i set it to numeric." [1] "same_shape: educationBachelors class was logical i set it to numeric." [1] "same_shape: educationDoctorate class was logical i set it to numeric." [1] "same_shape: educationHS-grad class was logical i set it to numeric." [1] "same_shape: educationMasters class was logical i set it to numeric." [1] "same_shape: educationPreschool class was logical i set it to numeric." [1] "same_shape: educationProf-school class was logical i set it to numeric." [1] "same_shape: educationSome-college class was logical i set it to numeric." [1] "same_shape: maritalMarried-AF-spouse class was logical i set it to numeric." [1] "same_shape: maritalMarried-civ-spouse class was logical i set it to numeric." [1] "same_shape: maritalMarried-spouse-absent class was logical i set it to numeric." [1] "same_shape: maritalNever-married class was logical i set it to numeric." [1] "same_shape: maritalSeparated class was logical i set it to numeric." [1] "same_shape: maritalWidowed class was logical i set it to numeric." [1] "same_shape: occupationAdm-clerical class was logical i set it to numeric." [1] "same_shape: occupationArmed-Forces class was logical i set it to numeric." [1] "same_shape: occupationCraft-repair class was logical i set it to numeric." [1] "same_shape: occupationExec-managerial class was logical i set it to numeric." [1] "same_shape: occupationFarming-fishing class was logical i set it to numeric." [1] "same_shape: occupationHandlers-cleaners class was logical i set it to numeric." [1] "same_shape: occupationMachine-op-inspct class was logical i set it to numeric." [1] "same_shape: occupationOther-service class was logical i set it to numeric." [1] "same_shape: occupationPriv-house-serv class was logical i set it to numeric." [1] "same_shape: occupationProf-specialty class was logical i set it to numeric." [1] "same_shape: occupationProtective-serv class was logical i set it to numeric." [1] "same_shape: occupationSales class was logical i set it to numeric." [1] "same_shape: occupationTech-support class was logical i set it to numeric." [1] "same_shape: occupationTransport-moving class was logical i set it to numeric." [1] "same_shape: relationshipNot-in-family class was logical i set it to numeric." [1] "same_shape: relationshipOther-relative class was logical i set it to numeric." [1] "same_shape: relationshipOwn-child class was logical i set it to numeric." [1] "same_shape: relationshipUnmarried class was logical i set it to numeric." [1] "same_shape: relationshipWife class was logical i set it to numeric." [1] "same_shape: raceAsian-Pac-Islander class was logical i set it to numeric." [1] "same_shape: raceBlack class was logical i set it to numeric." [1] "same_shape: raceOther class was logical i set it to numeric." [1] "same_shape: raceWhite class was logical i set it to numeric." [1] "same_shape: sexMale class was logical i set it to numeric." [1] "same_shape: capital_loss1408 class was logical i set it to numeric." [1] "same_shape: capital_loss1564 class was logical i set it to numeric." [1] "same_shape: capital_loss1573 class was logical i set it to numeric." [1] "same_shape: capital_loss1719 class was logical i set it to numeric." [1] "same_shape: capital_loss1762 class was logical i set it to numeric." [1] "same_shape: capital_loss1887 class was logical i set it to numeric." [1] "same_shape: capital_loss1902 class was logical i set it to numeric." [1] "same_shape: capital_loss2042 class was logical i set it to numeric." [1] "same_shape: capital_loss2179 class was logical i set it to numeric." [1] "same_shape: countryCambodia class was logical i set it to numeric." [1] "same_shape: countryCanada class was logical i set it to numeric." [1] "same_shape: countryChina class was logical i set it to numeric." [1] "same_shape: countryColumbia class was logical i set it to numeric." [1] "same_shape: countryCuba class was logical i set it to numeric." [1] "same_shape: countryDominican-Republic class was logical i set it to numeric." [1] "same_shape: countryEcuador class was logical i set it to numeric." [1] "same_shape: countryEl-Salvador class was logical i set it to numeric." [1] "same_shape: countryEngland class was logical i set it to numeric." [1] "same_shape: countryFrance class was logical i set it to numeric." [1] "same_shape: countryGermany class was logical i set it to numeric." [1] "same_shape: countryGreece class was logical i set it to numeric." [1] "same_shape: countryGuatemala class was logical i set it to numeric." [1] "same_shape: countryHaiti class was logical i set it to numeric." [1] "same_shape: countryHoland-Netherlands class was logical i set it to numeric." [1] "same_shape: countryHonduras class was logical i set it to numeric." [1] "same_shape: countryHong class was logical i set it to numeric." [1] "same_shape: countryHungary class was logical i set it to numeric." [1] "same_shape: countryIndia class was logical i set it to numeric." [1] "same_shape: countryIran class was logical i set it to numeric." [1] "same_shape: countryIreland class was logical i set it to numeric." [1] "same_shape: countryItaly class was logical i set it to numeric." [1] "same_shape: countryJamaica class was logical i set it to numeric." [1] "same_shape: countryJapan class was logical i set it to numeric." [1] "same_shape: countryLaos class was logical i set it to numeric." [1] "same_shape: countryMexico class was logical i set it to numeric." [1] "same_shape: countryNicaragua class was logical i set it to numeric." [1] "same_shape: countryOutlying-US(Guam-USVI-etc) class was logical i set it to numeric." [1] "same_shape: countryPeru class was logical i set it to numeric." [1] "same_shape: countryPhilippines class was logical i set it to numeric." [1] "same_shape: countryPoland class was logical i set it to numeric." [1] "same_shape: countryPortugal class was logical i set it to numeric." [1] "same_shape: countryPuerto-Rico class was logical i set it to numeric." [1] "same_shape: countryScotland class was logical i set it to numeric." [1] "same_shape: countrySouth class was logical i set it to numeric." [1] "same_shape: countryTaiwan class was logical i set it to numeric." [1] "same_shape: countryThailand class was logical i set it to numeric." [1] "same_shape: countryTrinadad&Tobago class was logical i set it to numeric." [1] "same_shape: countryUnited-States class was logical i set it to numeric." [1] "same_shape: countryVietnam class was logical i set it to numeric." [1] "same_shape: countryYugoslavia class was logical i set it to numeric." [1] "same_shape: income>50K class was logical i set it to numeric." [1] "same_shape: verify that every factor as the right number of levels." [1] "same_shape: verify that every column is present." [1] "same_shape: drop unwanted columns." [1] "same_shape: verify that every column is in the right type." [1] "same_shape: verify that every factor as the right number of levels." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to scale 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to scale 1 numeric columns." [1] "fast_scale: I will scale 1 numeric columns." [1] "fast_scale: it took me: 0s to unscale 1 numeric columns." [1] "build_scales: I will compute scale on 1 numeric columns." [1] "build_scales: it took me: 0s to compute scale for 1 numeric columns." [1] "set_col_as_numeric: I will set some columns as numeric" [1] "set_col_as_numeric: I am doing the column char_col_1." [1] "set_col_as_numeric: 0 NA have been created due to transformation to numeric." [1] "set_col_as_numeric: I am doing the column char_col_2." [1] "set_col_as_numeric: 0 NA have been created due to transformation to numeric." [1] "set_col_as_character: I will set some columns as character" [1] "set_col_as_character: I am doing the column numCol." [1] "set_col_as_character: I am doing the column factorCol." [1] "set_col_as_character: I am doing the column charcol." [1] "set_col_as_character: charcol is a character, i do nothing." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date1." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: I am doing the column date2." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.03s to transform 2 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date2." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.02s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column date1." [1] "set_col_as_date:1 NA have been created due to transformation to Date." [1] "set_col_as_date: it took me: 0.05s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: it took me: 0s to transform 0 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: Since i generated only NAs i set ID as it was before." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column ID." [1] "set_col_as_date: ID doesn't seem to be a date, if it really is please provide format." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time_stamp_s." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_date: I will set some columns as Date." [1] "set_col_as_date: I am doing the column time_stamp_ms." [1] "set_col_as_date: it took me: 0.01s to transform 1 column(s) to Dates." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: it took me: 0s to transform 1 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: it took me: 0s to transform 1 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: I am doing the column col." [1] "set_col_as_factor: col has more than 2 values, i don't transform it." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming numerical variables into factors when length(unique(col)) <= 10." [1] "shape_set: Previous distribution of column types:" col_class_init factor integer 9 6 [1] "shape_set: Current distribution of column types:" col_class_end factor integer 9 6 [1] "set_col_as_factor: I will set some columns to factor." [1] "set_col_as_factor: it took me: 0s to transform 0 column(s) to factor." [1] "shape_set: Transforming logical into binaries.\n" [1] "shape_set: Previous distribution of column types:" col_class_init logical 1 [1] "shape_set: Current distribution of column types:" col_class_end integer 1 [1] "which_are_constant: constantCol is constant." [1] "which_are_constant: it took me 0.02s to identify 1 constant column(s)" [1] "which_are_in_double: it took me 0s to identify 2 column(s) to drop." [1] "which_are_in_double: it took me 0s to identify 1 column(s) to drop." [1] "which_are_in_double: it took me 0.01s to identify 1 column(s) to drop." [1] "which_are_in_double: it took me 0.01s to identify 0 column(s) to drop." [1] "which_are_bijection: it took me 0.11s to identify 1 column(s) to drop." [1] "which_are_bijection: education is a bijection of education_num. I put it in drop list." [1] "which_are_bijection: it took me 0.14s to identify 1 column(s) to drop." [1] "which_are_bijection: it took me 0s to identify 0 column(s) to drop." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column education." [1] "which_are_included: are_50_or_more is included in column age." [1] "which_are_included: constant is included in column sex." [1] "which_are_included: sex is included in column fnlwgt." [1] "which_are_included: income is included in column id." [1] "which_are_included: race is included in column fnlwgt." [1] "which_are_included: relationship is included in column id." [1] "which_are_included: type_employer is included in column fnlwgt." [1] "which_are_included: marital is included in column id." [1] "which_are_included: occupation is included in column id." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column id." [1] "which_are_included: capital_gain is included in column fnlwgt." [1] "which_are_included: capital_loss is included in column fnlwgt." [1] "which_are_included: country is included in column fnlwgt." [1] "which_are_included: hr_per_week is included in column id." [1] "which_are_included: age is included in column id." [1] "which_are_included: mail is included in column id." [1] "which_are_included: date2 is included in column id." [1] "which_are_included: date1 is included in column id." [1] "which_are_included: date3 is included in column date4." [1] "which_are_included: date4 is included in column id." [1] "which_are_included: num1 is included in column num3." [1] "which_are_included: num3 is included in column id." [1] "which_are_included: num2 is included in column id." [1] "which_are_included: fnlwgt is included in column id." [1] "which_are_included: constant is included in column sex." [1] "which_are_included: sex is included in column fnlwgt." [1] "which_are_included: income is included in column id." [1] "which_are_included: race is included in column fnlwgt." [1] "which_are_included: relationship is included in column id." [1] "which_are_included: type_employer is included in column fnlwgt." [1] "which_are_included: marital is included in column id." [1] "which_are_included: occupation is included in column id." [1] "which_are_included: education is included in column education_num." [1] "which_are_included: education_num is included in column id." [1] "which_are_included: capital_gain is included in column fnlwgt." [1] "which_are_included: capital_loss is included in column fnlwgt." [1] "which_are_included: country is included in column fnlwgt." [1] "which_are_included: hr_per_week is included in column id." [1] "which_are_included: age is included in column id." [1] "which_are_included: mail is included in column id." [1] "which_are_included: date2 is included in column id." [1] "which_are_included: date1 is included in column id." [1] "which_are_included: date3 is included in column date4." [1] "which_are_included: date4 is included in column id." [1] "which_are_included: num1 is included in column num3." [1] "which_are_included: num3 is included in column id." [1] "which_are_included: num2 is included in column id." [1] "which_are_included: fnlwgt is included in column id." [ FAIL 7 | WARN 0 | SKIP 1 | PASS 322 ] ══ Skipped tests (1) ═══════════════════════════════════════════════════════════ • empty test (1): ══ Failed tests ════════════════════════════════════════════════════════════════ ── Error ('test_generate_from_character.R:13:5'): generate_from_character: don't drop so generate 3 new cols ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 3/3 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, cols = "character_col") at test_generate_from_character.R:13:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_character.R:26:5'): generate_from_character: drop generate 3 col and suppress one ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, drop = TRUE) at test_generate_from_character.R:26:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_character.R:40:5'): generate_from_character: don't reduce number of rows even with NA ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_character(data_set, cols = "character_col") at test_generate_from_character.R:40:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:14:5'): generate_from_factor: drop: functionnal test on reference set ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 25/25 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_factor(...) at test_generate_from_factor.R:14:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:27:5'): generate_from_factor: test don't drop => keep original col ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 2/2 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::generate_from_factor(...) at test_generate_from_factor.R:27:5 2. ├─data_set[, `:=`(c(new_col), .N), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_generate_from_factor.R:80:5'): build_encoding: min_frequency allows to drop rare values ── Error in ``[.data.table`(data_set, , `:=`(c("freq"), (.N/nrow(data_set))), by = col)`: attempt access index 1/1 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::build_encoding(...) at test_generate_from_factor.R:80:5 2. ├─data_set[, `:=`(c("freq"), (.N/nrow(data_set))), by = col] 3. └─data.table:::`[.data.table`(...) ── Error ('test_prepare_set.R:14:5'): prepare_set: functionnal test: test full pipeline. Should give result with as many rows as unique key. ── Error in ``[.data.table`(data_set, , `:=`(c(new_col), .N), by = col)`: attempt access index 15/15 in VECTOR_ELT Backtrace: ▆ 1. └─dataPreparation::prepare_set(...) at test_prepare_set.R:14:5 2. └─dataPreparation::generate_from_character(...) 3. ├─data_set[, `:=`(c(new_col), .N), by = col] 4. └─data.table:::`[.data.table`(...) [ FAIL 7 | WARN 0 | SKIP 1 | PASS 322 ] Error: ! Test failures. Execution halted Flavor: r-devel-linux-x86_64-fedora-gcc