Hi John,

I've been using the new version for a little while now, and I'm very
happy with the changes.  The caching is a fantastic addition; I've
been using it to store my (somewhat intensive) SQL queries.  Also, the
global config file makes it easier to see what is going on in the
project.  All in all, I've found no major regressions and plenty of
improvements.

However, there is one tiny problem I've had, and I'd like your input
on how to handle it.  Some libraries I load (such as tikzDevice) have
very verbose output, and I like to suppress that using the package
loading option "quietly=TRUE".  Is there a way to pass options to the
loading of packages in the global config file?

In a somewhat related vein, I load packages, and depending on their
availability, perform some action, like so:
# Register the multi-core backend for "foreach" package, if available.
if (require(doMC, quietly=TRUE)) {
 registerDoMC()
}

I was thinking lib/ would be the place to put this sort of thing, but
I notice that only utilities.R is loaded from that directory.  Is
there a proper place for that kind of loading that I'm missing?

Thank you so much for your help, and this great package!

===

I'm sorry, I'm a bit concerned about the new version.

Some things that really pop out at me:

1) File names determine execution ordering.  This restricts my ability
to choose informative filenames in e.g. munge.

2) Reader functions are defined within load.project() which means I
can't use them from elsewhere.  Believe it or not, I might just need
to load in specific data from other locations.  I could always symlink
them into data in linux, but why make me do that?

3) Severely limits customizability.  This places major restrictions on
my ability to customize my workflow, outside of simply replacing the
entire inst/defaults directory.  For example, for the CRAN version in
about ten minutes, I edited the various scripts to use a custom data
directory to load datasets.  In a bit longer, I changed things so that
after every file was loaded, I had the opportunity to extract whatever
I wanted from it using a separately defined function.  Now, I can't do
any of that because it's built into the core logic of load.project().
I'd have to reinstall my own version of ProjectTemplate rather than
just replace the inst/defaults.


In general, I think that you're taking away flexibility from the user
and making important assumptions about their workflow that may not be
correct.  Quite frankly, if I had installed ProjectTemplate from CRAN
and gotten this, I would've  thought it was interesting but not for
me.  It was the flexibility of the current release that really made me
start to think about it and ways I could use it in my workflow.


I do really like some of the new things, though.  In particularly, I'm
glad to see some new workflow concepts, especially tests, cache and
config.  The setup for the cache is just awesome.  I think that a
place to put configuration options is great, but I'm not sure how I
feel about yaml.  I just don't quite see why plain ole R objects
weren't good enough.  Yaml doesn't particular scare me, but it
certainly might scare non-programmers.  It's just another thing you're
forcing people to learn and I don't quite see what the added value is.


I just want to reiterate that I love this project.  I think it's
really helpful to develop a structure that promotes best practices.
The only reason I'm being so critical is because I think there's so
much potential (and need) for a great product.

A thought on the problem you say you're trying to solve:
"make it much easier to make revisions to ProjectTemplate in the
future without having to worry about existing projects falling behind
because of vestigial code that's not being automatically updated when
you install a new version of ProjectTemplate."
It seems like some of these issues could be solved by adding some kind
of project DESCRIPTION file to the template.  If a project contained
information about the template that generated it then you could use
that information to ensure backwards compatibility by loading projects
according to the version of ProjectTemplate that the project is
compatible with.  In the process, this would also result in built-in
support for multiple project templates.  As great as the default
template is, I'm probably always going to want to define my own.  It
would be great if I could add a template and choose the template I
want as an argument to create.project(...) rather than having to
replace the defaults as I am now.

====

Hi John, all,

I think Jamie's and Antonio's comments are really important, and represent the difficulty of defining a convention that's not so rigid as to alienate. In my opinion, ProjectTemplate definitely takes away flexibility (at some level at least) from the statistician in exchange for the benefits of ease-of-use for the statistician, code readability for the fans of the statistician and the guidance toward 'good' software-carpentry practice, which isn't necessarily something a statistician comes armed with.

I'd like to specifically ask about Jamie's questions 2 and 3 (I'm sorry for referring to Jamie in the 3rd person like this - I'm not sure s/he's on the list?):

Question 2 is about the reader functions not being usable from elsewhere. Is it possible to provide some examples of what's meant by this? Does it mean Jamie want to load files outside of the project's structure, in a way that's not taken care of by the database or URL readers, or does it mean Jamie doesn't want to load the data (or cleaned/cached data) all at once?

Question 3 is about limiting customizability, and the ability to shape one's own workflow. The customised workflow described uses a  different data directory, and has an automatic preprocessing step after the loading of the data. Now that load.project hard-encourages use of the data / munge / cache structure (more about 'munge' below btw), this flexibility has been removed. Would it be possible to explain the dislike of the auto-preprocessing that's built into the new project template, and maybe what level of customisability would be required for the directory structures? 

For example, if there were a config file, read at the instantiation of the project, that allowed the specification of where the data directory lived, (and all the other directories) would that address the criticism? I guess this is something along the lines of the DESCRIPTION file - something that maps the concepts of data science (or whatever all this is called) onto an actual file structure.

The last thing I wanted to chat about is that a few people, Jamie and Antonio included, have referred to things like YAML as programmer-y and not data-analyst-y. As someone who is definitely afraid of R objects, I was initially surprised at this. But, and especially after seeing Jeffery Horner's rRack presentation, maybe it does make sense to stick to a pure-R implementation. I'm not terribly enamoured with .Rprofile and all those option('') commands etc, but the yaml stuff should be replaceable with that no? While we lose the read- and write-ability of yaml, it's much better to lose that than to lose R programmers. 

Is there anything else about ProjectTemplate that's considered too focused at 'general' programmers, and alienating to specialised R programmers? 

Mike Dewar

P.S. I'm totally in favour of a config file that lets me change the directory names. That 'munge' directory still sounds faintly disgusting to my British ears.

===

Hi all,
I'm on the list, I just didn't realize I was replying only to the John
rather than the list.  I am also a 'he'.

Question 2 is about the reader functions not being usable from elsewhere. Is it possible to provide some examples of what's meant by this? Does it mean Jamie want to load files outside of the project's structure, in a way that's not taken care of by the database or URL readers, or does it mean Jamie doesn't want to load the data (or cleaned/cached data) all at once?


I guess I hadn't really thought through the use cases too clearly.
They had primarily just seemed a great collection of easy to use data
processing tools and it seemed like it'd be nice to have them
accessible rather than locked up..  One possible reason might be if I
want to wait until after reading in either the configuration file or
one of the other datasets before deciding what else to load.  This
would make the most with something like a database query, but you
could also just change the yaml file containing the query when you
need to.  I'm not really sure about this, but I just thought I might
want to use them directly at some point.  I guess it really comes back
to the assumptions about data organization.  For me many of my larger
datasets I prefer to keep rsynced in a separate directory to keep them
out of version control for each of the projects I use them for.  This
means I certainly need a configurable data directory, and may not even
be able to put all of my data in a single directory.  I've played
around with changing a data_dir configuration into a list, but that
makes the variable naming a bit more complicated.  One option, would
then be for me to just manually load in whatever data I want whenever
I want with those handy little functions.

Question 3 is about limiting customizability, and the ability to shape one's own workflow. The customised workflow described uses a  different data directory, and has an automatic preprocessing step after the loading of the data. Now that load.project hard-encourages use of the data / munge / cache structure (more about 'munge' below btw), this flexibility has been removed. Would it be possible to explain the dislike of the auto-preprocessing that's built into the new project template, and maybe what level of customisability would be required for the directory structures?


So for large files or even for lots of little ones, I might only be
interested in a subset of the data.  Sure, I can just load it all in
and extract what I'm interested in in data processing, but that means
I'm necessarily loading Everything into memory.  That may be
impossible or at least annoying if your operating system starts
swapping/caching everything other than R to disk.  So I made a little
data_extraction function that gets called every time a dataset gets
loaded in.  In my case right now, it just checks the variable name and
if it matches a query, subsets the time interval I'm interested in.
This leaves me with R using about 700-800 MB after load.project()
instead of more like 2.5 GB if I wait to do any subsetting until
afterward.  And sure, I could use a bigger machine, but I didn't
really want to.  I should note that I did end up making this work in
the new version as well.  You can still do anything with the new
ProjectTemplate, but you just might have to build and install your own
version.


For example, if there were a config file, read at the instantiation of the project, that allowed the specification of where the data directory lived, (and all the other directories) would that address the criticism? I guess this is something along the lines of the DESCRIPTION file - something that maps the concepts of data science (or whatever all this is called) onto an actual file structure.


I think that that would go a good way toward alleviating some of my
concerns, and if everything that I want to do was incorporated into
the package as an option, I might even be happy.  But in the end, I
still fully expect at some point to realize I want to do things a bit
differently, like add in a new configuration option like a project
version, to maintain multiple output and cache directories.  Now I can
certainly do that, and build and install my own version of
ProjectTemplate, but if, e.g. I've version-controlled it and I push my
changes to a server and try to run the project on the server, it'll
fail because I changed the ProjectTemplate code.  Really, what I want
is the ability to change anything I want to about how the project is
run without changing the ProjectTemplate code.  For me, the value in
ProjectTemplate is exactly as a ProjectTemplate, an initial template
for setting up a project, and Not as a built-in set of tools that I
have to use to run to load/setup the project.  The ideal setup for me
was when load just sourced a script in the project folder.  It was
completely transparent to me, I could change anything and it gave me
good initial project setup.  As far as I can tell, everything in the
new version could be changed in very simple ways to go back to that
setup without removing any of the awesome new features.  You could
move the logic of "load.project()" back into "lib/boot.R" or whatever
you want to call it and have things work exactly the same and I'd be
thrilled.

The last thing I wanted to chat about is that a few people, Jamie and Antonio included, have referred to things like YAML as programmer-y and not data-analyst-y. As someone who is definitely afraid of R objects, I was initially surprised at this. But, and especially after seeing Jeffery Horner's rRack presentation, maybe it does make sense to stick to a pure-R implementation. I'm not terribly enamoured with .Rprofile and all those option('') commands etc, but the yaml stuff should be replaceable with that no? While we lose the read- and write-ability of yaml, it's much better to lose that than to lose R programmers.


I'd probably fall into the programmer camp, so I don't really know how
non-programmers would react to yaml.  It's reasonable and it's simple
and intuitive, but it's still another tool for people to learn.  I
don't know the right way to setup config options.  I agree in that I
don't really care to the options("") approach either.  I frankly think
yaml is better than that.  I have to say, I'm coming around to yaml.
The basics of it are just so simple and intuitive that I would think
people would have too much trouble with it.  People might have more
experience with xml, but xml lists are not concise.


P.S. I'm totally in favour of a config file that lets me change the directory names. That 'munge' directory still sounds faintly disgusting to my British ears.


I'm from the US as well as a grad student and that name still bothers
me.

===


(1) The older scaffold code approach to setting a project up clearly has virtues that both Jamie and Antonio have touched upon. But are people not concerned that generating lots of static code makes it difficult for projects to be updated when ProjectTemplate gets updated? As far as I can see, the older version of ProjectTemplate encouraged users to effectively write their own fork of the inst/defaults for every project. I'm inclined to think that a better architecture is to pull out some of the guts from load.project() into a special namespace and then allow users to override specific features by redefining certain readers.


Honestly, I'm not sure how much I really want any of my projects
behavior to automatically change when I update ProjectTemplate.
Without knowing what kind of changes you're thinking, I see that as
surprising me down the road and not leaving me a happy camper.

I think in some ways each project is sort of fundamentally a fork of
int/defaults in that you're copying inst/defaults and then making
project-specific changes.  I think it's good to think of ways to
minimize the amount of work people have to do in making project-
specific changes and I think the configurations will do a lot for
that.

You say that 'a better architecture is to pull out some of the guts
from load.project() into a special namespace and then allow users to
override specific features by redefining certain readers.'  One thing
to think about is that in doing this, most people will be making
changes to the default readers rather than writing their own from
scratch.  If nothing else, I don't think people are especially
familiar with some of the more reflexive functions (e.g. assign).  If
that's the case, I think it's worth thinking about where the reader
code is most accessible to people, in the package hidden inside a
function, or explicitly copied into their project where they're
already looking.  Sure, they can get the code by print(load.project)
but you lose formatting, comments, etc.

(2) I like YAML quite a lot, but we've had trouble with it on Windows, so I'm quite open to using an alternative configuration system. Unfortunately, I don't see any obviously superior options. We need a new alternative that works across all possible platforms. I'd like to hear suggestions for a configuration system that is easy for humans to read and that has a parser that's already been implemented in R.


I do have to say that I think I was wrong about yaml.  I was mainly
just concerned about making people learn another technology.  I think
some alternatives might be json and pure R.  json would have all the
disadvantages in terms of making people learn a new technology.  Pure
R might take be workable.  Just sourcing in a file with some variables
defined could among other things, lead to name collisions.  The
options("") version would be overly verbose and prone to error
(forgetting to assign something in options()).  It might be okay to
source a file with some variables defined into a local environment and
then use that:
in config/configurations.R:
data_loading = TRUE
in loading script:
config = new.env()
sys.source('config/configurations.R', envir=config)
...
if (config[['data_loading']]){
...
}

===

I'm sympathetic to wanting the "right" architecture, but I think it's
more important to think about how people get things done.  Example:
right now, the CSVReader mostly uses the default options of read.csv,
which despite the documentation that says quote="\"'" has a value of
quote="\"".  This is a problem for me since one of my files uses '
quotes.  This is a simple but more importantly common problem!  How do
I fix it for my project with the current setup?  I can't.  You
mentioned a way that you might some day be able to do it by creating a
special namespace, but for me, today, my only way to change a single
option for read.csv is to build my own version of ProjectTemplate.

I just wanted to bring this up because it's an extremely simple
example of a way load.project didn't quite work for me, and right now,
the solution is to build my own version of ProjectTemplate.  Sure, you
can add this as an option to the configuration file, but I really
don't think you can ever cover all the bases.

===

John,

I was checking out ProjectTemplate in more details and had some initial thoughts. Since I'm not at all a user/analyst my feedback could be totally off base so feel free to discard liberally :-)  

I like the idea of reducing initail clutter and conceptual space via minimal mode however once a project has been created in minimal mode there isn't a good way for users to get into full mode. There also isn't the suggestive power of empty directories (e.g. reports, graphs) to drive behavior. This means that minimal mode projects could easily get out of sync with full projects, thereby somewhat undermining the goal of strong shared conventions.

My initial thoughts about minimal mode were:

- Include lib/ in minimal mode
- Include tests/ in minimal mode
- Make sure that when logging: on is enabled the log/ directory is auto-created

This would leave the following directories out of minimal mode:

- diagnostics/
- profiling/
- reports/
- graphs/

Then I was thinking, what if we used one additional level of heirarchy to include everything all the time:

project/
cache/
config/
data/
lib/
munge/
output/
log/
reports/
graphs/
src/
tests/
testthat/
diagnostics/
profiling/
README
TODO

Anyway, just a thought. Let me know how this jibes with your own intuition and your observation about how users grok the system. The right answer could of course be to scrap minimal mode altogether.

In terms of multiple files in directories like lib/, diagnostics/, etc. What about a convention of having a root file which sources other files? For example, you currently source lib/helpers.R automatically and if the user wants other files sourced they need to add source statements to lib/helpers.R -- this seems good as the user can control order of execution, etc. 

However for some of the other directories you use 1.R, 2.R as a convention. What about using a simillar convention as lib/, so diagnostics/run.R or profiling/run.R could source sub-files in a user controlled order, or the user could source indvidual diagnostics or profiling files as needed. So something like this
diagnostics/
run.R
validateFactors.R
validateNames.R
Question about caching: it is based on file time/date stamps or hashes? Does it encompass both changes to data files as well to the munge scripts?

Okay, just some quick feedback with the proviso that I am not a user so realize that it could all be completely off-base!

Last thought: if we are going to bake this directly into RStudio we may want to solicit some broader feedback from users before it's frozen in place. Rather than try to make changes to the existing ProjectTemplate (as I'm suggesting above) we could:

- Keep it exactly as-is
- Add the hooks for ProjectTemplate to become an RStudio project type if the user has already installed ProjectTemplate.
- Do a blog post saying that we'd like to bake ProjectTemplate or something very much like it directly into RStudio, but want to make sure that it meets everyone's needs, strikes the right simplicity/flexibility balance, etc. 

We could then collect all of this input and either "bless" the existing ProjectTemplate design or use the feedback to create a new project called "RProjectTemplate" (so as not to have breaking changes for existing users of ProjectTemplate) which could be baked directly into the default installation of RStudio.

Anyway, look forward to discussing all of this on Tuesday!

J.J.
