The people involved in putting together the Cross Validated website contacted me recently for feedback. Cross Validated is a new site brought to you by the people who host the excellent Stack Overflow website.
I've been a regular user of Stack Overflow. It's a Q&A website where programmers ask and answer each other's questions. You want to find the best way to compute the union of two Python lists? You can find lots of people offering up their solutions; see here for instance. An initial question leads to a multi-thread answer in which people suggest code and others critique and vote on the suggestions. One learns different solutions, as well as different solution criteria (e.g. speed of the code, style of the programming, conciseness, elegance). Most importantly, one can lift code and place directly into one's work.
Cross Validated is their attempt to do this for the statistics/data mining community. It's in infancy stage as there are very few threads right now. Given my positive experience with Stack Overflow, I'd like to see this new site grow and prosper but I see some challenges ahead:
- For clearcut programming questions like "how do I turn the axis labels sideways in an R graphic?", this site could easily replicate the success of Stack Overflow. But lots of statistical questions are not so clearcut. Something like "What should k be in a k-fold cross validation?" is unlikely to generate anything like a majority opinion, and instead will likely degenerate into a skirmish over "Does cross-validation work?", which of course also does not have an answer
- For general programming questions, it is often possible to extract little pieces of a whole program at which to shine a light. For instance, the union-of-lists question could be a tiny part of a very large program; other users need not know anything about that program to contribute to the thread. Many statistical problems, including some programming related ones, do not have this modular property.
- Solutions to statistical problems are often data-specific. It is usually difficult to answer such questions without having a data set (at least a toy one) to play with. The same question may have different answers depending on which data is being analyzed. Is there a way to share data with other users?
- Let's say data sharing is solved. How do we encourage people to participate? It's easy to put up some sample Python/Java/C code but it's quite a bit of effort to write code and test it on a data set and then write up one's answers.
- Something like finding the union of Python lists will have the same solution for any user for any problem. But I don't think it helps me to know that someone else had success with some particular methodology with an unrelated dataset.
PS. See also this Gelman post about R-help, which is a mailing list for R users. As I said there in a comment, I think R-help should be thought of "tech support" where users can get answers from developers while Stack Overflow is a "user community" in which users help each other. At least that's how I feel when I interact with these two services.
Comments