The CHILDES Treebank
The CHILDES Treebank corpus is derived from several corpora from the American English section of CHILDES (MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates). The goal was to annotate child-directed speech utterance transcriptions with phrase structure tree information. For each corpus included, the following was done:
- The selected utterances (either all child-directed speech utterances, or simply those child-directed speech utterances containing wh-words) were first automatically parsed by using the Charniak parser (available here) [pre-2012] or the Stanford parser [post-2012] (available here).
- Then, the output from the parser was hand-checked by an undergraduate annotator who had been trained in syntactic tree structure.
- The output from the first annotator was hand-checked by a second undergraduate annotator who had been trained in syntactic tree structure.
- In the case of trace, animacy, and thematic role annotation, these annotations were added to the previously corrected versions of the output.
- Avery Andrews: bates-wh.parsed, bernstein-wh.parsed, brown-adam.parsed, brown-eve.parsed, brown-sarah.parsed, soderstrom.parsed, valian.parsed
- Alandi Bates: valian.parsed
- Bob Berwick & Aline Villavicencio: brown-adam.parsed
- Kyle Gorman: bates.wh.parsed, bernstein.wh.parsed, brown-adam.parsed, brown-eve.parsed, brown-sarah.parsed, suppes.parsed, valian.parsed, vanhouten-threes-wh.parsed, vanhouten-twos-wh.parsed, vankleeck-wh.parsed
- Spencer Perry: brown-eve.parsed
- Abbie Thornton: brown-eve.parsed, brown-eve+animacy+theta.parsed, brown-adam3to4+animacy+theta.parsed, brown-adam4up+animacy+theta.parsed
PLEASE NOTE: The hope was that this process would remove the errors resulting from the automatic parsing process. However, errors may remain and we strongly suggest that any users of these data review automated extraction results to make sure they are accurate. If you do find syntactic annotation errors, please feel free to email Lisa Pearl (email address on the side bar) what they are and where you found them - we'll happily update the files.
Recommended tool for automatically searching through CHILDES Treebank trees:
Tregex, from the Stanford Natural Language Processing Group
Download the current CHILDES Treebank here (.zip file) (updated January 3, 2017)
July 2016 (.zip file)
January 2016 (.zip file)
May 2015 (.zip file)
Dec 2014, v2 (.zip file)
Dec 2014, v1 (.zip file)
Jul 2014 (.zip file)
Mar 2014 (.zip file)
Mar 2013 (.zip file)
Aug 2012 (.zip file)
Feb 2012 (.zip file)
Oct 2011 (.zip file)
Sep 2011 (.zip file)
The corpora currently included in the CHILDES Treebank are as follows:
All child-directed speech utterances
- Brown/Adam (**Includes trace-annotation, ***3to4 and 4up subsections include additional animacy and thematic role annotation)
- Brown/Eve (**Includes trace-annotation, animacy, and thematic role annotation)
- Valian (**Includes trace-annotation, animacy, and thematic role annotation)
Only child-directed speech utterances containing wh-words
The phrase structure tree annotation is similar to the Penn Treebank II notation, with a few exceptions noted in the included readme file. The trace annotation, animacy annotation, and thematic role annotation are also documented in the included readme file.
If using these corpora in published materials, please cite one or more of the following:
- Pearl, L. & Sprouse, J. 2013. Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition, 20, 23-68. [lingbuzz]
- Pearl, L. & Sprouse, J. 2013. Computational Models of Acquisition for Islands. In J. Sprouse & N. Hornstein (eds), Experimental Syntax and Islands Effects. Cambridge University Press, 109-131.