CHILDES Treebank

The CHILDES Treebank

Current Version

The CHILDES Treebank corpus is derived from several corpora from the American English section of CHILDES (MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates). The goal was to annotate child-directed speech utterance transcriptions with phrase structure tree information. For each corpus included, the following was done:

The selected utterances (either all child-directed speech utterances, or simply those child-directed speech utterances containing wh-words) were first automatically parsed by using the Charniak parser (available here) [pre-2012] or the Stanford parser [post-2012] (available here).
Then, the output from the parser was hand-checked by an undergraduate annotator who had been trained in syntactic tree structure.
The output from the first annotator was hand-checked by a second undergraduate annotator who had been trained in syntactic tree structure.
In the case of trace, animacy, and thematic role annotation, these annotations were added to the previously corrected versions of the output.

These files have additionally benefited from some error-checking from the following researchers:

Avery Andrews: bates-wh.parsed, bernstein-wh.parsed, brown-adam.parsed, brown-eve.parsed, brown-sarah.parsed, soderstrom.parsed, valian.parsed
Alandi Bates: valian.parsed, hslld-*/*.parsed
Bob Berwick & Aline Villavicencio: brown-adam.parsed
Alex Clark: all *.parsed files
Kyle Gorman: bates.wh.parsed, bernstein.wh.parsed, brown-adam.parsed, brown-eve.parsed, brown-sarah.parsed, suppes.parsed, valian.parsed, vanhouten-threes-wh.parsed, vanhouten-twos-wh.parsed, vankleeck-wh.parsed
Spencer Perry: brown-eve.parsed
Abbie Thornton: brown-eve.parsed, brown-eve+animacy+theta.parsed, brown-adam3to4+animacy+theta.parsed, brown-adam4up+animacy+theta.parsed

PLEASE NOTE: The hope was that this process would remove the errors resulting from the automatic parsing process. However, errors may remain and we strongly suggest that any users of these data review automated extraction results to make sure they are accurate. If you do find syntactic annotation errors, please feel free to email Lisa Pearl (email address on the side bar) what they are and where you found them - we'll happily update the files.

Recommended tool for automatically searching through CHILDES Treebank trees:
Tregex, from the Stanford Natural Language Processing Group

Download the current CHILDES Treebank here (.zip file) (updated Aug 3, 2020)
Previous versions:
April 2019 (.zip file)
March 2018 (.zip file)
January 2017 (.zip file)
July 2016 (.zip file)
January 2016 (.zip file)
May 2015 (.zip file)
Dec 2014, v2 (.zip file)
Dec 2014, v1 (.zip file)
Jul 2014 (.zip file)
Mar 2014 (.zip file)
Mar 2013 (.zip file)
Aug 2012 (.zip file)
Feb 2012 (.zip file)
Oct 2011 (.zip file)
Sep 2011 (.zip file)

The corpora currently included in the CHILDES Treebank are as follows:

All child-directed speech utterances

Brown/Adam (**Includes trace-annotation, ***3to4 and 4up subsections include additional animacy and thematic role annotation)
Brown/Eve (**Includes trace-annotation, animacy, and thematic role annotation)
Brown/Sarah
HSLLD: HV1-ER and HV1-MT subsections (**Includes trace-annotation)
Soderstrom
Suppes
Valian (**Includes trace-annotation, animacy, and thematic role annotation)

Only child-directed speech utterances containing wh-words

Bates
Bernstein
VanHouten/Threes
VanHouten/Twos
VanKleeck

The phrase structure tree annotation is similar to the Penn Treebank II notation, with a few exceptions noted in the included readme file. The trace annotation, animacy annotation, and thematic role annotation are also documented in the included readme file.

If using these corpora in published materials, please cite one or more of the following:

For phrase structure and trace annotation:
- Pearl, L. & Sprouse, J. 2013. Syntactic islands and learning biases: Combining experimental syntax and computational modeling to investigate the language acquisition problem. Language Acquisition, 20, 23-68. [lingbuzz]
- Pearl, L. & Sprouse, J. 2013. Computational Models of Acquisition for Islands. In J. Sprouse & N. Hornstein (eds), Experimental Syntax and Islands Effects. Cambridge University Press, 109-131.
For thematic role and animacy annotation:
- Pearl, L. & Sprouse, J. 2019. Comparing solutions to the linking problem using an integrated quantitative framework of language acquisition. Language, 95(4), 583-611.[lingbuzz].