This version -- last updated March 2014
Redirecting to current version in 3 seconds...

The CHILDES Treebank corpus is derived from several corpora from the American English section of CHILDES (MacWhinney, B. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah, NJ: Lawrence Erlbaum Associates). The goal was to annotate child-directed speech utterance transcriptions with phrase structure tree information. For each corpus included, the following was done:

  1. The selected utterances (either all child-directed speech utterances, or simply those child-directed speech utterances containing wh-words) were first automatically parsed by using the Charniak parser (available here).
  2. Then, the output from the Charniak parser was hand-checked by an undergraduate annotator who had been trained in syntactic tree structure.
  3. The output from the first annotator was hand-checked by a second undergraduate annotator who had been trained in syntactic tree structure.

PLEASE NOTE: The hope was that this process would remove the errors resulting from the automatic parsing process. However, errors may remain and we strongly suggest that any users of these data review automated extraction results to make sure they are accurate. If you do find syntactic annotation errors, please feel free to email Lisa Pearl (email address on the side bar) what they are and where you found them - we'll happily update the files.

Recommended tool for automatically searching through CHILDES Treebank trees:
Tregex,
from the Stanford Natural Language Processing Group

Download the current CHILDES Treebank here (.zip file)
Previous versions:
Mar 2013 (.zip file)
Aug 2012 (.zip file)
Feb 2012 (.zip file)
Oct 2011 (.zip file)
Sep 2011 (.zip file)

The corpora currently included in the CHILDES Treebank are as follows:

All child-directed speech utterances

  • Brown/Adam
  • Brown/Eve
  • Brown/Sarah
  • Soderstrom
  • Suppes
  • Valian

Only child-directed speech utterances containing wh-words

  • Bates
  • Bernstein
  • VanHouten/Threes
  • VanHouten/Twos
  • VanKleeck

The phrase structure tree annotation is similar to the Penn Treebank II notation, with a few exceptions noted in the included readme file.

If using these corpora in published materials, please cite one or more of the following: