DA, Jun: Reading news for information: How much vocabulary a CFL learner should know
Reading involves identifying words in texts and forming an understanding from them. Past researches reveal that vocabulary knowledge is the single most important factor contributing to reading comprehension. (Laflamme 1997) and educated native English speakers know about 20,000 word families (Goulden, Nation and Read, 1990; Zechmeister, Chronis, Cull, D’Anna and Healy, 1995), where a word family is defined as a headword, its inflected forms and its closely related derived forms (from affixation, etc.). While readers can decipher the meaning of words as they read, research shows that adequate reading comprehension depends on a person already knowing between 90 and 95 percent of the words in a text (Nagy 2000). Laufer (1989), Liu and Nation (1985) and Hu and Nation (2000) have found that unless there is at least 95% or higher coverage rate of the running words in a text, the probability of successful guessing of unknown words will be severely reduced. Carver (1994) suggests a similar rate of 98-99% for native speakers for reading to be pleasurable.
As compared with the availability of many studies on the size and coverage rate of vocabulary necessary for adequate reading comprehension in English, there are much less similar studies on Chinese and Chinese native speakers or learners. One major cause of the problem is the fact that we still do not have a comprehensive collection of the vocabulary in written Chinese or a good estimate on its size, even though there have been many statistical studies on Chinese characters (c.f., Da 2004).
The lack of comprehensive information about vocabulary in written Chinese also creates problems for CFL (Chinese as a Foreign Language) learning and instruction. While general-purpose wordlists such as the HSK wordlists have been made available for CFL learners, they contain mostly high-frequency words or phrases that are more appropriate for beginning or intermediate level learners. Past research suggests that the number of words needed for the reading of technical texts such as science texts or newspapers is larger than for less formal texts (Waring and Nation 2004). In order to achieve good comprehension of authentic Chinese texts, especially technical texts or news articles, CFL learners need to know not only high-frequency words but also other academic, technical and low-frequency words or phrases.
In this paper, we report the findings of an ongoing corpus-based research project that is intended to provide a partial solution to the above mentioned problem. The main objectives of the project are 1) to provide an estimate on the size of vocabulary used in current Chinese news media; and 2) to build a comprehensive list of words and phrases used in Chinese news reports that are made up of two or three characters. We choose news articles on the belief that news reading is a typical case of reading for information, where vocabulary knowledge plays an essential role in adequate reading comprehension.
Data used in this study were collected from the Internet over a period of two years. Our news corpus contains more than 20,000 pieces of news summaries from various Chinese media sources including the mainland, Taiwan and H.K, etc. They cover a wide range of topics including, but not limited to politics, culture, military, social and sports and entertainment, etc.
Both automatic methods and manual editing will be used to generate subject-based lists of two- and three-character words and phrases as well as common proper names. Those identified words will in turn be used as head words based on which an estimate on the size of vocabulary used in the Chinese news media can be made. It is expected that the wordlists will include not only high-frequency words and phrases but also low-frequency words and technical terms that are essential for adequate news reading comprehension.
It is hoped that results from this study will be useful for future CFL reading research and instruction. For example, word lists obtained from this study can be incorporated into advanced news reading instructions. Information about the vocabulary used in Chinese news media (such as its size and coverage rate) can be used as an independent measure based on which future research can be conducted to study CFL learners’ deciphering process and difficulties in news reading.