ISA 14th World Congress Montreal: RC35: Session 'Social Sciences in the World Wide Web'

Text Analysis of data in the World Wide Web

This paper deals with the possibilites of obtaining textual data from the World Wide Web. The first topic deals with the different kind of sources like homepages, overviews, and link pages. A more technical aspect are the different formats of the information like texts, graphics, or animations. Before an analysis the text must be transformed to a format that can be analysed. The text must be separated into text units, and also external variables have to be defined. The problems derived from this prerequisite of each text analysis will be discussed. At last the different standards of text encoding are dealt with, like HTML, VRML, XML, TEI, and the lack of text encoding standards in the current text analysis software.