BayesFor.eu

beta

Bayesian web spidering

Translations of this page?:

Menu

Projects

Personal pages

Trace: » home » start » link » manifest » en » attivita » working_papers » sunbelt_2009 » intro » norma
::

Application of geochemical and chemical models to word frequency data from web information flow


We have developed a method to normalize words occurence flow and we have presented it in a working paper, this is the abstract:

In the present work we applied geochemical concepts and the normalization method from Secondary Ion Mass Spectrometry (SIMS) to analyze word frequency data from web information sources. Bayes-Swarm is a research project that aims to design and build an engine to extract information from internet sources (news portals, newspapers and news agencies websites, blogs, etc.), mainly homepages and economical and political pages. Once a day every page passes through a working process whose main steps are: (i) formatting tags and punctuation removal, (ii) “empty words” (i.e., conjunctions, articles and function words) removal, (iii) word roots extraction. Subsequently the number of appearances of every word (“word frequency”) is saved and stored in a database, as well as the webpages analyzed by the software. We show that empty words behave as conservative elements in the web information flow: ratios between empty words are constant and independent on time or word flow volume (Figure 1). Thus empty words can yield a computationally convenient way to normalize word frequency data. Some examples of how this normalization procedure works are shown and discussed.

Figure 1: frequency curves of the empty word “che” and (“che”/”emptysum”)*525, where “emptysum”=”per”+”con”+”del”+”della”, all of which are empty words. The curves extend over 80 days. During the first 40 days half of the web pages were manually excluded from the word count procedure: this explains the “che” frequency sudden increase around January the 24th. The ratio “che”/”emptysum” is constant and independent on the artificially created word flow volume variation.

Back to documentation
Use Swarm!

Back to top :: en/bayes-swarm/norma.txt · Last modified: 2008/09/15 16:00 by paolo.brunori
Show pagesource Old revisions Recent changes Index