Friday, March 25, 2011

Apology to Dr. Hartnett

I have discovered the nature of an error in a previous post (Quantized Redshifts. VIII. An Uncommon Power Spectral Density "Blooper") concerning the paper by Hartnett & Hirano, "Galaxy redshift abundance periodicity from Fourier analysis of number counts \$N(z)\$ using SDSS and 2dF GRS galaxy surveys", hereafter labelled HH2007.

My error was that I claimed the x-axes were incorrectly labeled when Hartnett converted them from 'frequency space' to his '\Delta_z space'.  These graphics were NOT mislabeled.

I've deleted the content of the original post and set a link to this post, since the great majority of the post was on the erroneous graphic plotting claim.  Other issues with HH2007 will be examined in a future post.

How did I make the error?
  • This error was one of those 'flashes' of 'insight' that occurred only about a week before I wrote the post.  That in itself should have made it suspect.  It did not receive the: idea - examine - walk away for a few weeks - reexamine - chat about with others - repeat (2 or more times) that other material in HH2007 had received.
  • I have read HH2007 at least 3 or 4 times.  ALL the other critiques of the HH2007 analysis have actually been under development for several years, going through the analysis cycle above several times for verification.  If the Delta_z problem had been real, the idea that I would not have noticed it earlier is a bit unlikely.  I should been bothered by that.
  • Customarily, plots are posted with the low value to the left end of the axis, whereas Hartnett (correctly) plotted the large value of Delta_z in Figure 5 of HH2007.  The prejudice of the customary axis direction was reinforced in casual examination of the graph on a small laptop screen where font antialiasing made some of the exponents in the axis scale unreadable and I failed to zoom-in on the graphic for a better look.  Several other non-standard notations for cosmology and power spectral in HH2007 contributed to confusing the underlying concepts as well.
  • Having read so many redshift-space analysis papers it is entirely possible that some of the graphics actually became confused with graphics in other papers and a quick examination of HH2007 graphics failed to fully register.
  • I hurried the analysis, trying to meet a self-imposed schedule of one post per week.  I allowed that to rush me into preparing the post without carefully checking the graphics in question beforehand.  
What changes do I intend to make to prevent the problem from happening again?
  • No more one-post-per-week schedule, especially for detailed analysis posts.  This blog is a part-time project and some of the ideas explored cannot be examined quickly but require careful examination.  Note that it is the pseudo-scientists in the comments of this blog who often try to push me to adhere to their time frame and in this case I foolishly took their bait.
  • I've wanted to do this for a time and have numerous individuals whom I have consulted on occasion for reviewing material and ideas.  I'm going to explore making that a more regular practice.
I doubt these changes will totally eliminate the possibility of an error, but it will reduce it.

I suspect there are some who will attempt to exploit this admission, so why am I doing this?  I work in a technical field were there are absolute standards of success and failure.  Those who learn from their mistakes and continue on, who don't dogmatically hang on to delusions that they are always right when the evidence is against them, have far more professional success. 

Depending on reactions, I may apply some different standards to comment moderation for a time.

Saturday, March 19, 2011

Quantized Redshifts. IX. Testing the Null Hypothesis

We're done with our detour on random number generators and now back on track to testing redshift quantization where I'll apply a PRNG.

A good many amateur cosmologists blindly apply the FFT and PSD to survey data because it is simple and 'obvious'.  Yet none seem to ask the question, "If it is so simple, why haven't others done it?"  The answer is that others have tried this, and found the 'simple solution' was riddled with false trails.  Yet this error is made repeatedly, and in spite of progress in understanding the problems, the method still manages to find its way into peer-reviewed journals.

The implicit assumption in these redshift-space analyses is that a uniform distribution of matter in these surveys would exhibit no peaks at all.  If that were true, then all peaks in any PSD would be significant, indicating a real deviation from uniformity.  But is it true that a uniform distribution would have no peaks?

Some researchers attempt to test the reality of their peaks in a PSD using what is sometimes called a 'Shuffle Test', moving the data to different bins to see if frequency peaks move.  The peaks usually do move, but it should be noted that each frequency bin in the FFT has contributions from EVERY point in the dataset (it is a weighted sum of every data point), so this is no surprise.  Sadly, few researchers seem to recognize the implications of this and often proceed to concoct more stranger tests or misapplying tests from different data assumptions.  These types of repeated contortions of the same dataset(s) are so unproductive, I have a nickname for this mental trap: "data masturbation".

The only way out of this trap is to test your analysis process with datasets of KNOWN content, subjected to data selection similar to the data in question.  In this case, we should first construct a dataset that does NOT have our hypothesized content (the Null Hypothesis), which for a cosmological model, would be a dataset with a uniform distribution of galaxies.

So how would you construct such a test dataset?

Building a Test Universe
What we will do is build a simple 'model universe' with galaxies distributed uniformly, but randomly, in 3-dimensions, x, y, and z.  We will create a test sample with numberOfGalaxies, which I've run with numberOfGalaxies = 10,000,000 and larger. in a volume with a defined radius, in parsecs (1 parsec = 3.26 light years).  To do this, we will make use of the numpy random number generator (see Mathematical Interlude: Pseudo-Random Number Generators)

import numpy as np
from numpy import random

There are a couple of ways we can do this.  One is to assign x,y,z coordinates for numberOfGalaxies using a uniform sampling random number generator  over the range -radius to +radius

x=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))
y=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))
z=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))

Once we have this dataset, we can compute the distance to each galaxy from an observer at the center of the dataset at coordinates (0,0,0).

distance = np.sqrt(x*x+y*y+z*z)  # convert cartesian coordinates to spherical

This method creates a 'cube' universe to which we can apply observational selections.  In the graphics below, the term SimpleCartesianHubbleModel indicates models run using this algorithm.

If we understand probability distributions, it is also possible to generate a table of galaxy distances directly by weighting the uniform probability distribution in a way that makes it match a uniform spherical distribution.  In this case, it is done by taking the cube-root of the uniform distribution.

distance = radius*(random.uniform(0.0,1.0,size = (numberOfGalaxies,)))**(1.0/3.0)

Models generated with this algorithm are labeled with SimpleRadialHubbleModel.

In this case, we make the simplifying assumption that all galaxies travel with the Hubble flow, so the redshift value, z, will be computed with a simple, linear Hubble relation, where H0 is the Hubble constant (km/s/Mpc), c is the speed of light in km/s and 1e6 converts parsecs to megaparsecs.

redshift = distance*H0/c/1e6

I've used H0 = 72.0 km/s/Mpc.  If we want to include some deviations from the Hubble flow, we can install a simple simulation of this by 'fuzzing' each redshift value with some Gaussian noise of width sigmaSpeed:

redshift = (distance*H0/c/1e6) \
                    + random.normal(0.0,sigmaSpeed,size = (numberOfGalaxies,))/c

Some of my simulations are generated with this model for comparison.  They will have SimpleHubbleDisperseModel in the graph title.

Next we define an absolute magnitude for all our galaxies.  In this case, we will assume a simple Gaussian distribution around a referenceMagitude with a standard deviation of magnitudeWidth for each galaxy.  We make no distinction between nearby or more distant galaxies, all galaxies will be part of the same brightness distribution.  We do this to simulate the fact that galaxies have a range of luminosities. 

absoluteMagnitude = random.normal(referenceMagnitude,magnitudeWidth,size = (numberOfGalaxies,))

Finally, we want to assign an apparentMagnitude to each galaxy, so that we know how bright the galaxies will appear to an observer on the Earth.  This way, we can determine which galaxies might be visible to instruments on the Earth.  Since we have a distance and absolute magnitude, the apparent magnitude for each galaxy is easy to compute (see Wikipedia: Absolute magnitude):

apparentMagnitude = absoluteMagnitude+5.0*(np.log10(distance)-1.0)

In one of our next steps, we will select only galaxies brighter than a given apparent magnitude (the magnitude visible from Earth) to simulate the fact that our instruments can only see objects brighter than a certain threshold.

This process creates a set of arrays where index i of any one of them corresponds to measurements of a galaxy:

x[i], y[i], z[i], distance[i], redshift[i], apparentMagnitude[i]

These arrays contain the dataset for our simulated universe.

We could look at a small subset, plotting them in 3-D.  We don't have the screen resolution to plot the full 10,000,000 galaxies - it would be a completely black ball - so we just plot 2000 galaxies.

So we have our simulation generator.  We can run it with different starting seeds in the random number generator to explore different variations of the uniform model. 

But first we might want to conduct a few tests on our test dataset, to make sure our simulation has generated a reasonable result for our input assumptions.

Next Weekend: Testing Our Designer Universe

Saturday, March 12, 2011

Mathematical Interlude: Pseudo-Random Number Generators

Many of the projects that I develop for this site require the generation of test datasets.  Because real data has some degree of uncertainty in the measurements, good test data should include a sprinkling of randomness to test the robustness of analysis tools developed.  Therefore I feel the need to describe a little about the software tools for simulating randomness in data, pseudo-random number generators, or PRNGs.

I have encountered numerous pseudo-scientists who seem to latch onto any terminology scientists actually use with 'pseudo' in the name, such as pseudo-tensor, or pseudo-random number generator and try to leverage the terminology to insinuate the methodology is unreliable. 

Since pseudo-random number generators are routinely used by scientists in generating test datasets and testing their algorithms (and I will start using them in the next post), I felt the need to do a pre-emptive strike.

Why Do Scientists Need Random Number Generators?
As mentioned above, often times scientists need to develop software for data analysis to test their algorithms.  Sometimes they might add a little random noise to a simulation run to determine how these changes influence their results.  Sometimes they just want to create datasets of random noise to determine how such data might be distinguished from a real signal.

Why Are they Called PSEUDO-Random Number Generators (PRNG)?
  • While the sequences are not truly random, they do pass many of the tests for randomness
  • Initializing a PRNG with the same 'seed' makes the PRNG generate the same sequences of random numbers.  This is valuable for the development of reproducible test processes.
  • After a certain number of numbers are drawn from the generator with a given seed, the sequence will repeat.  The  length of this sequence is called its period.  Ideally, you want a PRNG with a LONG period.

How Good are Pseudo-Random Number Generators?
  • Modern PRNGs are very good, suitable for a wide variety of problem-solving and simulation applications.
  • The primary application for which PRNGs are NOT suitable is cryptography.  This is because the underlying engine of PRNGs is deterministic and it is, in principle, possible for a cracker to determine the underlying PRNG sequence with enough samples of cypher-text.
What PRNG is Used by the Python NumPy Module?

The Random module in Numpy uses the Mersenne Twister (Wikipedia) as its underlying engine for generating random numbers (Python Documentation, Statistics in Numpy & Scipy).  This is also the generator used in the standard Python library.  At one time, these PRNGs were different and this can generate some confusion in online searches.

How many random numbers are in the period used by the Mersenne Twister algorithm?
>>> 2**19937-1
>>> 4315424797388162648055235516337919839053935043226711505165250541403330680137658

(Wow!  Isn't Python's long integer type cool?)

That's 6002 digits, in case you were wondering, far more than a googol (Wikipedia), but smaller than a googolplex(Wikipedia).

If you called this generator one billion times per second (10^9 /second), you could run  the program for a year and you would have only used about 3.1x10^7 seconds * 10^9 seconds = 3.1 x 10^16 of the numbers in the cycle.  There is no concern of repetition of a cycle when using this PRNG.

A Simple Sample
To illustrate the use of the numpy random number generator, here's a simple python program that generates and plots 1500 random x,y pairs with values between zero and one. 

import numpy as np
from numpy import random
import matplotlib.pyplot as plt
points=random.uniform(0.0,1.0,size = (1500,2))
ax = plt.subplot(111,aspect='equal')
ax.scatter(points[:,0], points[:,1],color='r',marker='o',s=2,label='random') 
How many structures can YOU see in this plot?

For those who will still want to complain...
If you want to complain about PRNGs, then you must demonstrate that a true random number generator (TRNG) will generate significantly different results than a PRNG in the particular application.  This is the standard of evidence for anyone wishing to complain about the use of pseudo-random number generators.

Additional References
Next Weekend: Quantized Redshifts - Testing the Null Hypothesis

So...What Happened?

Wow.  It's been over eight years since I last posted here... When I stepped back in August 2015,...