Saturday, March 19, 2011

Quantized Redshifts. IX. Testing the Null Hypothesis

We're done with our detour on random number generators and now back on track to testing redshift quantization where I'll apply a PRNG.

A good many amateur cosmologists blindly apply the FFT and PSD to survey data because it is simple and 'obvious'.  Yet none seem to ask the question, "If it is so simple, why haven't others done it?"  The answer is that others have tried this, and found the 'simple solution' was riddled with false trails.  Yet this error is made repeatedly, and in spite of progress in understanding the problems, the method still manages to find its way into peer-reviewed journals.

The implicit assumption in these redshift-space analyses is that a uniform distribution of matter in these surveys would exhibit no peaks at all.  If that were true, then all peaks in any PSD would be significant, indicating a real deviation from uniformity.  But is it true that a uniform distribution would have no peaks?

Some researchers attempt to test the reality of their peaks in a PSD using what is sometimes called a 'Shuffle Test', moving the data to different bins to see if frequency peaks move.  The peaks usually do move, but it should be noted that each frequency bin in the FFT has contributions from EVERY point in the dataset (it is a weighted sum of every data point), so this is no surprise.  Sadly, few researchers seem to recognize the implications of this and often proceed to concoct more stranger tests or misapplying tests from different data assumptions.  These types of repeated contortions of the same dataset(s) are so unproductive, I have a nickname for this mental trap: "data masturbation".

The only way out of this trap is to test your analysis process with datasets of KNOWN content, subjected to data selection similar to the data in question.  In this case, we should first construct a dataset that does NOT have our hypothesized content (the Null Hypothesis), which for a cosmological model, would be a dataset with a uniform distribution of galaxies.

So how would you construct such a test dataset?

Building a Test Universe
What we will do is build a simple 'model universe' with galaxies distributed uniformly, but randomly, in 3-dimensions, x, y, and z.  We will create a test sample with numberOfGalaxies, which I've run with numberOfGalaxies = 10,000,000 and larger. in a volume with a defined radius, in parsecs (1 parsec = 3.26 light years).  To do this, we will make use of the numpy random number generator (see Mathematical Interlude: Pseudo-Random Number Generators)

import numpy as np
from numpy import random

There are a couple of ways we can do this.  One is to assign x,y,z coordinates for numberOfGalaxies using a uniform sampling random number generator  over the range -radius to +radius

x=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))
y=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))
z=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))

Once we have this dataset, we can compute the distance to each galaxy from an observer at the center of the dataset at coordinates (0,0,0).

distance = np.sqrt(x*x+y*y+z*z)  # convert cartesian coordinates to spherical

This method creates a 'cube' universe to which we can apply observational selections.  In the graphics below, the term SimpleCartesianHubbleModel indicates models run using this algorithm.

If we understand probability distributions, it is also possible to generate a table of galaxy distances directly by weighting the uniform probability distribution in a way that makes it match a uniform spherical distribution.  In this case, it is done by taking the cube-root of the uniform distribution.

distance = radius*(random.uniform(0.0,1.0,size = (numberOfGalaxies,)))**(1.0/3.0)

Models generated with this algorithm are labeled with SimpleRadialHubbleModel.

In this case, we make the simplifying assumption that all galaxies travel with the Hubble flow, so the redshift value, z, will be computed with a simple, linear Hubble relation, where H0 is the Hubble constant (km/s/Mpc), c is the speed of light in km/s and 1e6 converts parsecs to megaparsecs.

redshift = distance*H0/c/1e6

I've used H0 = 72.0 km/s/Mpc.  If we want to include some deviations from the Hubble flow, we can install a simple simulation of this by 'fuzzing' each redshift value with some Gaussian noise of width sigmaSpeed:

redshift = (distance*H0/c/1e6) \
                    + random.normal(0.0,sigmaSpeed,size = (numberOfGalaxies,))/c

Some of my simulations are generated with this model for comparison.  They will have SimpleHubbleDisperseModel in the graph title.

Next we define an absolute magnitude for all our galaxies.  In this case, we will assume a simple Gaussian distribution around a referenceMagitude with a standard deviation of magnitudeWidth for each galaxy.  We make no distinction between nearby or more distant galaxies, all galaxies will be part of the same brightness distribution.  We do this to simulate the fact that galaxies have a range of luminosities. 

absoluteMagnitude = random.normal(referenceMagnitude,magnitudeWidth,size = (numberOfGalaxies,))

Finally, we want to assign an apparentMagnitude to each galaxy, so that we know how bright the galaxies will appear to an observer on the Earth.  This way, we can determine which galaxies might be visible to instruments on the Earth.  Since we have a distance and absolute magnitude, the apparent magnitude for each galaxy is easy to compute (see Wikipedia: Absolute magnitude):

apparentMagnitude = absoluteMagnitude+5.0*(np.log10(distance)-1.0)

In one of our next steps, we will select only galaxies brighter than a given apparent magnitude (the magnitude visible from Earth) to simulate the fact that our instruments can only see objects brighter than a certain threshold.

This process creates a set of arrays where index i of any one of them corresponds to measurements of a galaxy:

x[i], y[i], z[i], distance[i], redshift[i], apparentMagnitude[i]

These arrays contain the dataset for our simulated universe.

We could look at a small subset, plotting them in 3-D.  We don't have the screen resolution to plot the full 10,000,000 galaxies - it would be a completely black ball - so we just plot 2000 galaxies.

So we have our simulation generator.  We can run it with different starting seeds in the random number generator to explore different variations of the uniform model. 

But first we might want to conduct a few tests on our test dataset, to make sure our simulation has generated a reasonable result for our input assumptions.

Next Weekend: Testing Our Designer Universe


Anonymous said...

Somewhat off-topic, but it turns out that simulating observations of galaxies isn't quite as straight-forward as it might seem!

Here are some aspects of the real observations which can - and undoubtedly have - caused problems for those searching for galaxy-redshift patterns (no way intended to be a comprehensive list):

* sky distribution is distinctly anisotropic; e.g. there is a Zone of Avoidance (i.e. the plane of the Milky Way)

* as galaxies are not, usually, point sources, there will inevitably be overlaps; robustly addressing the biases that overlaps introduce is anything but trivial (SDSS' DR8 reports a particularly challenging consequence: subtle photometric offsets quite some distance (on the sky) from bright galaxies; this affects galaxy-redshift research because it introduces a non-uniform magnitude limit, of a kind guarranteed to produce spurious associations!)

* 'galaxy evolution'; if, on average, galaxies change as they age (or, somewhat equivalently, with distance from us, or redshift, or ...), then there will almost certainly be some sort of galaxy-redshift trend (beyond a Hubble relationship one). Depending on the details of how galaxies 'evolve', this may show up, in the observational data, as something resembling redshift quantisation.


W.T."Tom" Bridgman said...

There's loads of other things I could do to make this more realistic.

At some point I'd like to do an experiment of what a real quantized dataset would look like with proper power-spectra methods, but that is beginning to look like a fairly major effort and will probably be delayed to a later time.

As you'll see in the next post, even this simple simulation exhibits characteristics surprisingly similar to the N-z distributions used in many quantized-redshift analyses.