Friday, March 25, 2011

Apology to Dr. Hartnett

I have discovered the nature of an error in a previous post (Quantized Redshifts. VIII. An Uncommon Power Spectral Density "Blooper") concerning the paper by Hartnett & Hirano, "Galaxy redshift abundance periodicity from Fourier analysis of number counts \$N(z)\$ using SDSS and 2dF GRS galaxy surveys", hereafter labelled HH2007.

My error was that I claimed the x-axes were incorrectly labeled when Hartnett converted them from 'frequency space' to his '\Delta_z space'.  These graphics were NOT mislabeled.

I've deleted the content of the original post and set a link to this post, since the great majority of the post was on the erroneous graphic plotting claim.  Other issues with HH2007 will be examined in a future post.

How did I make the error?
  • This error was one of those 'flashes' of 'insight' that occurred only about a week before I wrote the post.  That in itself should have made it suspect.  It did not receive the: idea - examine - walk away for a few weeks - reexamine - chat about with others - repeat (2 or more times) that other material in HH2007 had received.
  • I have read HH2007 at least 3 or 4 times.  ALL the other critiques of the HH2007 analysis have actually been under development for several years, going through the analysis cycle above several times for verification.  If the Delta_z problem had been real, the idea that I would not have noticed it earlier is a bit unlikely.  I should been bothered by that.
  • Customarily, plots are posted with the low value to the left end of the axis, whereas Hartnett (correctly) plotted the large value of Delta_z in Figure 5 of HH2007.  The prejudice of the customary axis direction was reinforced in casual examination of the graph on a small laptop screen where font antialiasing made some of the exponents in the axis scale unreadable and I failed to zoom-in on the graphic for a better look.  Several other non-standard notations for cosmology and power spectral in HH2007 contributed to confusing the underlying concepts as well.
  • Having read so many redshift-space analysis papers it is entirely possible that some of the graphics actually became confused with graphics in other papers and a quick examination of HH2007 graphics failed to fully register.
  • I hurried the analysis, trying to meet a self-imposed schedule of one post per week.  I allowed that to rush me into preparing the post without carefully checking the graphics in question beforehand.  
What changes do I intend to make to prevent the problem from happening again?
  • No more one-post-per-week schedule, especially for detailed analysis posts.  This blog is a part-time project and some of the ideas explored cannot be examined quickly but require careful examination.  Note that it is the pseudo-scientists in the comments of this blog who often try to push me to adhere to their time frame and in this case I foolishly took their bait.
  • I've wanted to do this for a time and have numerous individuals whom I have consulted on occasion for reviewing material and ideas.  I'm going to explore making that a more regular practice.
I doubt these changes will totally eliminate the possibility of an error, but it will reduce it.

I suspect there are some who will attempt to exploit this admission, so why am I doing this?  I work in a technical field were there are absolute standards of success and failure.  Those who learn from their mistakes and continue on, who don't dogmatically hang on to delusions that they are always right when the evidence is against them, have far more professional success. 

Depending on reactions, I may apply some different standards to comment moderation for a time.

Saturday, March 19, 2011

Quantized Redshifts. IX. Testing the Null Hypothesis

We're done with our detour on random number generators and now back on track to testing redshift quantization where I'll apply a PRNG.

A good many amateur cosmologists blindly apply the FFT and PSD to survey data because it is simple and 'obvious'.  Yet none seem to ask the question, "If it is so simple, why haven't others done it?"  The answer is that others have tried this, and found the 'simple solution' was riddled with false trails.  Yet this error is made repeatedly, and in spite of progress in understanding the problems, the method still manages to find its way into peer-reviewed journals.

The implicit assumption in these redshift-space analyses is that a uniform distribution of matter in these surveys would exhibit no peaks at all.  If that were true, then all peaks in any PSD would be significant, indicating a real deviation from uniformity.  But is it true that a uniform distribution would have no peaks?

Some researchers attempt to test the reality of their peaks in a PSD using what is sometimes called a 'Shuffle Test', moving the data to different bins to see if frequency peaks move.  The peaks usually do move, but it should be noted that each frequency bin in the FFT has contributions from EVERY point in the dataset (it is a weighted sum of every data point), so this is no surprise.  Sadly, few researchers seem to recognize the implications of this and often proceed to concoct more stranger tests or misapplying tests from different data assumptions.  These types of repeated contortions of the same dataset(s) are so unproductive, I have a nickname for this mental trap: "data masturbation".

The only way out of this trap is to test your analysis process with datasets of KNOWN content, subjected to data selection similar to the data in question.  In this case, we should first construct a dataset that does NOT have our hypothesized content (the Null Hypothesis), which for a cosmological model, would be a dataset with a uniform distribution of galaxies.

So how would you construct such a test dataset?

Building a Test Universe
What we will do is build a simple 'model universe' with galaxies distributed uniformly, but randomly, in 3-dimensions, x, y, and z.  We will create a test sample with numberOfGalaxies, which I've run with numberOfGalaxies = 10,000,000 and larger. in a volume with a defined radius, in parsecs (1 parsec = 3.26 light years).  To do this, we will make use of the numpy random number generator (see Mathematical Interlude: Pseudo-Random Number Generators)

import numpy as np
from numpy import random

There are a couple of ways we can do this.  One is to assign x,y,z coordinates for numberOfGalaxies using a uniform sampling random number generator  over the range -radius to +radius

x=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))
y=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))
z=radius*(random.uniform(-1.0,1.0,size = (numberOfGalaxies,)))

Once we have this dataset, we can compute the distance to each galaxy from an observer at the center of the dataset at coordinates (0,0,0).

distance = np.sqrt(x*x+y*y+z*z)  # convert cartesian coordinates to spherical

This method creates a 'cube' universe to which we can apply observational selections.  In the graphics below, the term SimpleCartesianHubbleModel indicates models run using this algorithm.

If we understand probability distributions, it is also possible to generate a table of galaxy distances directly by weighting the uniform probability distribution in a way that makes it match a uniform spherical distribution.  In this case, it is done by taking the cube-root of the uniform distribution.

distance = radius*(random.uniform(0.0,1.0,size = (numberOfGalaxies,)))**(1.0/3.0)

Models generated with this algorithm are labeled with SimpleRadialHubbleModel.

In this case, we make the simplifying assumption that all galaxies travel with the Hubble flow, so the redshift value, z, will be computed with a simple, linear Hubble relation, where H0 is the Hubble constant (km/s/Mpc), c is the speed of light in km/s and 1e6 converts parsecs to megaparsecs.

redshift = distance*H0/c/1e6

I've used H0 = 72.0 km/s/Mpc.  If we want to include some deviations from the Hubble flow, we can install a simple simulation of this by 'fuzzing' each redshift value with some Gaussian noise of width sigmaSpeed:

redshift = (distance*H0/c/1e6) \
                    + random.normal(0.0,sigmaSpeed,size = (numberOfGalaxies,))/c

Some of my simulations are generated with this model for comparison.  They will have SimpleHubbleDisperseModel in the graph title.

Next we define an absolute magnitude for all our galaxies.  In this case, we will assume a simple Gaussian distribution around a referenceMagitude with a standard deviation of magnitudeWidth for each galaxy.  We make no distinction between nearby or more distant galaxies, all galaxies will be part of the same brightness distribution.  We do this to simulate the fact that galaxies have a range of luminosities. 

absoluteMagnitude = random.normal(referenceMagnitude,magnitudeWidth,size = (numberOfGalaxies,))


Finally, we want to assign an apparentMagnitude to each galaxy, so that we know how bright the galaxies will appear to an observer on the Earth.  This way, we can determine which galaxies might be visible to instruments on the Earth.  Since we have a distance and absolute magnitude, the apparent magnitude for each galaxy is easy to compute (see Wikipedia: Absolute magnitude):

apparentMagnitude = absoluteMagnitude+5.0*(np.log10(distance)-1.0)

In one of our next steps, we will select only galaxies brighter than a given apparent magnitude (the magnitude visible from Earth) to simulate the fact that our instruments can only see objects brighter than a certain threshold.

This process creates a set of arrays where index i of any one of them corresponds to measurements of a galaxy:

x[i], y[i], z[i], distance[i], redshift[i], apparentMagnitude[i]

These arrays contain the dataset for our simulated universe.

We could look at a small subset, plotting them in 3-D.  We don't have the screen resolution to plot the full 10,000,000 galaxies - it would be a completely black ball - so we just plot 2000 galaxies.

So we have our simulation generator.  We can run it with different starting seeds in the random number generator to explore different variations of the uniform model. 

But first we might want to conduct a few tests on our test dataset, to make sure our simulation has generated a reasonable result for our input assumptions.

Next Weekend: Testing Our Designer Universe

Saturday, March 12, 2011

Mathematical Interlude: Pseudo-Random Number Generators

Many of the projects that I develop for this site require the generation of test datasets.  Because real data has some degree of uncertainty in the measurements, good test data should include a sprinkling of randomness to test the robustness of analysis tools developed.  Therefore I feel the need to describe a little about the software tools for simulating randomness in data, pseudo-random number generators, or PRNGs.

I have encountered numerous pseudo-scientists who seem to latch onto any terminology scientists actually use with 'pseudo' in the name, such as pseudo-tensor, or pseudo-random number generator and try to leverage the terminology to insinuate the methodology is unreliable. 

Since pseudo-random number generators are routinely used by scientists in generating test datasets and testing their algorithms (and I will start using them in the next post), I felt the need to do a pre-emptive strike.

Why Do Scientists Need Random Number Generators?
As mentioned above, often times scientists need to develop software for data analysis to test their algorithms.  Sometimes they might add a little random noise to a simulation run to determine how these changes influence their results.  Sometimes they just want to create datasets of random noise to determine how such data might be distinguished from a real signal.

Why Are they Called PSEUDO-Random Number Generators (PRNG)?
  • While the sequences are not truly random, they do pass many of the tests for randomness
  • Initializing a PRNG with the same 'seed' makes the PRNG generate the same sequences of random numbers.  This is valuable for the development of reproducible test processes.
  • After a certain number of numbers are drawn from the generator with a given seed, the sequence will repeat.  The  length of this sequence is called its period.  Ideally, you want a PRNG with a LONG period.

How Good are Pseudo-Random Number Generators?
  • Modern PRNGs are very good, suitable for a wide variety of problem-solving and simulation applications.
  • The primary application for which PRNGs are NOT suitable is cryptography.  This is because the underlying engine of PRNGs is deterministic and it is, in principle, possible for a cracker to determine the underlying PRNG sequence with enough samples of cypher-text.
What PRNG is Used by the Python NumPy Module?

The Random module in Numpy uses the Mersenne Twister (Wikipedia) as its underlying engine for generating random numbers (Python Documentation, Statistics in Numpy & Scipy).  This is also the generator used in the standard Python library.  At one time, these PRNGs were different and this can generate some confusion in online searches.

How many random numbers are in the period used by the Mersenne Twister algorithm?
2**19937-1
>>> 2**19937-1
>>> 4315424797388162648055235516337919839053935043226711505165250541403330680137658
0911304513629318584665545269938257648835317902217334584413909528269154609168019007
8753437413962968019201144864809026614143184432769803000667281049840954515881760771
3296984376213462179039639134128520562761960051310664637664861599423667548653748024
1964350295935168662363909047948347692313978301377820785712419054474332844529183172
9732423108882650813216264694510777078122828294447750226804880578200287646593991647
6626520090056149580034405435369038986289406179287201112083361480844748291354732836
7277879565648307846909116945866230169702401260240187028746650033445774570315431292
9960251877807901193759028631710841496424733789862675033089613749057663409052895722
9001603800057163087519137397955504746815433325347499104624813250451634179655147057
5481459200859472614836213875557116864445789750886277996487304308450484223420629266
5185560243393391908443689210184248446770427276646018529149252772809226975384267702
5733392895440120546589561034765885538663390254628996213264328242574803578623358060
8154696546932563833327670769899439774888526687278527451002963059146963875715425735
5344759797344631006783673933274021499309687782967413915145996023742136298987206114
3141040214723899809096281891589064569393448333099416963229587799584899336674701487
1763494805549996163051541225403465297007721146231355704081493098663065733677191172
8539870957481678162560842128233801686253345864312540346708061352735432707144788768
6186198332077728064480669112571319726258176315131359642954776357636783701934983517
8462144294960757190918054625114143666384189433852576452289347652454631535740468786
2289458856546085620580424689873724369214450923153776984071681983765382377486141962
0704154810637936512319281799900662176646716711347163271548179587700538269439340040
3061700457691135349187874888923429349340145170571716181125795888889277495426977149
9145496239163940148229850253316515114312788020090568084565068188772666098316368838
8490562182226293398654864566908067219170474040889134983568566242806323119852043682
6329415290752972798343429446509992206368781367154091702655772727391329424277529349
0826005858847665231509574170778319100161684756856586731928608820701797603072698499
8735483604237173466025769434723550630174411887414129243895814154910060975221688223
0887611431996472330842380137110927449483557815037586849644585749917772869926744218
3696211376751010832785437940817490940910430840967741447084363242794768920562004272
2796163866914980548983112124467639993195537148401288636074870647956866904857478285
5217054740113945929622177502575565811067452201448981991968635965361551681273982740
7601388996388203187763036687627301575846400427988806918626402686126861808838749395
7381812502227968993026744625577395954246983163786300017127922715140603412990218157
0659650532600775823677398182129087394449859182749999007223592423334567850671186568
8391867477049600162775406253314406190191299837899147125153652003360579935086016788
0768756856237785709525554130490292719222018417250235712444991187021064269456506138
4919373474324503966267799038402386781686809962015879090586549423504699190743519551
0437225445157409678290843360259382257807308802738552615519720440756203267806244488
0349099823216123168779471561340579324954550952805251801012308725877897411581704824
5588971438596754408081313438375502988726739523375296641615501406091607983229239827
2406147832528924797165199369895191878086812211916417477109024806334910917048274412
2828118663244590714578713835123484226138007462191400481815238666604313334487506790
3582838283562688083236575482068479639546383819532174522502682372441363275765875609
1197836532983120667082171493167735643403792897243939867441398918554166122957393566
6861265827123469643837712283899804019973907806144367541567107846340467370240377765
3478173367084844734702056866636158138003692253382209909466469591930161626097920508
7421756703065051395428607508061598353575410321470950842784610567013677397949320242
0299870773101769258204621070221251412042932253043178961626704777611512359793540414
7084870985465426502772057300900333847905334250604119503030001704002887892941404603
3458699263675013550949427505525915816399805231906796107849935808966832992976812624
4231400865703342186809455174050644882903920731671130769513189229659350901862309481
0557519560305240787163809219164433754514863301000915916985856242176563624771328981
6785482462973762495302513603634127683664561750770319774575349128064331765399959943
4330811847014715871281614939442127661422826290995005574698105320661000156029578465
6616193252269412026831159508949671513845195883217147982748879261851417819979034417
2855986077272208666776804260903087546340467370240377765347817336708484473470205686
6636158138003692253382209909466469591930161626097920508742175670306505139542860750
8061598353575410321470950842784610567013677397949320242029987077310176925820462107
0221251412042932253043178961626704777611512359793540414708487098546542650277205730
0900333847905334250604119503030001704002887892941404603345869926367501355094942750
5525915816399805231906796107849935808966832992976812624423140086570334218680945517
4050644882903920731671130769513189229659350901862309481055751956030524078716380921
9164433754514863301000915916985856242176563624771328981678548246297376249530251360
3634127683664561750770319774575349128064331765399959943433081184701471587128161493
9442127661422826290995005574698105320661000156029578465661619325226941202683115950
8949671513845195883217147982748879261851417819979034417285598607727220866677680426
0903087548238033454465663056192413083744527546681430154877108777280110860043258922
6225941396828528349704557106275770142176156526272515340740762540514993198949445910
6414660534305378576709862520049864880961144869258603473714363659194013962706366851
3892996928694918051725568185082988249549548157960631695176587414201597987542734280
2672345248126356915730721315373978104162765371507859850415479728766312294671134815
8529418816432825044466692781137474494898385064375787507376496345148625306383391555
1456900878919553159944629444932352488175999071191357559333821217061914771850549366
3221115722292033114850248756330311801880568507356984158051811871077865395357129601
4372940865270407021924383167290323231567912289419486240594039074452321678019381871
2190921554607684445735785595136133042422061513564575139372709390097072378271012458
5383767833816102339758685489423069609154024998790745346131192396385295075475805820
5625956600817743007191746812655955021747670922460866747744520875607859062334750627
0983285934800677894561696024943928137634956575998474857735539909575573132008090408
3003644649221940993409694873054749430121616568675073574955588234030398987467297545
5060957736921559195480815514035915707129930057027117286252843197413312307617886797
5067842601954367603059903407084814646072789554954877421407535706212171982521929788
69786916734625618430175454903864111585429504569920905636741539030968041471L

(Wow!  Isn't Python's long integer type cool?)

That's 6002 digits, in case you were wondering, far more than a googol (Wikipedia), but smaller than a googolplex(Wikipedia).

If you called this generator one billion times per second (10^9 /second), you could run  the program for a year and you would have only used about 3.1x10^7 seconds * 10^9 seconds = 3.1 x 10^16 of the numbers in the cycle.  There is no concern of repetition of a cycle when using this PRNG.

A Simple Sample
To illustrate the use of the numpy random number generator, here's a simple python program that generates and plots 1500 random x,y pairs with values between zero and one. 

import numpy as np
from numpy import random
import matplotlib.pyplot as plt
points=random.uniform(0.0,1.0,size = (1500,2))
ax = plt.subplot(111,aspect='equal')
ax.scatter(points[:,0], points[:,1],color='r',marker='o',s=2,label='random')
plt.show() 
How many structures can YOU see in this plot?

For those who will still want to complain...
If you want to complain about PRNGs, then you must demonstrate that a true random number generator (TRNG) will generate significantly different results than a PRNG in the particular application.  This is the standard of evidence for anyone wishing to complain about the use of pseudo-random number generators.

Additional References
Next Weekend: Quantized Redshifts - Testing the Null Hypothesis

So...What Happened?

Wow.  It's been over eight years since I last posted here... When I stepped back in August 2015,...