Estimating joint probability density functions from marginals

By Mauricio N.A. Monsalve Moreno

Methodology

I created a methodology for estimating joint pdfs from marginal pdfs, which can be summarized in the following steps:

Find functions S1..Sn such that each Yk = Sk(X1..Xk) forms a set of independent random variables.
Estimate each gk, the pdf that Yk follows.
The joint pdf of X1..Xn is simply:
f(X1..Xn)=g1(Y1)..gn(Yn) dS1/dX1..dSn/dXn.

This methodology is further explained and proven in my paper.

Software: Joint PDF Estimator

I programmed an application for estimating joint pdfs from sample data. It only works when correlations are linear or almost linear. Note that I programmed it in Java using NetBeans (great for creating GUIs).

Joint PDF Estimator (Java JAR, 80.4 kB).

This is how Joint PDF Estimator looks like. It has a textarea which shows the joint pdf generated. From that explanation anyone should be able to write the joint pdf in an analytical form. (At least, this is what I believe!)

Datasets used

I used the following datasets to test my software (and my theory, in turn):

CC10B Course data. Grades of the first, second and third tests, and the final exam, of the course "Introduction to Computing" given at the School of Engineering of Universidad de Chile.
CC42A Course data. Grades of the first, second and third tests of the course "Databases" givan at the Department of Computer Science of Universidad de Chile.
IN50A Course data. Six grades of the course "Organizational Behavior and Human Resources Management" given at the Department of Industrial Engineering of Universidad de Chile.
Voting intentions of the Chilean 1988 Plebiscite. Its name says it all.
Ss1: Sampled set 1 which data is perfectly correlated. However, the pseudorandom numbers were poorly generated (which is okay; nobody cares of software which only works under ideal conditions).
Ss2: Sampled set 2, of almost linear relations. We used the same pseudorandom numbers used in Sampled set 1.
Ss3: Sampled set 3, with greater non linearities than Ss3, but still almost linear.
Ss4: Sampled set 4, with greater non linearities than Ss4. If A, B and C are independent random variables, Ss4's columns are (A+B)^2, (B-C)^2 and C^2.
Portion of the T01.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T03.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T07.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T13.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T17.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T25.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T30.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T33.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T35.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T41.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T53.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T59.1 dataset from the Andrews and Herzberg archive (StatLib).
Portion of the T60.1 dataset from the Andrews and Herzberg archive (StatLib).

I don't claim any kind of copyright over these datasets, even over these built by me.

For more datasets, go to:

StatLib, tons of datasets to work with. They only ask you to acknowledge the source of the data you used (StatLib and the contributor).
Statistical Science Web Datasets. Lots of links to datasets.
UCI Machine Learning Repository. They ask you to cite their repository, and they give you the full NatBib/BibTex entry already written, yet some datasets require additional reference.
JSE Information Service. Datasets of the Journal of Statistics Education.

Mauricio Monsalve, April 2009.