Estimating joint probability density functions
from marginals
By Mauricio N.A. Monsalve Moreno
Methodology
I created a methodology for estimating joint pdfs
from marginal pdfs, which can be summarized in the
following steps:
- Find functions S1..Sn such that each
Yk = Sk(X1..Xk) forms a set of
independent random variables.
- Estimate each gk, the pdf
that Yk follows.
- The joint pdf of X1..Xn is
simply:
f(X1..Xn)=g1(Y1)..gn(Yn) dS1/dX1..dSn/dXn.
This methodology is further explained and
proven in my paper.
Software: Joint PDF Estimator
I programmed an application for estimating
joint pdfs from sample data. It only works
when correlations are linear or almost linear.
Note that I programmed it in Java using
NetBeans (great for creating GUIs).

This is how Joint PDF Estimator looks like.
It has a textarea which shows the joint pdf
generated. From that explanation anyone
should be able to write the joint pdf
in an analytical form. (At least, this is
what I believe!)
Datasets used
I used the following datasets to test my
software (and my theory, in turn):
- CC10B Course data.
Grades of the first, second and third tests,
and the final exam, of the course
"Introduction to Computing" given at the
School of Engineering of Universidad de Chile.
- CC42A Course data.
Grades of the first, second and third tests
of the course "Databases" givan at the
Department of Computer Science of
Universidad de Chile.
- IN50A Course data.
Six grades of the course "Organizational
Behavior and Human Resources Management"
given at the Department of Industrial
Engineering of Universidad de Chile.
- Voting intentions
of the Chilean 1988 Plebiscite. Its
name says it all.
- Ss1: Sampled set 1 which
data is perfectly correlated. However, the
pseudorandom numbers were poorly generated
(which is okay; nobody cares of software which
only works under ideal conditions).
- Ss2: Sampled set 2, of
almost linear relations. We used the same
pseudorandom numbers used in Sampled set 1.
- Ss3: Sampled set 3,
with greater non linearities than Ss3, but
still almost linear.
- Ss4: Sampled set 4,
with greater non linearities than Ss4. If
A, B and C are independent random variables,
Ss4's columns are (A+B)^2, (B-C)^2 and C^2.
- Portion of the T01.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T03.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T07.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T13.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T17.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T25.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T30.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T33.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T35.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T41.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T53.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T59.1
dataset from the
Andrews and Herzberg archive (StatLib).
- Portion of the T60.1
dataset from the
Andrews and Herzberg archive (StatLib).
I don't claim any kind of copyright over these
datasets, even over these built by me.
For more datasets, go to:
Mauricio Monsalve, April 2009.