Module: Ratistics::Probability

Extended by:
Probability
Included in:
Ratistics, Probability
Defined in:
lib/ratistics/probability.rb

Overview

Various probability computation functions.

Instance Method Summary collapse

Instance Method Details

#cumulative_distribution_function(data, value, opts = {}) {|item| ... } ⇒ 0, ... Also known as: cdf, cumulative_distribution

Calculate the probability that a random variable will be at or below a given value based on the given sample (aka cumulative distribution function, CDF).

0 <= P <= 1

Accepts a block for processing individual items in a raw data sample (:from => :sample).

When a block is given the block will be applied to every element in the data set. Using a block in this way allows probability to be computed against a specific field in a data set of hashes or objects.

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

Options Hash (opts):

  • :from (Symbol)

    describes the nature of the data. :sample (the default) indicates data is a raw data sample, :frequency (or :freq) indicates data is a frequency distribution created from the #frequency function.

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (0, Float, 1)

    the probability of a random variable being at or below the given value. Returns zero if the value is lower than the lowest value in the sample and one if the value is higher than the highest value in the sample. Returns zero for a nil or empty sample.

See Also:


288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# File 'lib/ratistics/probability.rb', line 288

def cumulative_distribution_function(data, value, opts={})
  return 0 if data.nil? || data.empty?

  count = 0

  if opts[:from] == :frequency || opts[:from] == :freq
    size = 0
    data.each do |datum, freq|
      datum = yield(datum) if block_given?
      count = count + freq if datum <= value
      size = size + freq
    end
  else
    data.each do |datum|
      datum = yield(datum) if block_given?
      count = count + 1 if datum <= value
    end
    size = data.size
  end

  return 0 if count == 0
  return 1 if count == size
  return count / size.to_f
end

#cumulative_distribution_function_value(data, prob, opts = {}) {|item| ... } ⇒ Object Also known as: cdf_value, cumulative_distribution_value

Inverse of the #cumulative_distribution_function function. For the given data sample, return the highest value for a given probability.

Accepts a block for processing individual items in a raw data sample (:from => :sample).

When a block is given the block will be applied to every element in the data set. Using a block in this way allows probability to be computed against a specific field in a data set of hashes or objects.

Will sort the data set using natural sort order unless the :sorted option is true or a block is given.

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

Options Hash (opts):

  • :sorted (true, false)

    indicates of the data is already sorted

  • :from (Symbol)

    describes the nature of the data. :sample (the default) indicates data is a raw data sample, :frequency (or :freq) indicates data is a frequency distribution created from the #frequency function.

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (Object)

    the highest value in the sample for the given probability

See Also:


349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
# File 'lib/ratistics/probability.rb', line 349

def cumulative_distribution_function_value(data, prob, opts={}, &block)
  return nil if data.nil? || data.empty? || prob < 0 || prob > 1

  if (opts[:from].nil? || opts[:from] == :sample) && !(block_given? || opts[:sorted] == true)
    data = data.sort
  end

  if opts[:from].nil? || opts[:from] == :sample
    return (block_given? ? yield(data[0]) : data[0]) if prob == 0
    return (block_given? ? yield(data[-1]) : data[-1]) if prob == 1
  else
    return Math.min(data.keys, &block) if prob == 0
    return Math.max(data.keys, &block) if prob == 1
  end

  if opts[:from] == :freq || opts[:from] == :frequency
    ps = probability(data, :as => :array, :inc => true, :from => :freq, &block)
  else
    ps = probability(data, :as => :array, :inc => true, &block)
  end
  ps = Sort.insertion_sort!(ps){|item| item.first}
  index = Collection.bisect_left(ps, prob){|item| item.last}

  index = index-1 if prob == ps[index-1].first
  return ps[index].first
end

#frequency(data, opts = {}) {|item| ... } ⇒ Hash, ...

Calculates the statistical frequency.

When a block is given the block will be applied to every element in the data set. Using a block in this way allows computation against a specific field in a data set of hashes or objects.

The return value is a hash where the keys are the data elements from the sample and the values are the corresponding frequencies. When the :as option is set to :array the return value will be an array of arrays. Each element of the outer array will be a two-element array with the sample value at index 0 and the corresponding frequency at index 1.

Examples:

sample = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Ratistics.frequency(sample) #=> {13=>4, 18=>1, 14=>2, 16=>1, 21=>1}
Ratistics.frequency(sample, :as => :array) #=> [[13, 4], [18, 1], [14, 2], [16, 1], [21, 1]]

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

Options Hash (opts):

  • :as (Symbol)

    sets the output to :hash/:map or :array/:catalog/:catalogue (default :hash)

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (Hash, Array, nil)

    the statistical frequency of the given data set or nil if the data set is empty


40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# File 'lib/ratistics/probability.rb', line 40

def frequency(data, opts={})
  return nil if data.nil? || data.empty?

  freq = data.reduce({}) do |memo, datum|
    datum = yield(datum) if block_given?
    memo[datum] = memo[datum].to_i + 1
    memo
  end

  if (opts[:as] == :array || opts[:as] == :catalog || opts[:as] == :catalogue)
    freq = Collection.catalog_hash(freq)
  end

  return freq
end

#normalize_probability(pmf, opts = {}) ⇒ Hash Also known as: normalize_pmf

Normalize a probability distribution sample.

The data set must be formatted as output by the #probability method. Specifically, a hash where each hash key is a datum from the original data set and each hash value is the probability associated with that datum. A probability hash may become denormalized when performing conditional probability.

Parameters:

  • pmf (Enumerable)

    the data to perform the calculation against

Returns:

  • (Hash)

    a new, normalized probability distribution.

See Also:


138
139
140
141
142
143
144
145
146
147
148
149
150
# File 'lib/ratistics/probability.rb', line 138

def normalize_probability(pmf, opts={})
  total = pmf.values.reduce(0.0){|n, value| n + value} 

  return { pmf.keys.first => 1 } if pmf.count == 1
  return pmf if Math.delta(total, 1.0) < 0.01

  factor = 1.0 / total.to_f
  normalized = pmf.reduce({}) do |memo, pair|
    memo[pair[0]] = pair[1] * factor
    memo
  end
  return normalized
end

#probability(data, opts = {}, &block) {|item| ... } ⇒ Array, ... Also known as: pmf

Calculates the statistical probability.

When a block is given the block will be applied to every element in the data set. Using a block in this way allows computation against a specific field in a data set of hashes or objects.

Examples:

sample = [13, 18, 13, 14, 13, 16, 14, 21, 13]
Ratistics.probability(sample) #=> {13=>0.4444444444444444, 18=>0.1111111111111111, 14=>0.2222222222222222, 16=>0.1111111111111111, 21=>0.1111111111111111}
Ratistics.probability(sample, :as => :array) #=> [[13, 0.4444444444444444], [18, 0.1111111111111111], [14, 0.2222222222222222], [16, 0.1111111111111111], [21, 0.1111111111111111]]
Ratistics.probability(sample, :as => :catalog) #=> [[13, 0.4444444444444444], [18, 0.1111111111111111], [14, 0.2222222222222222], [16, 0.1111111111111111], [21, 0.1111111111111111]]

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

  • block (Block)

    optional block for per-item processing

Options Hash (opts):

  • :from (Symbol)

    describes the nature of the data. :sample indicates data is a raw data sample, :frequency (or :freq) indicates data is a frequency distribution created from the #frequency function. (default :sample)

  • :as (Symbol)

    sets the output to :hash or :array (default :hash)

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (Array, Hash, nil)

    the statistical probability of the given data set or nil if the data set is empty

See Also:


87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# File 'lib/ratistics/probability.rb', line 87

def probability(data, opts={}, &block)
  return nil if data.nil? || data.empty?
  from_frequency = (opts[:from] == :frequency || opts[:from] == :freq)

  if from_frequency
    count = data.reduce(0) do |n, key, value|
      key, value = key if key.is_a? Array
      key = yield(key) if block_given?
      n + value
    end
  else
    count = data.count
    data = frequency(data, &block)
  end

  prob = data.reduce({}) do |memo, key, value|
    key, value = key if key.is_a? Array
    key = yield(key) if from_frequency && block_given?
    memo[key] = value.to_f / count.to_f
    memo
  end

  if opts[:inc] || opts[:increment] || opts[:incremental]
    base = 0
    prob.keys.sort.each do |key|
      prob[key] = base = prob[key] + base
    end
  end

  if (opts[:as] == :array || opts[:as] == :catalog || opts[:as] == :catalogue)
    prob = Collection.catalog_hash(prob)
  end

  return prob
end

#probability_mean(data, opts = {}, &block) {|item| ... } ⇒ Float, 0 Also known as: pmf_mean, frequency_mean

Calculates the statistical mean of a probability distribution. Accepts a block for processing individual items.

When a block is given the block will be applied to every element in the data set. Using a block in this way allows computation against a specific field in a data set of hashes or objects.

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

  • block (Block)

    optional block for per-item processing

Options Hash (opts):

  • :from (Symbol)

    describes the nature of the data. :sample indicates data is a raw data sample, :frequency (or :freq) indicates data is a frequency distribution created from the #frequency function, and :probability (or :prob) indicates the data is a probability distribution created by the #probability function. (default :sample)

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (Float, 0)

    the statistical mean of the given data set or zero if the data set is empty

See Also:


180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# File 'lib/ratistics/probability.rb', line 180

def probability_mean(data, opts={}, &block)
  return 0 if data.nil? || data.empty?
  
  from_probability = (opts[:from] == :probability || opts[:from] == :prob)
  unless from_probability
    data = probability(data, :from => opts[:from], &block)
  end

  mean = data.reduce(0.0) do |n, key, value|
    key, value = key if key.is_a? Array
    key = yield(key) if from_probability and block_given?
    n + (key * value)
  end

  return mean
end

#probability_variance(data, opts = {}, &block) {|item| ... } ⇒ Float, 0 Also known as: pmf_variance

Calculates the statistical variance of a probability distribution. Accepts a block for processing individual items in a raw data sample (:from => :sample).

When a block is given the block will be applied to every element in the data set. Using a block in this way allows computation against a specific field in a data set of hashes or objects.

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

  • block (Block)

    optional block for per-item processing

Options Hash (opts):

  • :from (Symbol)

    describes the nature of the data. :sample indicates data is a raw data sample, :frequency (or :freq) indicates data is a frequency distribution created from the #frequency function, and :probability (or :prob) indicates the data is a probability distribution created by the #probability function. (default :sample)

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (Float, 0)

    the statistical variance of the given data set or zero if the data set is empty

See Also:


226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# File 'lib/ratistics/probability.rb', line 226

def probability_variance(data, opts={}, &block)
  return 0 if data.nil? || data.empty?

  if opts[:from] == :probability || opts[:from] == :prob
    from_probability = true
  else
    data = probability(data, :from => opts[:from], &block)
    from_probability = false
  end
    
  mean = data.reduce(0.0) do |n, key, value|
    key, value = key if key.is_a? Array
    key = yield(key) if from_probability && block_given?
    n + (key * value)
  end

  variance = data.reduce(0.0) do |n, key, value|
    key, value = key if key.is_a? Array
    key = yield(key) if from_probability && block_given?
    n + (value * ((key - mean) ** 2))
  end

  return variance
end

#sample_with_replacement(data, opts = {}) {|item| ... } ⇒ Object Also known as: resample_with_replacement, bootstrap

Resamples the given sample with replacement (aka bootstrap). The resample will have the same number of elements as the original sample unless the :size (or :length, :count) option is given.

Will sort the data set using natural sort order unless the :sorted option is true or a block is given.

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

Options Hash (opts):

  • :size (Integer)

    the size of the resample

  • :sorted (true, false)

    indicates of the data is already sorted

  • :from (Symbol)

    describes the nature of the data. :sample (the default) indicates data is a raw data sample, :frequency (or :freq) indicates data is a frequency distribution created from the #frequency function.

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (Object)

    the highest value in the sample for the given probability

See Also:


408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
# File 'lib/ratistics/probability.rb', line 408

def sample_with_replacement(data, opts={}, &block)
  return [] if data.nil? || data.empty?

  if (opts[:from].nil? || opts[:from] == :sample) && !(block_given? || opts[:sorted] == true)
    data = data.sort
  end

  if opts[:from] == :freq || opts[:from] == :frequency
    ps = probability(data, :as => :array, :inc => true, :from => :freq, &block)
    length = data.reduce(0){|length, item| length += item.last }
  else
    ps = probability(data, :as => :array, :inc => true, &block)
    length = opts[:length] || opts[:size] || opts[:count] || data.length
  end
  ps = Sort.insertion_sort!(ps){|item| item.first}

  resample = []
  length.times do
    prob = rand()
    index = Collection.bisect_left(ps, prob){|item| item.last}
    index = index-1 if prob == ps[index-1].first
    resample << ps[index].first
  end

  return resample
end

#sample_without_replacement(data, opts = {}) {|item| ... } ⇒ Object Also known as: resample_without_replacement, jackknife

Resamples the given sample without replacement (aka jackknife). The resample will have one half the number of elements as the original sample unless the :size (or :length, :count) option is given.

Parameters:

  • data (Enumerable)

    the data to perform the calculation against

  • opts (Hash) (defaults to: {})

    processing options

Options Hash (opts):

  • :size (Integer)

    the size of the resample

  • :from (Symbol)

    describes the nature of the data. :sample (the default) indicates data is a raw data sample, :frequency (or :freq) indicates data is a frequency distribution created from the #frequency function.

Yields:

  • iterates over each element in the data set

Yield Parameters:

  • item

    each element in the data set

Returns:

  • (Object)

    the highest value in the sample for the given probability

See Also:


461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
# File 'lib/ratistics/probability.rb', line 461

def sample_without_replacement(data, opts={}, &block)
  return [] if data.nil? || data.empty?

  if opts[:from] == :freq || opts[:from] == :frequency
    data = data.inject([]) do |memo, item|
      item.last.times{ memo << item.first }
      memo
    end
  else
    data = Collection.collect(data, &block).shuffle!
  end

  length = opts[:length] || opts[:size] || opts[:count] || (data.length / 2)

  return data.slice!(0, length)
end