Welcome to PHStatsMethods’s documentation!

Introduction

This is a Python package to support analysts in the execution of statistical methods approved for use in the production of Public Health indicators such as those presented via [Fingertips](https://fingertips.phe.org.uk/). It provides functions for the generation of Proportions, Rates, DSRs, ISRs, Funnel plots and Means including confidence intervals for these statistics, and a function for assigning data to quantiles.

Any feedback would be appreciated and can be provided using the Issues section of the [PHStatsMethods GitHub repository](https://github.com/dhsc-govuk/PHStatsMethods/issues).

Installation

This packaged should be installed using pip:

` pip install PHStatsMethods `

Or it can be compiled from source (still requires pip):

` pip install git+https://github.com/dhsc-govuk/PHStatsMethods.git `

Usage

PH_statistical_methods should be imported and used in line with standard python conventions. It is suggested that if the whole package is to be imported then one of the two the following conventions are used:

>>> import PHStatsMethods

>>> from PHStatsMethods import *

For more information on any function, you can use:

>>> help(PHStatsMethods.function)

Examples

Below is a example using the ph_proportion() function to demonstrate the purpose of package.

>>> df = pd.DataFrame({'area': ["Area1", "Area2", "Area3", "Area4"] * 3,
                       'numerator': [None, 48, 10000, 7, 82, 6500, 10000, 750, 9, 8200, 8, 900],
                       'denominator': [100, 10000, 10000, 10000] * 3})

Ungrouped

>>> PHStatsMethods.ph_proportion(df, 'numerator', 'denominator')

Grouped

>>> PHStatsMethods.ph_proportion(df, 'numerator', 'denominator', 'area', multiplier = 100)

Licence

This project is released under the [GPL-3](https://opensource.org/licenses/GPL-3.0) licence.

Functions

PHStatsMethods.proportions.ph_proportion(df, num_col, denom_col, group_cols=None, metadata=True, confidence=0.95, multiplier=1)

Calculates proportions with confidence limits using Wilson Score method.

Parameters:

df – DataFrame containing the data to calculate proportions for.
num_col (str) – Name of column containing observed number of cases in the sample (the numerator of the population).
denom_col (str) – Name of column containing number of cases in sample (the denominator of the population).
group_cols (str | list) – A string or list of column name(s) to group the data by. Defaults to None.
metadata (bool) – Whether to include information on the statistic and confidence interval methods.
confidence (float) – Confidence interval(s) to use, either as a float, list of float values or None. Confidence intervals must be between 0.9 and 1. Defaults to 0.95 (2 std from mean).
multiplier (int) – Multiplier used to express the final values (e.g. 100 = percentage).

Returns:

DataFrame of calculated proportion statistics with confidence intervals.

Return type:

Pandas DataFrame

Notes

Wilson Score method (2) is applied using the internal wilson_lower and wilson_upper functions.

References

Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc; 1927; 22. Pg 209 to 212.
Newcombe RG, Altman DG. Proportions and their differences. In Altman DG et al. (eds). Statistics with confidence (2nd edn). London: BMJ Books; 2000. Pg 46 to 48.

Examples

Below is a example using the ph_proportion() function to demonstrate the purpose of package.

>>> import pandas as pd
>>> from PHStatsMethods import *
>>> df = pd.DataFrame({'area': ["Area1", "Area2", "Area3", "Area4"] * 3,
                       'numerator': [None, 48, 10000, 7, 82, 6500, 10000, 750, 9, 8200, 8, 900],
                       'denominator': [100, 10000, 10000, 10000] * 3})

Ungrouped

>>> ph_proportion(df, 'numerator', 'denominator')
>>> ph_proportion(df, 'numerator', 'denominator', confidence = 0.998)
>>> ph_proportion(df, 'numerator', 'denominator', confidence = [0.95, 0.998])

Grouped

>>> ph_proportion(df, 'numerator', 'denominator', 'area', multiplier = 100)

PHStatsMethods.rates.ph_rate(df, num_col, denom_col, group_cols=None, metadata=True, confidence=0.95, multiplier=100000)

Calculates rates uwith confidence limits using byars or exact method.

Parameters:

df – Dataframe containing the data to calculate rates for.
num_col (str) – Name of the column containing the observed number of cases in the sample(numerator).
denom_col (str) – Name of the column containing the number of cases in the sample(denominator).
group_cols (str | list) – A string or list of column name(s) to group the data by. Defaults to None.
metadata (bool) – Whether to include information on the statistic and confidence interval methods.
confidence (float) – Confidence interval(s) to use, either as a float, list of float values or None. Confidence intervals must be between 0.9 and 1. Defaults to 0.95 (2 std from mean).
multiplier (int) – Multiplier for calculation, default is 100000 for rates per 100,000

Returns:

DataFrame with calculated rates and confidence intervals (df).

Return type:

Pandas DataFrame

Notes

For numerators >= 10 Byar’s method (1) is applied using the internal byars_lower and byars_upper functions. For small numerators Byar’s method is less accurate and so an exact method (2) based on the Poisson distribution is used.

References

Breslow NE, Day NE. Statistical methods in cancer research, volume II: The design and analysis of cohort studies. Lyon: International Agency for Research on Cancer, World Health Organisation; 1987.
Armitage P, Berry G. Statistical methods in medical research (4th edn). Oxford: Blackwell; 2002.

PHStatsMethods.means.ph_mean(df, num_col, group_cols, metadata=True, confidence=0.95)

Calculates means with confidence limits using Student-t distribution.

Parameters:

df – DataFrame containing the data to calculate proportions for.
num_col (str) – Name of column containing observed number of cases in the sample (the numerator of the population).
group_cols (str | list) – A string or list of column name(s) to group the data by.
metadata (bool) – Whether to include information on the statistic and confidence interval methods.
confidence (float) – Confidence interval(s) to use, either as a float, list of float values or None. Confidence intervals must be between 0.9 and 1. Defaults to 0.95 (2 std from mean).

Returns:

DataFrame of calculated mean statistics with confidence intervals (df).

Return type:

Pandas DataFrame

PHStatsMethods.DSR.ph_dsr(df, num_col, denom_col, ref_denom_col, group_cols=None, metadata=True, confidence=0.95, multiplier=100000, euro_standard_pops=True, **kwargs)

Calculates directly standardised rates with confidence limits using Byar’s method (1) with Dobson method adjustment (2).

Parameters:

df – DataFrame containing the data to be standardised.
num_col (str) – Column name from data containing the observed number of events for each standardisation category (e.g. ageband) within each grouping set (e.g. area).
denom_col (str) – Column name from data containing the population for each standardisation category (e.g. age band).
ref_denom_col (str) – The standard populations for each standardisation category (e.g. age band). This is either the column name in the main dataframe, the reference data if given, or the column name of the agebands to join to if euro_standard_pops is set to True.
group_cols (str | list) – A string or list of column name(s) to group the data by. Default to None.
metadata (bool) – Whether to include information on the statistic and confidence interval methods.
euro_standard_pops (bool) – Whether to use the european standard populations. You can see what these populations are with euro_standard_pop().
multiplier (int) – The multiplier used to express the final values. Default 100,000.
confidence (float) – Confidence interval(s) to use, either as a float, list of float values or None. Confidence intervals must be between 0.9 and 1. Defaults to 0.95 (2 std from mean).

Other Parameters:

ref_df – DataFrame of reference data to join.
ref_join_left (str | list) – A string or list of column name(s) in df to join on to.
ref_join_right (str | list) – A string or list of column name(s) in ref_df to join on to.

Returns:

DataFrame of calculated directly standardised rates and confidence intervals

Return type:

Pandas DataFrame

Notes

For total counts >= 10 Byar’s method (1) is applied using the internal byars_lower and byars_upper functions. When the total count is < 10 DSRs are not reliable and will therefore not be calculated.

References

Breslow NE, Day NE. Statistical methods in cancer research, volume II: The design and analysis of cohort studies. Lyon: International Agency for Research on Cancer, World Health Organisation; 1987.
Dobson A et al. Confidence intervals for weighted sums of Poisson parameters. Stat Med 1991;10:457-62.

PHStatsMethods.ISRate.ph_ISRate(df, num_col, denom_col, ref_num_col, ref_denom_col, group_cols=None, metadata=True, confidence=0.95, multiplier=100000, **kwargs)

Calculates indirectly standardized rates with confidence limits using Byar’s or exact CI method.

Parameters:

df – DataFrame containing the data.
num_col (str) – Field containing observed number of events.
denom_col (str) – Field containing population at risk.
ref_num_col (str) – Observed events in the reference population.
ref_denom_col (str) – Population at risk in the reference population.
group_cols (str | list) – Columns to group data by.
metadata (bool) – Include metadata columns.
confidence (float | list) – Confidence levels, default 0.95.
multiplier (int) – The multiplier for the rate calculation, default 100000.

Other Parameters:

ref_df – DataFrame of reference data to join.
ref_join_left (str | list) – A string or list of column name(s) in df to join on to.
ref_join_right (str | list) – A string or list of column name(s) in ref_df to join on to.
obs_df – DataFrame of total observed events for each group.
obs_join_left (str | list) – A string or list of column name(s) in df to join on to.
obs_join_right (str | list) – A string or list of column name(s) in obs_df to join on to.

Returns:

Dataframe containing calculated IS Rates.

Return type:

Pandas DataFrame

Notes

For numerators >= 10 Byar’s method (1) is applied using the internal byars_lower and byars_upper functions. For small numerators Byar’s method is less accurate and so an exact method (2) based on the Poisson distribution is used.

References

Breslow NE, Day NE. Statistical methods in cancer research, volume II: The design and analysis of cohort studies. Lyon: International Agency for Research on Cancer, World Health Organisation; 1987.
Armitage P, Berry G. Statistical methods in medical research (4th edn). Oxford: Blackwell; 2002.

PHStatsMethods.ISRatio.ph_ISRatio(df, num_col, denom_col, ref_num_col, ref_denom_col, group_cols=None, metadata=True, confidence=0.95, refvalue=1, **kwargs)

Calculates standard mortality ratios (or indirectly standardised ratios) with confidence limits using Byar’s (1) or exact (2) CI method.

Parameters:

df – DataFrame containing the data to calculate IS ratios for.
num_col (str) – Field name from data containing the observed number of events for each standardisation category (e.g. ageband) within each grouping set (eg area). If observed_totals is not None, then num_col will contain the observations from the observed_totals dataframe.
denom_col (str) – Field name from data containing the population for each standardisation category (e.g. age band).
ref_num_col (str) – The observed number of events in the reference population for each standardisation category (eg age band); field name from df or ref_def.
ref_denom_col (str) – The reference population for each standardisation category (eg age band)
group_cols (str | list) – A string or list of column name(s) to group the data by.
confidence (float) – Confidence interval(s) to use, either as a float, list of float values or None. Confidence intervals must be between 0.9 and 1. Defaults to 0.95 (2 std from mean).
refvalue (int) – The standardised reference ratio, default = 1

Other Parameters:

ref_df – DataFrame of reference data to join.
ref_join_left (str | list) – A string or list of column name(s) in df to join on to.
ref_join_right (str | list) – A string or list of column name(s) in ref_df to join on to.
obs_df – DataFrame of total observed events for each group.
obs_join_left (str | list) – A string or list of column name(s) in df to join on to.
obs_join_right (str | list) – A string or list of column name(s) in obs_df to join on to.

Returns:

Dataframe containing calculated IS Ratios.

Return type:

Pandas DataFrame

Notes

For numerators >= 10 Byar’s method (1) is applied using the internal byars_lower and byars_upper functions. For small numerators Byar’s method is less accurate and so an exact method (2) based on the Poisson distribution is used.

References

Breslow NE, Day NE. Statistical methods in cancer research, volume II: The design and analysis of cohort studies. Lyon: International Agency for Research on Cancer, World Health Organisation; 1987.
Armitage P, Berry G. Statistical methods in medical research (4th edn). Oxford: Blackwell; 2002.

PHStatsMethods.quantiles.ph_quantile(df, values, group_cols=None, nquantiles=10, invert=True, type='full')

Assigns data to quantiles based on numeric data rankings.

Parameters:

df – A data frame containing the quantitative data to be assigned to quantiles. If group_cols is used, separate sets of quantiles will be assigned for each grouping set
values (str) – Column name from data containing the numeric values to rank data by and assign quantiles.
group_cols (str | list) – A string or list of column name(s) to group the data by. Defaults to None.
nquantiles (int) – The number of quantiles to separate each grouping set into.
invert (bool) – Whether the quantiles should be directly (False) or inversely (True) related to the numerical value order.
type (str) – Defines whether to include metadata columns in output to reference the arguments passed; can be “standard” or “full”.

Returns:

When type = “full”, returns the original data.frame with quantile (quantile value), nquantiles (number of quantiles requested), groupvars (grouping sets quantiles assigned within) and invert (indicating direction of quantile assignment) fields appended.

Return type:

Pandas DataFrame

Notes

See OHID Technical Guide - Assigning Deprivation Categories for methodology. In particular, note that this function strictly applies the algorithm defined but some manual review, and potentially adjustment, is advised in some cases where multiple small areas with equal rank fall across a natural quantile boundary.

PHStatsMethods.funnels.assign_funnel_significance(df, num_col, statistic, denom_col=None, rate=None, rate_type=None, multiplier=None)

Identifies whether each value in a dataset falls outside of 95 and/or 99.8 percent control limits based on the aggregated average value across the whole dataset as an indicator of statistically significant difference.

Parameters:

df – DataFrame containing the data to calculate control limits for.
num_col (str) – Name of column containing observed number of cases in the sample (the numerator of the population).
statistic (str) – Type of statistic to inform funnel calculations: ‘proportion’, ‘rate’, or ‘ratio’
denom_col (str) – Name of column containing number of cases in sample (the denominator of the population).
metadata (bool) – Whether to include information on the statistic and confidence interval methods.
rate (str) – Column name containing the ‘rate’.
rate_type (str) – If statistic is ‘rate’, specify either ‘dsr’ or ‘crude’.
multiplier (int) – Multiplier the rate is normalised with (i.e. per 100000) only required when statistic is ‘rate’.

Returns:

DataFrame of calculated significance levels.

Return type:

Pandas DataFrame

PHStatsMethods.funnels.calculate_funnel_limits(df, num_col, statistic, multiplier, denom_col=None, metadata=True, rate=None, rate_type=None, ratio_type=None, years_of_data=None)

Calculates control limits adopting a consistent method as per the Fingertips Technical Guidance

Parameters:

df – DataFrame containing the data to calculate control limits for.
num_col (str) – Name of column containing observed number of cases in the sample (the numerator of the population).
statistic (str) – Type of statistic to inform funnel calculations: ‘proportion’, ‘rate’, or ‘ratio’.
multiplier (int) – Multiplier used to express the final values (e.g. 100 = percentage).
denom_col (str) – Name of column containing number of cases in sample (the denominator of the population).
metadata (bool) – Whether to include information on the statistic and confidence interval methods.
rate (str) – Column name containing the ‘rate’.
rate_type (str) – If statistic is ‘rate’, specify either ‘dsr’ or ‘crude’.
ratio_type (str) – If statistic is ‘ratio’, specify either ‘count’ or ‘isr’ (indirectly standardised ratio).
years_of_data (int) – Number of years the data represents; this is required if statistic is ‘ratio’.

Returns:

DataFrame of calculated control limits.

Return type:

Pandas DataFrame

PHStatsMethods.funnels.calculate_funnel_points(df, num_col, rate, rate_type, denom_col=None, multiplier=100000, years_of_data=1)

For rate-based funnels: Derive rate and annual population values for charting based. Process removes rates where the rate type is dsr and the number of observed events are below 10.

Parameters:

df – DataFrame containing the data to calculate control limits for.
num_col (str) – Name of column containing observed number of cases in the sample (the numerator of the population).
statistic (str) – type of statistic to inform funnel calculations: ‘proportion’, ‘rate’, or ‘ratio’
denom_col (str) – Name of column containing number of cases in sample (the denominator of the population).
metadata (bool) – Whether to include information on the statistic and confidence interval methods.
years_of_data (int) – number of years the data represents
multiplier (int) – multiplier the rate is normalised with (i.e. per 100000).

Returns:

DataFrame of calculated funnel points. First will have the same name as the rate field, with the suffix ‘_chart’, the second will be called denominator_derived.

Return type:

Pandas DataFrame

PHStatsMethods.confidence_intervals.byars(value, confidence=0.95, denominator=None, rate=None, exact_method_for_low_numbers=True)

Calculates confidence intervals using Byar’s method (1).

Parameters:

value (int | float) – Value to calculate confidence intervals, must be over 9 to calculate Byar’s, else exact method is used.
confidence (float) – Confidence interval to use, default 0.95 for 95% confidence interval.
denominator (int | float, Optional) – Denominator to calculate Byar’s on a rate.
rate (int | float, Optional) – Rate to calculate Byar’s on.
exact_method_for_low_numbers (bool) – Boolean instruction as to whether to use exact method for low numbers. Default True.

Returns:

Either Exact method or Byar’s method confidence intervals.

Return type:

Tuple

References

Breslow NE, Day NE. Statistical methods in cancer research, volume II: The design and analysis of cohort studies. Lyon: International Agency for Research on Cancer, World Health Organisation; 1987.

PHStatsMethods.confidence_intervals.byars_lower(value, confidence=0.95)

Calculates lower confidence interval using Byar’s method (1).

Parameters:

value (int | float) – Value to calculate upper confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Byar’s lower confidence interval.

Return type:

Float

References

Breslow NE, Day NE. Statistical methods in cancer research, volume II: The design and analysis of cohort studies. Lyon: International Agency for Research on Cancer, World Health Organisation; 1987.

PHStatsMethods.confidence_intervals.byars_upper(value, confidence=0.95)

Calculates upper confidence interval using Byar’s method (1).

Parameters:

value (int | float) – Value to calculate upper confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Byar’s upper confidence interval.

Return type:

Float

References

Breslow NE, Day NE. Statistical methods in cancer research, volume II: The design and analysis of cohort studies. Lyon: International Agency for Research on Cancer, World Health Organisation; 1987.

PHStatsMethods.confidence_intervals.dobson_lower(value, total_count, var, confidence, multiplier)

Calculates lower confidence interval using Dobson’s method

Parameters:

value (int | float) – The value to calculate confidence intervals over.
total_count (int | float) – The total count to calculate dobsons confidence interval.
var (float) – Variance to be used in the calculation.
confidence (float) – Confidence interval to be used, default 0.95 for 95% confidence interval.
multiplier (int) – Multiplier to be used in the calculation.

Returns:

Dobson’s lower confidence interval.

Return type:

Float

PHStatsMethods.confidence_intervals.dobson_upper(value, total_count, var, confidence, multiplier)

Calculates upper confidence interval using Dobson’s method

Parameters:

value (int | float) – The value to calculate confidence intervals over.
total_count (int | float) – The total count to calculate dobsons confidence interval.
var (float) – Variance to be used in the calculation.
confidence (float) – Confidence interval to be used, default 0.95 for 95% confidence interval.
multiplier (int) – Multiplier to be used in the calculation.

Returns:

Dobson’s upper confidence interval.

Return type:

Float

PHStatsMethods.confidence_intervals.exact(value, confidence=0.95)

Calculates confidence intervals using the exact method (1).

Parameters:

value (int | float) – Value to calculate upper confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Exact lower and upper confidence intervals.

Return type:

Tuple

References

Armitage P, Berry G. Statistical methods in medical research (4th edn). Oxford: Blackwell; 2002.

PHStatsMethods.confidence_intervals.exact_lower(value, confidence=0.95)

Calculates lower confidence interval using the exact method (1).

Parameters:

value (int | float) – Value to calculate upper confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Exact lower confidence interval.

Return type:

Float

References

Armitage P, Berry G. Statistical methods in medical research (4th edn). Oxford: Blackwell; 2002.

PHStatsMethods.confidence_intervals.exact_upper(value, confidence=0.95)

Calculates upper confidence interval using the exact method (1).

Parameters:

value (int | float) – Value to calculate upper confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Exact upper confidence interval.

Return type:

Float

References

Armitage P, Berry G. Statistical methods in medical research (4th edn). Oxford: Blackwell; 2002.

PHStatsMethods.confidence_intervals.student_t_dist(value_count, st_dev, confidence=0.95)

Calculates the Student-t value to be used to create confidence intervals using the Student-t distribution.

Parameters:

value_count (int | float) – The total value to calculate confidence intervals over.
st_dev (int | float) – The standard deviation to be used in the calculation.
confidence (float) – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Student-t distribution value.

Return type:

Float

PHStatsMethods.confidence_intervals.wilson(count, denominator, confidence=0.95)

Calculates the CI using Wilson Score method (1, 2).

Parameters:

count (int | float) – Numerator to calculate wilsons confidence interval.
denominator int | float – Denominator to calculate wilsons confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Wilson’s lower and upper confidence intervals.

Return type:

Tuple

References

Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc; 1927; 22. Pg 209 to 212.
Newcombe RG, Altman DG. Proportions and their differences. In Altman DG et al. (eds). Statistics with confidence (2nd edn). London: BMJ Books; 2000. Pg 46 to 48.

PHStatsMethods.confidence_intervals.wilson_lower(count, denominator, confidence=0.95)

Calculates the lower CI using Wilson Score method (1, 2).

Parameters:

count (int | float) – Numerator to calculate wilsons lower confidence interval.
denominator int | float – Denominator to calculate wilsons lower confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Wilson’s lower confidence interval.

Return type:

Float

References

Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc; 1927; 22. Pg 209 to 212.
Newcombe RG, Altman DG. Proportions and their differences. In Altman DG et al. (eds). Statistics with confidence (2nd edn). London: BMJ Books; 2000. Pg 46 to 48.

PHStatsMethods.confidence_intervals.wilson_upper(count, denominator, confidence=0.95)

Calculates the upper CI using Wilson Score method (1, 2).

Parameters:

count (int | float) – Numerator to calculate wilsons upper confidence interval.
denominator int | float – Denominator to calculate wilsons upper confidence interval.
confidence float – Confidence interval to use, default 0.95 for 95% confidence interval.

Returns:

Wilson’s upper confidence interval.

Return type:

Float

References

Wilson EB. Probable inference, the law of succession, and statistical inference. J Am Stat Assoc; 1927; 22. Pg 209 to 212.
Newcombe RG, Altman DG. Proportions and their differences. In Altman DG et al. (eds). Statistics with confidence (2nd edn). London: BMJ Books; 2000. Pg 46 to 48.

PHStatsMethods.utils.euro_standard_pop()

Generates a dataframe containing the European Standard Population.

Returns:: DataFrame containg the European Standard Population.
Return type:: Pandas DataFrame