Welcome to pyreadstat’s documentation!

Metadata Object Description

Each parsing function returns a metadata object in addition to a pandas dataframe. That object contains the following fields:

  • notes: notes or documents (text annotations) attached to the file if any (spss and stata).

  • column_names : a list with the names of the columns.

  • column_labels : a list with the column labels, if any.

  • column_names_to_labels : a dictionary with column_names as keys and column_labels as values

  • file_encoding : a string with the file encoding, may be empty

  • number_columns : an int with the number of columns

  • number_rows : an int with the number of rows. If metadataonly option was used, it may be None if the number of rows could not be determined. If you need the number of rows in this case you need to parse the whole file.

  • variable_value_labels : a dict with keys being variable names, and values being a dict with values as keys and labels as values. It may be empty if the dataset did not contain such labels. For sas7bdat files it will be empty unless a sas7bcat was given. It is a combination of value_labels and variable_to_label.

  • value_labels : a dict with label name as key and a dict as value, with values as keys and labels as values. In the case of parsing a sas7bcat file this is where the formats are.

  • variable_to_label : A dict with variable name as key and label name as value. Label names are those described in value_labels. Sas7bdat files may have this member populated and its information can be used to match the information in the value_labels coming from the sas7bcat file.

  • original_variable_types : a dict of variable name to variable format in the original file. For debugging purposes.

  • readstat_variable_types : a dict of variable name to variable type in the original file as extracted by Readstat.i For debugging purposes. In SAS and SPSS variables will be either double (numeric in the original app) or string (character). Stata has in addition int8, int32 and float types.

  • table_name : table name (string)

  • file_label : file label (SAS) (string)

  • missing_ranges: a dict with keys being variable names. Values are a list of dicts. Each dict contains two keys, ‘lo’ and ‘hi’ being the lower boundary and higher boundary for the missing range. Even if the value in both lo and hi are the same, the two elements will always be present. This appears for SPSS (sav) files when using the option user_missing=True: user defined missing values appear not as nan but as their true value and this dictionary stores the information about which values are to be considered missing.

  • missing_user_values: a dict with keys being variable names. Values are a list of character values (A to Z and _ for SAS, a to z for SATA) representing user defined missing values in SAS and STATA. This appears when using user_missing=True in read_sas7bdat or read_dta if user defined missing values are present.

  • variable_alignment: a dict with keys being variable names and values being the display alignment: left, center, right or unknown

  • variable_storage_width: a dict with keys being variable names and values being the storage width

  • variable_display_width: a dict with keys being variable names and values being the display width

  • variable_measure: a dict with keys being variable names and values being the measure: nominal, ordinal, scale or unknown

There are two functions to deal with value labels: set_value_labels and set_catalog_to_sas. You can read about them in the next section.

Functions Documentation

pyreadstat.pyreadstat.read_dta()

Read a STATA dta file

Parameters
  • filename_path (str, bytes or Path-like object) – path to the file. In Python 2.7 the string is assumed to be utf-8 encoded

  • metadataonly (bool, optional) – by default False. IF true, no data will be read but only metadata, so that you can get all elements in the metadata object. The data frame will be set with the correct column names but no data.

  • dates_as_pandas_datetime (bool, optional) – by default False. If true dates will be transformed to pandas datetime64 instead of date.

  • apply_value_formats (bool, optional) – by default False. If true it will change values in the dataframe for they value labels in the metadata, if any appropiate are found.

  • formats_as_category (bool, optional) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed for their formatted version will be transformed into pandas categories.

  • formats_as_ordered_category (bool, optional) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories. it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of the value of formats_as_category.

  • encoding (str, optional) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an iconv-compatible name

  • usecols (list, optional) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!

  • user_missing (bool, optional) – by default False, in this case user defined missing values are delivered as nan. If true, the missing values will be deliver as is, and an extra piece of information will be set in the metadata (missing_user_values) to be able to interpret those values as missing.

  • disable_datetime_conversion (bool, optional) – if True pyreadstat will not attempt to convert dates, datetimes and times to python objects but those columns will remain as numbers. In order to convert them later to an appropiate python object, the user can use the information about the original variable format stored in the metadata object in original_variable_types. Disabling datetime conversion speeds up reading files. In addition it helps to overcome situations where there are datetimes that are beyond the limits of python datetime (which is limited to year 10,000, dates beyond that will rise an Overflow error in pyreadstat).

  • row_limit (int, optional) – maximum number of rows to read. The default is 0 meaning unlimited.

  • row_offset (int, optional) – start reading rows after this offset. By default 0, meaning start with the first row not skipping anything.

  • output_format (str, optional) – one of ‘pandas’ (default) or ‘dict’. If ‘dict’ a dictionary with numpy arrays as values will be returned, the user can then convert it to her preferred data format. Using dict is faster as the other types as the conversion to a pandas dataframe is avoided.

Returns

  • data_frame (pandas dataframe) – a pandas data frame with the data. If the output_format is other than ‘pandas’ the object type will change accordingly.

  • metadata – object with metadata. Look at the documentation for more information.

pyreadstat.pyreadstat.read_file_in_chunks()

Returns a generator that will allow to read a file in chunks.

Parameters
  • read_function (pyreadstat function) – a pyreadstat reading function

  • file_path (string) – path to the file to be read

  • chunksize (integer, optional) – size of the chunks to read

  • offset (integer, optional) – start reading the file after certain number of rows

  • limit (integer, optional) – stop reading the file after certain number of rows, will be added to offset

  • multiprocess (bool, optional) – use multiprocessing to read each chunk?

  • num_processes (integer, optional) – in case multiprocess is true, how many workers/processes to spawn?

  • kwargs (dict, optional) – any other keyword argument to pass to the read_function. row_limit and row_offset will be discarded if present.

Yields
  • data_frame (pandas dataframe) – a pandas data frame with the data

  • metadata – object with metadata. Look at the documentation for more information.

  • it (generator) – A generator that reads the file in chunks.

pyreadstat.pyreadstat.read_file_multiprocessing()

Reads a file in parallel using multiprocessing.

Parameters
  • read_function (pyreadstat function) – a pyreadstat reading function

  • file_path (string) – path to the file to be read

  • num_processes (integer, optional) – number of processes to spawn, by default the min 4 and the max cores on the computer

  • kwargs (dict, optional) – any other keyword argument to pass to the read_function.

Returns

  • data_frame (pandas dataframe) – a pandas data frame with the data

  • metadata – object with metadata. Look at the documentation for more information.

pyreadstat.pyreadstat.read_por()

Read a SPSS por file

Parameters
  • filename_path (str, bytes or Path-like object) – path to the file. In Python 2.7 the string is assumed to be utf-8 encoded

  • metadataonly (bool, optional) – by default False. IF true, no data will be read but only metadata, so that you can get all elements in the metadata object. The data frame will be set with the correct column names but no data.

  • dates_as_pandas_datetime (bool, optional) – by default False. If true dates will be transformed to pandas datetime64 instead of date.

  • apply_value_formats (bool, optional) – by default False. If true it will change values in the dataframe for they value labels in the metadata, if any appropiate are found.

  • formats_as_category (bool, optional) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed for their formatted version will be transformed into pandas categories.

  • formats_as_ordered_category (bool, optional) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories. it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of the value of formats_as_category.

  • encoding (str, optional) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an iconv-compatible name

  • usecols (list, optional) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!

  • disable_datetime_conversion (bool, optional) – if True pyreadstat will not attempt to convert dates, datetimes and times to python objects but those columns will remain as numbers. In order to convert them later to an appropiate python object, the user can use the information about the original variable format stored in the metadata object in original_variable_types. Disabling datetime conversion speeds up reading files. In addition it helps to overcome situations where there are datetimes that are beyond the limits of python datetime (which is limited to year 10,000, dates beyond that will rise an Overflow error in pyreadstat).

  • row_limit (int, optional) – maximum number of rows to read. The default is 0 meaning unlimited.

  • row_offset (int, optional) – start reading rows after this offset. By default 0, meaning start with the first row not skipping anything.

  • output_format (str, optional) – one of ‘pandas’ (default) or ‘dict’. If ‘dict’ a dictionary with numpy arrays as values will be returned, the user can then convert it to her preferred data format. Using dict is faster as the other types as the conversion to a pandas dataframe is avoided.

Returns

  • data_frame (pandas dataframe) – a pandas data frame with the data. If the output_format is other than ‘pandas’ the object type will change accordingly.

  • metadata – object with metadata. Look at the documentation for more information.

pyreadstat.pyreadstat.read_sas7bcat()

Read a SAS sas7bcat file. The returning dataframe will be empty. The metadata object will contain a dictionary value_labels that contains the formats. When parsing the sas7bdat file, in the metadata, the dictionary variable_to_label contains a map from variable name to the formats. In order to apply the catalog to the sas7bdat file use set_catalog_to_sas or pass the catalog file as an argument to read_sas7bdat directly. SAS catalog files are difficult ones, some of them can be read only in specific SAS version, may contain strange encodings etc. Therefore it may be that many catalog files are not readable from this application.

Parameters
  • filename_path (str, bytes or Path-like object) – path to the file. The string is assumed to be utf-8 encoded

  • encoding (str, optional) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an iconv-compatible name

  • output_format (str, optional) – one of ‘pandas’ (default) or ‘dict’. If ‘dict’ a dictionary with numpy arrays as values will be returned. Notice that for this function the resulting object is always empty, this is done for consistency with other functions but has no impact on performance.

Returns

  • data_frame (pandas dataframe) – a pandas data frame with the data (no data in this case, so will be empty). If the output_parameter is other than ‘pandas’ then the object type will change accordingly altough the object will always be empty

  • metadata – object with metadata. The member value_labels is the one that contains the formats. Look at the documentation for more information.

pyreadstat.pyreadstat.read_sas7bdat()

Read a SAS sas7bdat file. It accepts the path to a sas7bcat.

Parameters
  • filename_path (str, bytes or Path-like object) – path to the file. In python 2.7 the string is assumed to be utf-8 encoded.

  • metadataonly (bool, optional) – by default False. IF true, no data will be read but only metadata, so that you can get all elements in the metadata object. The data frame will be set with the correct column names but no data.

  • dates_as_pandas_datetime (bool, optional) – by default False. If true dates will be transformed to pandas datetime64 instead of date.

  • catalog_file (str, optional) – path to a sas7bcat file. By default is None. If not None, will parse the catalog file and replace the values by the formats in the catalog, if any appropiate is found. If this is not the behavior you are looking for, Use read_sas7bcat to parse the catalog independently of the sas7bdat and set_catalog_to_sas to apply the resulting format into sas7bdat files.

  • formats_as_category (bool, optional) – Will take effect only if the catalog_file was specified. If True the variables whose values were replaced by the formats will be transformed into pandas categories.

  • formats_as_ordered_category (bool, optional) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories. it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of the value of formats_as_category.

  • encoding (str, optional) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an iconv-compatible name

  • usecols (list, optional) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!

  • user_missing (bool, optional) – by default False, in this case user defined missing values are delivered as nan. If true, the missing values will be deliver as is, and an extra piece of information will be set in the metadata (missing_user_values) to be able to interpret those values as missing.

  • disable_datetime_conversion (bool, optional) – if True pyreadstat will not attempt to convert dates, datetimes and times to python objects but those columns will remain as numbers. In order to convert them later to an appropiate python object, the user can use the information about the original variable format stored in the metadata object in original_variable_types. Disabling datetime conversion speeds up reading files. In addition it helps to overcome situations where there are datetimes that are beyond the limits of python datetime (which is limited to year 10,000, dates beyond that will rise an Overflow error in pyreadstat).

  • row_limit (int, optional) – maximum number of rows to read. The default is 0 meaning unlimited.

  • row_offset (int, optional) – start reading rows after this offset. By default 0, meaning start with the first row not skipping anything.

  • output_format (str, optional) – one of ‘pandas’ (default) or ‘dict’. If ‘dict’ a dictionary with numpy arrays as values will be returned, the user can then convert it to her preferred data format. Using dict is faster as the other types as the conversion to a pandas dataframe is avoided.

Returns

  • data_frame (pandas dataframe) – a pandas data frame with the data. If the output_format is other than ‘pandas’ the object type will change accordingly.

  • metadata – object with metadata. The members variables_value_labels will be empty unless a valid catalog file is supplied. Look at the documentation for more information.

pyreadstat.pyreadstat.read_sav()

Read a SPSS sav or zsav (compressed) files

Parameters
  • filename_path (str, bytes or Path-like object) – path to the file. In Python 2.7 the string is assumed to be utf-8 encoded

  • metadataonly (bool, optional) – by default False. IF true, no data will be read but only metadata, so that you can get all elements in the metadata object. The data frame will be set with the correct column names but no data.

  • dates_as_pandas_datetime (bool, optional) – by default False. If true dates will be transformed to pandas datetime64 instead of date.

  • apply_value_formats (bool, optional) – by default False. If true it will change values in the dataframe for they value labels in the metadata, if any appropiate are found.

  • formats_as_category (bool, optional) – by default True. Takes effect only if apply_value_formats is True. If True, variables with values changed for their formatted version will be transformed into pandas categories.

  • formats_as_ordered_category (bool, optional) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories. it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of the value of formats_as_category.

  • encoding (str, optional) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an iconv-compatible name

  • usecols (list, optional) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!

  • user_missing (bool, optional) – by default False, in this case user defined missing values are delivered as nan. If true, the missing values will be deliver as is, and an extra piece of information will be set in the metadata (missing_ranges) to be able to interpret those values as missing.

  • disable_datetime_conversion (bool, optional) – if True pyreadstat will not attempt to convert dates, datetimes and times to python objects but those columns will remain as numbers. In order to convert them later to an appropiate python object, the user can use the information about the original variable format stored in the metadata object in original_variable_types. Disabling datetime conversion speeds up reading files. In addition it helps to overcome situations where there are datetimes that are beyond the limits of python datetime (which is limited to year 10,000, dates beyond that will rise an Overflow error in pyreadstat).

  • row_limit (int, optional) – maximum number of rows to read. The default is 0 meaning unlimited.

  • row_offset (int, optional) – start reading rows after this offset. By default 0, meaning start with the first row not skipping anything.

  • output_format (str, optional) – one of ‘pandas’ (default) or ‘dict’. If ‘dict’ a dictionary with numpy arrays as values will be returned, the user can then convert it to her preferred data format. Using dict is faster as the other types as the conversion to a pandas dataframe is avoided.

Returns

  • data_frame (pandas dataframe) – a pandas data frame with the data. If the output_format is other than ‘pandas’ the object type will change accordingly.

  • metadata – object with metadata. Look at the documentation for more information.

pyreadstat.pyreadstat.read_xport()

Read a SAS xport file.

Parameters
  • filename_path (str, bytes or Path-like object) – path to the file. In python 2.7 the string is assumed to be utf-8 encoded

  • metadataonly (bool, optional) – by default False. IF true, no data will be read but only metadata, so that you can get all elements in the metadata object. The data frame will be set with the correct column names but no data.

  • dates_as_pandas_datetime (bool, optional) – by default False. If true dates will be transformed to pandas datetime64 instead of date.

  • encoding (str, optional) – Defaults to None. If set, the system will use the defined encoding instead of guessing it. It has to be an iconv-compatible name

  • usecols (list, optional) – a list with column names to read from the file. Only those columns will be imported. Case sensitive!

  • disable_datetime_conversion (bool, optional) – if True pyreadstat will not attempt to convert dates, datetimes and times to python objects but those columns will remain as numbers. In order to convert them later to an appropiate python object, the user can use the information about the original variable format stored in the metadata object in original_variable_types. Disabling datetime conversion speeds up reading files. In addition it helps to overcome situations where there are datetimes that are beyond the limits of python datetime (which is limited to year 10,000, dates beyond that will rise an Overflow error in pyreadstat).

  • row_limit (int, optional) – maximum number of rows to read. The default is 0 meaning unlimited.

  • row_offset (int, optional) – start reading rows after this offset. By default 0, meaning start with the first row not skipping anything.

  • output_format (str, optional) – one of ‘pandas’ (default) or ‘dict’. If ‘dict’ a dictionary with numpy arrays as values will be returned, the user can then convert it to her preferred data format. Using dict is faster as the other types as the conversion to a pandas dataframe is avoided.

Returns

  • data_frame (pandas dataframe) – a pandas data frame with the data. If the output_format is other than ‘pandas’ the object type will change accordingly.

  • metadata – object with metadata. Look at the documentation for more information.

pyreadstat.pyreadstat.set_catalog_to_sas()

Changes the values in the dataframe and sas_metadata according to the formats in the catalog. It will return a copy of the dataframe and metadata. If no appropriate formats were found, the result will be an unchanged copy of the original dataframe.

Parameters
  • sas_dataframe (pandas dataframe) – resulting from parsing a sas7bdat file

  • sas_metadata (pyreadstat metadata object) – resulting from parsing a sas7bdat file

  • catalog_metadata (pyreadstat metadata object) – resulting from parsing a sas7bcat (catalog) file

  • formats_as_category (bool, optional) – defaults to True. If True the variables having formats will be transformed into pandas categories.

  • formats_as_ordered_category (bool, optional) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories. it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of the value of formats_as_category.

Returns

  • df_copy (pandas dataframe) – a copy of the original dataframe with the values changed, if appropriate formats were found, unaltered otherwise

  • metadata (dict) – a copy of the original sas_metadata enriched with catalog information if found, otherwise unaltered

pyreadstat.pyreadstat.set_value_labels()

Changes the values in the dataframe according to the value formats in the metadata. It will return a copy of the dataframe. If no appropiate formats were found, the result will be an unchanged copy of the original dataframe.

Parameters
  • dataframe (pandas dataframe) – resulting from parsing a file

  • metadata (dictionary) – resulting from parsing a file

  • formats_as_category (bool, optional) – defaults to True. If True the variables having formats will be transformed into pandas categories.

  • formats_as_ordered_category (bool, optional) – defaults to False. If True the variables having formats will be transformed into pandas ordered categories. it has precedence over formats_as_category, meaning if this is True, it will take effect irrespective of the value of formats_as_category.

Returns

df_copy – a copy of the original dataframe with the values changed, if appropiate formats were found, unaltered otherwise

Return type

pandas dataframe

pyreadstat.pyreadstat.write_dta()

Writes a pandas data frame to a STATA dta file

Parameters
  • df (pandas data frame) – pandas data frame to write to sav or zsav

  • dst_path (str or pathlib.Path) – full path to the result dta file

  • file_label (str, optional) – a label for the file

  • column_labels (list or dict, optional) – labels for columns (variables), if list must be the same length as the number of columns. Variables with no labels must be represented by None. If dict values must be variable names and values variable labels. In such case there is no need to include all variables; labels for non existent variables will be ignored with no warning or error.

  • version (int, optional) – dta file version, supported from 8 to 15, default is 15

  • variable_value_labels (dict, optional) – value labels, a dictionary with key variable name and value a dictionary with key values and values labels. Variable names must match variable names in the dataframe otherwise will be ignored. Value types must match the type of the column in the dataframe.

  • missing_user_values (dict, optional) – user defined missing values for numeric variables. Must be a dictionary with keys being variable names and values being a list of missing values. Missing values must be a single character between a and z.

  • variable_format (dict, optional) – sets the format of a variable. Must be a dictionary with keys being the variable names and values being strings defining the format. See README, setting variable formats section, for more information.

pyreadstat.pyreadstat.write_por()

Writes a pandas data frame to a SPSS POR file.

Parameters
  • df (pandas data frame) – pandas data frame to write to sav or zsav

  • dst_path (str or pathlib.Path) – full path to the result por file

  • file_label (str, optional) – a label for the file

  • column_labels (list or dict, optional) – labels for columns (variables), if list must be the same length as the number of columns. Variables with no labels must be represented by None. If dict values must be variable names and values variable labels. In such case there is no need to include all variables; labels for non existent variables will be ignored with no warning or error.

  • variable_format (dict, optional) – sets the format of a variable. Must be a dictionary with keys being the variable names and values being strings defining the format. See README, setting variable formats section, for more information.

pyreadstat.pyreadstat.write_sav()

Writes a pandas data frame to a SPSS sav or zsav file.

Parameters
  • df (pandas data frame) – pandas data frame to write to sav or zsav

  • dst_path (str or pathlib.Path) – full path to the result sav or zsav file

  • file_label (str, optional) – a label for the file

  • column_labels (list or dict, optional) – labels for columns (variables), if list must be the same length as the number of columns. Variables with no labels must be represented by None. If dict values must be variable names and values variable labels. In such case there is no need to include all variables; labels for non existent variables will be ignored with no warning or error.

  • compress (boolean, optional) – if true a zsav will be written, by default False, a sav is written

  • row_compress (boolean, optional) – if true it applies row compression, by default False, compress and row_compress cannot be both true at the same time

  • note (str, optional) – a note to add to the file

  • variable_value_labels (dict, optional) – value labels, a dictionary with key variable name and value a dictionary with key values and values labels. Variable names must match variable names in the dataframe otherwise will be ignored. Value types must match the type of the column in the dataframe.

  • missing_ranges (dict, optional) – user defined missing values. Must be a dictionary with keys as variable names matching variable names in the dataframe. The values must be a list. Each element in that list can either be either a discrete numeric or string value (max 3 per variable) or a dictionary with keys ‘hi’ and ‘lo’ to indicate the upper and lower range for numeric values (max 1 range value + 1 discrete value per variable). hi and lo may also be the same value in which case it will be interpreted as a discrete missing value. For this to be effective, values in the dataframe must be the same as reported here and not NaN.

  • variable_display_width (dict, optional) – set the display width for variables. Must be a dictonary with keys being variable names and values being integers.

  • variable_measure (dict, optional) – sets the measure type for a variable. Must be a dictionary with keys being variable names and values being strings one of “nominal”, “ordinal”, “scale” or “unknown” (default).

  • variable_format (dict, optional) – sets the format of a variable. Must be a dictionary with keys being the variable names and values being strings defining the format. See README, setting variable formats section, for more information.

pyreadstat.pyreadstat.write_xport()

Writes a pandas data frame to a SAS Xport (xpt) file. If no table_name is specified the dataset has by default the name DATASET (take it into account if reading the file from SAS.) Versions 5 and 8 are supported, default is 8.

Parameters
  • df (pandas data frame) – pandas data frame to write to sav or zsav

  • dst_path (str or pathlib.Path) – full path to the result xport file

  • file_label (str, optional) – a label for the file

  • column_labels (list or dict, optional) – labels for columns (variables), if list must be the same length as the number of columns. Variables with no labels must be represented by None. If dict values must be variable names and values variable labels. In such case there is no need to include all variables; labels for non existent variables will be ignored with no warning or error.

  • table_name (str, optional) – name of the dataset, by default DATASET

  • file_format_version (int, optional) – XPORT file version, either 8 or 5, default is 8

  • variable_format (dict, optional) – sets the format of a variable. Must be a dictionary with keys being the variable names and values being strings defining the format. See README, setting variable formats section, for more information.

Indices and tables