Search notes:

Python pandas: DataFrame

A DataFrame is a 2 dimensional data structure (similar to the R data frame).
The columns of the DataFrame are (one dimensonal) Series objects.
import pandas

df = pandas.DataFrame(
      { 'col_1': pandas.Series( [  10 ,  20 ,  30 ] ),
        'col_2': pandas.Series( ['foo','bar','baz'] ),
        'col_3': pandas.Series( [  'X',  'Y',  'Z'] )
      })

print(type(df))
# <class 'pandas.core.frame.DataFrame'>

print(df)
#    col_1 col_2 col_3
# 0     10   foo     X
# 1     20   bar     Y
# 2     30   baz     Z
Github repository about-python, path: /libraries/pandas/DataFrame/DataFrame.py

When constructing a DataFrame, index selects the rows

The Series object has an index that assigns a value (label) to its elements.
When constructing a DataFrame object, the parameter index can be passed a list of labels with which the elements for the rows are selected:
import pandas as pd

s1 = pd.Series ( [  10 ,  20 ,  30 ,  40 ,  50 ],
        index =  ['abc','def','ghi','jkl','mno'])


s2 = pd.Series ( [  1.1,       3.3 , 4.4 , 5.5 , 6.6 , 7.7 ],
        index =  ['abc',      'ghi','jkl','mno','pqr','stu'])


s3 = pd.Series ( [ 111 , 222 , 333 , 444 , 555 , 6.6 , 7.7 ],
        index =  ['jkl','def','vwx','ghi','abc','stu','pqr'])

df = pd.DataFrame(
      { 'col_1': s1,
        'col_2': s2,
        'col_3': s3
      },
      index = ['abc', 'def', 'jkl', 'stu']
)

print(df)
#
#       col_1  col_2  col_3
#  abc   10.0    1.1  555.0
#  def   20.0    NaN  222.0
#  jkl   40.0    4.4  111.0
#  stu    NaN    7.7    6.6

Column selection

A data frame's column can be selected with df['column-name'] (very similar to a dict). This syntax returns a Series.
Multiple columns can be selected with df[ ['col-three', 'col-nine' …]]. This syntax returns a DataFrame.
import pandas as pd
  
df = pd.DataFrame ({
           'x': [7, 2, 3    ],
           'y': [4, 4, 5    ],
           'z': [9, 6, 5    ]
       },
    )

print(type(df['x']))
print(     df['x'] )

print(type(df[ ['y', 'z']]))
print(     df[ ['y', 'z']] )

Row selection

Rows can be selected by using a condition like so: df[ df['col'] > x ]
Multiple conditions are and'ed or or'ed with & and | (trying to use and or or results in the Value error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
import pandas as pd

fruits = pd.DataFrame(
   {
    'price': [0.50, 1.25, 2.00, 0.75, 1.50, 0.80, 2.50, 0.60, 3.00],
    'color': ['red', 'green', 'orange', 'yellow', 'purple', 'brown', 'black', 'pink', 'white'],
    'avg_weight': [100, 150, 200, 50, 75, 300, 120, 90, 175],
    'sweetness': [4, 7, 6, 3, 9, 2, 5, 8, 7],
    'shelf_life': [5, 3, 7, 2, 6, 8, 4, 9, 10],
   },
   index =['apple', 'grapes', 'orange', 'lemon', 'banana', 'pear', 'blackberry', 'watermelon', 'kiwi']
)

print(fruits[ fruits['sweetness'] > 6 ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  grapes       1.25   green         150          7           3
#  banana       1.50  purple          75          9           6
#  watermelon   0.60    pink          90          8           9
#  kiwi         3.00   white         175          7          10

print(fruits[ (fruits['sweetness' ] > 6)    &
              (fruits['shelf_life'] > 7) ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  watermelon    0.6   pink          90          8           9
#  kiwi          3.0  white         175          7          10

print(fruits[ (fruits['sweetness' ] > 8)    |
              (fruits['shelf_life'] > 7) ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  banana        1.5  purple          75          9           6
#  pear          0.8   brown         300          2           8
#  watermelon    0.6    pink          90          8           9
#  kiwi          3.0   white         175          7          10


print(fruits[ fruits['color'].isin( ['yellow', 'green', 'red' ]) ])
#
#          price   color  avg_weight  sweetness  shelf_life
#  apple    0.50     red         100          4           5
#  grapes   1.25   green         150          7           3
#  lemon    0.75  yellow          50          3           2
Select rows using the index:
print(fruits.loc[ ['lemon', 'grapes'] ])
#
#          price   color  avg_weight  sweetness  shelf_life
#  lemon    0.75  yellow          50          3           2
#  grapes   1.25   green         150          7           3
Select rows using their (zero based) «row-number»:
print(fruits.iloc[ [2,3,5,7] ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  orange       2.00  orange         200          6           7
#  lemon        0.75  yellow          50          3           2
#  pear         0.80   brown         300          2           8
#  watermelon   0.60    pink          90          8           9
See also filter methods

group by

import pandas as pd
import numpy  as np

np.random.seed(2807)

df = pd.DataFrame({
    'gender': np.random.choice(['M', 'F'], size=10),
    'age'   : np.random.normal(40, 10, size=10).astype(int),
    'income': np.random.normal(75000, 15000, size=10).astype(int)
   },
   index = ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Gina', 'Harry', 'Isabelle', 'Jake']
)

print(df.groupby('gender')['income'].mean())
#
#  gender
#  F    80786.333333
#  M    72413.714286
#  Name: income, dtype: float64
See also statistical methods.

read and write methods

The pandas module has a few read_…() functions which construct a data frame from data in various formats.
The data of a data frame can be written (persisted) to another format with one of the to_…() methods.

fillna

import pandas

df = pandas.DataFrame(
      { 'col_1': pandas.Series( [   10, None,   30, None ] ),
        'col_2': pandas.Series( [ None,  'Y', None,  'Q' ] )
      })


df.col_1 = df.col_1.fillna( 999 )

print(df)
#    col_1 col_2
# 0     10  None
# 1    999     Y
# 2     30  None
# 3    999     Q

df.col_2.fillna ( '?', inplace=True )

print(df)
#    col_1 col_2
# 0     10     ?
# 1    999     Y
# 2     30     ?
# 3    999     Q
Github repository about-python, path: /libraries/pandas/DataFrame/fillna.py

modify column

import pandas

df = pandas.DataFrame(
      { 'col_1': pandas.Series( [  10 ,  20 ,  30 ] ),
        'col_2': pandas.Series( ['foo','bar','baz'] ),
        'col_3': pandas.Series( [  'X',  'Y',  'Z'] )
      })


df.col_1 = df.col_1.map(lambda n: n * n)

print(df)
#    col_1 col_2 col_3
# 0    100   foo     X
# 1    400   bar     Y
# 2    900   baz     Z
Github repository about-python, path: /libraries/pandas/DataFrame/modify-column.py

pivot table

import pandas
import numpy

df = pandas.DataFrame( {
        'c1'   : pandas.Series( ['foo','bar','baz',    'foo','bar','baz',    'bar',   'foo','foo'] ),
        'c2'   : pandas.Series( [  'X',  'Y',  'Z',      'X',  'Y',  'Z',      'X',     'Z',  'Z'] ),
        'vals' : pandas.Series( [  10 ,  20 ,  30 ,       1 ,   2 ,   3 ,      17 ,    100 ,  50 ] )
      })

piv = df.pivot_table(
            index  =['c1'  ], 
            columns=['c2'  ],
            values =['vals'],
            aggfunc=numpy.sum)

print(type(piv))
# <class 'pandas.core.frame.DataFrame'>

print(piv)
#      vals
# c2      X   Y    Z
# c1
# bar    17  22  NaN
# baz   NaN NaN   33
# foo    11 NaN  150
Github repository about-python, path: /libraries/pandas/DataFrame/pivot_table.py

Members of a DataFrame

Filter methods

Filter methods return an array (of the same length as the data frame) of booleans (True or False). Such an array is used to select rows.
isna()
isnull()
isin()
notna()
notnull()

Statistical/summary methods

max()
mean()
median()
min()
sum()
See also groupby()

to_ methods

A data frame's to_…() method converts the data of a data frame to another format.
to_clipboard()
to_csv()
to_dict()
to_excel()
to_feather()
to_gbq()
to_hdf()
to_html()
to_json()
to_latex()
to_markdown()
to_numpy()
to_orc()
to_parquet()
to_period()
to_pickle()
to_records()
to_sql()
to_stata()
to_string()
to_xml()
The corresponding read_…() methods are found in the pandas module.

_repr_ methods

_repr_… methods make rich display possible in IPython.
_repr_html_()
_repr_latex_()

Other methods

abs()
_accessors set
_accum_func()
add()
_add_numeric_operations()
add_prefix()
add_suffix()
agg()
_agg_by_level()
_agg_examples_doc str
aggregate()
_agg_summary_and_see_also_doc str
align()
_align_frame()
_align_series()
all()
any()
_append()
append()
apply()
applymap()
_arith_method()
asfreq()
_as_manager()
asof()
assign()
astype()
at pandas.core.indexing._AtIndexer
at_time()
_attrs dict
attrs dict
axes list
_AXIS_LEN int
_AXIS_ORDERS list
_AXIS_TO_AXIS_NUMBER dict
backfill()
between_time()
bfill()
bool()
_box_col_values()
boxplot()
_can_fast_transpose bool
_check_inplace_and_allows_duplicate_labels()
_check_inplace_setting()
_check_is_chained_assignment_possible()
_check_label_or_level_ambiguity()
_check_setitem_copy()
_clear_item_cache()
clip()
_clip_with_one_bound()
_clip_with_scalar()
_cmp_method()
columns pandas.core.indexes.base.Index
combine()
combine_first()
_combine_frame()
compare()
_consolidate()
_consolidate_inplace()
_construct_axes_dict()
_construct_axes_from_arguments()
_constructor type
_constructor_sliced type
_construct_result()
_convert()
convert_dtypes()
copy()
corr()
corrwith()
count()
_count_level()
cov()
cummax()
cummin()
cumprod()
cumsum()
_data pandas.core.internals.managers.BlockManager
describe() Returns a DataFrame with statistical data about the columns with numerical values. The row-indexes are count, mean, std, min, 25%, 50%, 75% and max. The column (names) of the returned DataFrame correspond to the DataFrame being described. Compare with info()`
diff()
_dir_additions()
_dir_deletions()
_dispatch_frame_op()
div()
divide()
dot()
drop()
_drop_axis()
drop_duplicates()
_drop_labels_or_levels()
droplevel()
dropna()
dtypes pandas.core.series.Series
duplicated()
empty bool
_ensure_valid_index()
eq()
equals()
eval()
ewm()
expanding()
explode()
ffill()
fillna()
filter()
_find_valid_index()
first()
first_valid_index()
_flags pandas.core.flags.Flags
flags pandas.core.flags.Flags
floordiv()
_from_arrays()
from_dict()
from_records()
ge()
get()
_get_agg_axis()
_get_axis()
_get_axis_name()
_get_axis_number()
_get_axis_resolvers()
_get_block_manager_axis()
_get_bool_data()
_get_cleaned_column_resolvers()
_get_column_array()
_get_index_resolvers()
_getitem_bool_array()
_get_item_cache()
_getitem_multilevel()
_get_label_or_level_values()
_get_numeric_data()
_get_value()
_gotitem()
groupby() Returns a pandas.core.groupby.generic.DataFrameGroupBy object. After grouping a data frame, statisical methods can be applied to each group. See also here here
gt()
_HANDLED_TYPES tuple
head() Returns a data frame with the first n rows of a data frame (similar in spirit to the shell command head). See also tail()
_hidden_attrs frozenset
hist()
iat pandas.core.indexing._iAtIndexer
id pandas.core.series.Series
idxmax()
idxmin()
iloc[] Returns the rows by numerical position. Compare with loc[]
index pandas.core.indexes.range.RangeIndex
_indexed_same()
infer_objects()
info() Some internal information about the data frame (number of columns, index type, memory usage etc.) Compare with describe()
_info_axis pandas.core.indexes.base.Index
_info_axis_name str
_info_axis_number int
_info_repr()
_init_mgr()
_inplace_method()
insert()
_internal_names list
_internal_names_set set
interpolate()
_is_copy NoneType
_iset_item()
isetitem()
_iset_item_mgr()
_iset_not_inplace()
_is_homogeneous_type bool
_is_label_or_level_reference()
_is_label_reference()
_is_level_reference()
_is_mixed_type bool
_is_view bool
_item_cache dict
items()
_iter_column_arrays()
iteritems()
iterrows()
itertuples()
_ixs()
join()
_join_compat()
keys()
kurt()
kurtosis()
last()
last_valid_index()
le()
loc[] Select rows by their index. Compare with iloc[].
_logical_func()
_logical_method()
lookup()
lt()
mad()
mask()
_maybe_cache_changed()
_maybe_update_cacher()
melt() The opposite is pivot()
memory_usage()
merge()
_metadata list
_mgr pandas.core.internals.managers.BlockManager
_min_count_stat_function()
mod()
mode()
mul()
multiply()
ndim int. Compare with shape and size
ne()
_needs_reindex_multi()
nlargest()
nsmallest()
nunique()
pad()
pct_change()
pipe()
pivot() The opposite is melt()
pivot_table()
plot() Plots (the Series objects in the DataFrame?) or the DataFrame itslf on the backend specified by the plotting.backend option (whose default is matplotlib).
pop()
pow()
prod()
product()
_protect_consolidate()
quantile()
query()
radd()
rank()
rdiv()
_reduce()
_reduce_axis1()
reindex()
_reindex_axes()
_reindex_columns()
_reindex_index()
reindex_like()
_reindex_multi()
_reindex_with_indexers()
_rename()
rename()
rename_axis()
reorder_levels()
replace()
_replace_columnwise()
_repr_data_resource_()
_repr_fits_horizontal_()
_repr_fits_vertical_()
resample()
_reset_cache()
_reset_cacher()
reset_index()
rfloordiv()
rmod()
rmul()
rolling()
round()
rpow()
rsub()
rtruediv()
sample()
_sanitize_column()
select_dtypes()
sem()
_series dict
_set_axis()
set_axis()
_set_axis_name()
_set_axis_nocheck()
set_flags()
set_index()
_set_is_copy()
_set_item()
_setitem_array()
_setitem_frame()
_set_item_frame_value()
_set_item_mgr()
_setitem_slice()
_set_value()
shape A tuple which describes the data frame's dimensionality. Comapre with size and ndim
shift()
size numpy.int64. Compare with shape and ndim.
skew()
_slice()
slice_shift()
sort_index()
sort_values()
squeeze()
stack()
_stat_axis pandas.core.indexes.range.RangeIndex
_stat_axis_name str
_stat_axis_number int
_stat_function()
_stat_function_ddof()
std()
style Returns a Styler object (which has a _repr_html_() method, which makes it poosible for the data frame to be rendered in a Jupyter Notebook)
sub()
subtract()
swapaxes()
swaplevel()
T pandas.core.frame.DataFrame
tail() Returns a data frame with the last n rows of a data frame (similar in spirit to the shell command tail). See also head()
_take()
take()
_take_with_is_copy()
_to_dict_of_blocks()
to_timestamp()
to_xarray()
transform()
transpose()
truediv()
truncate()
txt pandas.core.series.Series
_typ str
tz_convert()
tz_localize()
unstack()
update()
_update_inplace()
_validate_dtype()
value_counts()
_values numpy.ndarray
values numpy.ndarray
var()
_where()
where()
xs()

Index