Technical Preliminaries

# Enable immediate usage of Python script files
%reload_ext autoreload
%autoreload 2

# Plot figures inline
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

import sys
#sys.path.append('/Users/itacdonev/Documents/PROJECTS/AstroAI')
#sys.path.append('/Users/itacdonev/Documents/PROJECTS/AstroAI/code')
sys.path

['/Users/itacdonev/Documents/PROJECTS/AstroAI/astroai/notebooks',
 '/Users/itacdonev/anaconda3/envs/astro/lib/python37.zip',
 '/Users/itacdonev/anaconda3/envs/astro/lib/python3.7',
 '/Users/itacdonev/anaconda3/envs/astro/lib/python3.7/lib-dynload',
 '',
 '/Users/itacdonev/.local/lib/python3.7/site-packages',
 '/Users/itacdonev/anaconda3/envs/astro/lib/python3.7/site-packages',
 '/Users/itacdonev/anaconda3/envs/astro/lib/python3.7/site-packages/IPython/extensions',
 '/Users/itacdonev/.ipython']

sys.path.append('/Users/itacdonev/Documents/PROJECTS/AstroAI/astroai')
from external.light_curve import binning

# IMORT PACKAGES
#------------------------------
import gc

import pandas as pd
import numpy as np 
import random

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

from termcolor import colored, cprint

from astropy.io import fits # Import fits files
from astropy.table import Table # Converting to tidy data tables

import tensorflow as tf
tf.enable_eager_execution()
#------------------------------
# Source: Ita Cirovic Donev
import utils

# Source: Chris Shaulle
from external.light_curve import kepler_io

# Define the main data path
DATA_PATH = '../data/'

INTRODUCTION

In this notebook we will go over the Kepler mission data, which we will later structure and use to train ML and DL models. Goals of this notebook:

what to download
how to download
undertsanding what we have in data
plotting data

What to Download?

Given that we want to build a model using the supervised learning algorithm we need a set of labeled data. To detect the potential planet we will be using the transit method. From previous notebook we understood what the transit method is. Hence, our training data is based on the light curves of a star and its labeled class.

Therefore, we need to acquire two sets of data: labeled data and the light curves (time series).

The labeled TCE data, namely the DR24 TCE table we will obtain from the (NASA Exoplanet Archive)
Light curves of the stars corresponding to the DR24 table we can obtained from the (Mikulski Archive)

All the data is saved in the folder ../data/.

DR24 TCE data

The complete description of available data columns can be found in Data Columns in the Kepler TCE Table. From all the available columns, we will download the following:

kepid - Target identification number, as listed in the Kepler Input Catalog (KIC). The KIC was derived from a ground-based imaging survey of the Kepler field conducted prior to launch. The survey's purpose was to identify stars for the Kepler exoplanet survey by magnitude and color. The full catalog of 13 million sources can be searched at the MAST archive. The subset of 4 million targets found upon the Kepler CCDs can be searched via the Kepler Target Search form. The Kepler ID is unique to a target and there is only one Kepler ID per target.
tce_plnt_num - Planet number
tce_period - Orbital period (days): The interval between consecutive planetary transits.
tce_time0bk - Transit Epoch (BJD - 2,454,833.0): The time corresponding to the center of the first detected transit in Barycentric Julian Day (BJD) minus a constant offset of 2,454,833.0 days. The offset corresponds to 12:00 on Jan 1, 2009 UTC.
tce_duration - Transit Duration (hrs): The duration of the observed transits. Duration is measured from first contact between the planet and star until last contact. Contact times are typically computed from a best-fit model produced by a Mandel-Agol (2002) model fit to a multi-quarter Kepler light curve, assuming a linear orbital ephemeris.
av_training_set - Autovetter Training Set Label: If the TCE was included in the training set, the training label encodes what is believed to be the "true" classification, and takes a value of either PC, AFP or NTP. The TCEs in the UNKNOWN class sample are marked UNK. Training labels are given a value of NULL for TCEs not included in the training set. For more detail about how the training set is constructed, see Autovetter Planet Candidate Catalog for Q1-Q17 Data Release 24 (KSCI-19091).
- PC: planet candidate
- AFP: astrophysical false positive
- NTP: non-transiting phenomenon
- UNK: unknown
av_pred_class - Autovetter Predicted Classification: Predicted classifications, which are the "optimum MAP classifications." Values are either PC, AFP, or NTP.

There are two options to download the data: manually or using the API. For manual download go to NASA Exoplanet Archive and select the appropriate columns. If you would like to use the API just run the following in your terminal:

python -m data.get_tce_data --output_file=dr24_tce_labels.csv
./get_tce.sh

where for the argument output_file enter the desired file name. The default location for the saved file is ../data/. Note that this is run from the main root of the project.

Light curves for DR24 TCE data

The Mikulski archive contains all the Kepler mission data, which as you can image is huge. Since we only have labels for a subset of the dataset, we will just download that specific subset, i.e. the DR24. The process of downloading the is closely followed with the code by Shaulle. We will use the light_curve package and its corresponding py scripts. The light curves are downloaded using the generate_download_script.py from the light_curve package by Schaulle using the get_data.shby running the following commands:

sh get_data.sh
./get_kepler.sh

The data of 90GB is downloaded in the ../data/kepler/ folder. Note that the download takes several days.

Examining the Data

Let's see what we have downloaded.

# List all files and folder in DATA_PATH
utils.get_files(DATA_PATH)

Target Labels

# Read the data for the target labels
df_y = pd.read_csv(f'{DATA_PATH}dr24_tce_labels.csv')
print(colored(f'Number of IDs: {df_y.shape[0]}','blue'))
df_y.head()

The column av_training_set provides the target labels. Let's see how many labels we have:

df_y['av_training_set'].value_counts()

There are 3600 planet candidates in the training set. We can create a binary target with 1 only for PC target label and 0 for the remaining.

# Create binary target
df_y['target'] = np.where(df_y['av_training_set'] == 'PC', 1, 0)

In every classification problem we need to check the level of imbalance of the dataset, i.e. the proportion of 1s to the whole dataset. In our example this is equal to:

utils.check_target(df_y, 'target')

Light Curves

Now, let's see the Kepler data on light curves.

# Enter which Kepler ID to plot
KEPLER_ID = 11442793 #Kepler-90

KEPLER_DATA_DIR = f'{DATA_PATH}kepler/'
LABEL = df_y[df_y.kepid == KEPLER_ID]['av_training_set'].iloc[0,] # Extract target label

# Default ploting style
plt.style.use('ggplot')

# Get file names for the KEPLER_ID
file_names = kepler_io.kepler_filenames(KEPLER_DATA_DIR, KEPLER_ID)
assert file_names, f'Failed to find .fits files in {KEPLER_DATA_DIR}'
file_names

The file names have the following format (taken from the light_curve.kepler_io.py):

${kep_id:0:4}/${kep_id}/kplr${kep_id}-${quarter_prefix}_${type}.fits

where:

kep_id is the Kepler id left-padded with zeros to length 9;
quarter_prefix is the filename quarter prefix;
type is one of "llc" (long cadence light curve) or "slc" (short cadence light curve).

Before plotting the light curves let's examine the *.fits files. To read these files we will need the astropy package. Taking just one *.fits file from file_names and extracting the information contained in the file.

# Take the first fits file from file_names
f = file_names[0]
fits.info(f)

As we can see there are 3 main HUDs or header data units:

No. 0 (Primary): This HDU (Header Data Units) contains meta-data related to the entire file.
No. 1 (Light curve): This HDU contains a binary table that holds data like flux measurements and times. We will extract information from here when we define the parameters for the light curve plot.
No. 2 (Aperture): This HDU contains the image extension with data collected from the aperture. We will also use this to display a bitmask plot that visually represents the optimal aperture used to create the SAP_FLUX column in HDU1.

For more detailed description of FITS headers please refer to Section 2.1.3. in the Kepler Archive Manual.
Reference: MAST Notebook examples

Now, let's see how to extract information from each specific HDU.

with fits.open(file_names[0]) as hdulist:
    HDU_LC = hdulist['LIGHTCURVE'].header

print(f'Number of columns: {len(HDU_LC)}')
print('-'*30)
print(repr(HDU_LC[:25])) # Show first 25 column information

We can also see the actual values of each column:

with fits.open(f) as hdulist:
    LC_table = hdulist['LIGHTCURVE'].data
    
LC_table = Table(LC_table)
LC_table[:5] # Show first 5 rows

Consulting the Kepler Archive Manual (Section 2.3.1.) we need information on the following columns:

TIME = The time at the mid-point of the cadence in BKJD. Convert to BJD using the following formula BJD_i = TIME_i + BJDREFI + BJDREFF where BJDREFI and BJDREFF are keywords in the header.
SAP_FLUX = The flux in units of electrons per second contained in the optimal aperture pixels collected by the spacecraft. This light curve is the output of the PA module in the SOC pipeline.
PDCSAP_FLUX = The flux contained in the optimal aperture in electrons per second after the PDC module has applied its cotrending algorithm to the PA light curve. To better understand how PDC manipulated the light curve, read Section 2.3.1.2 and see the PDCSAPFL keyword in the header.

Each of the .fits files contains data per specific quarter. We need to combine all from each file to plot the complete time series. So lets define flux and time to be the two variables where we will append timestamps and the corresponding flux or the brigthness of the star. Then going through all the file_names using the astropy package we extract data for time and the flux. Note that we will take the column PDCSAP_FLUX. We will get an array with length equal to the number of .fits files, i.e. 14.

flux = []
time = []

# Extract data for time and flux
for f in file_names:
    with fits.open(f) as hdulist:
        flux.append(hdulist['LIGHTCURVE'].data['PDCSAP_FLUX'])
        time.append(hdulist['LIGHTCURVE'].data['TIME'])
        
# Print number of observations for each quarter period
for i in range(len(flux)):
    print(f'N = {len(flux[i])}')

Let's see what we have in the flux and time objects:

nan_flux = 0
for i in flux:
    # Sum the number of non-finite values from all quarter periods
    nan_flux += (np.isnan(i)).sum()
print(f'There are {nan_flux} non-finite values in the timeseries.')

As we can see there are nan values, which we need to clean up before any further analysis. We will use the numpy method logical_and which computes the truth value of x1 AND x2 element-wise, where in our case x1 is time and x2 is flux. As the result we will obtain an array with values [True, False, ... , True] depending on whether both the time and flux for that particular value in the array is a finite value, with True for both finite and False otherwise. Then we will only select the values of the original time series where we obtained True value from the logical check. Hence, we will shorten the time series, but in essence since there were nan values, we didn't have them in the first place. So all is good!

# Remove non-finite values from the time series array
for i, (t, f) in enumerate(zip(time, flux)):
    t_f_finite = np.logical_and(np.isfinite(t), np.isfinite(f))
    time[i] = t[t_f_finite]
    flux[i] = f[t_f_finite]
    
    # Print number of observations for each quarter period
    print(f'N = {len(flux[i])}')

Now that we have the full dataset, i.e. without any non-finite values, let's plot the time series for the selected Kepler ID. For easier readjustments of the figures, let's define some plotting arguments upfront, like color and opacity.

color = '#371F72'
alpha = 0.8

plt.figure(figsize=(15,6))
plt.plot(np.concatenate(time), np.concatenate(flux), '.', c = color, alpha = alpha)
plt.xlabel('Time (days)')
plt.ylabel('Brigthness')
plt.show()

We can see that the time series for each quarter period is on different scale. (REVIEW) The differences arise due to the repositioning of the Kepler telescope for the purposes of optimizing the solar panels and accumulating solar energy. Since we want a one time series we need to rescale all the quarter periods. We will achieve this by dividing the time series with the median of each quarter period. First let's try to plot just one quarter period to see the change in brigthness more clearly.

# Divide values in each quarter period by that period's median value
for f in flux:
    f /= np.median(f)

Now let's plot the complete time series again. As we can see the time series is centered at 1 and much easier to read.. Notice the two planets in the figure?

plt.figure(figsize=(15,6))
plt.plot(np.concatenate(time), np.concatenate(flux), '.', c = color, alpha = alpha)
plt.xlabel('Time (days)')
plt.ylabel('Brigthness')
plt.show()

Going back to the quarter periods, let's choose one and plot only that time period for more detailed viewing capabilities. To illustrate the structure of light curves better we choose the third quarter period where a clear decrease in brigthness ca be spotted.

q = 3 # Select the quarter period to plot
plt.figure(figsize=(15,6))
plt.plot(time[q], flux[q], '.', c = color, alpha = alpha)
plt.xlabel('Time (days)')
plt.ylabel('Brigthness')
plt.show()

Ploting light curves (the shorter way)

Let's plot the light curves for the first 10 Kepler IDs in the dataset using the function view_kepler.

all_time, all_flux = utils.get_light_curves(KEPLER_DATA_DIR, KEPLER_ID)

utils.plot_light_curve(all_time, all_flux, KEPLER_ID, LABEL, color, alpha, quarter = 3)

utils.plot_light_curve(all_time, all_flux, KEPLER_ID, LABEL, color, alpha)

Combining the data extraction and ploting into one function view_kepler we can use the following one line to get the light curves:

utils.view_kepler(KEPLER_DATA_DIR, KEPLER_ID, LABEL, color, alpha)

Now let's radomly view the Kepler data by chosing, at random, 5 Kepler IDs for which to plot the time series. We will choose the index value of the length of the data set i.e. from the total number of data entries in the TCE labels dataset. Then using the choosen index we extract the ID and the label from kepid and av_training_set columns respectively.

n = 5 # Number of Kepler ID to plot
# Randomly choose n number of IDs
list = range(len(df_y['kepid']))
rnd_ids = random.choices(list, k = n)

# Plot light curves for each Kepler ID
for i in rnd_ids:
    kep_id = df_y.loc[i]['kepid']
    label = df_y.loc[i]['av_training_set']
    
    utils.view_kepler(KEPLER_DATA_DIR, kep_id, label, color, alpha)

Aperture Data

Similarly, we can also explore the APERTURE HDU. The process of reading the data is the same as for the light curves we just select the HDU as follows:

with fits.open(file_names[0]) as hdulist:
    HDU_APT = hdulist['APERTURE'].header

print(f'Number of columns: {len(HDU_APT)}')
print('-'*30)
print(repr(HDU_APT)) # Show first 25 column information

# Aperture data
with fits.open(file_names[3]) as hdulist:
    apt_data = hdulist['APERTURE'].data
print(apt_data)

nrow = 4
ncol = 4
#fig, ax = plt.subplots(nrows = nrow, ncols = ncol, figsize = (3,3))
fig = plt.figure(figsize = (20,20))
#fig.subplots_adjust(hspace = 0.05)
fig.suptitle(f'Kepler Apertures for Kepler ID = {KEPLER_ID}')

for i in range(len(file_names)):
    with fits.open(file_names[i]) as hdulist:
        apt_data = hdulist['APERTURE'].data
    
    # Plot each image
    #plt.subplot(2,7, sharex = 'col', sharey = 'row')
    ax = fig.add_subplot(nrow, ncol, (i+1))
    img = ax.imshow(apt_data, cmap = plt.cm.YlGnBu_r)
    plt.title(f'Quarter period:')
    fig.colorbar(img, ax = ax)
plt.show()

TFRecord View - Downloaded preprocessed data

import os

tf_path = os.path.join(DATA_PATH, 'tfrecord')
tf_path = [os.path.join(tf_path,f,'.tfrecord') for f in os.listdir(tf_path)]
tf_path

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-2-bab87f60a61c> in <module>
----> 1 tf_path = os.path.join(DATA_PATH, 'tfrecord')
      2 tf_path = [os.path.join(tf_path,f,'.tfrecord') for f in os.listdir(tf_path)]
      3 tf_path

NameError: name 'DATA_PATH' is not defined

raw_data = tf.data.TFRecordDataset('../data/tfrecord/train-00007-of-00008.tfrecord')
raw_data

KEPLER_ID = 11442793  # Kepler-90
TFRECORD_DIR = "../data/tfrecord/"

def find_tce(kepid, tce_plnt_num, filenames):
    for filename in filenames:
        for record in tf.python_io.tf_record_iterator(filename):
            ex = tf.train.Example.FromString(record)
            if (ex.features.feature["kepid"].int64_list.value[0] == kepid and
                ex.features.feature["tce_plnt_num"].int64_list.value[0] == tce_plnt_num):
                print("Found {}_{} in file {}".format(kepid, tce_plnt_num, filename))
                return ex
    raise ValueError("{}_{} not found in files: {}".format(kepid, tce_plnt_num, filenames))

filenames = tf.gfile.Glob(os.path.join(TFRECORD_DIR, "*"))
assert filenames, "No files found in {}".format(TFRECORD_DIR)
ex = find_tce(KEPLER_ID, 1, filenames)

ex.features.feature['kepid']

global_view = np.array(ex.features.feature["global_view"].float_list.value)
local_view = np.array(ex.features.feature["local_view"].float_list.value)
fig, axes = plt.subplots(1, 2, figsize=(20, 6))
axes[0].plot(global_view, ".")
axes[1].plot(local_view, ".")
plt.show()