Understanding Kepler Data
Project: Applying DL methods in exoplanet search
- Technical Preliminaries
- INTRODUCTION
- Examining the Data
- TFRecord View - Downloaded preprocessed data
# Enable immediate usage of Python script files
%reload_ext autoreload
%autoreload 2
# Plot figures inline
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import sys
#sys.path.append('/Users/itacdonev/Documents/PROJECTS/AstroAI')
#sys.path.append('/Users/itacdonev/Documents/PROJECTS/AstroAI/code')
sys.path
sys.path.append('/Users/itacdonev/Documents/PROJECTS/AstroAI/astroai')
from external.light_curve import binning
# IMORT PACKAGES
#------------------------------
import gc
import pandas as pd
import numpy as np
import random
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
from termcolor import colored, cprint
from astropy.io import fits # Import fits files
from astropy.table import Table # Converting to tidy data tables
import tensorflow as tf
tf.enable_eager_execution()
#------------------------------
# Source: Ita Cirovic Donev
import utils
# Source: Chris Shaulle
from external.light_curve import kepler_io
# Define the main data path
DATA_PATH = '../data/'
What to Download?
Given that we want to build a model using the supervised learning algorithm we need a set of labeled data. To detect the potential planet we will be using the transit method. From previous notebook we understood what the transit method is. Hence, our training data is based on the light curves of a star and its labeled class.
Therefore, we need to acquire two sets of data: labeled data and the light curves (time series).
- The labeled TCE data, namely the DR24 TCE table we will obtain from the (NASA Exoplanet Archive)
- Light curves of the stars corresponding to the DR24 table we can obtained from the (Mikulski Archive)
All the data is saved in the folder ../data/
.
DR24 TCE data
The complete description of available data columns can be found in Data Columns in the Kepler TCE Table. From all the available columns, we will download the following:
-
kepid
- Target identification number, as listed in the Kepler Input Catalog (KIC). The KIC was derived from a ground-based imaging survey of the Kepler field conducted prior to launch. The survey's purpose was to identify stars for the Kepler exoplanet survey by magnitude and color. The full catalog of 13 million sources can be searched at the MAST archive. The subset of 4 million targets found upon the Kepler CCDs can be searched via the Kepler Target Search form. The Kepler ID is unique to a target and there is only one Kepler ID per target. -
tce_plnt_num
- Planet number -
tce_period
- Orbital period (days): The interval between consecutive planetary transits. -
tce_time0bk
- Transit Epoch (BJD - 2,454,833.0): The time corresponding to the center of the first detected transit in Barycentric Julian Day (BJD) minus a constant offset of 2,454,833.0 days. The offset corresponds to 12:00 on Jan 1, 2009 UTC. -
tce_duration
- Transit Duration (hrs): The duration of the observed transits. Duration is measured from first contact between the planet and star until last contact. Contact times are typically computed from a best-fit model produced by a Mandel-Agol (2002) model fit to a multi-quarter Kepler light curve, assuming a linear orbital ephemeris. -
av_training_set
- Autovetter Training Set Label: If the TCE was included in the training set, the training label encodes what is believed to be the "true" classification, and takes a value of either PC, AFP or NTP. The TCEs in the UNKNOWN class sample are marked UNK. Training labels are given a value of NULL for TCEs not included in the training set. For more detail about how the training set is constructed, see Autovetter Planet Candidate Catalog for Q1-Q17 Data Release 24 (KSCI-19091).-
PC
: planet candidate -
AFP
: astrophysical false positive -
NTP
: non-transiting phenomenon -
UNK
: unknown
-
-
av_pred_class
- Autovetter Predicted Classification: Predicted classifications, which are the "optimum MAP classifications." Values are either PC, AFP, or NTP.
There are two options to download the data: manually or using the API. For manual download go to NASA Exoplanet Archive and select the appropriate columns. If you would like to use the API just run the following in your terminal:
python -m data.get_tce_data --output_file=dr24_tce_labels.csv
./get_tce.sh
where for the argument output_file
enter the desired file name. The default location for the saved file is ../data/
. Note that this is run from the main root of the project.
Light curves for DR24 TCE data
The Mikulski archive contains all the Kepler mission data, which as you can image is huge. Since we only have labels for a subset of the dataset, we will just download that specific subset, i.e. the DR24. The process of downloading the is closely followed with the code by Shaulle. We will use the light_curve
package and its corresponding py scripts. The light curves are downloaded using the generate_download_script.py
from the light_curve
package by Schaulle using the get_data.sh
by running the following commands:
sh get_data.sh
./get_kepler.sh
The data of 90GB is downloaded in the ../data/kepler/
folder. Note that the download takes several days.
# List all files and folder in DATA_PATH
utils.get_files(DATA_PATH)
# Read the data for the target labels
df_y = pd.read_csv(f'{DATA_PATH}dr24_tce_labels.csv')
print(colored(f'Number of IDs: {df_y.shape[0]}','blue'))
df_y.head()
The column av_training_set
provides the target labels. Let's see how many labels we have:
df_y['av_training_set'].value_counts()
There are 3600 planet candidates in the training set. We can create a binary target with 1 only for PC
target label and 0 for the remaining.
# Create binary target
df_y['target'] = np.where(df_y['av_training_set'] == 'PC', 1, 0)
In every classification problem we need to check the level of imbalance of the dataset, i.e. the proportion of 1s to the whole dataset. In our example this is equal to:
utils.check_target(df_y, 'target')
Now, let's see the Kepler data on light curves.
# Enter which Kepler ID to plot
KEPLER_ID = 11442793 #Kepler-90
KEPLER_DATA_DIR = f'{DATA_PATH}kepler/'
LABEL = df_y[df_y.kepid == KEPLER_ID]['av_training_set'].iloc[0,] # Extract target label
# Default ploting style
plt.style.use('ggplot')
# Get file names for the KEPLER_ID
file_names = kepler_io.kepler_filenames(KEPLER_DATA_DIR, KEPLER_ID)
assert file_names, f'Failed to find .fits files in {KEPLER_DATA_DIR}'
file_names
The file names have the following format (taken from the light_curve.kepler_io.py
):
${kep_id:0:4}/${kep_id}/kplr${kep_id}-${quarter_prefix}_${type}.fits
where:
-
kep_id
is the Kepler id left-padded with zeros to length 9; -
quarter_prefix
is the filename quarter prefix; -
type
is one of "llc" (long cadence light curve) or "slc" (short cadence light curve).
Before plotting the light curves let's examine the *.fits
files. To read these files we will need the astropy
package. Taking just one *.fits
file from file_names
and extracting the information contained in the file.
# Take the first fits file from file_names
f = file_names[0]
fits.info(f)
As we can see there are 3 main HUDs or header data units:
- No. 0 (Primary): This HDU (Header Data Units) contains meta-data related to the entire file.
- No. 1 (Light curve): This HDU contains a binary table that holds data like flux measurements and times. We will extract information from here when we define the parameters for the light curve plot.
- No. 2 (Aperture): This HDU contains the image extension with data collected from the aperture. We will also use this to display a bitmask plot that visually represents the optimal aperture used to create the SAP_FLUX column in HDU1.
For more detailed description of FITS headers please refer to Section 2.1.3. in the Kepler Archive Manual.
Reference: MAST Notebook examples
Now, let's see how to extract information from each specific HDU.
with fits.open(file_names[0]) as hdulist:
HDU_LC = hdulist['LIGHTCURVE'].header
print(f'Number of columns: {len(HDU_LC)}')
print('-'*30)
print(repr(HDU_LC[:25])) # Show first 25 column information
We can also see the actual values of each column:
with fits.open(f) as hdulist:
LC_table = hdulist['LIGHTCURVE'].data
LC_table = Table(LC_table)
LC_table[:5] # Show first 5 rows
Consulting the Kepler Archive Manual (Section 2.3.1.) we need information on the following columns:
-
TIME = The time at the mid-point of the cadence in BKJD. Convert to BJD using the following formula
BJD_i = TIME_i + BJDREFI + BJDREFF
where BJDREFI and BJDREFF are keywords in the header. -
SAP_FLUX = The flux in units of electrons per second contained in the optimal aperture pixels collected by the spacecraft. This light curve is the output of the PA module in the SOC pipeline.
-
PDCSAP_FLUX = The flux contained in the optimal aperture in electrons per second after the PDC module has applied its cotrending algorithm to the PA light curve. To better understand how PDC manipulated the light curve, read Section 2.3.1.2 and see the PDCSAPFL keyword in the header.
Each of the .fits files contains data per specific quarter. We need to combine all from each file to plot the complete time series. So lets define flux
and time
to be the two variables where we will append timestamps and the corresponding flux or the brigthness of the star. Then going through all the file_names
using the astropy
package we extract data for time and the flux. Note that we will take the column PDCSAP_FLUX
. We will get an array with length equal to the number of .fits files, i.e. 14.
flux = []
time = []
# Extract data for time and flux
for f in file_names:
with fits.open(f) as hdulist:
flux.append(hdulist['LIGHTCURVE'].data['PDCSAP_FLUX'])
time.append(hdulist['LIGHTCURVE'].data['TIME'])
# Print number of observations for each quarter period
for i in range(len(flux)):
print(f'N = {len(flux[i])}')
Let's see what we have in the flux
and time
objects:
nan_flux = 0
for i in flux:
# Sum the number of non-finite values from all quarter periods
nan_flux += (np.isnan(i)).sum()
print(f'There are {nan_flux} non-finite values in the timeseries.')
As we can see there are nan
values, which we need to clean up before any further analysis. We will use the numpy
method logical_and
which computes the truth value of x1 AND x2 element-wise, where in our case x1 is time
and x2 is flux
. As the result we will obtain an array with values [True, False, ... , True]
depending on whether both the time
and flux
for that particular value in the array is a finite value, with True
for both finite and False
otherwise. Then we will only select the values of the original time series where we obtained True
value from the logical check. Hence, we will shorten the time series, but in essence since there were nan
values, we didn't have them in the first place. So all is good!
# Remove non-finite values from the time series array
for i, (t, f) in enumerate(zip(time, flux)):
t_f_finite = np.logical_and(np.isfinite(t), np.isfinite(f))
time[i] = t[t_f_finite]
flux[i] = f[t_f_finite]
# Print number of observations for each quarter period
print(f'N = {len(flux[i])}')
Now that we have the full dataset, i.e. without any non-finite values, let's plot the time series for the selected Kepler ID. For easier readjustments of the figures, let's define some plotting arguments upfront, like color and opacity.
color = '#371F72'
alpha = 0.8
plt.figure(figsize=(15,6))
plt.plot(np.concatenate(time), np.concatenate(flux), '.', c = color, alpha = alpha)
plt.xlabel('Time (days)')
plt.ylabel('Brigthness')
plt.show()
We can see that the time series for each quarter period is on different scale. (REVIEW) The differences arise due to the repositioning of the Kepler telescope for the purposes of optimizing the solar panels and accumulating solar energy. Since we want a one time series we need to rescale all the quarter periods. We will achieve this by dividing the time series with the median of each quarter period. First let's try to plot just one quarter period to see the change in brigthness more clearly.
# Divide values in each quarter period by that period's median value
for f in flux:
f /= np.median(f)
Now let's plot the complete time series again. As we can see the time series is centered at 1 and much easier to read.. Notice the two planets in the figure?
plt.figure(figsize=(15,6))
plt.plot(np.concatenate(time), np.concatenate(flux), '.', c = color, alpha = alpha)
plt.xlabel('Time (days)')
plt.ylabel('Brigthness')
plt.show()
Going back to the quarter periods, let's choose one and plot only that time period for more detailed viewing capabilities. To illustrate the structure of light curves better we choose the third quarter period where a clear decrease in brigthness ca be spotted.
q = 3 # Select the quarter period to plot
plt.figure(figsize=(15,6))
plt.plot(time[q], flux[q], '.', c = color, alpha = alpha)
plt.xlabel('Time (days)')
plt.ylabel('Brigthness')
plt.show()
all_time, all_flux = utils.get_light_curves(KEPLER_DATA_DIR, KEPLER_ID)
utils.plot_light_curve(all_time, all_flux, KEPLER_ID, LABEL, color, alpha, quarter = 3)
utils.plot_light_curve(all_time, all_flux, KEPLER_ID, LABEL, color, alpha)
Combining the data extraction and ploting into one function view_kepler
we can use the following one line to get the light curves:
utils.view_kepler(KEPLER_DATA_DIR, KEPLER_ID, LABEL, color, alpha)
Now let's radomly view the Kepler data by chosing, at random, 5 Kepler IDs for which to plot the time series. We will choose the index value of the length of the data set i.e. from the total number of data entries in the TCE labels dataset. Then using the choosen index we extract the ID and the label from kepid
and av_training_set
columns respectively.
n = 5 # Number of Kepler ID to plot
# Randomly choose n number of IDs
list = range(len(df_y['kepid']))
rnd_ids = random.choices(list, k = n)
# Plot light curves for each Kepler ID
for i in rnd_ids:
kep_id = df_y.loc[i]['kepid']
label = df_y.loc[i]['av_training_set']
utils.view_kepler(KEPLER_DATA_DIR, kep_id, label, color, alpha)
Similarly, we can also explore the APERTURE
HDU. The process of reading the data is the same as for the light curves we just select the HDU as follows:
with fits.open(file_names[0]) as hdulist:
HDU_APT = hdulist['APERTURE'].header
print(f'Number of columns: {len(HDU_APT)}')
print('-'*30)
print(repr(HDU_APT)) # Show first 25 column information
# Aperture data
with fits.open(file_names[3]) as hdulist:
apt_data = hdulist['APERTURE'].data
print(apt_data)
nrow = 4
ncol = 4
#fig, ax = plt.subplots(nrows = nrow, ncols = ncol, figsize = (3,3))
fig = plt.figure(figsize = (20,20))
#fig.subplots_adjust(hspace = 0.05)
fig.suptitle(f'Kepler Apertures for Kepler ID = {KEPLER_ID}')
for i in range(len(file_names)):
with fits.open(file_names[i]) as hdulist:
apt_data = hdulist['APERTURE'].data
# Plot each image
#plt.subplot(2,7, sharex = 'col', sharey = 'row')
ax = fig.add_subplot(nrow, ncol, (i+1))
img = ax.imshow(apt_data, cmap = plt.cm.YlGnBu_r)
plt.title(f'Quarter period:')
fig.colorbar(img, ax = ax)
plt.show()
import os
tf_path = os.path.join(DATA_PATH, 'tfrecord')
tf_path = [os.path.join(tf_path,f,'.tfrecord') for f in os.listdir(tf_path)]
tf_path
raw_data = tf.data.TFRecordDataset('../data/tfrecord/train-00007-of-00008.tfrecord')
raw_data
KEPLER_ID = 11442793 # Kepler-90
TFRECORD_DIR = "../data/tfrecord/"
def find_tce(kepid, tce_plnt_num, filenames):
for filename in filenames:
for record in tf.python_io.tf_record_iterator(filename):
ex = tf.train.Example.FromString(record)
if (ex.features.feature["kepid"].int64_list.value[0] == kepid and
ex.features.feature["tce_plnt_num"].int64_list.value[0] == tce_plnt_num):
print("Found {}_{} in file {}".format(kepid, tce_plnt_num, filename))
return ex
raise ValueError("{}_{} not found in files: {}".format(kepid, tce_plnt_num, filenames))
filenames = tf.gfile.Glob(os.path.join(TFRECORD_DIR, "*"))
assert filenames, "No files found in {}".format(TFRECORD_DIR)
ex = find_tce(KEPLER_ID, 1, filenames)
ex.features.feature['kepid']
global_view = np.array(ex.features.feature["global_view"].float_list.value)
local_view = np.array(ex.features.feature["local_view"].float_list.value)
fig, axes = plt.subplots(1, 2, figsize=(20, 6))
axes[0].plot(global_view, ".")
axes[1].plot(local_view, ".")
plt.show()