Financial Data Project #1: Yield Curves Modeling and Correlation Structures in Financial Markets¶

In fixed income and equity markets, understanding how interest rates evolve across maturities and how assest returns move together is essential for tisk management, pricing, and portfolio diversification. This project is part of the MSc. in Financial Engineering program at WorldQuant University. Aimed to explores three fundamental quantitative finance problems:

  1. How can we model the term structure of interest rates?
  2. How are movements across maturities correlated?
  3. Can similar factor structures explain stock market co-movements?

Together, these form an end-to-end workflow connecting fixed-income theory with empirical equity analysis.

Methodology¶

Data sources:

  • Vietnam HNX Treasury Bonds - Spot rate data (11 maturities, 3M-20Y, Oct 2025)
  • Reserve Bank of Australia (RBA) - Daily yields (1Y-10Y) over six months
  • State Street XLK ETF - Daily prices of 30 largest technology stocks (Apr - Oct 2025)

Techniques:

  • Yield Curve Modelling: Nelson–Siegel (parametric) vs Cubic Spline (non-parametric)
  • Factor Analysis: PCA and Scree plots on yield changes and stock returns
  • Decomposition: Singular Value Decomposition (SVD) to verify PCA results
In [3]:
#Setup environment
import os
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter

!pip install -q ipykernel nbconvert scikit-learn openpyxl nelson_siegel_svensson
from nelson_siegel_svensson.calibrate import calibrate_ns_ols
from scipy.interpolate import CubicSpline
from sklearn.metrics import mean_squared_error

1. Yield Curve Modeling¶

We select the HNX government securities from our country, Vietnam, and use the official HNX data export (HNX, “Yield Curve”) in Excel format for a specific settlement date. The file contains maturities (in Vietnamese), spot rates (continuous), par yields, and spot rates (annual).

Then, we load the Excel file, extract the table, and create both:

  • a string tenor column for plotting (e.g., “3M”, “1Y”) and
  • a numeric maturity in years for modeling (e.g., 0.25, 1, 2, …).

We use Spot Rate Annual for modeling.

In [5]:
notebook_dir = os.getcwd()
file_path = os.path.join(notebook_dir, "Bond_market_data_01_10_2025.xlsx")

df = pd.read_excel(file_path, header=None, skiprows = 4, usecols = [0,2,4,6])
df.head(10)
Out[5]:
0 2 4 6
0 Kỳ hạn còn lại Spot rate liên tục (%) Par Yield (%) Spot rate theo năm (%)
1 3 tháng 2.737066 NaN 2.774868
2 6 tháng 2.759724 NaN 2.798157
3 9 tháng 2.782112 NaN 2.821175
4 1 năm 2.804233 2.843921 2.843921
5 2 năm 2.890047 2.930938 2.932214
6 3 năm 2.971636 3.012858 3.01623
7 5 năm 3.122353 3.161771 3.17161
8 7 năm 3.256809 3.291521 3.310423
9 10 năm 3.428918 3.452486 3.488383

This data is in Vietnamese and has different formatting, so I need to clean and transform it before modelling:

In [7]:
#Convert tenor labels to English
def convert_vn_tenor_to_str(text):
    """
    Convert Vietnamese tenor labels (e.g. '3 tháng', '1 năm')
    into English-style labels ('3M', '1Y', etc.).
    """
    text = str(text).strip().lower()
    if "tháng" in text:
        number = int("".join(filter(str.isdigit, text)))
        return f"{number}M"
    elif "năm" in text:
        number = int("".join(filter(str.isdigit, text)))
        return f"{number}Y"
    else:
        return text

#Convert tenor to years for NS modelling
def tenor_to_years(s):
    s = str(s).upper().strip()
    if s.endswith("M"):
        return float(s[:-1]) / 12.0
    if s.endswith("Y"):
        return float(s[:-1])
    return np.nan

#Convert Numeric text to float
def to_float(x):
    try:
        return float(str(x).replace(",", "."))
    except:
        return None
In [8]:
df.columns = ['Maturity', 'Spotrate_Continuous', 'ParYield', 'Spotrate_Annual']
df["Maturity"] = df["Maturity"].apply(convert_vn_tenor_to_str) #Convert tenor names to English tenors

for col in ["ParYield"]:
    df[col] = df[col].map(to_float)

#Drop unecessary columns
df.drop(columns = ['ParYield', 'Spotrate_Continuous'],inplace = True)
df = df.rename(columns={"Years": "Maturity"})
df = df.drop(index=0).reset_index(drop=True)
df
Out[8]:
Maturity Spotrate_Annual
0 3M 2.774868
1 6M 2.798157
2 9M 2.821175
3 1Y 2.843921
4 2Y 2.932214
5 3Y 3.01623
6 5Y 3.17161
7 7Y 3.310423
8 10Y 3.488383
9 15Y 3.707359
10 20Y 3.834327

df.describe() #Exploratory Data Analysis

Exploratory Data Analysis:

The dataset contains 11 observations of annualised spot rates for Vietnam Treasury bonds across various maturities. The mean annual spot rate is approximately 3.15%, with a standard deviation of 0.38%, indicating moderate variation in yields across maturities.

The minimum rate is 2.77%, while the maximum reaches 3.83%, showing an upward-sloping yield curve pattern where longer maturities tend to have higher yields. The median rate (3.02%) aligns closely with the mean, suggesting a fairly symmetric distribution without extreme outliers.

Overall, the data reflects a typical, gradually increasing yield structure, consistent with a normal yield curve observed in stable economic conditions.

Now, let's plot the actual yield curve:

In [12]:
plt.figure(figsize = (9,5))

plt.plot(df['Maturity'], df['Spotrate_Annual'], marker = "o", linewidth = 1.8)

for i, (x, y) in enumerate(zip(df["Maturity"], df["Spotrate_Annual"])):
    offset = 0.05 #Format labels
    if i == len(df) - 1:  # last label
        plt.text(x, y -0.1, f"{y:.2f}", ha="left", fontsize=11, color="black")
    else:
        plt.text(x, y + 0.06, f"{y:.2f}", ha="center", fontsize=11, color="black")

#Naming title and axes
plt.xlabel("Maturity (Years)")
plt.ylabel("Yield [%]")
plt.title("Figure 1. Vietnam Treasury Bond (HNX) Yield Curve at 01-10-2025")
plt.tight_layout(pad=2)
plt.grid(True, alpha = 0.3)
plt.show()
No description has been provided for this image

After visualisation (Figure 1), the yield curve exhibits a normal upward-sloping shape, common for yield curves:

  • Short-term yields (3M-1Y) are relatively low, around 2.8%, reflecting lower compensation for short-term lending.
  • Medium-term yields (2Y-7Y) gradually increase toward 3.3%, suggesting moderate expectations of future economic growth or inflation.
  • Long-term yields (10Y-20Y) reach around 3.8%, indicating that investors demand a higher return for locking funds over extended periods.

Yield Curve Modeling using the Nelson-Siegel model¶

The Nelson-Siegel model is a popular method used to show how yields change with bond maturities:

$$y(t)=\beta_{0}+\beta_{1}\left( \frac{1-e^{-\lambda t}}{\lambda t} \right)+\beta_{2}\left( \frac{1-e^{^{-\lambda t}}}{\lambda t}-e^{-\lambda t} \right)+\epsilon$$

With:

  • $y(t)$: yield at maturity $t$
  • $\beta_{0}, \beta_{1}, \beta_{2}$, and $\tau$: Model parameters, they represent the level (long-term yield component), the slope (short-term yield component), the curvature (medium-term hump), and the decay rate (scale parameter) of the model, respectively
  • $\epsilon$: residual/error term

In this project, the Nelson–Siegel model was chosen because its parameters give us information about economic insights, meaning long-run rate expectations, short-term policy stance, and medium-term market sentiment. While splines or high-order regressions could perfectly interpolate observed points, they lack the economic insight and extrapolation reliability that Nelson–Siegel provides.

As a result, this model not only reproduces the observed upward-sloping curve with high accuracy but also enables a deeper understanding of how interest rate factors evolve across maturities.

In [16]:
from scipy.optimize import least_squares

df = df.dropna(subset=["Maturity", "Spotrate_Annual"])
df["Maturity_Years"] = df["Maturity"].map(tenor_to_years)
df = df.dropna(subset=["Maturity_Years"])

x = df["Maturity_Years"].astype(float).to_numpy()
y = df["Spotrate_Annual"].astype(float).to_numpy()

# Nelson-Siegel function
def nelson_siegel(maturity, beta0, beta1, beta2, tau):
    t = maturity / tau
    e = np.exp(-t)
    term1 = (1 - e) / t
    return beta0 + beta1 * term1 + beta2 * (term1 - e)

# Residuals
def residuals(params):
    return nelson_siegel(x, *params) - y

initial_guess = [3.0, -1.0, 0.5, 1.0]
bounds = ([0, -10, -10, 0.1], [10, 10, 10, 10])

# Fit
result = least_squares(residuals, x0=initial_guess, bounds=bounds)

beta0, beta1, beta2, tau = result.x
print(f"β0={beta0:.4f}, β1={beta1:.4f}, β2={beta2:.4f}, τ={tau:.4f}")
β0=4.3350, β1=-1.5796, β2=-0.8619, τ=4.1588

Using this model, we estimate the following parameters: $\beta_{0} = 4.3350$, $\beta_{1} = -1.5796$, $\beta_{2} = -0.8619$ and $\tau = 4.1588$.

Now we can fit the model and plot the fitting curve:

In [18]:
x_fit = np.linspace(min(x), max(x), 200)
y_fit = nelson_siegel(x_fit, beta0, beta1, beta2, tau)

plt.figure(figsize=(9,5))
plt.scatter(x, y*100, color="red", label="Actual Yields")
plt.plot(x_fit, y_fit*100, color="blue", label="Nelson-Siegel Fit")
plt.xlabel("Maturity (Years)")
plt.ylabel("Yield [%]")
plt.title("Figure 2. Vietnam Treasury Bond Yield Curve at 01-10-2025 - Nelson-Siegel Fit")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

The Nelson-Siegel model successfully estimated the parameters governing the shape of the Vietnam Treasury bond yield curve as of 01-10-2025 with:

$\beta_{0} = 4.3350$, $\beta_{1} = -1.5796$, $\beta_{2} = -0.8619$ and $\tau = 4.1588$.

  • $\beta_{0}$ (Level Factor) represents the long-term yield level toward which the curve converges as maturity increases. Here, the long-run interest rate stabilises around 4.33%, indicating a moderate expected rate environment.
  • $\beta_{1}$ (Slope Factor) determines the short-term steepness of the yield curve. A negative slope implies that short-term yields are lower than long-term yields, forming an upward-sloping curve; this is consistent with normal market conditions where investors demand higher yields for longer maturities.
  • $\beta_{2}$ (Curvature Factor) controls the hump shape or curvature of the yield curve at medium-term maturities. The negative value suggests that mid-term yields are slightly flatter, meaning the curve rises smoothly without a strong hump.
  • $\tau$ (Decay Rate) determines the maturity point where the curve transitions from steep to flat. A $\tau$ value ~4 implies that the curve flattens around four years, after which long-term yields stabilise.

Cubic Spline Fitting of Yield Curve¶

For comparison, we fit a Cubic Spline model (k = 3) to interpolate the yield curve across maturities. The maturities (0.25-20 years) are treated as the knot points, and the spline estimates a smooth function y(t)=f(t) that passes through all observed yields.

The fitted spline (red line) produces a smooth and continuous curve connecting the discrete yield observations. This model captures the overall upward-sloping pattern of the Vietnamese yield curve at 01-10-2025, while allowing for minor local curvature between maturities.

In [21]:
from scipy.interpolate import make_interp_spline

x = np.array([0.25, 0.5, 0.75, 1, 2, 3, 5, 7, 10, 15, 20])  # Maturities
y = df['Spotrate_Annual'].values  # Yields

# Fit a cubic spline
spline = make_interp_spline(x, y, k=3)
x_smooth = np.linspace(x.min(), x.max(), 200)
y_smooth = spline(x_smooth)

#Plot the fitted curve
plt.figure(figsize=(8,5))
plt.plot(x, y, 'o', label="Observed Yields")
plt.plot(x_smooth, y_smooth, '-', label="Cubic Spline Fit", color='red')
plt.xlabel("Maturity (Years)")
plt.ylabel("Yield [%]")
plt.title("Figure 3. Vietnam Treasury Yield Curve at 01-10-2025 - Cubic Spline Fit")
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image
In [22]:
#Extract coefficients
coefficients = spline.c
knots = spline.t

print(knots)
print(coefficients)
[ 0.25  0.25  0.25  0.25  0.75  1.    2.    3.    5.    7.   10.   20.
 20.   20.   20.  ]
[2.77486769 2.79048449 2.81363746 2.86671314 2.93292727 3.04493812
 3.1743708  3.33624889 3.63250271 3.7790624  3.83432704]

For the Cubic Spline model, the curve is defined by a series of local polynomial segments between maturity knots. The fitted spline includes 15 knot points (from 0.25 to 20 years) and the corresponding spline coefficients ($\alpha_{1}-\alpha_{11}$) representing the fitted yield levels at each maturity. For example: $\alpha_{1}$ = 2.7749 (0.25Y), $\alpha_{5}$ = 2.9329 (0.75Y), $\alpha_{9}$ = 3.6325 (5Y), $\alpha_{11}$ = 3.8343 (20Y).

Comparison between Cubic Spline and NS¶

In [25]:
from sklearn.metrics import mean_squared_error

rmse_ns = np.sqrt(mean_squared_error(y, nelson_siegel(x, beta0, beta1, beta2, tau)))
rmse_spline = np.sqrt(mean_squared_error(y, spline(x)))

print(f"NS RMSE: {rmse_ns:.6f},  Spline RMSE: {rmse_spline:.6f}")
NS RMSE: 0.003136,  Spline RMSE: 0.000000
  1. Fit comparison:

The Cubic Spline model achieves an RMSE of nearly zero, meaning it perfectly fits all observed yield points. This happens because the spline is an interpolating function, it passes exactly through every data point, minimising residual errors to almost nothing.
In contrast, the Nelson-Siegel model has a slightly higher RMSE (0.0031), which is expected. Since it is a parametric model, it smooths the data based on its functional form rather than perfectly interpolating each observation. This allows it to capture the general trend and curvature of the yield curve without overfitting small fluctuations.

  1. Interpretation:

The Nelson-Siegel model provides a smooth and interpretable fit to the yield curve, where each parameter $\beta_{0}, \beta_{1}, \beta_{2}$ and $\tau$ represents an economic aspect such as long-term level, short-term slope, and medium-term curvature. In contrast, the Cubic Spline model offers a near-perfect mathematical fit by interpolating through all data points, resulting in a lower RMSE but limited economic interpretability. Thus, while the spline excels in precision, the Nelson-Siegel model is preferred for understanding and analysing the underlying term structure of interest rates.

2. Exploiting Correlation¶

In this section, we aim to understand how yields move together across different maturities, which is crucial for both fixed-income investors and policymakers. A change in monetary policy or macroeconomic outlook often shifts the yield curve, as bond yields are not indepedent.

By studying the correlation structure of these movements, we can uncover the latent factors that drive most of the variation in interest rates.

This analysis focuses on daily yield changes for Australian government bonds with maturities from 1 to 10 years.
Instead of treating each maturity as a separate variable, we apply Principal Component Analysis (PCA) to identify the dominant underlying forces influencing yield dynamics.

PCA helps us reduce a complex, multi-dimensional dataset into a few orthogonal factors, typically interpreted as:

  • Level: parallel shifts of the whole curve,
  • Slope: steepening or flattening of short vs. long maturities,
  • Curvature: twists around the middle segment of the curve.

To assess how many of these factors are meaningful, we use a Scree Plot, which displays the proportion of variance explained by each principal component.

We used data from the Reserve Bank of Australia (RBA, F2: Australian Government Securities Yields). Five maturities were selected to represent the Australian government securities yield curve - 1-year, 2-year, 3-year, 5-year, and 10-year bonds, covering a six-month period.

In [29]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

#online link
#rba = pd.read_csv("https://www.rba.gov.au/statistics/tables/csv/f2-data.csv", skiprows=10, encoding='ISO-8859-1')

notebook_dir = os.getcwd()
file_path = os.path.join(notebook_dir, "f2-data.csv")

rba = pd.read_csv(file_path, skiprows=10, encoding='ISO-8859-1', low_memory=False)

cols = ["Series ID", "FCMYGBAGID", "FCMYGBAG2D", "FCMYGBAG3D", "FCMYGBAG5D", "FCMYGBAG10D"]
rba = rba[cols].dropna()

rba.columns = ["Date", "1Y", "2Y", "3Y", "5Y", "10Y"]
rba["Date"] = pd.to_datetime(rba["Date"])
rba = rba.set_index("Date")

yields = rba[rba.index >= "2025-04-01"]

head_rows = yields.head(5)
tail_rows = yields.tail(5)


display(pd.concat([head_rows, pd.DataFrame([["..."] * yields.shape[1]], columns=yields.columns), tail_rows]))
1Y 2Y 3Y 5Y 10Y
2025-04-01 00:00:00 2.22 3.665 3.69 3.842 4.391
2025-04-02 00:00:00 2.21 3.689 3.713 3.861 4.397
2025-04-03 00:00:00 2.061 3.527 3.549 3.697 4.242
2025-04-04 00:00:00 2.078 3.394 3.417 3.593 4.199
2025-04-07 00:00:00 2.092 3.252 3.272 3.47 4.081
0 ... ... ... ... ...
2025-09-25 00:00:00 2.049 3.491 3.54 3.738 4.356
2025-09-26 00:00:00 2.049 3.521 3.577 3.773 4.395
2025-09-29 00:00:00 2.004 3.475 3.529 3.724 4.342
2025-09-30 00:00:00 1.97 3.487 3.538 3.722 4.307
2025-10-01 00:00:00 1.996 3.519 3.569 3.756 4.357

To analyse movements in the yield curve, we compute the daily yield changes instead of using raw yield levels. This is done by taking the first difference of yields across consecutive dates using the .diff() command.

In [31]:
#Compute daily yield changes
yield_changes = yields.diff(axis = 0).dropna()

#Preview
yield_changes
Out[31]:
1Y 2Y 3Y 5Y 10Y
Date
2025-04-02 -0.010 0.024 0.023 0.019 0.006
2025-04-03 -0.149 -0.162 -0.164 -0.164 -0.155
2025-04-04 0.017 -0.133 -0.132 -0.104 -0.043
2025-04-07 0.014 -0.142 -0.145 -0.123 -0.118
2025-04-08 0.140 0.060 0.065 0.085 0.141
... ... ... ... ... ...
2025-09-25 0.038 0.043 0.039 0.043 0.059
2025-09-26 0.000 0.030 0.037 0.035 0.039
2025-09-29 -0.045 -0.046 -0.048 -0.049 -0.053
2025-09-30 -0.034 0.012 0.009 -0.002 -0.035
2025-10-01 0.026 0.032 0.031 0.034 0.050

127 rows × 5 columns

Then, the PCA was employed using the covariance matrix, as all yield changes are measured on the same scale, making standardisation unnecessary.

In [33]:
#Initiate the PCA model and fit the yield_changes
pca = PCA()
pca.fit(yield_changes)

#Display the variance contribution
explained_var = pca.explained_variance_ratio_
print("Explained Variance Ratio for each component:")
for i, var in enumerate(explained_var, start=1):
   print(f"Component {i}: {var:.4f} ({var*100:.2f}%)")
Explained Variance Ratio for each component:
Component 1: 0.8299 (82.99%)
Component 2: 0.1557 (15.57%)
Component 3: 0.0130 (1.30%)
Component 4: 0.0011 (0.11%)
Component 5: 0.0003 (0.03%)

Based on the results in (g), the first principal component (PC1) explains approximately 83% of the total variance, representing a parallel shift (level movement) of the entire yield curve. PC2 (around 15.6%) captures slope changes, the steepening or flattening of the curve. PC3 (about 1.4%) reflects curvature or twisting effects, while the remaining components contribute minimal explanatory power.

This pattern confirms that most yield movements are driven by a single dominant factor, consistent with the theoretical structure of government bond markets. To better visualize this variance contribution, let's look at the scree plot:

In [35]:
#Plotting the Scree Plot
plt.figure(figsize=(6,4))
plt.plot(range(1, len(explained_var)+1), explained_var, 'o-', linewidth=2)
plt.title("Scree Plot - Australian Government Bonds")
plt.xlabel("Principal Component")
plt.ylabel("Variance Explained")
plt.grid(True)
plt.show()
No description has been provided for this image

The government bond data scree plot shows that the first component dominates, explaining about 83% of the total variance. This pattern confirms that most yield movements are driven primarily by a single dominant factor (the level component), consistent with the three-factor theoretical structure of government bond markets, where slope and curvature play smaller secondary roles.

3. Empirical Analysis of ETFs¶

After exploring interest rate dynamics in the bond market, we extend the analysis to the equity domain by examining the correlation structure of assets within the State Street XLK Technology ETF.
This ETF tracks the performance of major U.S. technology companies such as Apple, Microsoft, and NVIDIA, providing a representative sample of the broader tech sector. The 30 largest holdings’ tickers were collected from State Street Global Advisors as of 1 October 2025.

Equity returns, like yields, tend to move together due to systematic market forces, such as macroeconomic conditions, sector-wide shocks, or investor sentiment. However, not all stocks react equally: some are more sensitive to common risk factors, while others are driven by firm-specific characteristics.

To uncover these underlying relationships, we apply PCA to the daily return matrix of the ETF’s top constituents. To help us quantify the total variation in stock returns explained by systemic factors, identify the dominant drivers of co-movement across assets, and differentiate between broad market effects and idiosyncratic noise.

Through a Scree Plot, we can assess how many meaningful components capture most of the ETF’s return behavior. Typically, a strong first component indicates a single market-wide factor dominating the entire sector, while smaller components reflect more specific sub-sector dynamics or company-level effects. Therefore, we can reveal the core economic forces driving asset co-movement by dimensionality reduction and factor extraction, bridging fixed income and equity markets.

In [38]:
#Import data
notebook_dir = os.getcwd()
file_path = os.path.join(notebook_dir, "holdings-daily-us-en-xlk.xlsx")

df = pd.read_excel(file_path, skiprows=4)
print("Columns in file:", df.columns.tolist())

df_top30 = df.sort_values(by='Weight', ascending=False).head(30).reset_index(drop=True)

print("Top 30 Holdings of XLK:")
display(df_top30[['Ticker', 'Name', 'Weight']])

tickers = df_top30['Ticker'].tolist()
print(tickers)
Columns in file: ['Name', 'Ticker', 'Identifier', 'SEDOL', 'Weight', 'Sector', 'Shares Held', 'Local Currency']
Top 30 Holdings of XLK:
Ticker Name Weight
0 NVDA NVIDIA CORP 14.824341
1 MSFT MICROSOFT CORP 12.330501
2 AAPL APPLE INC 12.273676
3 AVGO BROADCOM INC 5.116163
4 PLTR PALANTIR TECHNOLOGIES INC A 3.819037
5 ORCL ORACLE CORP 3.720152
6 AMD ADVANCED MICRO DEVICES 2.472807
7 CSCO CISCO SYSTEMS INC 2.413904
8 IBM INTL BUSINESS MACHINES CORP 2.397773
9 CRM SALESFORCE INC 2.050192
10 MU MICRON TECHNOLOGY INC 1.844963
11 INTU INTUIT INC 1.707569
12 NOW SERVICENOW INC 1.699435
13 LRCX LAM RESEARCH CORP 1.670124
14 APP APPLOVIN CORP CLASS A 1.661486
15 QCOM QUALCOMM INC 1.634886
16 AMAT APPLIED MATERIALS INC 1.610849
17 TXN TEXAS INSTRUMENTS INC 1.488057
18 INTC INTEL CORP 1.465687
19 ACN ACCENTURE PLC CL A 1.366281
20 APH AMPHENOL CORP CL A 1.354537
21 KLAC KLA CORP 1.349615
22 ADBE ADOBE INC 1.338528
23 ANET ARISTA NETWORKS INC 1.336609
24 PANW PALO ALTO NETWORKS INC 1.255049
25 CRWD CROWDSTRIKE HOLDINGS INC A 1.111665
26 ADI ANALOG DEVICES INC 1.076651
27 CDNS CADENCE DESIGN SYS INC 0.849445
28 SNPS SYNOPSYS INC 0.782688
29 MSI MOTOROLA SOLUTIONS INC 0.670122
['NVDA', 'MSFT', 'AAPL', 'AVGO', 'PLTR', 'ORCL', 'AMD', 'CSCO', 'IBM', 'CRM', 'MU', 'INTU', 'NOW', 'LRCX', 'APP', 'QCOM', 'AMAT', 'TXN', 'INTC', 'ACN', 'APH', 'KLAC', 'ADBE', 'ANET', 'PANW', 'CRWD', 'ADI', 'CDNS', 'SNPS', 'MSI']

Daily closing prices for the 30 constituent stocks of the XLK ETF were collected from Yahoo Finance using the yfinance Python library over a six-month period, based on the tickets identified in part a, covering the period from April to October 2025.

In [40]:
#Import relevant packages
import pandas as pd
import yfinance as yf
from datetime import datetime

#Specify the date range
end_date = "2025-10-01"
start_date = "2025-04-01" 

#Download the data
data = yf.download(tickers, start=start_date, end=end_date, interval='1d', auto_adjust=True)['Close']

data = data.dropna(how='all')

print("6-month close prices (last 5 days):")
display(data.tail())
[*********************100%***********************]  30 of 30 completed
6-month close prices (last 5 days):

Ticker AAPL ACN ADBE ADI AMAT AMD ANET APH APP AVGO ... MSI MU NOW NVDA ORCL PANW PLTR QCOM SNPS TXN
Date
2025-09-24 252.309998 239.080002 353.269989 248.610001 201.440002 160.880005 142.639999 123.129997 641.919983 339.309998 ... 455.130005 161.608795 933.369995 176.970001 307.925629 200.699997 179.559998 173.550003 468.089996 182.808289
2025-09-25 256.869995 232.559998 354.160004 247.529999 199.600006 161.270004 143.059998 122.330002 639.909973 336.100006 ... 455.730011 156.731857 918.609985 177.690002 290.825287 202.210007 179.119995 169.679993 487.200012 180.429520
2025-09-26 255.460007 238.970001 360.369995 247.559998 203.919998 159.460007 142.500000 122.599998 669.859985 334.529999 ... 456.519989 157.171570 936.000000 178.190002 282.968933 202.369995 177.570007 169.199997 487.760010 182.917328
2025-09-29 254.429993 247.000000 359.420013 244.789993 204.949997 161.360001 143.369995 121.010002 712.359985 327.899994 ... 454.179993 163.797424 940.849976 181.850006 282.270172 203.960007 178.860001 165.300003 481.609985 181.608994
2025-09-30 254.630005 246.600006 352.750000 245.699997 204.740005 161.789993 145.710007 123.750000 718.539978 329.910004 ... 457.290009 167.215286 920.280029 186.580002 280.752777 203.619995 182.419998 166.360001 493.390015 182.104568

5 rows × 30 columns

In [41]:
#EDA on the raw data
summary_stats = data.describe().T.round(2)

print("Summary Statistics of 30 XLK Holdings (6-Month Close Prices):")
display(summary_stats)
Summary Statistics of 30 XLK Holdings (6-Month Close Prices):
count mean std min 25% 50% 75% max
Ticker
AAPL 126.0 213.89 17.29 172.00 201.15 209.85 227.61 256.87
ACN 126.0 281.41 27.31 232.56 255.38 283.81 304.78 321.60
ADBE 126.0 371.83 23.33 333.65 352.88 367.12 384.71 420.68
ADI 126.0 223.54 23.42 163.21 213.67 229.14 244.05 254.62
AMAT 126.0 169.63 18.70 126.23 157.19 168.39 184.95 204.95
AMD 126.0 135.54 29.81 78.21 111.06 138.47 161.34 184.42
ANET 126.0 107.98 24.93 64.37 90.48 101.53 135.47 153.04
APH 126.0 95.80 17.26 58.90 85.38 97.80 109.31 125.40
APP 126.0 393.76 111.76 219.37 336.17 365.33 438.64 718.54
AVGO 126.0 262.04 55.16 145.70 229.00 270.52 299.23 368.94
CDNS 126.0 317.82 33.91 231.64 297.58 318.71 348.99 373.37
CRM 126.0 259.04 14.05 231.26 246.59 259.80 268.85 290.18
CRWD 126.0 444.34 41.59 321.63 423.56 444.21 476.27 514.10
CSCO 126.0 64.28 4.67 52.55 62.29 66.56 67.74 71.36
IBM 126.0 258.15 18.99 218.09 242.27 256.92 278.15 292.80
INTC 126.0 22.50 3.20 18.13 20.32 21.73 23.64 35.50
INTU 126.0 697.32 67.47 541.40 654.18 697.43 761.52 805.92
KLAC 126.0 842.31 119.53 573.90 762.05 876.60 917.60 1078.60
LRCX 126.0 92.55 17.31 58.83 82.13 96.55 100.56 133.90
MSFT 126.0 472.03 49.66 353.33 451.48 495.47 509.17 534.76
MSI 126.0 433.41 24.52 392.79 415.35 423.68 456.32 489.19
MU 126.0 110.84 25.67 64.62 94.80 114.14 123.05 168.78
NOW 126.0 938.54 77.47 721.65 900.20 952.69 1005.07 1044.69
NVDA 126.0 150.45 27.87 94.30 132.04 157.86 176.09 186.58
ORCL 126.0 208.23 54.68 122.35 157.22 220.37 244.40 327.76
PANW 126.0 188.61 13.03 152.44 181.33 192.04 198.28 208.19
PLTR 126.0 140.03 28.18 74.01 123.33 140.69 158.29 186.97
QCOM 126.0 151.98 9.63 123.21 145.91 153.42 158.34 173.55
SNPS 126.0 516.39 73.19 380.90 466.18 501.10 595.16 645.35
TXN 126.0 183.93 19.09 142.07 177.45 184.28 196.50 217.72

Daiy returns¶

The daily return of stock i on day t is computed as:

$$ r_{i,t} = \frac{P_{i,t} - P_{i,t-1}}{P_{i,t-1}} $$

where:

  • $r_{i,t}$ : daily return of stock i at time t
  • $P_{i,t}$ : closing price of stock i on day t
  • $P_{i,t-1}$ : closing price of stock i on the previous day
In [43]:
#Compute the daily returns
returns = data.pct_change().dropna().round(4)

#Preview the data
head_rows = returns.head(5)
tail_rows = returns.tail(5)

ellipsis_row = pd.DataFrame([["..."] * returns.shape[1]], columns=returns.columns, index=["..."])

display(pd.concat([head_rows, ellipsis_row, tail_rows]).T)
2025-04-02 00:00:00 2025-04-03 00:00:00 2025-04-04 00:00:00 2025-04-07 00:00:00 2025-04-08 00:00:00 ... 2025-09-24 00:00:00 2025-09-25 00:00:00 2025-09-26 00:00:00 2025-09-29 00:00:00 2025-09-30 00:00:00
Ticker
AAPL 0.0031 -0.0925 -0.0729 -0.0367 -0.0498 ... -0.0083 0.0181 -0.0055 -0.004 0.0008
ACN 0.0088 -0.047 -0.0544 -0.0012 -0.0117 ... 0.0152 -0.0273 0.0276 0.0336 -0.0016
ADBE 0.0067 -0.048 -0.0495 -0.024 -0.0021 ... -0.0235 0.0025 0.0175 -0.0026 -0.0186
ADI 0.0021 -0.0937 -0.09 0.0409 -0.0306 ... 0.0074 -0.0043 0.0001 -0.0112 0.0037
AMAT 0.0143 -0.0828 -0.0632 0.0465 -0.0293 ... 0.0028 -0.0091 0.0216 0.0051 -0.001
AMD 0.0018 -0.089 -0.0857 -0.0247 -0.0649 ... -0.0001 0.0024 -0.0112 0.0119 0.0027
ANET 0.0213 -0.1109 -0.0968 0.059 0.0195 ... -0.0101 0.0029 -0.0039 0.0061 0.0163
APH 0.0277 -0.0772 -0.057 0.0306 -0.0136 ... -0.0181 -0.0065 0.0022 -0.013 0.0226
APP 0.0272 -0.0978 -0.1626 0.0586 0.0132 ... -0.0142 -0.0031 0.0468 0.0634 0.0087
AVGO 0.0212 -0.1051 -0.0501 0.0537 0.0123 ... 0.0011 -0.0095 -0.0047 -0.0198 0.0061
CDNS 0.0238 -0.0605 -0.0644 0.004 -0.0093 ... -0.0255 -0.0165 -0.0027 -0.0045 0.0079
CRM 0.005 -0.0601 -0.0567 0.0143 -0.0009 ... 0.0054 -0.0201 0.0103 0.0069 -0.033
CRWD 0.0251 -0.0649 -0.0742 0.0085 0.0021 ... -0.0161 -0.0068 0.0176 0.0146 0.004
CSCO 0.0003 -0.0668 -0.0483 -0.0024 -0.0224 ... -0.0033 0.0079 -0.0093 0.0074 0.0103
IBM -0.0014 -0.026 -0.0658 -0.0075 -0.021 ... -0.0173 0.052 0.0102 -0.0159 0.0084
INTC -0.0032 0.0205 -0.115 -0.0141 -0.0736 ... 0.0641 0.0887 0.0444 -0.0287 -0.027
INTU 0.0116 -0.036 -0.0618 -0.0094 -0.0219 ... -0.0063 -0.003 0.0081 -0.0051 -0.017
KLAC 0.0055 -0.0953 -0.0713 0.0487 -0.0085 ... -0.0024 -0.009 0.0049 -0.0002 0.0136
LRCX 0.013 -0.116 -0.094 0.0526 -0.0314 ... -0.0254 -0.0015 0.0016 0.0215 0.0214
MSFT -0.0001 -0.0236 -0.0356 -0.0055 -0.0092 ... 0.0018 -0.0061 0.0087 0.0061 0.0065
MSI 0.0023 -0.0034 -0.0766 0.0025 -0.0208 ... -0.0331 0.0013 0.0017 -0.0051 0.0068
MU -0.0012 -0.1609 -0.1294 0.0564 -0.0414 ... -0.0282 -0.0302 0.0028 0.0422 0.0209
NOW 0.0154 -0.0606 -0.0677 0.0192 -0.0107 ... 0.0061 -0.0158 0.0189 0.0052 -0.0219
NVDA 0.0025 -0.0781 -0.0736 0.0353 -0.0137 ... -0.0082 0.0041 0.0028 0.0205 0.026
ORCL 0.0276 -0.0592 -0.0653 -0.0087 -0.0209 ... -0.0171 -0.0555 -0.027 -0.0025 -0.0054
PANW 0.0109 -0.0463 -0.0702 -0.0074 0.0006 ... -0.0125 0.0075 0.0008 0.0079 -0.0017
PLTR 0.0327 -0.044 -0.1147 0.0517 -0.0067 ... -0.0164 -0.0025 -0.0087 0.0073 0.0199
QCOM 0.0067 -0.0951 -0.0858 0.0177 -0.039 ... 0.0237 -0.0223 -0.0028 -0.023 0.0064
SNPS 0.006 -0.0474 -0.0709 -0.0186 0.0018 ... -0.0453 0.0408 0.0011 -0.0126 0.0245
TXN 0.0011 -0.0785 -0.078 0.0172 -0.0519 ... 0.0132 -0.013 0.0138 -0.0072 0.0027

Covariance Matrix¶

Now, let's compute the covariance matrix from the daily returns of the 30 XLK holdings to measure how their returns move together.

In [45]:
cov_matrix = returns.cov()

print("✅ Covariance matrix (first 5x5):")
display(cov_matrix.iloc[:10, :10])
✅ Covariance matrix (first 5x5):
Ticker AAPL ACN ADBE ADI AMAT AMD ANET APH APP AVGO
Ticker
AAPL 0.000620 0.000223 0.000258 0.000471 0.000471 0.000584 0.000403 0.000315 0.000494 0.000419
ACN 0.000223 0.000352 0.000203 0.000309 0.000288 0.000264 0.000262 0.000155 0.000353 0.000226
ADBE 0.000258 0.000203 0.000331 0.000290 0.000245 0.000282 0.000253 0.000167 0.000299 0.000238
ADI 0.000471 0.000309 0.000290 0.000776 0.000664 0.000747 0.000482 0.000406 0.000633 0.000596
AMAT 0.000471 0.000288 0.000245 0.000664 0.000961 0.000757 0.000499 0.000415 0.000624 0.000634
AMD 0.000584 0.000264 0.000282 0.000747 0.000757 0.001373 0.000468 0.000474 0.000752 0.000707
ANET 0.000403 0.000262 0.000253 0.000482 0.000499 0.000468 0.001239 0.000449 0.000842 0.000681
APH 0.000315 0.000155 0.000167 0.000406 0.000415 0.000474 0.000449 0.000453 0.000540 0.000454
APP 0.000494 0.000353 0.000299 0.000633 0.000624 0.000752 0.000842 0.000540 0.001960 0.000767
AVGO 0.000419 0.000226 0.000238 0.000596 0.000634 0.000707 0.000681 0.000454 0.000767 0.001005

The table is difficult to understand intuitively, so we can visualize it:

In [47]:
import matplotlib.pyplot as plt
import seaborn as sns

cov_matrix = returns.cov()

plt.figure(figsize=(12, 8))
sns.heatmap(cov_matrix,
            cmap='RdBu_r',        
            center=0,              
            annot=False,           
            square=True,
            cbar_kws={'label': 'Covariance'})
plt.title("Covariance Matrix of XLK 30 Holdings", fontsize=14)
plt.tight_layout()
plt.show()
No description has been provided for this image

The covariance matrix illustrates how daily returns of XLK’s 30 technology holdings move together. The darker the color, the higher co-movement between two securities. Most cells show positive covariances (in light to dark red), indicating that the majority of stocks tend to rise and fall in tandem, reflects a strong sector-wide co-movement, typical of large-cap technology firms.

A few darker red blocks along the diagonal highlight pairs or subgroups with particularly strong relationships, such as semiconductor stocks (e.g., NVIDIA, AMD, and TXN) and software giants (e.g., Microsoft and Adobe). This clustering suggests the presence of industry-specific factors in addition to the broad market trend.

Meanwhile, a handful of light or slightly blue areas represent weak or mildly negative covariances, meaning certain stocks move somewhat independently from the rest, possibly due to differing business models or diversification within the ETF.

Let's see the PCA result on daily returns:

In [49]:
#Initiate the PCA model and fit the returns
pca = PCA()
pca.fit(returns)

#Result of variance contribution
explained_var = pca.explained_variance_ratio_

print("Explained Variance Ratio for each component:")
for i, var in enumerate(explained_var, start=1):
    print(f"Component {i}: {var:.4f} ({var*100:.2f}%)")
print(f"The first 10 component explain:", round((sum(explained_var[:10])*100),2),"% of the total variance")
Explained Variance Ratio for each component:
Component 1: 0.5064 (50.64%)
Component 2: 0.1154 (11.54%)
Component 3: 0.0614 (6.14%)
Component 4: 0.0456 (4.56%)
Component 5: 0.0391 (3.91%)
Component 6: 0.0308 (3.08%)
Component 7: 0.0264 (2.64%)
Component 8: 0.0217 (2.17%)
Component 9: 0.0200 (2.00%)
Component 10: 0.0169 (1.69%)
Component 11: 0.0154 (1.54%)
Component 12: 0.0134 (1.34%)
Component 13: 0.0111 (1.11%)
Component 14: 0.0097 (0.97%)
Component 15: 0.0085 (0.85%)
Component 16: 0.0075 (0.75%)
Component 17: 0.0070 (0.70%)
Component 18: 0.0059 (0.59%)
Component 19: 0.0052 (0.52%)
Component 20: 0.0045 (0.45%)
Component 21: 0.0044 (0.44%)
Component 22: 0.0042 (0.42%)
Component 23: 0.0039 (0.39%)
Component 24: 0.0030 (0.30%)
Component 25: 0.0030 (0.30%)
Component 26: 0.0026 (0.26%)
Component 27: 0.0023 (0.23%)
Component 28: 0.0019 (0.19%)
Component 29: 0.0017 (0.17%)
Component 30: 0.0013 (0.13%)
The first 10 component explain: 88.37 % of the total variance

As we can see, the first 10 components explain 88.37% of the total variance. We can visualize this with a Scree Plot:

In [51]:
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(explained_var) + 1), explained_var * 100, 'o-', linewidth=2)
plt.title("Scree Plot - Variance Explained by Principal Components", fontsize=12)
plt.xlabel("Principal Component", fontsize=11)
plt.ylabel("Variance Explained (%)", fontsize=11)
plt.xticks(np.arange(1, len(explained_var) + 1))
plt.grid(True, alpha=0.3)
plt.show()
No description has been provided for this image

Singular Value Decomposition¶

Alternatively, we can use Singular Value Decomposition (SVD) to decompose the return matrix into orthogonal components. This is a numerically stable method t perform PCA directly on the returns matrix.

In [53]:
#Calculate X, U, S, VT matrices
X = returns - returns.mean()

U, S, VT = np.linalg.svd(X, full_matrices=False)
print("Shapes:")
print("U:", U.shape, " | S:", S.shape, " | VT:", VT.shape)

#Compute the Eigenvalues
eigenvalues = (S**2) / (len(X)-1)
print("\n First 5 singular values:")
print(S[:5])

print("\n Corresponding eigenvalues (variance explained):")
print(eigenvalues[:5])
Shapes:
U: (125, 30)  | S: (30,)  | VT: (30, 30)

 First 5 singular values:
[1.30000094 0.62049368 0.45254319 0.38993217 0.36134511]

 Corresponding eigenvalues (variance explained):
[0.01362905 0.00310494 0.00165158 0.00122619 0.00105299]

The return matrix (125 days × 30 stocks) was decomposed into three matrices:

  • U (125 × 30) represents the time-series weights of each component.
  • S (30,) contains the singular values, showing the strength of each factor.
  • Vᵀ (30 × 30) gives the loadings of each stock on the components.

The first five singular values are [1.30, 0.62, 0.45, 0.39, 0.36], indicating that the first few components dominate the structure. The corresponding eigenvalues [0.0136, 0.0031, 0.0017, 0.0012, 0.0011] confirm that most of the total variance is captured by the leading components, consistent with the PCA results.

Conclusion¶

This project explored the structure and dynamics of financial markets through three complementary analyses:
(1) modeling yield curves, (2) uncovering correlation patterns in bond yields, and (3) identifying common factors in equity returns. Together, these steps demonstrate how mathematical models and statistical tools can transform raw financial data into interpretable insights about market behavior.

In the first part, both the Nelson–Siegel model and Cubic Spline were used to fit the Vietnamese Treasury yield curve. Despite limited observations, the model effectively captured the smooth, upward-sloping term structure using only four economically meaningful parameters: level, slope, curvature, and decay. This shows its strength in both interpretability and parsimony.

Next, by analyzing Australian government bond yield changes with Principal Component Analysis (PCA), we revealed that most of the variation (≈98%) is driven by just two or three latent factors. These correspond to well-known fixed-income movements: level shifts, slope changes, and curvature twists, confirming the existence of a low-dimensional structure in bond market dynamics.

Finally, the Empirical Analysis of the XLK ETF extended the same framework to the equity market. The covariance matrix and PCA results demonstrated that technology stocks exhibit strong co-movement, dominated by a single market-wide component explaining over 50% of total variance, followed by smaller sub-sector and firm-specific effects. Alternatively, Singular Value Decomposition (SVD) confirmed this hierarchy, showing that only a few orthogonal components account for most of the market’s behavior.

Across both fixed income and equity domains, the findings emphasize a consistent principle:

Financial systems, while high-dimensional in appearance, are governed by a small number of underlying factors that capture most of their dynamics.